High-throughput prediction of small molecule binding affinities to ABL1, HSP90, and CDK2 using gradient boosting and machine learning on the BELKA dataset

Vedant Shrinivas Sagare *

Dublin High School, Dublin, Alameda County, California.
 
Research Article
World Journal of Advanced Research and Reviews, 2024, 24(01), 2426–2434
Article DOI: 10.30574/wjarr.2024.24.1.3068
 
Publication history: 
Received on 27 August 2024; revised on 23 October 2024; accepted on 26 October 2024
 
Abstract: 
Most of the time, drug development is burdened by a large search space of possible drug-like molecules and resource-consuming conventional screening methodologies. This work leverages machine learning to predict the binding affinity of small molecules to certain protein targets, one of the major steps in modern drug development. The paper hereby aims at making the process of drug discovery more efficient and accurate by leveraging information from the Big Encoded Library for Chemical Assessment, BELKA dataset, which involves 133 million small molecules screened in interaction against three protein targets, namely Tyrosine-protein kinase ABL1, Heat shock protein 90, and Cyclin-dependent kinase 2. A model using LightGBM was thus developed for affinity prediction, using molecular descriptors derived from the SMILES representation of the molecules. It then splits the data into training and test data, and feature extraction is done through RDKit, calculating the molecular weight, hydrogen bond donors, and acceptors for each molecule. The model achieved an average precision score of 0.84 with strong predictive power. This gave an average precision of 0.88 on the target Tyrosine-protein kinase ABL1, followed by a rather moderate score for targets HSP90 and CDK2, with averages of 0.83 and 0.81, respectively. Feature importance analysis showed that molecular weight joined with hydrogen bonding capacity was among the most valued features in the model's predictions. In this respect, LightGBM can be considered a powerful tool in accelerating drug discovery due to its high accuracy and efficiency of prediction of binding interactions, whereby further potential improvements are related to the inclusion of more complex molecular features and 3D descriptors.
 
Keywords: 
Small molecule-protein interactions; Machine learning; LightGBM; Molecular descriptors; SMILES; Binding affinity
 
Full text article in PDF: 
Share this