Current research demonstrated that
machine learning can be adopted to build up high-accurate predictive
models in drugs/cyclodextrins (CDs) systems. Molecular descriptors
of compounds and experimental conditions were employed as inputs,
while complexation free energy as outputs. Results showed that the
light gradient boosting machine provided significantly improved
predictive performance over random forest and deep learning.
The mean absolute error was 1.38 kJ/mol and squared correlation
coefficient was 0.86. In conclusion, the developed predictive
models were able to quickly and accu-rately predict the
solubilizing capacity of CD systems.
Figure 1: the statistic of the dataset.

Figure 2: the performance of the model.

Table 1: the parameters and statistics.
Item | Value |
---|---|
Training set number: | 2318 |
Validation set number: | 339 |
Test set number: | 335 |
Training set R2/RMSE | 0.86/1.94 |
Validation set R2/RMSE | 0.87/1.63 |
Test set R2/RMSE | 0.86/1.83 |

Zhao Q, Ye Z, Su Y, et al. Predicting complexation performance between cyclodextrins and guest molecules by integrated machine learning and molecular modeling techniques[J].
Acta Pharmaceutica Sinica B, 2019, 9(6): 1241-1252.
Amorphous solid dispersion (SD)
is an effective solubilization technique for water-insoluble drugs. However,
physical stability issue of solid dispersions still heavily hindered the
development of this technique. Traditional stability experiments need
to be tested at least three to six months, which is time-consuming and unpredictable.
In this research, a novel prediction model for physical stability of solid dispersion
formulations was developed by machine learning techniques. 646 stability data points
were collected and described by over 20 molecular descriptors.
All data was classified into the training set (60%), validation set (20%),
and testing set (20%) by the improved maximum dissimilarity algorithm (MD-FIS).
Eight machine learning approaches were compared and random forest (RF) model
achieved the best prediction accuracy.
Figure 1: the statistic of the dataset.

Table 1: the parameters and statistics.
Item | 3 month | 6 month |
---|---|---|
Training set: | 407 | 407 |
Validation set: | 121 | 121 |
Test set: | 121 | 121 |
Training set ACC | 0.951 | 0.941 |
Validation set ACC | 0.833 | 0.817 |
Test set ACC | 0.9 | 0.842 |

Han R, Xiong H, Ye Z, et al. Predicting physical stability of solid dispersions by machine learning techniques[J].
Journal of Controlled Release, 2019, 311: 16-25.
The drug-phospholipid complexation technique is widely used as an increasingly common method
to improve the bioavailability of drugs. This project aims to predict the complexation
rate of phospholipid complex by machine learning methods. 363 drug/phospholipid complex
formulations were collected from the literature. The datasets were trained by several popular
machine learning methods including RandomForest, SVM, KNN, Naive Bayes, Decision Tree, XGBoost, LightGBM.
By analysis and compare these methods, the RandomForest got better performance.
The predictive accuracy and AUC of RandomForest
was 0.772/0.824 on training set(5-fold cross validation), 0.808/0.902 for the test set, respectively.
Figure 1: the statistic of the phospholipids in dataset.

Table 1: the parameters and statistics.
Item | Combination |
---|---|
Training set: | 290 |
Test set: | 73 |
Training set ACC/AUC | 0.772/0.824 |
Test set ACC/AUC | 0.808/0.902 |

Gao H, Ye Z, Dong J, et al. Predicting drug/phospholipid complexation by the lightGBM method[J].
Chemical Physics Letters, 2020, 747: 137354.
Nanocrystals have exhibited great advantages for solving the dissolution issue of water insoluble drugs.
In this study, the machine learning models for the three nanocrystal preparation methods were
constructed for the nanocrystal size and PDI predictions. The results demonstrated
that random forest (RF) exhibited well predictive performance for the ball wet milling (BWM),
high-pressure homogenization (HPH) methods and anti solvent method (ASP).
Here, by submitting a molecule, we can predict both size and PDI for the three methods.
Figure 1: the statistic of the antisolvent dataset.

Figure 2: the statistic of the HPH dataset.

Figure 3: the statistic of the BWM dataset.

Table 1: the statistics of the models.
Dataset | Size(Qcv2/Rt2) | PDI(Qcv2/Rt2) |
---|---|---|
Antisolvent: | 0.434/0.419 | 0.593/0.518 |
HPH: | 0.585/0.554 | 0.786/0.690 |
BWM: | 0.699/0.786 | 0.837/0.796 |

He Y, Ye Z, Liu X, et al. Can machine learning predict drug nanocrystals?[J].
Journal of Controlled Release, 2020, 322: 274-285.
Self-emulsifying drug delivery system (SEDDS), a thermodynamic stable nano formulation, is consists of
oil,
surfactant, co-surfactant and APIs. For oral administration,
drugs are dissolved in the SEDDS instead of solid state,
which benefit to the absorption in the Gi tract.
Here, a pseudo-ternary phase diagram predicting model was constructed by random forest algorithm
and this model could predict the self-emulsion status of the formulation under specified conditions.
Figure 1: the statistic of the oli types in the dataset.

Figure 2: the statistic of the surfactant types in the dataset.

Figure 3: the statistic of the cosurfactant types in the dataset.

Table 1: the statistics of the models.
ACC(5-fold CV/Test) | AUC(5-fold CV/Test) | SE(5-fold CV/Test) | SP(5-fold CV/Test) |
---|---|---|---|
0.925/0.926 | 0.977/0.978 | 0.922/0.927 | 0.927/0.926 |

Gao H, Jia H, Dong J, et al. Integrated in silico formulation design of self-emulsifying drug delivery systems[J].
Acta Pharmaceutica Sinica B, 2021, 11(11): 3585-3594 (JCRC Q1, IF: 14.90).
Liposome is a spherical vesicle that consists of a bilayer of lipid and internal
hydrophilic cavity with particle sizes ranging from 30 nm to several micrometers.
As its great biocompatibility, biodegradability and low toxicity, liposome
is regarded as a promising drug delivery system, not only for drug molecule delivery,
but for genes and bio-macromolecules. The composition of lipids, particle size,
surface charge and lamellarity would decide the structure and morphology of liposome
particles. The various preparation methods have been applied in liposome products.
With the development of liposome technology, a number of liposome-based drug formulations
have been available for market and for clinic trails. In current research, size, PDI,
zeta potential and encapsulation of liposome are regarded as key parameters for formulation
evaluation. Herein, we develop a prediction platform for
these key parameters of liposome.
Table 1: the statistics of the models.
Properties | Qcv2/Rt2 | RMSEcv/RMSEt | MAEcv/MAEt |
---|---|---|---|
Zeta: | 0.571/0.644 | 17.528/12.559 | 11.086/8.299 |
PDI: | 0.408/0.604 | 0.081/0.055 | 0.059/0.042 |
Size: | 0.707/0.823 | 145.271/116.002 | 75.263/67.198 |
Encap: | 0.764/0.724 | 16.185/17.926 | 10.853/11.333 |
Drug solubility is a basic property for the formulation design, especially in different
solvents. Here, we collected a current largest dataset containing more than 4000 data points.
Then, we employed lightGBM and RandomForest algorithms to build a robust model. This tool enables users
to predict drug solubility in 27 kinds of solvent.
Table 1: the performance of the model.
Items | Values |
---|---|
Qcv2 | 0.922 |
RMSEcv | 0.380 |
MAEcv | 0.180 |
Rt2 | 0.910 |
RMSEt | 0.395 |
MAEt | 0.177 |
Figure 1: the statistics of the solvents.


Ye Z, Defang Ouyang. Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms[J].
Journal of Cheminformatics, 98(2021).