Аннотация:The COVID-19 pandemic has been going on for two years now. During this time, vaccines andapproved drugs for SARS-CoV-2 have been developed, but the design of more selective and effectiveantiviral drugs remains an urgent problem. One of the most promising targets for the anticoronavirusdrugs design is the main protease (Mpro) [1]. The amount of available structural data allows the mostefficient use of the ensemble docking method, which uses the results of docking into various structuresto rank ligands for virtual screening.The aim of our work was to develop a virtual screening method for SARS-CoV-2 Mpro proteaseinhibitors based on ensemble docking and machine learning. The machine learning model is used toimprove generated during the docking process poses ranking.As a training set, a library of 6897 compounds with experimentally determined percentage ofSARS-CoV-2 protease Mpro inhibition at a concentration of 20 µM was used. Compounds withinhibition percentage of more than 50% were assigned active. 70% of the compounds were assignedinto training set, and the remaining 30% were split in half into test and validation sets. The ratio ofactive to inactive compounds in the training, test and validation sets were 0.034, 0.034, 0.034respectively.[2].In January 2022, 415 SARS-CoV-2 Mpro structures were available in the PDB database. Theensemble of 3CLpro protease structures was composed of mature, non-oxidized, fully resolvedstructures, which have the highest all-atom pairwise RMSD between the conformations of the activesite residues. Six structures were selected to establish the final ensemble [3].The training set was comprised of the structures of the best protein-ligand complexes obtained bydocking compounds with known activity into the structures of the ensemble using DOCK6.9. Thevector description of the complex was made using interaction fingerprints characterizing the type ofcontact between the ligand atoms and the nearest protein atoms. Fingerprints were calculated using theFlare 5.0.0 (Cresset, UK).Various machine learning models (random forest, gradient boosting, support vector machine,deep learning) were trained on vector descriptions of ligand-protein complexes and percent inhibitionof Mpro protease by the ligand to classify active and inactive molecules. The best by AUC and ROCcurves result was shown by the random forest model with AUC of 0.79 on validation dataset.This study was supported by the Non-commercial Foundation for the Advancement of Scienceand Education INTELLECT and State Research Funding No. FNZG-2022-0002.We thank Cresset team for Flare academic license.