ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ПсковГУ |
||
The success of the application of Convolutional Neural Networks (CNN) in general image classification tasks and computer vision now is undeniable. Furthermore, over the past decade we witnessed a steady increase in number of publications concerning new ways of using CNN for objects classification in natural sciences and engineering (structural biology, drug design, chemistry etc.). Thus, considering promising results of several related articles (“Chemseption” by Garrett B. Goh et al. and “Molecular graph convolutions: moving beyond fingerprints” by Steven Kearnes et al.) we have decided to conduct a study on using connectivity-based representation of molecules with no explicit features defined as input for a CNN. Finally, we have evaluated this novel approach for quantitative structure-property relationships (QSPR) and quantitative structure-activity relationships (QSAR) predictions. Certain number of “classical” representations and their combinations (adjacency matrices, connectivity tables etc.) were tested as well as our own specifically designed complex representations. The open source Scikit-learn (http://scikit-learn.org/stable/) machine learning Python library was used for data preprocessing. Several models with Convolutional Layers connected to Feed Forward Neural Network (FFNN) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/), a deep learning library, and Tensorflow (www.tensorflow.org) as a backend. Different types of train/test/validation splitting approaches were tested including our own take on k-fold cross-validation probability averaging (aka k- models ensemble), which were able to produce high quality end-points predictions outperforming classic split methods. Finally, we showed that CNN deep learning models trained on connectivity-based representation can be used for QSPR and QSAR predictions with good model’s prediction performance.