GenSMILES: An enhanced validity conscious representation for inverse design of molecules

Arun Singh Bhadwal, Kamal Kumar, Neeraj Kumar
Knowledge-Based Systems, V 268 (2023) 110429, ISSN 0950-7051
Development Published: (Mar/2023)

Deep neural networks have become increasingly important in recent years for creating molecules with desirable properties. In general, SMILES strings are used to train deep neural network based models. The trained model is then used to generate the desired molecules. Unfortunately, due to syntactical and semantic flaws in the representation, the SMILES string generates a substantial number of invalid molecules. SMILES representation fails to efficiently handle rings, branch and bonds between atoms. Lack of robustness in dealing with the cited aspects results in abundance of invalid strings. To overcome this limitation, this paper proposes a SMILES like representation, called GenSMILES. GenSMILES tackles syntactical and semantic issues by relying upon derivative rules to apply constraints. This causes a generative model produces more valid SMILES strings. By substituting a single notation for the pair representation of branches and rings in SMILES with a ) and ^, respectively, GenSMILES corrects the syntactical issues. The mismatching of atom’s bonds is the main cause of semantic errors. GenSMILES addresses such issues by employing derivation rules during string conversion from GenSMILES to SMILES. Every SMILES string can be represented with an equivalent GenSMILES. When used for designing drug molecule, GenSMILES increases molecule’s validity when compared with SMILES and DeepSMILES on two popular architectures i.e., Recurrent Neural Network and Variational Autoencoder. The main benefit of GenSMILES is that it can be applied directly to generative algorithms without adapting the model environment. GenSMILES is beneficial not only to generative approaches of DL but also to the approaches that use SMILES string-like representation. On most of the datasets, GenSMILES is effective in improving validity above 90% and diversity score 15. GenSMILES results in more diversity in the properties of generated molecules and allows exploration of larger portion of undiscovered chemical space as compared to SMILES and DeepSMILES. GenSMILES guarantee that the generative model does not need to remember any long dependencies and is principal contribution of this work.