Monday 2018-06-18 14.00 – 15.00
Seminar room T5, T-building
Machine Learning for Enzyme Promiscuity
With the discovery of an increasing number of catalytically promiscuous enzymes, which are capable of catalyzing multiple reactions, the traditional view of enzymes as highly specific proteins has been brought into question. The significant implications of protein promiscuity for the theory of enzyme evolution suggest that this inherent feature can be utilized as the seed for engineering new functions in biotechnology and synthetic biology as well as in drug design. Therefore, understanding protein promiscuity is becoming even more important as it provides new insights into the evolutionary process that has led to such vast functional diversity. While there have been numerous efforts devoted to recognizing the determinants of promiscuity, till date, this pertinent question regarding the distinctions between specialized enzymes and promiscuous enzymes has remained unanswered.
As an in silico approach, in this thesis, we attempt to find a predictive model which can accurately classify unseen proteins into catalytically promiscuous and non-promiscuous. To this end, we exploit different representations and properties of proteins, and adopt different computational approaches accordingly. The role of proteins sequences as indicators of promiscuity is investigated by means of the BLAST algorithm as well as string kernels. Additionally, to validate the interplay between proteins’ three-dimensional structures and their promiscuous behaviors, we employ a novel method which is modeling the topological details of proteins as graphs. Graph kernel functions are then applied to measure the structural similarities between the 3D structures of proteins. The classification is performed using SVM as a kernel-based method. The results indicate that proteins’ sequences have limited bearings on promiscuity. Conversely, proteins’ 3D structures can reliably predict whether a protein has promiscuous activities with an accuracy of 96%. Our best results are achieved using the Weisfeiler-Lehman subtree graph kernel and the secondary structure information of proteins.