RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来预测RNA的柔性。为此,本文提出了一种机器学习方法RNAfwe来预测RNA柔性,该方法采用词嵌入技术提取RNA序列特征。RNAfwe与同类基于序列的RNAflex方法比较,结果显示:相比于使用独热编码的RNAflex (One-Hot),RNAfwe在训练和测试集上都获得了更高的皮尔逊相关系数(PCC) 0.5017和0.4704,这表明词嵌入相较于独热编码可从RNA序列中提取与柔性更相关的特征;相比于利用进化信息的RNAflex (PSSM),尽管RNAfwe的性能稍差,但前者需要知道足够的同源序列。这项工作有助于RNA动力学性质的研究,另外为词嵌入技术广泛用于生物信息学研究提供了支持。RNA molecular dynamics is closely related to their functions. The flexibility of RNA molecules, as one of the most fundamental characteristics of their dynamics, has been widely used to study their folding properties, structural stability, ligand binding ability and so on. Experimental methods for measuring RNA flexibility are often time-consuming and labor intensive, so there is an urgent need to develop a fast and accurate theoretical method to predict RNA flexibility. To this end, we propose a machine learning method, RNAfwe, to predict RNA flexibility, which uses the word embedding technique to extract RNA sequence features. The comparison of RNAfwe with the similar sequence-based RNAflex method shows that compared with RNAflex (One-Hot), RNAfwe obtains higher Pearson correlation coefficients (PCC) of 0.5017 and 0.4704 on both training and test sets, indicating that the word embedding could extract the more related features to flexibility from RNA sequences than the one-hot encoding. Compared with RNAflex (PSSM) which uses evo
变构是调节蛋白质功能的重要机制,对许多生物过程至关重要。变构调节剂比正构剂具有更高的特异性和更低的毒副作用,这使得变构药物设计比正构药物设计有更多的优势。变构位点的发现是变构药物设计的前提,目前实验上获得的变构位点多是偶然所得,因此亟待发展有效的理论方法来预测蛋白质变构位点。本工作提出了一种集成的机器学习方法AllosEC用于预测蛋白质变构口袋,该方法除了考虑口袋的理化性质外,还加入了口袋的二级结构信息、深度指数(DPX)和突出指数(CX)特征。另外,为了克服正负样本极度不平衡的问题,本工作使用欠采样方法来平衡训练数据集。在独立测试集上,AllosEC在多个评价指标上优于现有的其他方法,SEN、SPE、PRE和MCC分别为0.708、0.915、0.405和0.486。这样,本工作提供了性能良好的蛋白质变构位点预测方法AllosEC。Allostery is an important mechanism for regulating protein functions, which is essential for many biological processes. Compared with orthosteric regulators, allosteric regulators have higher specificity and lower toxicities, which makes allosteric drug design have more advantages than orthosteric drug design. The discovery of allosteric sites is a prerequisite for allosteric drug design. Currently, experimentally obtained allosteric sites are mostly obtained by chance, and therefore there is an urgent need to develop effective theoretical methods to predict protein allosteric sites. Here, we present an ensemble machine learning method AllosEC for protein allosteric pocket prediction, where besides the pockets’ physicochemical properties, their secondary structure information, depth indexes (DPXes) and protrusion indexes (CXes) are considered. In order to overcome the problem of extreme imbalance between positive and negative samples, this work uses an under sampling method to balance the training dataset. AllosEC outperforms other existing methods in multiple