Extraction of Entity Interactions Based on Multiple Feature Fusion Linear Kernel SVM Approach
Wei Xing1,2, Hu Dehua2*, Yi Minhan2, Chang Xuelian3, Yang Xiaodi3, Zhu Wenjie1
1(School of Basic Courses, Bengbu Medical College,Bengbu 233003, Anhui, China) 2(Institute of Information Security and Big Data, Central South University, Changsha 410083, China) 3 (School of Basic Medicine, Bengbu Medical College, Bengbu 233003, Anhui, China)
Abstract:Improving the performance of interaction mining algorithm can help to explore some innovative ideas in the biomedical literature. We proposed a novel feature-based linear kernel support vector machine (SVM) approach to extract and investigate the interactions between diabetes mellitus, genes and drugs. We elaborated the five types of features (entity, entity pair, dependency graph, parse tree, noun phrase-constrained coordination) used, including two novel features, word pair and noun phrase-constrained coordination features. Then 173 interactions between 13 kinds of diabetes mellitus and 23 genes, 79 interactions between 13 kinds of diabetes mellitus and 26 drugs, 159 interactions between 18 genes and 17 genes, 619 interactions between 8 kinds of diabetes mellitus, 23 genes and 26 drugs were ontained. And 27 new entity interactions were predicted. After that we constructed the interaction network of the disease-gene, gene-drug, and disease-gene-drug. The experimental results showed that the proposed method was comparable with the algorithms used in CoPub (0.710), PubGene (0.609), FBK-irst (0.547, 0.800) and WBI (0.510, 0.759), the highest accuracy increased by about 5% (0.847 vs 0.800, and the minimum increased by over 20% (0.742 vs 0.510), which provided perspectives for applications of biomedical big data.
魏星, 胡德华, 易敏寒, 常雪莲, 杨小迪, 朱文婕. 基于多特征融合的线性内核SVM法挖掘生物实体关联[J]. 中国生物医学工程学报, 2018, 37(4): 451-460.
Wei Xing, Hu Dehua, Yi Minhan, Chang Xuelian, Yang Xiaodi, Zhu Wenjie. Extraction of Entity Interactions Based on Multiple Feature Fusion Linear Kernel SVM Approach. Chinese Journal of Biomedical Engineering, 2018, 37(4): 451-460.
[1] Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery[J]. Nature Reviews Genetics, 2012, 13(8): 523-536.
[2] Fundel K, Kuffner RR. RelEx—-Relation extraction using dependency parse trees[J]. Bioinformatics, 2007, 23(3): 365-371.
[3] Bui QC, Sloot PM, van Mulligen EM, et al. A novel feature-based approach to extract drug-drug interactions from biomedical text[J]. Bioinformatics, 2014, 30(23): 3365-3371.
[4] Xu R, Wang QQ. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing[J]. BMC Bioinformatics, 2013, 14(13): 1-11.
[5] Gondy LA, Thomas CRB, Bayes N. Programs for machine learning[J]. Advances in Neural Information Processing Systems, 1993, 79(2): 937-944.
[6] Zhang J, Shen D, Zhou G, et al. Enhancing HMM-based biomedical named entity recognition by studying special phenomena[J]. Journal of Biomedical Informatics, 2004, 37(6): 411-422.
[7] Corbett P, Copestake A. Cascaded classifiers for confidence-based chemical named entity recognition[J]. BMC Bioinformatics, 2008, 9(1): S4.
[8] Skeppstedt M, Kvist M, Nilsson GH, et al. Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study[J]. Journal of Biomedical Informatics, 2014, 49(6): 148-158.
[9] Habib MS, Kalita J. Scalable biomedical named entity recognition: investigation of a database-supported SVM approach[J]. International Journal of Bioinformatics Research & Applications, 2010, 6(2): 191-208.
[10] Faisal M. FBK-irst: A multi-phase kernel based approach for drug-drug interaction detection and classification that exploits linguistic information[C]// Proceedings of the 7th International Workshop on Semantic Evaluation, USA, 2013: 351-355.
[11] Björne J, Salakoski T. Generalizing biomedical eventin [C]// Proceedings of BioNLP Shared Task 2011 Workshop. Portland: Association for Computational Linguistics, 2011: 183-191.
[12] Rastegar MM. Extraction and classification of drug-drug interaction from biomedical text using a two-stage classifier[J]. Dissertations & Theses-Gradworks, 2013, 11(8): 1423-1426.
[13] Donna M, Jim O, Pruitt KD, et al. Entrez Gene: gene-centered information at NCBI[J]. Nucleic Acids Research, 2007, 39(Database issue): D54-D58.
[14] Pruitt KD, Tatiana T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes transcripts and proteins[J]. Nucleic Acids Research, 2008, 33(Database issue): D501-D504.
[15] Ashburner M, Ball C A, Blake J A, et al. Gene Ontology: tool for the unification of biology[J]. Nature Genetics, 2000, 25(1): 25-29.
[16] Ada, Hamosh, Alan F, Scott, Joanna S, Amberger, et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders[J]. Nucleic Acids Research, 2005, 33(Database issue): D514- D517.
[17] Craig, Knox, Vivian, Law, Timothy, Jewison, et al. DrugBank 3.0: A comprehensive resource for ‘Omics’ research on drugs[J]. Nucleic Acids Research, 2011, 39(Database issue): D1035-D1041.
[18] Sun K, Wilbur WJ. Classifying protein-protein interaction articles using word and syntactic features[J]. BMC Bioinformatics, 2011,12(39): S9.
[19] Kim W, Wilbur WJ. Corpus-based statistical screening for content-bearing terms[J]. Journal of the American Society for Information Science & Technology, 2001, 52(52): 247-259.
[20] Miyao Y, Sagae KR, Matsuzaki T, et al. Evaluating contributions of natural language parsers to protein-protein interaction extraction[J]. Bioinformatics, 2009, 25(3): 394-400.
[21] Airola A, Pyysalo S, Björne J, et al. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning[J]. BMC Bioinformatics, 2008, 9(1): 1-12.
[22] Bunescu RC, Mooney RJ. A shortest path dependency kernel for relation extraction[C]// Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Ann Arbor: ACL, 2005: 724-731.
[23] Liu H, Hunter L, Kešelj V, et al. Approximate subgraph matching-based literature mining for biomedical events and relations[J]. PloS ONE, 2013, 8(4): e60954-e60954.
[24] Kuboyama T, Hirata K, Kashima H, et al. A spectrum tree kernel[J]. Transactions of the Japanese Society for Artificial Intelligence, 2006, 22(1): 109-116.
[25] Qian L, Zhou G. Tree kernel-based protein-protein interaction extraction from biomedical literature[J]. Journal of Biomedical Informatics, 2012, 45(3): 535-543.
[26] Joachims T. Training linear SVMs in linear time[C]// ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Santa Monica: ACM, 2006: 217-226.
[27] Smith LH, Wilbur WJ. Finding related sentence pairs in MEDLINE[J]. Information Retrieval Journal, 2010, 13(6): 601-617.
[28] Mcclosky D, Charniak E. SelfTraining for Biomedical Parsing[C]// Proceedings of the 46th Meeting of the Association for Computational Linguistics. Columbus: DBLP, 2008: 101-104.
[29] De Marneffe MC, Manning CD. The Stanford typed dependencies representation[C]// Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation. Manchester: ACL, 2008: 1-8.
[30] Liu H, Christiansen T, Jr WAB, et al. BioLemmatizer: A lemmatization tool for morphological processing of biomedical text[J]. Journal of Biomedical Semantics, 2012, 3(1): 1-29.
[31] Chowdhury MFM, Lavelli A. Impact of less skewed distributions on efficiency and effectiveness of biomedical relation extraction[C]//Proceedings of the Workshop on Cross\|Framework and Cross\|Domain Parser Evaluation, Sofia: ACL, 2013:34\|42.
[32] R Core Team. R: A language and environment for statistical computing[EB/OL]. http://www.R\|project.org/2015/2017\|06\|27.
[33] Hanley JA, Mcneil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve[J].Radiology, 1982, 143(1): 29-36.
[34] Frijters R, Vugt MV, Smeets R, et al. Literature mining for the discovery of hidden connections between drugs, genes and diseases[J]. PLoS Computational Biology, 2010, 6(9): 655-664.
[35] Davis AP, Grondin CJ, Johnson RJ, et al. The comparative toxicogenomics database: Update[J]. Nucleic Acids Res, 2017, 45(D1): D972-D978.
[36] Stefanski A, Majkowska L, Ciechanowicz A, et al. The common C49620T polymorphism in the sulfonylurea receptor gene (ABCC8), pancreatic beta cell function and long-term diabetic complications in obese patients with long-lasting type 2 diabetes mellitus[J]. Experimental & Clinical Endocrinology & Diabetes, 2007, 115(5): 317-321.
[37] Kamińska-Galwas B, Sroczyński J. Evaluation of hepatic microsomal enzyme activity using C-l4-labeled aminopyrine breath test in patients with diabetes mellitus type 2 treated with tolbutamide[J]. Polskie Archiwum Medycyny Wewnętrznej, 1991, 86(3): 142-148.
[38] Rong X, Wang QQ. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles[J]. Journal of Biomedical Informatics, 2015, 55(1): 64-72.
[39] Gonzalez GH, Tahsin T, Goodale BC, et al. Recent advances and emerging applications in text and data mining for biomedical discovery[J]. Briefings in Bioinformatics, 2015, 17(1): 1-10.
[40] 张闻. 英汉人类基因词典[M].北京:人民卫生出版社, 2011: 37-78.
[41] Jr RG, Schlotterer A, Humpert PM, et al. A M55V polymorphism in the SUMO4 gene is associated with a reduced prevalence of diabetic retinopathy in patients with Type 1 diabetes[J]. Experimental & Clinical Endocrinology & Diabetes, 2007, 116(1): 14-17.
[42] Esmatjes E, Jimenez A, Diaz G, et al. Neonatal diabetes with end-stage nephropathy pancreas transplantation decision[J]. Diabetes Care, 2008, 31(11): 2116-2117.
[43] Miyao Y, Sagae K, Tre R, et al. Evaluating contributions of natural language parsers to protein-protein interaction extraction[J]. Bioinformatics, 2009, 25(3): 394-400.
[44] He L, Yang Z, Zhao Z, et al. Extracting drug-drug interaction from the biomedical literature using a stacked generalization-based approach[J].PLoS ONE, 2013, 8(6): e65814.
[45] Björne J, Kaewphan S, Salakoski T. UTurku: Drug named entity recognition and drug-drug interaction extraction using SVM classification and domain knowledge[C]//International Workshop on Semantic Evaluation.Sydney: SSWS, 2013: 651-659.
[46] Jenssen TK, Laegreid A, Komorowski J, et al. A literature network of human genes for high-throughput analysis of gene expression[J]. Nature Genetics, 2001, 28(1): 21-28.
[47] Sun K, Liu H, Yeganova L, et al. Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach[J]. Biomedical Informatics, 2015, 55(1): 23-30.
[48] Frijters R, Vugt MV, Smeets R, et al. Literature mining for the discovery of hidden connections between drugs, genes and diseases[J]. PLoS Computational Biology, 2010, 6(9): 655-664.
[49] Thomas P, Neves M, Rocktäschel T, et al. WBI-DDI: Drug-drug interaction extraction using majority voting[C]// DDI Challenge at Semeval. Atlanta:ACL, 2013: 628-635.