|
|
Research on Heterogeneity and Compatibility of Biomedical Field Metadata Supported by Ontology |
Zhang Lulu, Yang Sheng, Shi Furen, Pan Hongjie, Wang Zhigang, Yang Xiaolin* |
(Institute of Basic Medical Sciences,Chinese Academy of Medical Sciences,School of Basic Medicine,Peking Union Medical College, Beijing 100005,China) |
|
|
Abstract Using ontologies to support the representation of data elements is an important means to improve the machine′s understanding of metadata. In this paper, we evaluated the semantic heterogeneity of data elements in caDSR and assessed two related data elements integration ability. First, 60 pairs of common data elements were selected from caDSR, covering demography, lifestyle, medical history and laboratory measurements. Next, the essential components of data elements were extracted according to the ISO/IEC11179 standard and the similarity of these essential components between every pair of data elements with the support of NCIT was calculated. At last, the compatibility between related data elements was predicted by using SVM based on the semantic similarity between corresponding CDE components. The overall accuracy was above 80%. The results showed that there was currently considerable heterogeneity in the definition of metadata in the caDSR database, especially in the conceptual domain of data elements. Nevertheless, our method still could realize the automatic judgement of data compatibility based on the definition of existing data elements by the help of machine learning. The method established in this study has a certain value for optimizing data element construction process and enriching data standardization tools.
|
Received: 14 March 2019
|
|
|
|
|
[1] Howe D, Costanzo M, Fey P, et al. Big data: the future of biocuration [J]. Nature, 2008, 455(7209): 47-50. [2] Carlos RC, Kahn CE, Halabi S. Data science: big data, machine learning, and artificial intelligence [J]. Journal of the American College of Radiology, 2018, 15(3): 497-498. [3] 姚远.“大数据”的摩尔定律 [J]. 中国电子商情(基础电子),2013,39(5):29-30. [4] 任天. 大数据也有问题:数据量增加远超摩尔定律需要“智能遗忘”[J]. 信息与电脑(理论版),2016,95(11):9-10. [5] Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship [J]. Scientific Data, 2016, 3(160018): 167-172. [6] Ngouongo SM, Löbe M, Stausberg J. The ISO/IEC 11179 norm for metadata registries: does it cover healthcare standards in empirical research? [J]. Journal of Biomedical Informatics, 2013, 46(2): 318-327. [7] Hartel FW, Coronado SD, Dionne R, et al. Modeling a description logic vocabulary for cancer research [J]. Journal of Biomedical Informatics, 2005, 38(2): 114-129. [8] Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: Results and lessons learned from the eMERGE network [J]. Journal of the American Medical Informatics Association, 2013, 20(e1): 147-154. [9] Basford M, Li R, Chute CG, et al. Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: The eMERGE network experience [J]. Journal of the American Medical Informatics Association, 2011, 18(4): 376-386. [10] Pan H, Tryka KA, Vreeman DJ, et al. Using PhenX measures to identify opportunities for cross-study analysis [J]. Human Mutation, 2012, 33(5): 849-857. [11] Abhyankar S, Demnerfushman D, Mcdonald CJ. Standardizing clinical laboratory data for secondary use [J]. Journal of Biomedical Informatics, 2012, 45(4): 642-650. [12] Covitz PA, Hartel F, Schaefer C, et al. CaCORE: A common infrastructure for cancer informatics [J]. Bioinformatics, 2003, 19(18): 2404-2412. [13] 朱广瑾. 中国人生理常数与健康状况调查报告[M]. 北京:中国协和医科大学出版社,2012. [14] Warzel DB, Andonaydis C, Mccurry B, et al. Common data element (CDE) management and deployment in clinical trials [J]. AMIA Annu Symp Proc, 2003, 2003:1048. [15] Yu Guangchuang, Li Fei, Qin Yide, et al. GOSemSim: An R package for measuring semantic similarity among GO terms and gene products [J]. Bioinformatics, 2010, 26(7): 976-978. [16] 彭佳杰,王亚东. 基于基因本体的语义相似度计算方法研究综述 [J]. 智能计算机与应用,2016,6(01):97-100. [17] Wang JZ, Du Z, Payattakool R, et al. A new method to measure the semantic similarity of GO terms [J]. Bioinformatics, 2007, 23(10): 1274-1281. [18] Rodriguez MA, Egenhofer MJ. Determining semantic similarity among entity classes from different ontologies [J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(2): 442-456. [19] Greene D, Richardson S, Turro E. OntologyX: A suite of R packages for working with ontological data [J]. Bioinformatics, 2016, 33(7): 1104-1106. [20] Garla VN, Cynthia B. Semantic similarity in the biomedical domain: an evaluation across knowledge sources [J]. BMC Bioinformatics, 2012, 13(1): 261-261. [21] Whetzel PL, Noy NF, Shah NH, et al. BioPortal: Enhanced functionality via new web services from the National Center for Biomedical Ontology to access and use ontologies in software applications [J]. Nucleic Acids Research, 2011, 39(Supp l): W541-W545. [22] Cuggia M, Bourdé A, Turlin B, et al. Automatic definition of the oncologic EHR data elements from NCIT in OWL [J]. Stud Health Technol Inform, 2011, 169(1): 517-521. [23] Horridge M, Bechhofer S. The OWL API: A Java API for OWL ontologies [J]. Semantic Web, 2011, 2(1): 11-21. [24] Ye Fei, Lou Xinyuan, Sun Linfu. An improved chaotic fruit fly optimization based on a mutation strategy for simultaneous feature selection and parameter optimization for SVM and its applications [J]. PLoS ONE, 2017, 12(4): e0173516. [25] 吴喜芝. 复杂数据统计方法:基于R的应用 [M]. 第三版. 北京:中国人民大学出版社,2013:42-43. [26] Wu TF, Lin CJ, Weng RC. Probability estimates for multi-class classification by pairwise coupling [J]. Journal of Machine Learning Research, 2004, 5(4): 975-1005. [27] Lin CJ. Errata to “A comparison of methods for multiclass support vector machines”[J]. IEEE Transactions on Neural Networks, 2002, 13(4): 1026-1027. [28] Jiang Guoqian, Solbrig HR, Eric Prud’hommeaux, et al. Quality assurance of cancer study common data elements using a post-coordination approach [J]. AMIA Annu Symp Proc, 2015, 2015: 659-668. [29] Sharma DK, Solbrig HR, Cui Tao, et al. Building a semantic web-based metadata repository for facilitating detailed clinical modeling in cancer genome studies [J]. Journal of Biomedical Semantics, 2017, 8(1): 19-19. [30] Oliveira MIS, Lima GDFB, Lóscio BF. Investigations into Data Ecosystems: A systematic mapping study [J]. Knowledge and Information Systems, 2019,58(1): 1-42. [31] Boeckhout M, Zielhuis GA, Bredenoord AL. The FAIR guiding principles for data stewardship: fair enough? [J]. European Journal of Humangenetics, 2018, 26(7): 931-936. [32] Schuurman N, Leszczynski A. A method to map heterogeneity between near but non-equivalent semantic attributes in multiple health data registries [J]. Health Informatics Journal, 2008, 14(1): 39-57. |
|
|
|