A Review on the Patient Similarity Analysis Based on Electronic Medical Records
Jia Zheng1, Zong Ruijie2, Duan Huilong1, Li Haomin3,4*
1 College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou 310027, China 2 Qingdao Blood Center, Qingdao 266071, Shandong, China 3 The Children’s Hospital, Zhejiang University School of Medicine, Hangzhou 310052, China 4 The Institute of Translational Medicine, Zhejiang University, Hangzhou, 310029, China
Abstract:Predicting the future health status of a patient has important social and scientific value. At the same time, the accumulating big data in health care domain provides a new basis for obtaining predictive models or establishing predictive methods through medical big data analysis. The patient similarity analysis that provides a general-purpose computer assistant clinical decision support framework based on the predictive knowledge mining from the large practice clinical data generated by a mount of routine patients using the patient distance assessment has paved a way to personalized medicine. Up to date, this method that had been initially approved in many medical domains such as cancer, endocrine diseases and heart diseases, has become a very important direction in clinical translation of artificial intelligence technology. In this paper,the theoretical basis and research progress of the patient similarity analysis were reviewed through introducing the common structure of the patient similarity calculation framework and corresponding key technologies in different processes, such as data preprocessing, dimension reduction, measuring distance of different concepts and generation of similarity. At the same time, existing problems and challenges faced by the patient similarity analysis were proposed.
贾峥,宗瑞杰,段会龙,李昊旻. 基于电子病历的患者相似性分析综述[J]. 中国生物医学工程学报, 2018, 37(3): 353-366.
Jia Zheng, Zong Ruijie, Duan Huilong, Li Haomin. A Review on the Patient Similarity Analysis Based on Electronic Medical Records. Chinese Journal of Biomedical Engineering, 2018, 37(3): 353-366.
[1] Belle A, Thiagarajan R, Soroushmehr SMR, et al. Big data analytics in healthcare [J]. BioMed Res Int, 2015: 370194. [2] McGuire R. Road to recovery? Medicine moves from proficiency-based art to data-driven science[EB/OL]. https://goo.gl/ug8rC5, 2014-02-25/2017-07-29. [3] Snyderman R. Personalized health care: from theory to practice [J]. Biotechnol. 2012, 7(8): 973–979. [4] Stiglic G, Brzan PP, Fijacko N, et al. Comprehensible predictive modeling using regularized logistic regression and comorbidity based features [J]. PLoS ONE, 2015, 10(12): e0144439. [5] Nguyen P, Tran T, Wickramasinghe N, et al.Deepr: A convolutional net for medical records [J]. IEEE J Biomed Health Inform,2017, 21(1): 22-30. [6] Choi E, Bahadori MT, Kulas JA, et al. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism[C]//The 30th Conference on Neural Information Processing Systems. Barcelona: IEEE, 2016:1-9. [7] Hoogendoorn M, El Hassouni A, Mok K, et al. Prediction using patient comparison vs. modeling: a case study for mortality prediction[C]//2016 38thConference Proceedings IEEE EMBS. Orlando: IEEE,2016: 2464-2467. [8] Sharafoddini A, Dubin JA, Lee J. Patient similarity in prediction models based on health data: a scoping review [J]. JMIR Med Inform,2017, 5(1): e7. [9] Zhang Ping, Wang Fei, Hu Jianying, et al. Towards personalized medicine: leveraging patient similarity and drug similarity analytics [C]//AMIA Jt Summits Transl Sci Proc 2014. San Francisco: Oxford University Press,2014: 132-136. [10] Ng K, Sun Jimeng, Hu Jianying, et al. Personalized predictive modeling and risk factor identification using patient similarity [C]// AMIA Jt Summits Transl Sci Proc 2015. San Francisco: Oxford University Press,2015: 132-136. [11] Brown SA. Patient similarity: emerging concepts in systems and precision medicine [J]. Front Physiol.2016, 7: 561. [12] Salekin A, Stankovic J. Detection of chronic kidney disease and selecting important predictive attributes[C]//2016 IEEE IntConf on ICHI. Los Alamitos: IEEE Computer Society, 2016: 262-270. [13] Park E, Chang HJ, Nam HS. Use of machine learning classifiers and sensor data to detect neurological deficit in stroke patients [J]. J Med Internet Res,2017, 19(4): e120. [14] Churpek MM, Yuen TC, Winslow C, et al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards [J]. Crit Care Med, 2016, 44(2): 368-374. [15] Harutyunyan H, Khachatrian H, Kale DC, et al. Multitask learning and benchmarking with clinical time series data [J]. arXiv preprint,2017: ArXiv:1703. [16] Sadegh-Zadeh K. Handbook of Analytic Philosophy of Medicine [M]. Dordrecht: Springer, 2015: 301-302. [17] Wang Fei, Sun Jimeng. PSF: a unified patient similarity evaluation framework through metric learning with weak supervision [J]. IEEE J Biomed Health Inform,2015, 19(3): 1053-1060. [18] Li Li, Cheng Weiyi, Glicksberg BS, et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity [J]. Sci Transl Med, 2015, 7(311): 311ra174. [19] Berner ES, Lande TJL. Overview of clinical decision support systems [M]//Clinical Decision Support Systems (2nd Edition). New York: Springer,2007: 3-22. [20] Roque FS, Jensen PB, Schmock H, et al.Using electronic patient records to discover disease correlations and stratify patient cohorts [J]. PLoS Comput Biol,2011, 7(8):e1002141. [21] Wang Haiying, Zheng Huiru, Wang Jianxin, et al. Integrating omic data with a multiplex network-based approach for the identification of cancer subtypes [J]. IEEE Trans Nanobioscience, 2016, 15(4): 335-342. [22] Wang Fei. Adaptive semi-supervised recursive tree partitioning: The ART towards large scale patient indexing in personalized healthcare [J]. J Biomed Inform,2015, 55: 41-54. [23] Chan LWC, Liu Ying, Chan Tao, et al. PubMed-supported clinical term weighting approach for improving inter-patient similarity measure in diagnosis prediction [J]. BMC Med Inform Decis Mak, 2015, 15: 43. [24] Carreiro AV, Madeira SC, Francisco AP. Unravelling communities of ALS patients using network mining[C]//DMH Workshop KDD ’13. Chicago: DMH, 2013. [25] Buske OJ, Schiettecatte F, Hutton B, et al. The Matchmaker Exchange API: automating patient matching through the exchange of structured phenotypic and genotypic profiles [J]. Hum Mutat, 2015, 36(10): 922-927. [26] Bolouri H, Zhao LP, Holland EC. Big data visualization identifies the multidimensional molecular landscape of human gliomas [J]. Proc Natl AcadSci USA, 2016, 113(19): 5394-5399. [27] Panahiazar M, Taslimitehrani V, Pereira NL, et al. Using EHRs for heart failure therapy recommendation using multidimensional patient similarity analytics [J]. Stud Health Technol Inform,2015, 210: 369-373. [28] Bjrnson E, Borén J, Mardinoglu A. Personalized cardiovascular disease prediction and treatment-a review of existing strategies and novel systems medicine tools [J]. Front Physiol,2016, 7: 2. [29] Gotz D, Stavropoulos H, Sun J, et al. ICDA: a platform for Intelligent Care Delivery Analytics [C]//AMIA Annu Symp Proc 2012. Chicago: Oxford Univerity Press, 2012: 264-273. [30] Subirats L, Ceccaroni L, Miralles F. Knowledge representation for prognosis of health status in rehabilitation [J]. Future Internet,2012, 4(4): 762-775. [31] Wang Fei, Hu Jianying, Sun Jimeng. Medical prognosis based on patient similarity and expert feedback [C]//Proc of the 21st Int Conf on Pattern Recognition. Tsukuba: IEEE. 2012: 1799-1802. [32] Henriques J, Carvalho P, Paredes S, et al. Prediction of heart failure decompensation events by trend analysis of telemonitoring data [J]. IEEE J Biomed Health Inform,2015, 19(5): 1757-1769. [33] Lowsky DJ, Ding Y, Lee DKK, et al. A K-nearest neighbors survival probability prediction method [J]. Stat Med,2013, 32(12): 2062-2069. [34] Chiu PH, Hripcsak G. EHR-based phenotyping: Bulk learning and evaluation [J]. J Biomed Inform,2017, 70: 35-51. [35] Joshi R, Szolovits P. Prognostic physiology:Modeling patient severity in Intensive Care Units using radial domain folding [C]//AMIA Annu Symp Proc 2012. Chicago: Oxford Univerity Press, 2012: 1276-1283. [36] Gottlieb A, Stein GY, Ruppin E, et al. A method for inferring medical diagnoses from patient similarities [J]. BMC Med,2013, 11: 194. [37] Saraiva RM, Bezerra J, Perkusich M, et al. A hybrid approach using case-based reasoning and rule-based reasoning to support cancer diagnosis: A pilot study [J]. Stud Health Technol Inform,2015, 216: 862-866. [38] Singh A, Pandey B. Diagnosis of liver disease using correlation distance metric based k-nearest neighbor approach [M]//Intelligent Systems Technologies and Applications 2016. Jaipur: Springer,2016: 845-856. [39] Labellapansa A, Efendi A, Yulianti A, et al. Lambda value analysis on Weighted Minkowski distance model in CBR of Schizophrenia type diagnosis [C]//2016 4th Int Conf on Information and Communication Technology. Bandung: IEEE, 2016: 1-4. [40] Girardi D, Wartner S, Halmerbauer G, et al. Using concept hierarchies to improve calculation of patient similarity [J]. J Biomed Inform, 2016, 63: 66-73. [41] Rivault Y, Meur N L, Dameron O. A similarity measure based on care trajectories as sequences of sets [C]//2017 Conf on Artificial Intelligence in Medicine in Europe. Vienna:Springer, 2017: 278-282. [42] Dey S, Wang Yajuan, Byrd RJ, et al. Characterizing physicians practice phenotype from unstructured electronic health records [C]//AMIA Annu Symp Proc 2016, Chicago:Oxford University Press, 2016: 514-523. [43] Dai Yang, Lokhandwala S, Long W, et al. Phenotyping hypotensive patients in critical care using hospital discharge summaries [C]//2017 IEEE EMBS Int Conf on Biomedical Health Informatics. Orlando: IEEE, 2017: 401-404. [44] Somasundaram RS, Nedunchezhian R. Evaluation of three simple imputation methods for enhancing preprocessing of data with missing values [J]. Int J Comput Appl,2011, 21: 0975 - 8887. [45] Chattopadhyay A K, Chattopadhyay T. Missing Observations and Imputation[M]//Statistical Methods for Astronomical Data Analysis. New York: Springer, 2014: 155-162. [46] Sun Jimeng, Sow D, Hu Jianying, et al.A system for mining temporal physiological data streams for advanced prognostic decision support [C]//2010 IEEE International Conf on Data Mining. Sydney: IEEE, 2010: 1061-1066. [47] Saeed M, Mark R. A novel method for the efficient retrieval of similar multiparameter physiologic time series using wavelet-based symbolic representations [C]//AMIA Annu Symp Proc 2006.Washington DC: Oxford Univerity Press, 2006: 679-683. [48] Kamkar I, Gupta SK, Phung D, et al. Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso [J]. J Biomed Inform,2015, 53: 277-290. [49] Kamkar I, Gupta S K, Phung D, et al.Stabilizing l1-norm prediction models by supervised feature grouping [J]. J Biomed Inform,2016, 59: 149-168. [50] Izenman A J. Linear discriminant analysis[M]//Modern Multivariate Statistical Techniques. New York: Springer, 2013: 237-280. [51] Ling XB, Lau K, Kanegaye JT, et al. A diagnostic algorithm combining clinical and molecular data distinguishes Kawasaki disease from other febrile illnesses [J]. BMC Med,2011, 9: 130. [52] Dunteman GH. Principal Components Analysis[M]. California: SAGE, 1989: 50-52. [53] Miotto R, Li L, Kidd BA, et al. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records [J]. Sci Rep,2016, 6: 26094. [54] Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization [J]. Nature,1999, 401(6755): 788-791. [55] Ling Yuan, Jiang Xingpeng, Yuan An, et al. Data exploration and visualization of risk factors for heart disease from medical documents using non-negative matrix factorization (NMF) [C]//7th i2b2 Shared Task and Workshop: Challenges in Natural Language Processing for Clinical Data. Washington DC: IEEE, 2014:1-4. [56] Burrus C, Gopinath R, Guo H. Introduction to Wavelets and Wavelet Transforms: A Primer[M]. New Jersey: Prentice-Hall Inc, 1997: 31-40. [57] Duchon J. Splines minimizing rotation-invariant semi-norms in Sobolev spaces[C]//Constructive Theory of Functions of Several Variables. Berlin: Springer,1977: 85-100. [58] Wang Yue, Luo Jin, Hao S, et al. NLP based congestive heart failure case finding: A prospective analysis on statewide electronic medical records [J]. Int J Med Inf,2015, 84(12): 1039-1047. [59] Lin Chihuang, Lin Ichun, Roan JS, et al. Critical factors influencing hospitals’ adoption of HL7 version 2 standards: an empirical investigation [J]. J Med Syst,2012, 36(3): 1183-1192. [60] Yang Xiufeng, Peng Hui, Shi Mingrui. SVM with multiple kernels based on manifold learning for Breast Cancer diagnosis [C]//2013 IEEE Int Conf on Information and Automation. Yinchuan: IEEE, 2013: 396-399. [61] 周小勇. 对银屑病病情与治疗结果临床判断方法及其质量的评价 [J]. 国外医学:皮肤性病学分册, 2001, 27(2): 106-108. [62] Wolfram语言参考资料 - 距离和相似度测量[EB/OL]. https://goo.gl/CPVJWe, 2017-08-02/2017-08-02. [63] Gottlieb A, Stein GY, Ruppin E, et al. PREDICT: A method for inferring novel drug indications with application to personalized medicine [J]. Mol Syst Biol,2011, 7: 496-496. [64] Lin Dekang. An information-theoretic definition of similarity [C]// Proc of the 15th Int Conf on Machine Learning. Burlington: Morgan Kaufmann, 1998, 28: 296-304. [65] Gehrmann S, Dernoncourt F, Li Y, et al.Comparing rule-based and deep learning models for patient phenotyping [J].arXiv preprint, 2017: ArXiv:1703. [66] Farhan W, Wang Zhimu, Huang Yingxiang, et al. A predictive model for medical events based on contextual embedding of temporal sequences [J]. JMIR Med Inform,2016, 4(4): e39. [67] Dogaru R, Micota F, Zaharie D. Taxonomy-based dissimilarity measures for profile identification in medical data [C]//2015 IEEE 13th International Symposium on Intelligent Systems and Informatics. Subotica: IEEE, 2015: 149-154. [68] 徐祥. 多中心聚类算法的研究与改进[D]. 合肥: 安徽大学, 2015. [69] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model [J]. J Mach Learn Res,2003, 3: 1137-1155. [70] Goldberg Y, Levy O. Word2Vec explained:Deriving Mikolov et al.′s negative-sampling word-embedding method [J].arXiv preprint, 2014: arXiv:1402. [71] Hajihashemi Z, Popescu M. Predicting health patterns using sensor sequence similarity and NLP [C]//2012 IEEE International Conf on Bioinformatics and Biomedicine Workshops. Philadelphia: IEEE, 2012: 948-950. [72] Meyer CP, Hollis M, Cole AP, et al. Complications following common inpatient urological procedures: temporal trend analysis from 2000 to 2010 [J]. Eur Urol Focus,2016, 2(1): 3-9. [73] Rea S, Pathak J, Savova G, et al. Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project [J]. J Biomed Inform,2012, 45(4): 763-771. [74] Agrawal R, Lin K, Sawhney HS, et al. Fast similarity search in the presence of noise, scaling, and translation in time-series databases [C]//Proc of the 21th IntConf on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers Inc, 1995: 490-501. [75] Rocha T, Paredes S, de Carvalho P, et al. Prediction of acute hypotensive episodes by means of neural network multi-models [J]. ComputBiol Med,2011, 41(10): 881-890. [76] Zhang Shuai, Liu Lei, Li Hui, et al.MTPGraph: a data-driven approach to predict medical risk based on temporal profile graph [C]//2016 IEEE Trustcom/BigDataSE/ISPA. Tianjin: IEEE, 2016: 1174-1181. [77] Steimann F. The interpretation of time-varying data with DiaMon-1 [J]. ArtifIntell Med,1996, 8(4): 343-357. [78] Ben-Assuli O, Leshno M. Using electronic medical records in admission decisions: a cost effectiveness analysis [J]. Decis Sci,2013, 44(3): 463-481. [79] Pageler NM, Longhurst CA, Wood M, et al. Use of electronic medical record-enhanced checklist and electronic dashboard to decrease CLABSIs [J]. Pediatrics,2014, 133(3): e738-e746. [80] Wang Fei, Zhou Jiayu, Hu Jianying. DensityTransfer: A data driven approach for imputing electronic health records [C]//2014 22nd Int Conf on Pattern Recognition.Stockholm: IEEE, 2014: 2763-2768. [81] Wu M, Ghassemi M, Feng Mengling, et al. Understanding vasopressor intervention and weaning: risk prediction in a public heterogeneous clinical time series database [J]. J Am Med Inform Assoc, 2017, 24(3): 488-495. [82] Che Chao, Xiao Cao, Liang Jian, et al. An RNN architecture with dynamic temporal matching for personalized predictions of Parkinson’s disease [C]//Proc of the 2017 SIAM Int Conf on Data Mining. Houston: SIAM, 2017: 198-206. [83] Baytas IM, Xiao C, Zhang X, et al. Patient Subtyping via Time-Aware LSTM Networks [C]//Proc of the 23rd ACM SIGKDD International Conf on Knowledge Discovery and Data Mining. Halifax: ACM, 2017: 65-74. [84] Zhou Jiayu, Wang Fei, Hu Jianying, et al. From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records [C]//Proc of the 20th IntConf on Knowledge Discovery and Data Mining.New York: ACM, 2014: 135-144.