Influence of Protein Databases in Proteomic Identification
Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
Abstract:Database searching is a common strategy to identify proteins in current proteomic studies. In this strategy, searching against a highly comprehensive database might produce more protein identifications, but have the risk of incorrect database annotations. In contrast, using a more accurate database might loss some correct protein identifications that are not included in the database due to less database completeness. Achieving both completeness and accuracy in protein identification is an important problem. Taking human proteomic study as an example, this study compared database searching results of three commonly used protein databases (IPI database, UniProt database and Swiss-Prot database) on three proteomic datasets that were obtained from different biological samples and mass spectrometers. In general, although these databases performed differently on various proteomic data, the differences among them were not significant. For each database, no more than 5% of the total peptide identifications were not identified by the other two databases, while the differences of protein identifications ranged from 1% to 5%. This result indicates that all of the databases are with high completeness by covering most of the commonly identified proteins in human samples. Therefore, we recommend using Swiss-Prot database, a manually curated and continuously updated database, for routine human proteomic analysis. In addition, if the aim of a study to identify or quantify some special sequences that are not included in SwissProt database, such as protein isoforms or mutations, researchers can add the target protein sequences to Swiss-Prot database, or use a more complete database instead
[1]Eng JK, Searle BC, Clauser KR, et al. A face in the crowd: recognizing peptides through database search [J]. Mol Cell Proteomics, 2011, 10(11):R111 009522.
[2]Kersey PJ, Duarte J, Williams A, et al. The International Protein Index: an integrated database for proteomics experiments [J]. Proteomics, 2004, 4(7):1985-1988.
[3]UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt) [J]. Nucleic Acids Res, 2012, 40(Database Issue):D71-D75.
[4]Nakamura Y, Cochrane G, KarschMizrachi I. The international nucleotide sequence database collaboration [J]. Nucleic Acids Res, 2013, 41(D1):D21-D24.
[5]Flicek P, Amode MR, Barrell D, et al. Ensembl 2012 [J]. Nucleic Acids Res, 2012, 40(Database Issue):D84-D90.
[6]Pruitt KD, Tatusova T, Brown GR, et al. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy [J]. Nucleic Acids Res, 2012, 40(Database Issue):D130-135.
[7]Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics [J]. J Proteomics, 2010,73(11):2092-2123.
[8]Perkins DN, Pappin DJ, Creasy DM, et al. Probabilitybased protein identification by searching sequence databases using mass spectrometry data [J]. Electrophoresis, 1999, 20(18):3551-3567.
[9]Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem [J]. Mol Cell Proteomics, 2005, 4(10):1419-1440.
[10]Liu Xuejiao, Shao Chen, Wei Lilong, et al. An individual urinary proteome analysis in normal human beings to define the minimal sample number to represent the normal urinary proteome [J]. Proteome Sci, 2012, 10(1):70.
[11]Elias JE, Gygi SP. Targetdecoy search strategy for increased confidence in largescale protein identifications by mass spectrometry [J]. Nat Methods, 2007, 4(3):207-214.