Title: Developing a bioinformatics framework for proteogenomics
Abstract:In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of g...In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes.
The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence.
This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries.
This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed.Read More
Publication Year: 2015
Publication Date: 2015-01-01
Language: en
Type: dissertation
Access and Citation
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot