Title: The GreenCut2 Resource, a Phylogenomically Derived Inventory of Proteins Specific to the Plant Lineage
Abstract: The plastid is a defining structure of photosynthetic eukaryotes and houses many plant-specific processes, including the light reactions, carbon fixation, pigment synthesis, and other primary metabolic processes. Identifying proteins associated with catalytic, structural, and regulatory functions that are unique to plastid-containing organisms is necessary to fully define the scope of plant biochemistry. Here, we performed phylogenomics on 20 genomes to compile a new inventory of 597 nucleus-encoded proteins conserved in plants and green algae but not in non-photosynthetic organisms. 286 of these proteins are of known function, whereas 311 are not characterized. This inventory was validated as applicable and relevant to diverse photosynthetic eukaryotes using an additional eight genomes from distantly related plants (including Micromonas, Selaginella, and soybean). Manual curation of the known proteins in the inventory established its importance to plastid biochemistry. To predict functions for the 52% of proteins of unknown function, we used sequence motifs, subcellular localization, co-expression analysis, and RNA abundance data. We demonstrate that 18% of the proteins in the inventory have functions outside the plastid and/or beyond green tissues. Although 32% of proteins in the inventory have homologs in all cyanobacteria, unexpectedly, 30% are eukaryote-specific. Finally, 8% of the proteins of unknown function share no similarity to any characterized protein and are plant lineage-specific. We present this annotated inventory of 597 proteins as a resource for functional analyses of plant-specific biochemistry. The plastid is a defining structure of photosynthetic eukaryotes and houses many plant-specific processes, including the light reactions, carbon fixation, pigment synthesis, and other primary metabolic processes. Identifying proteins associated with catalytic, structural, and regulatory functions that are unique to plastid-containing organisms is necessary to fully define the scope of plant biochemistry. Here, we performed phylogenomics on 20 genomes to compile a new inventory of 597 nucleus-encoded proteins conserved in plants and green algae but not in non-photosynthetic organisms. 286 of these proteins are of known function, whereas 311 are not characterized. This inventory was validated as applicable and relevant to diverse photosynthetic eukaryotes using an additional eight genomes from distantly related plants (including Micromonas, Selaginella, and soybean). Manual curation of the known proteins in the inventory established its importance to plastid biochemistry. To predict functions for the 52% of proteins of unknown function, we used sequence motifs, subcellular localization, co-expression analysis, and RNA abundance data. We demonstrate that 18% of the proteins in the inventory have functions outside the plastid and/or beyond green tissues. Although 32% of proteins in the inventory have homologs in all cyanobacteria, unexpectedly, 30% are eukaryote-specific. Finally, 8% of the proteins of unknown function share no similarity to any characterized protein and are plant lineage-specific. We present this annotated inventory of 597 proteins as a resource for functional analyses of plant-specific biochemistry. IntroductionThe plastid is an organelle in plants and algae that evolved from a photosynthetic cyanobacterium after it was engulfed by an ancestral eukaryotic cell over 1.5 billion years ago (1Knoll A.H. Science. 1992; 256: 622-627Crossref PubMed Scopus (498) Google Scholar, 2Yoon H.S. Hackett J.D. Ciniglia C. Pinto G. Bhattacharya D. Mol. Biol. Evol. 2004; 21: 809-818Crossref PubMed Scopus (662) Google Scholar). How the endosymbiont became integral to host cell functions and evolved into a plastid is still under debate (3Gross J. Bhattacharya D. Nat. Rev. Genet. 2009; 10: 495-505Crossref PubMed Scopus (68) Google Scholar), but functions localized to the present day plastid depend on both plastid- and nucleus-encoded proteins. The latter are synthesized in the cytoplasm and imported into the organelle by a specific multiprotein complex composed of the translocon of the outer and inner chloroplast envelope membrane (TOC 4The abbreviations used are: TOCtranslocon of the outer chloroplast envelope membraneTICtranslocon of the inner chloroplast envelope membranestr.strainKknownKIknown with inferred functionUunknownUPunknown with predicted functionRubiscoribulose-bisphosphate carboxylase/oxygenaseRNA-seqRNA sequencingRPKMreads per kilobase of mappable sequence per million readsLHClight-harvesting chlorophyll-binding protein. and TIC, respectively) proteins (4Jarvis P. New Phytol. 2008; 179: 257-285Crossref PubMed Scopus (287) Google Scholar, 5Li H.M. Chiu C.C. Annu. Rev. Plant Biol. 2010; 61: 157-180Crossref PubMed Scopus (225) Google Scholar). Over 2000 proteins are estimated to be located in the plastid with the vast majority (>90%) encoded by genes in the nucleus (6Abdallah F. Salamini F. Leister D. Trends Plant Sci. 2000; 5: 141-142Abstract Full Text Full Text PDF PubMed Scopus (229) Google Scholar, 7Emanuelsson O. Nielsen H. Brunak S. von Heijne G. J. Mol. Biol. 2000; 300: 1005-1016Crossref PubMed Scopus (3584) Google Scholar, 8Small I. Peeters N. Legeai F. Lurin C. Proteomics. 2004; 4: 1581-1590Crossref PubMed Scopus (706) Google Scholar, 9van Wijk K.J. Plant Physiol. Biochem. 2004; 42: 963-977Crossref PubMed Scopus (106) Google Scholar). Many of the nucleus-encoded proteins that function within plastids are conserved among photosynthetic organisms. These conserved proteins function in processes such as the capture and utilization of excitation energy, carbohydrate metabolism, and the synthesis of key cellular metabolites (such as lipids, isoprenoids, pigments, and amino acids). Interestingly, however, many plastid-localized proteins have not yet been assigned a specific biochemical function.The increasing availability of sequence information from diverse organisms has allowed the application of comparative genomics, or phylogenomics, to discover proteins specific to bacteria (10Tatusov R.L. Koonin E.V. Lipman D.J. Science. 1997; 278: 631-637Crossref PubMed Scopus (2730) Google Scholar, 11Raymond J. Zhaxybayeva O. Gogarten J.P. Gerdes S.Y. Blankenship R.E. Science. 2002; 298: 1616-1620Crossref PubMed Scopus (198) Google Scholar, 12Comas I. Moya A. González-Candelas F. BMC Evol. Biol. 2007; 7: S7Crossref PubMed Scopus (14) Google Scholar), cyanobacteria (13Mulkidjanian A.Y. Koonin E.V. Makarova K.S. Mekhedov S.L. Sorokin A. Wolf Y.I. Dufresne A. Partensky F. Burd H. Kaznadzey D. Haselkorn R. Galperin M.Y. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 13126-13131Crossref PubMed Scopus (225) Google Scholar, 14Kettler G.C. Martiny A.C. Huang K. Zucker J. Coleman M.L. Rodrigue S. Chen F. Lapidus A. Ferriera S. Johnson J. Steglich C. Church G.M. Richardson P. Chisholm S.W. PLoS Genet. 2007; 3: e231Crossref PubMed Scopus (389) Google Scholar, 15Gupta R.S. Mathews D.W. BMC Evol. Biol. 2010; 10: 24Crossref PubMed Scopus (64) Google Scholar), fungi (16Dujon B. Sherman D. Fischer G. Durrens P. Casaregola S. Lafontaine I. De Montigny J. Marck C. Neuvéglise C. Talla E. Goffard N. Frangeul L. Aigle M. Anthouard V. Babour A. Barbe V. Barnay S. Blanchin S. Beckerich J.M. Beyne E. Bleykasten C. Boisramé A. Boyer J. Cattolico L. Confanioleri F. De Daruvar A. Despons L. Fabre E. Fairhead C. Ferry-Dumazet H. Groppi A. Hantraye F. Hennequin C. Jauniaux N. Joyet P. Kachouri R. Kerrest A. Koszul R. Lemaire M. Lesur I. Ma L. Muller H. Nicaud J.M. Nikolski M. Oztas S. Ozier-Kalogeropoulos O. Pellenz S. Potier S. Richard G.F. Straub M.L. Suleau A. Swennen D. Tekaia F. Wésolowski-Louvel M. Westhof E. Wirth B. Zeniou-Meyer M. Zivanovic I. Bolotin-Fukuhara M. Thierry A. Bouchier C. Caudron B. Scarpelli C. Gaillardin C. Weissenbach J. Wincker P. Souciet J.L. Nature. 2004; 430: 35-44Crossref PubMed Scopus (1241) Google Scholar, 17Souciet J.L. Dujon B. Gaillardin C. Johnston M. Baret P.V. Cliften P. Sherman D.J. Weissenbach J. Westhof E. Wincker P. Jubin C. Poulain J. Barbe V. Ségurens B. Artiguenave F. Anthouard V. Vacherie B. Val M.E. Fulton R.S. Minx P. Wilson R. Durrens P. Jean G. Marck C. Martin T. Nikolski M. Rolland T. Seret M.L. Casarégola S. Despons L. Fairhead C. Fischer G. Lafontaine I. Leh V. Lemaire M. de Montigny J. Neuvéglise C. Thierry A. Blanc-Lenfle I. Bleykasten C. Diffels J. Fritsch E. Frangeul L. Goëffon A. Jauniaux N. Kachouri-Lafond R. Payen C. Potier S. Pribylova L. Ozanne C. Richard G.F. Sacerdot C. Straub M.L. Talla E. Genome Res. 2009; 19: 1696-1709Crossref PubMed Scopus (169) Google Scholar), metazoa (18Babenko V.N. Krylov D.M. Nucleic Acids Res. 2004; 32: 5029-5035Crossref PubMed Scopus (13) Google Scholar), archaea (19Makarova K.S. Sorokin A.V. Novichkov P.S. Wolf Y.I. Koonin E.V. Biol. Direct. 2007; 2: 33Crossref PubMed Scopus (145) Google Scholar), and plastids (20Martin W. Rujan T. Richly E. Hansen A. Cornelsen S. Lins T. Leister D. Stoebe B. Hasegawa M. Penny D. Proc. Natl. Acad. Sci. U.S.A. 2002; 99: 12246-12251Crossref PubMed Scopus (887) Google Scholar). Additionally, computational attempts have been made to recognize protein families that are conserved in select plant genomes (21Conte M.G. Gaillard S. Lanau N. Rouard M. Périn C. Nucleic Acids Res. 2008; 36: D991-D998Crossref PubMed Scopus (64) Google Scholar). However, the inventory of proteins exclusive to plants was only first explored in 2007 because the number of plant genomes available before then had been limited.A previous phylogenomics analysis of green plants attempted to identify plant proteins associated with the plastid (22Merchant S.S. Prochnik S.E. Vallon O. Harris E.H. Karpowicz S.J. Witman G.B. Terry A. Salamov A. Fritz-Laylin L.K. Maréchal-Drouard L. Marshall W.F. Qu L.H. Nelson D.R. Sanderfoot A.A. Spalding M.H. Kapitonov V.V. Ren Q. Ferris P. Lindquist E. Shapiro H. Lucas S.M. Grimwood J. Schmutz J. Cardol P. Cerutti H. Chanfreau G. Chen C.L. Cognat V. Croft M.T. Dent R. Dutcher S. Fernández E. Fukuzawa H. González-Ballester D. González-Halphen D. Hallmann A. Hanikenne M. Hippler M. Inwood W. Jabbari K. Kalanon M. Kuras R. Lefebvre P.A. Lemaire S.D. Lobanov A.V. Lohr M. Manuell A. Meier I. Mets L. Mittag M. Mittelmeier T. Moroney J.V. Moseley J. Napoli C. Nedelcu A.M. Niyogi K. Novoselov S.V. Paulsen I.T. Pazour G. Purton S. Ral J.P. Riaño-Pachón D.M. Riekhof W. Rymarquis L. Schroda M. Stern D. Umen J. Willows R. Wilson N. Zimmer S.L. Allmer J. Balk J. Bisova K. Chen C.J. Elias M. Gendler K. Hauser C. Lamb M.R. Ledford H. Long J.C. Minagawa J. Page M.D. Pan J. Pootakham W. Roje S. Rose A. Stahlberg E. Terauchi A.M. Yang P. Ball S. Bowler C. Dieckmann C.L. Gladyshev V.N. Green P. Jorgensen R. Mayfield S. Mueller-Roeber B. Rajamani S. Sayre R.T. Brokstein P. Dubchak I. Goodstein D. Hornick L. Huang Y.W. Jhaveri J. Luo Y. Martínez D. Ngau W.C. Otillar B. Poliakov A. Porter A. Szajkowski L. Werner G. Zhou K. Grigoriev I.V. Rokhsar D.S. Grossman A.R. Science. 2007; 318: 245-250Crossref PubMed Scopus (1907) Google Scholar). In that study, orthologs (and recent paralogs) of proteins encoded by the Chlamydomonas reinhardtii genome were identified in the predicted proteomes of the angiosperm Arabidopsis thaliana (23Swarbreck D. Wilks C. Lamesch P. Berardini T.Z. Garcia-Hernandez M. Foerster H. Li D. Meyer T. Muller R. Ploetz L. Radenbaugh A. Singh S. Swing V. Tissier C. Zhang P. Huala E. Nucleic Acids Res. 2008; 36: D1009-D1014Crossref PubMed Scopus (737) Google Scholar), the moss Physcomitrella patens (24Rensing S.A. Lang D. Zimmer A.D. Terry A. Salamov A. Shapiro H. Nishiyama T. Perroud P.F. Lindquist E.A. Kamisugi Y. Tanahashi T. Sakakibara K. Fujita T. Oishi K. Shin-I T. Kuroki Y. Toyoda A. Suzuki Y. Hashimoto S. Yamaguchi K. Sugano S. Kohara Y. Fujiyama A. Anterola A. Aoki S. Ashton N. Barbazuk W.B. Barker E. Bennetzen J.L. Blankenship R. Cho S.H. Dutcher S.K. Estelle M. Fawcett J.A. Gundlach H. Hanada K. Heyl A. Hicks K.A. Hughes J. Lohr M. Mayer K. Melkozernov A. Murata T. Nelson D.R. Pils B. Prigge M. Reiss B. Renner T. Rombauts S. Rushton P.J. Sanderfoot A. Schween G. Shiu S.H. Stueber K. Theodoulou F.L. Tu H. Van de Peer Y. Verrier P.J. Waters E. Wood A. Yang L. Cove D. Cuming A.C. Hasebe M. Lucas S. Mishler B.D. Reski R. Grigoriev I.V. Quatrano R.S. Boore J.L. Science. 2008; 319: 64-69Crossref PubMed Scopus (1372) Google Scholar), and the marine, picoplanktonic algae Ostreococcus tauri (25Derelle E. Ferraz C. Rombauts S. Rouzé P. Worden A.Z. Robbens S. Partensky F. Degroeve S. Echeynié S. Cooke R. Saeys Y. Wuyts J. Jabbari K. Bowler C. Panaud O. Piégu B. Ball S.G. Ral J.P. Bouget F.Y. Piganeau G. De Baets B. Picard A. Delseny M. Demaille J. Van de Peer Y. Moreau H. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 11647-11652Crossref PubMed Scopus (646) Google Scholar) and Ostreococcus lucimarinus (26Palenik B. Grimwood J. Aerts A. Rouzé P. Salamov A. Putnam N. Dupont C. Jorgensen R. Derelle E. Rombauts S. Zhou K. Otillar R. Merchant S.S. Podell S. Gaasterland T. Napoli C. Gendler K. Manuell A. Tai V. Vallon O. Piganeau G. Jancek S. Heijde M. Jabbari K. Bowler C. Lohr M. Robbens S. Werner G. Dubchak I. Pazour G.J. Ren Q. Paulsen I. Delwiche C. Schmutz J. Rokhsar D. Van de Peer Y. Moreau H. Grigoriev I.V. Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 7705-7710Crossref PubMed Scopus (475) Google Scholar) but not in non-photosynthetic organisms. An inventory of 349 conserved proteins was generated and designated the “GreenCut” because it represented all of the protein families contained in a slice through the green lineage of the phylogenetic tree. However, the GreenCut was restricted in scope because of the limited number of genomes queried. In addition, the inclusion of two Ostreococcus species, which have reduced/specialized genomes and proteomes, constrained the output from the analysis.Several additional plant genomes have been sequenced in the last 3 years, including those of the poplar tree Populus trichocarpa (27Tuskan G.A. Difazio S. Jansson S. Bohlmann J. Grigoriev I. Hellsten U. Putnam N. Ralph S. Rombauts S. Salamov A. Schein J. Sterck L. Aerts A. Bhalerao R.R. Bhalerao R.P. Blaudez D. Boerjan W. Brun A. Brunner A. Busov V. Campbell M. Carlson J. Chalot M. Chapman J. Chen G.L. Cooper D. Coutinho P.M. Couturier J. Covert S. Cronk Q. Cunningham R. Davis J. Degroeve S. Déjardin A. Depamphilis C. Detter J. Dirks B. Dubchak I. Duplessis S. Ehlting J. Ellis B. Gendler K. Goodstein D. Gribskov M. Grimwood J. Groover A. Gunter L. Hamberger B. Heinze B. Helariutta Y. Henrissat B. Holligan D. Holt R. Huang W. Islam-Faridi N. Jones S. Jones-Rhoades M. Jorgensen R. Joshi C. Kangasjärvi J. Karlsson J. Kelleher C. Kirkpatrick R. Kirst M. Kohler A. Kalluri U. Larimer F. Leebens-Mack J. Leplé J.C. Locascio P. Lou Y. Lucas S. Martin F. Montanini B. Napoli C. Nelson D.R. Nelson C. Nieminen K. Nilsson O. Pereda V. Peter G. Philippe R. Pilate G. Poliakov A. Razumovskaya J. Richardson P. Rinaldi C. Ritland K. Rouzé P. Ryaboy D. Schmutz J. Schrader J. Segerman B. Shin H. Siddiqui A. Sterky F. Terry A. Tsai C.J. Uberbacher E. Unneberg P. Vahala J. Wall K. Wessler S. Yang G. Yin T. Douglas C. Marra M. Sandberg G. Van de Peer Y. Rokhsar D. Science. 2006; 313: 1596-1604Crossref PubMed Scopus (3193) Google Scholar), the legume Glycine max (28Schmutz J. Cannon S.B. Schlueter J. Ma J. Mitros T. Nelson W. Hyten D.L. Song Q. Thelen J.J. Cheng J. Xu D. Hellsten U. May G.D. Yu Y. Sakurai T. Umezawa T. Bhattacharyya M.K. Sandhu D. Valliyodan B. Lindquist E. Peto M. Grant D. Shu S. Goodstein D. Barry K. Futrell-Griggs M. Abernathy B. Du J. Tian Z. Zhu L. Gill N. Joshi T. Libault M. Sethuraman A. Zhang X.C. Shinozaki K. Nguyen H.T. Wing R.A. Cregan P. Specht J. Grimwood J. Rokhsar D. Stacey G. Shoemaker R.C. Jackson S.A. Nature. 2010; 463: 178-183Crossref PubMed Scopus (2976) Google Scholar), the spike moss Selaginella moellendorffii, and the green algae Ostreococcus sp. RCC890, Volvox carteri (29Prochnik S.E. Umen J. Nedelcu A.M. Hallmann A. Miller S.M. Nishii I. Ferris P. Kuo A. Mitros T. Fritz-Laylin L.K. Hellsten U. Chapman J. Simakov O. Rensing S.A. Terry A. Pangilinan J. Kapitonov V. Jurka J. Salamov A. Shapiro H. Schmutz J. Grimwood J. Lindquist E. Lucas S. Grigoriev I.V. Schmitt R. Kirk D. Rokhsar D.S. Science. 2010; 329: 223-226Crossref PubMed Scopus (415) Google Scholar), and Chlorella variabilis NC64A (30Blanc G. Duncan G. Agarkova I. Borodovsky M. Gurnon J. Kuo A. Lindquist E. Lucas S. Pangilinan J. Polle J. Salamov A. Terry A. Yamada T. Dunigan D.D. Grigoriev I.V. Claverie J.M. Van Etten J.L. Plant Cell. 2010; 22: 2943-2955Crossref PubMed Scopus (369) Google Scholar). In addition, the annotation of other plant genomes, such as Oryza sativa (31Goff S.A. Ricke D. Lan T.H. Presting G. Wang R. Dunn M. Glazebrook J. Sessions A. Oeller P. Varma H. Hadley D. Hutchison D. Martin C. Katagiri F. Lange B.M. Moughamer T. Xia Y. Budworth P. Zhong J. Miguel T. Paszkowski U. Zhang S. Colbert M. Sun W.L. Chen L. Cooper B. Park S. Wood T.C. Mao L. Quail P. Wing R. Dean R. Yu Y. Zharkikh A. Shen R. Sahasrabudhe S. Thomas A. Cannings R. Gutin A. Pruss D. Reid J. Tavtigian S. Mitchell J. Eldredge G. Scholl T. Miller R.M. Bhatnagar S. Adey N. Rubano T. Tusneem N. Robinson R. Feldhaus J. Macalma T. Oliphant A. Briggs S. Science. 2002; 296: 92-100Crossref PubMed Scopus (2630) Google Scholar, 32Ouyang S. Zhu W. Hamilton J. Lin H. Campbell M. Childs K. Thibaud-Nissen F. Malek R.L. Lee Y. Zheng L. Orvis J. Haas B. Wortman J. Buell C.R. Nucleic Acids Res. 2007; 35: D883-D887Crossref PubMed Scopus (880) Google Scholar), has been updated. This new sequence information allows for the recognition of a plant lineage-specific inventory that represents a greater diversity of all green plants.With the availability of this new genomic information, our goal was to generate an inventory of proteins unique to plastid-containing organisms. This inventory would contain fruitful targets for experimental studies of plant processes. Therefore, we performed a phylogenomics study to derive a set of proteins that is restricted to diverse organisms of the green lineage. We compared proteins encoded by eight plant genomes, but not by nine non-photosynthetic organisms, with proteins of five other photosynthetic eukaryotes (plants and diatoms) to establish a comprehensive set of green lineage proteins, which we designated the “GreenCut2.” We verified the completeness and representative character of the protein inventory by comparing it with proteins encoded by the genomes of six additional photosynthetic eukaryotes. We annotated the GreenCut2 inventory by performing a meta-analysis of gene, mRNA, and protein data to generate new hypotheses concerning the activity of proteins of unknown function in the GreenCut2 and the roles of these proteins in plastid biology. This analysis suggested potential functions/activities for some of these proteins based on the presence of specific protein domains or motifs, subcellular location, and pattern of expression of the genes that encode them, thus identifying promising targets for future experimental work. Furthermore, the analysis suggests that there is a subset of proteins that is not directly associated with photosynthetic function or plastid biochemistry but that is still specific to the green lineage. Given their conservation, these proteins are likely to be critical for plant-specific processes and activities beyond photosynthesis.RESULTS AND DISCUSSIONGeneration of InventoryC. reinhardtii proteins (FM 3.1 set of gene models) were used to identify orthologs in A. thaliana, P. patens, O. sativa, P. trichocarpa, and O. tauri, O. lucimarinus, and Ostreococcus sp. RCC809. Those proteins with orthologs in all land plants and at least one of the Ostreococcus species were retained (see below). This set of proteins was compared with proteins in a group of non-photosynthetic organisms (see “Experimental Procedures” for the list of these organisms), and those proteins that had orthologs in any of the non-photosynthetic organisms were removed from consideration (supplemental File 1, Fig. S1).The Ostreococcus species were considered useful in the phylogenomics analysis because they provide data from divergent species within the chlorophyte lineage (Fig. 1). They are cosmopolitan, marine algae found throughout the world's oceans. However, their genomes are small, and each species is adapted to their environmental niche (50Jancek S. Gourbière S. Moreau H. Piganeau G. Mol. Biol. Evol. 2008; 25: 2293-2300Crossref PubMed Scopus (32) Google Scholar). Therefore, they may have lost some biochemical functions that are present in other plants. To minimize the impact of specialization, we sampled three Ostreococcus genomes, which provide a broad base of prasinophyte gene representation. We required that an ortholog be encoded by a gene model in one or more of the Ostreococcus genomes. In effect, we attempted to sample an Ostreococcus “pan-genome” that represents protein-encoding genes that are present anywhere in the genus. As a result, 597 proteins were captured in the inventory. Had we required that a protein be encoded by all three Ostreococcus genomes, we would have lost 126 proteins, each of which is presumably dispensable in a particular marine niche occupied by a specialized Ostreococcus species.This set of 597 proteins is designated the GreenCut2 (supplemental File 2, Table S2). As a consequence of whole genome duplications in Arabidopsis, the 597 Chlamydomonas GreenCut2 proteins capture 710 Arabidopsis co-orthologs. GreenCut2 proteins were assigned to the general categories K, KI, U, and UP (see “Experimental Procedures”).Subgroups of GreenCut2To investigate whether GreenCut2 proteins are conserved in photosynthetic organisms that are not affiliated with the green lineage, we identified GreenCut2 orthologs encoded by the genomes of the red alga C. merolae (51Matsuzaki M. Misumi O. Shin-I T. Maruyama S. Takahara M. Miyagishima S.Y. Mori T. Nishida K. Yagisawa F. Nishida K. Yoshida Y. Nishimura Y. Nakao S. Kobayashi T. Momoyama Y. Higashiyama T. Minoda A. Sano M. Nomoto H. Oishi K. Hayashi H. Ohta F. Nishizaka S. Haga S. Miura S. Morishita T. Kabeya Y. Terasawa K. Suzuki Y. Ishii Y. Asakawa S. Takano H. Ohta N. Kuroiwa H. Tanaka K. Shimizu N. Sugano S. Sato N. Nozaki H. Ogasawara N. Kohara Y. Kuroiwa T. Nature. 2004; 428: 653-657Crossref PubMed Scopus (832) Google Scholar) and the diatoms T. pseudonana and P. tricornutum (52Armbrust E.V. Berges J.A. Bowler C. Green B.R. Martinez D. Putnam N.H. Zhou S. Allen A.E. Apt K.E. Bechner M. Brzezinski M.A. Chaal B.K. Chiovitti A. Davis A.K. Demarest M.S. Detter J.C. Glavina T. Goodstein D. Hadi M.Z. Hellsten U. Hildebrand M. Jenkins B.D. Jurka J. Kapitonov V.V. Kröger N. Lau W.W. Lane T.W. Larimer F.W. Lippmeier J.C. Lucas S. Medina M. Montsant A. Obornik M. Parker M.S. Palenik B. Pazour G.J. Richardson P.M. Rynearson T.A. Saito M.A. Schwartz D.C. Thamatrakoln K. Valentin K. Vardi A. Wilkerson F.P. Rokhsar D.S. Science. 2004; 306: 79-86Crossref PubMed Scopus (1541) Google Scholar, 53Bowler C. Allen A.E. Badger J.H. Grimwood J. Jabbari K. Kuo A. Maheswari U. Martens C. Maumus F. Otillar R.P. Rayko E. Salamov A. Vandepoele K. Beszteri B. Gruber A. Heijde M. Katinka M. Mock T. Valentin K. Verret F. Berges J.A. Brownlee C. Cadoret J.P. Chiovitti A. Choi C.J. Coesel S. De Martino A. Detter J.C. Durkin C. Falciatore A. Fournet J. Haruta M. Huysman M.J. Jenkins B.D. Jiroutova K. Jorgensen R.E. Joubert Y. Kaplan A. Kröger N. Kroth P.G. La Roche J. Lindquist E. Lommer M. Martin-Jézéquel V. Lopez P.J. Lucas S. Mangogna M. McGinnis K. Medlin L.K. Montsant A. Oudot-Le Secq M.P. Napoli C. Obornik M. Parker M.S. Petit J.L. Porcel B.M. Poulsen N. Robison M. Rychlewski L. Rynearson T.A. Schmutz J. Shapiro H. Siaut M. Stanley M. Sussman M.R. Taylor A.R. Vardi A. von Dassow P. Vyverman W. Willis A. Wyrwicz L.S. Rokhsar D.S. Weissenbach J. Armbrust E.V. Green B.R. Van de Peer Y. Grigoriev I.V. Nature. 2008; 456: 239-244Crossref PubMed Scopus (1192) Google Scholar). C. merolae is a member of the plant kingdom whose ancestor diverged from the green plant lineage (Fig. 1). Diatoms, in contrast, are heterokonts. They acquired their plastid through a secondary endosymbiosis (54Archibald J.M. Keeling P.J. Trends Genet. 2002; 18: 577-584Abstract Full Text Full Text PDF PubMed Scopus (165) Google Scholar, 55Gould S.B. Waller R.F. McFadden G.I. Annu. Rev. Plant Biol. 2008; 59: 491-517Crossref PubMed Scopus (484) Google Scholar), a process in which an endosymbiont-containing eukaryote is engulfed by another free-living eukaryote.Among the 597 GreenCut2 proteins, 124 are found in the genomes of green plants, C. merolae, and at least one diatom (supplemental File 1, Fig. S2). This set of 124 proteins has been designated the “PlastidCut2.” The genes for PlastidCut2 proteins are conserved in the nuclear genomes of the diverse plastid-containing, photosynthetic eukaryotes investigated (within and outside the plant lineage). Therefore, the name PlastidCut2 is independent of the eukaryotes' evolutionary history. These proteins are likely to be critically important for plastid metabolism, including photosynthesis, which is suggested by an enrichment of functions associated with photosynthesis among the K category proteins and the greater fraction of PlastidCut2 proteins found in cyanobacteria (see below). Surprisingly, despite their high degree of conservation, the functions of 52% (64 of 124) of PlastidCut2 proteins are not known (Table 1).TABLE 1Proteins of known and unknown functionProteinsPlastidCutDiatomCut-PlastidCutPlantCut-PlastidCutViridiCutTotalPercentGreenCut21249665312597K60453214928648U64513316331152GreenCut v1906027172349K291897913539U6142189321461 Open table in a new tab The subset of GreenCut2 proteins found in the genome of at least one diatom is labeled “DiatomCut2” (supplemental File 1, Fig. S2). The proteins of this subgroup include the 124 proteins of the PlastidCut2 plus a set of 96 proteins that are not apparently conserved/encoded by the C. merolae genome. Similarly, the set of proteins found in green plants and C. merolae is labeled “PlantCut2,” which includes PlastidCut2 proteins plus 65 additional proteins not apparently conserved/encoded by either of the diatom genomes analyzed in this study (supplemental File 1, Fig. S2). Green plants contain 312 proteins designated the “ViridiCut2.” These proteins are not encoded by the genome of C. merolae, P. tricornutum, or T. pseudonana (supplemental File 1, Fig. S2). The ViridCut2 is likely enriched in green lineage-specific functions, such as mechanisms of chlorophyll a/b protein regulation.Validation of GreenCut2For practical reasons, we used only a subset of genomes representing a divergent collection of reference organisms to generate the GreenCut2. To validate our choice of organisms, we tested the predicted proteomes of recently sequenced plants, algae, and diatoms.Land PlantsTo assess the conservation of GreenCut2 proteins in land plants, the genomes of G. max (soybean), S. bicolor (cereal grass), and S. moellendorffii (spike moss), which occupy phylogenetically distinct positions in the green plant tree of life relative to the plants used for generation of the GreenCut2 (Fig. 1), were searched for orthologs of GreenCut2 proteins. The analysis demonstrated that the genomes of G. max, S. bicolor, and S. moellendorffii may not encode one, one, and three GreenCut2 proteins, respectively (supplemental File 3, Table S3). The genes encoding these proteins may lie in genomic regions missing from the current genome assemblies, or the genes may have been selectively lost. Overall, the presence of genes encoding almost all (99%, or 592 of 597) of the GreenCut2 proteins in three additional plant genomes (a legume, a grass, and a fern), which are divergent from other green plants used in the construction of the GreenCut2, provides further evidence that the inventory of proteins in the GreenCut2 is especially relevant to and representative of all land plants of the green lineage and that the number of false positives is likely to be very low.AlgaeWe queried the predicted proteomes of the chlorophyte lineage algae V. carteri, C. variabilis NC64A, Coccomyxa sp. C-169, and M. pusilla (56Worden A.Z. Lee J.H. Mock T. Rouzé P. Simmons M.P. Aerts A.L. Allen A.E. Cuvelier M.L. Derelle E. Everett M.V. Foulon E. Grimwood J. Gundlach H. Henrissat B. Napoli C. McDonald S.M. Parker M.S. Rombauts S. Salamov A. Von Dassow P. Badger J.H. Coutinho P.M. Demir E. Dubchak I. Gentemann C. Eikrem W. Gready J.E. John U. Lanier W. Lindquist E.A. Lucas S. Mayer K.F. Moreau H. Not F. Otillar R. Panaud O. Pangilinan J. Paulsen I. Piegu B. Poliakov A. Robbens S. Schmutz J. Toulza E. Wyss T. Zelensky A. Zhou K. Armbrust E.V. Bhattacharya D. Goodenough U.W. Van de Peer Y. Grigoriev I.V. Science. 2009; 324: 268-272Crossref PubMed Scopus (478) Google Scholar) (Fig. 1) for orthologs to the Chlamydomonas protein set. The V. carteri genome encodes 100% of the GreenCut2 proteins, the trebouxiophyte algae C. variabilis and Coccomyxa encode 96 and 89%, respectively, and M. pusilla encodes 89%. The GreenCut2 proteins that were not identified in these algae (supplemental File 3, Table S3) may be encoded by genes located in regions missing from the genome assembly, may be present on unsequenced chloroplast genomes, or may have been lost during genome reduction.We note that of the 597 GreenCut2 proteins in Chlamydomonas 105 are missing in at least one of the other green algae (V. carteri, C. variabilis, Coccomyxa, Ostreococcus spp., and M. pusilla). With a few exceptions in the trebouxiophyte lineage (supplemental File 3, Table S3), there does not appear to be a consistent pattern of GreenCut2 prot