Tuesday, March 31, 2020

New Curated Cyanobacterial Databases in BioCyc

New Curated Databases
 The BioCyc.org microbial genome web portal contains three new, highly curated cyanobacterial Pathway/Genome Databases, in addition to one previously existing curated cyanobacterial database (for Synechococcus elongatus PCC 7942). Each database integrates a variety of information including the genome, metabolic pathways, operons, protein features, Gene Ontology terms, and orthologs. Each of these databases received extensive literature-based curation to correct annotation errors and to integrate information about experimentally determined pathways and gene functions.

"I have used BioCyc for research and teaching for over a decade. It is unquestionably one of the most useful resources for microbial metabolism and I ensure that all our microbiology students become familiar with its capabilities." 
                                                               - Prof. Louis Sherman, Purdue University

Curated cyanobacaterial databases:

These four databases will be free to access for the next few months for the community to explore them and will then be available by subscription.
Curation of the new databases involved manual removal or correction of wrong or irrelevant information and entry of extensive new information from the literature. More information about the manual curation process is provided in the Annotation Updates section below.
In addition, the newly curated gene functions for each of the preceding organisms were propagated via ortholog relationships among those four databases, and to the following databases. For example, the PRO_0550 gene of Prochlorococcus marinus marinus CCMP1375 was newly curated as a hexameric carboxysome shell protein CsoS1A. Since orthologs of that gene are present in several of the other genomes, we propagated the gene and protein names from CCMP1375 to all of the other databases in this list in which an ortholog was present.

BioCyc contains a total of 200+ cyanobacterial databases. 

BioCyc Bioinformatics Tools
 BioCyc is unique in integrating a rich collection of data content with extensive bioinformatics tools.
Learn More
Annotation Updates
 Curation of the new PGDBs involved manual removal of incorrect or irrelevant chemical compounds, reactions, and pathways, assignment of gene functions for genes with incorrect or missing annotation, construction of protein complexes, import of pathways that were not predicted due to missing/incorrect annotation, assignment of transport reactions to transporters, addition of pointers from pathways to other pathways upstream or downstream, and curation of relevant literature, including novel commentary for cyanobacterial-specific proteins and pathways.

During the curation process we have found many cases of annotation errors. For example, back in 1996 Brahamsha et al showed that Synechococcus sp. WH 8102 produces a very large protein (10,791 amino acids), which they named SwmB, that is required for swimming motility [PMID 8692845, 17158680]. However, in the recent RefSeq annotation the swmB gene was replaced by a series of short pseudogenes (the error has since been corrected after we notified RefSeq). 

Annotation pipelines appear to have difficulty keeping up with the literature. For example, in 2012 a new carboxysomal protein, CsoS1D, was discovered in Prochlorococcus [PMID 22155772]. Yet, 8 years later the respective genes are annotated simply as BMC domain-containing proteins. In addition, genes encoding carboxysomal proteins are named differently in
α-cyanobacteria (such as the marine Synechococcus and Prochlorococcus strains) and
β-cyanobacteria (such as the freshwater Synechococcus elongatus PCC 7942). Yet, many of these genes in the α-strains are annotated with the incorrect ccm format (e.g. ccmK) instead of the correct csoS format (see Uniprot P0A328).

Another problem we encountered often was inconsistency in naming. For example, the typical marine picocyanobacterial NADPH-quinone oxidoreductase is composed of 15 subunits, encoded by the genes ndhA-ndhO. We found that the different subunits are typically named using alternative terms such as “chain 5”, subunit 1”, “subunit O” etc. In the RefSeq annotation of Prochlorococcus marinus marinus CCMP1375 (SS120) the genes ndhL-ndhO are not named at all. Instead, the ndbB gene, which encodes demethylphylloquinone reductase [PMID 26023160], was named ndh.

Yet another common problem is annotation of different proteins with the same name. Prochlorococcus marinus pastoris CCMP1986 (MED4) contains 22 genes encoding High Light Inducible Proteins (HLIPs). Even though the literature has clearly named each of these genes (hli1-hli22) [PMID 12399037] and multiple papers describe the different properties of these proteins, in the RefSeq annotation all of these genes are not given a gene name at all, and their products are all named “high light inducible protein” without the identifying number.

Another example of this problem is the FoF1 ATP synthase, where the gene encoding the b’ subunit is never named properly. In Uniprot it was named atpG, the same as the gene encoding the
γ subunit (for example see Uniprot Q7VA60 and Q7VA64, respectively). In RefSeq it was named atpF, the same as the gene encoding the b subunit (after contacting Uniprot, they agreed to reannotate the b’ subunits as atpF2 to avoid confusion).