New Curated Databases
The BioCyc.org microbial genome web
portal contains three new, highly curated cyanobacterial Pathway/Genome
Databases, in addition to one previously existing curated cyanobacterial
database (for Synechococcus elongatus PCC 7942). Each database integrates a variety of information including
the genome, metabolic pathways, operons, protein features, Gene Ontology terms,
and orthologs. Each of these databases received extensive literature-based
curation to correct annotation errors and to integrate information about
experimentally determined pathways and gene functions.
"I have used BioCyc for research and teaching for over
a decade. It is unquestionably one of the most useful resources for microbial
metabolism and I ensure that all our microbiology students become familiar with
its capabilities."
- Prof. Louis Sherman, Purdue University
Curated cyanobacaterial databases:
These four databases will be free to
access for the next few months for the community to explore them and will then
be available by subscription.
Curation of the new databases
involved manual removal or correction of wrong or irrelevant information and
entry of extensive new information from the literature. More information about
the manual curation process is provided in the Annotation Updates section below.
In addition, the newly curated gene
functions for each of the preceding organisms were propagated via ortholog
relationships among those four databases, and to the following databases. For
example, the PRO_0550 gene of Prochlorococcus marinus marinus CCMP1375 was newly curated as a hexameric carboxysome shell
protein CsoS1A. Since
orthologs of that gene are present in several of the other genomes, we
propagated the gene and protein names from CCMP1375 to all of the other
databases in this list in which an ortholog was present.
BioCyc
contains a total of 200+ cyanobacterial databases.
BioCyc
Bioinformatics Tools
BioCyc is unique in integrating a
rich collection of data content with extensive bioinformatics tools.
Learn
More
Annotation
Updates
Curation of the new PGDBs involved
manual removal of incorrect or irrelevant chemical compounds, reactions, and
pathways, assignment of gene functions for genes with incorrect or missing
annotation, construction of protein complexes, import of pathways that were not
predicted due to missing/incorrect annotation, assignment of transport
reactions to transporters, addition of pointers from pathways to other pathways
upstream or downstream, and curation of relevant literature, including novel
commentary for cyanobacterial-specific proteins and pathways.
During the curation process we have
found many cases of annotation errors. For example, back in 1996 Brahamsha et
al showed that Synechococcus sp. WH 8102 produces a very large protein (10,791
amino acids), which they named SwmB, that is required for swimming motility [PMID 8692845,
17158680]. However, in the recent RefSeq annotation the swmB gene was
replaced by a series of short pseudogenes (the error has since been corrected
after we notified RefSeq).
Annotation pipelines appear to have
difficulty keeping up with the literature. For example, in 2012 a new
carboxysomal protein, CsoS1D, was discovered in Prochlorococcus [PMID
22155772]. Yet, 8 years later the respective genes are annotated simply as BMC
domain-containing proteins. In addition, genes encoding carboxysomal proteins
are named differently in
α-cyanobacteria (such as the marine Synechococcus
and Prochlorococcus strains) and
β-cyanobacteria
(such as the freshwater Synechococcus elongatus PCC 7942).
Yet, many of these genes in the α-strains are annotated with the incorrect ccm
format (e.g. ccmK) instead of the correct csoS format (see
Uniprot P0A328).
Another problem we encountered often
was inconsistency in naming. For example, the typical marine picocyanobacterial
NADPH-quinone oxidoreductase is composed of 15 subunits, encoded by the genes ndhA-ndhO.
We found that the different subunits are typically named using alternative
terms such as “chain 5”, subunit 1”, “subunit O” etc. In the RefSeq annotation
of Prochlorococcus marinus marinus CCMP1375 (SS120) the
genes ndhL-ndhO are not named at all. Instead, the ndbB
gene, which encodes demethylphylloquinone reductase [PMID 26023160], was named ndh.
Yet another common problem is
annotation of different proteins with the same name. Prochlorococcus marinus
pastoris CCMP1986 (MED4) contains 22 genes encoding High Light Inducible
Proteins (HLIPs). Even though the literature has clearly named each of these
genes (hli1-hli22) [PMID 12399037] and multiple papers
describe the different properties of these proteins, in the RefSeq annotation
all of these genes are not given a gene name at all, and their products are all
named “high light inducible protein” without the identifying number.
Another example of this problem is
the FoF1 ATP synthase, where the gene encoding the b’ subunit is never named
properly. In Uniprot it was named atpG, the same as
the gene encoding the
γ subunit (for example see Uniprot Q7VA60 and
Q7VA64, respectively). In RefSeq it was named atpF, the same
as the gene encoding the b subunit (after contacting Uniprot, they agreed to
reannotate the b’ subunits as atpF2 to avoid confusion).