New Curated Databases
The BioCyc.org microbial genome web
portal contains three new, highly curated cyanobacterial Pathway/Genome
Databases, in addition to one previously existing curated cyanobacterial
database (for Synechococcus elongatus PCC 7942). Each database integrates a variety of information including
the genome, metabolic pathways, operons, protein features, Gene Ontology terms,
and orthologs. Each of these databases received extensive literature-based
curation to correct annotation errors and to integrate information about
experimentally determined pathways and gene functions.
"I have used BioCyc for research and teaching for over
a decade. It is unquestionably one of the most useful resources for microbial
metabolism and I ensure that all our microbiology students become familiar with
its capabilities."
- Prof. Louis Sherman, Purdue University
- Prof. Louis Sherman, Purdue University
Curated cyanobacaterial databases:
- Synechococcus sp. WH 8102 (new)
- Prochlorococcus marinus CCMP1375 (new)
- Prochlorococcus marinus subsp. pastoris (strain CCMP1986) (new)
- Synechococcus elongatus PCC 7942 (existing)
These four databases will be free to access for the next few months for the community to explore them and will then be available by subscription.
Curation of the new databases
involved manual removal or correction of wrong or irrelevant information and
entry of extensive new information from the literature. More information about
the manual curation process is provided in the Annotation Updates section below.
In addition, the newly curated gene
functions for each of the preceding organisms were propagated via ortholog
relationships among those four databases, and to the following databases. For
example, the PRO_0550 gene of Prochlorococcus marinus marinus CCMP1375 was newly curated as a hexameric carboxysome shell
protein CsoS1A. Since
orthologs of that gene are present in several of the other genomes, we
propagated the gene and protein names from CCMP1375 to all of the other
databases in this list in which an ortholog was present.
- Synechococcus sp. CC9311
- Synechococcus sp. CC9902
- Synechococcus sp. CC9605
- Synechococcus sp. PCC7002
- Synechococcus sp. RS9916
- Synechocystis sp. PCC 6803
- Prochlorococcus marinus str. MIT 9313
- Prochlorococcus marinus str. MIT 9301
- Prochlorococcus marinus str. NATL2A
- Prochlorococcus marinus str. MIT 9211
- Nostoc punctiforme strain PCC 73102
- Thermosynechococcus elongatus strain BP-1
BioCyc contains a total of 200+ cyanobacterial databases.
BioCyc
Bioinformatics Tools
BioCyc is unique in integrating a
rich collection of data content with extensive bioinformatics tools. - Extensive search tools
- Pathway diagrams, zoomable organism-specific metabolic network diagrams
- Paint transcriptomics and metabolomics data onto pathway diagrams and metabolic network diagram
- Innovative Omics Dashboard tool
- Genome browser
- Comparative genome browser that aligns genomes at orthologous genes
- BLAST search, sequence pattern search
- Search for optimal metabolic routes connecting two metabolites
- Regulatory sites, regulatory network diagram
Learn
More
Annotation
Updates
Curation of the new PGDBs involved
manual removal of incorrect or irrelevant chemical compounds, reactions, and
pathways, assignment of gene functions for genes with incorrect or missing
annotation, construction of protein complexes, import of pathways that were not
predicted due to missing/incorrect annotation, assignment of transport
reactions to transporters, addition of pointers from pathways to other pathways
upstream or downstream, and curation of relevant literature, including novel
commentary for cyanobacterial-specific proteins and pathways.
During the curation process we have
found many cases of annotation errors. For example, back in 1996 Brahamsha et
al showed that Synechococcus sp. WH 8102 produces a very large protein (10,791
amino acids), which they named SwmB, that is required for swimming motility [PMID 8692845,
17158680]. However, in the recent RefSeq annotation the swmB gene was
replaced by a series of short pseudogenes (the error has since been corrected
after we notified RefSeq).
Annotation pipelines appear to have difficulty keeping up with the literature. For example, in 2012 a new carboxysomal protein, CsoS1D, was discovered in Prochlorococcus [PMID 22155772]. Yet, 8 years later the respective genes are annotated simply as BMC domain-containing proteins. In addition, genes encoding carboxysomal proteins are named differently in
α-cyanobacteria (such as the marine Synechococcus and Prochlorococcus strains) and
β-cyanobacteria
(such as the freshwater Synechococcus elongatus PCC 7942).
Yet, many of these genes in the α-strains are annotated with the incorrect ccm
format (e.g. ccmK) instead of the correct csoS format (see
Uniprot P0A328).
Another problem we encountered often
was inconsistency in naming. For example, the typical marine picocyanobacterial
NADPH-quinone oxidoreductase is composed of 15 subunits, encoded by the genes ndhA-ndhO.
We found that the different subunits are typically named using alternative
terms such as “chain 5”, subunit 1”, “subunit O” etc. In the RefSeq annotation
of Prochlorococcus marinus marinus CCMP1375 (SS120) the
genes ndhL-ndhO are not named at all. Instead, the ndbB
gene, which encodes demethylphylloquinone reductase [PMID 26023160], was named ndh.
Yet another common problem is
annotation of different proteins with the same name. Prochlorococcus marinus
pastoris CCMP1986 (MED4) contains 22 genes encoding High Light Inducible
Proteins (HLIPs). Even though the literature has clearly named each of these
genes (hli1-hli22) [PMID 12399037] and multiple papers
describe the different properties of these proteins, in the RefSeq annotation
all of these genes are not given a gene name at all, and their products are all
named “high light inducible protein” without the identifying number.
Another example of this problem is
the FoF1 ATP synthase, where the gene encoding the b’ subunit is never named
properly. In Uniprot it was named atpG, the same as
the gene encoding the
γ subunit (for example see Uniprot Q7VA60 and Q7VA64, respectively). In RefSeq it was named atpF, the same as the gene encoding the b subunit (after contacting Uniprot, they agreed to reannotate the b’ subunits as atpF2 to avoid confusion).
γ subunit (for example see Uniprot Q7VA60 and Q7VA64, respectively). In RefSeq it was named atpF, the same as the gene encoding the b subunit (after contacting Uniprot, they agreed to reannotate the b’ subunits as atpF2 to avoid confusion).
- For an example of a curated protein page see the hexameric carboxysome shell protein CsoS1A from Prochlorococcus marinus marinus CCMP1375.
- For an example of a curated pathway see aerobic respiration (NDH-1 to chytochrome c oxidase via cytochrome c6) from Synechococcus sp. WH 8102.
- For an example of a multi-genome alignment of the chromosomal regions around a selected gene (in this case cbbS) see here.
No comments:
Post a Comment