Friday, June 29, 2018

Generating a SmartTable of Orthologous Genes Across Multiple BioCyc Genomes


This post shows how, given a list of gene names and/or identifiers from one organism, to retrieve the orthologous genes in a second organism.    For example, this procedure could be used to find the orthologs in EcoCyc of a set of genes of uncertain function in another organism, potentially providing insights about their functions. 

The BioCyc project generates a database of orthologous genes between many of the organism Pathway Genome Databases (PGDBs) we maintain.  These orthologs are stored in a central MySQL database that is separate from the individual BioCyc PGDBs.  This post describes a way to retrieve orthologs for a list of genes using a SmartTable.  We don’t recommend generating a ortholog list for a whole genome because of performance problems that, as of June 2018, we are working to resolve.   

Since we will be using a PGDB in addition to EcoCyc, you will need a BioCyc subscription (not just a free account) to follow through this demonstration.  Here’s a screenshot of the final SmartTable, which you can access directly here. We will begin with a file of 88 genes from E. coli strain B str. REL606 and determine their orthologs in EcoCyc.  The columns in the resulting SmartTable are as follows:  the list of gene names from the input file, their ‘ECB’ accession IDs genes, the names of orthologous genes from EcoCyc, and  ‘b-number’ accession IDs from EcoCyc.  We also show how to add a column containing gene product names.






Here is the step-by-step procedure.

1.     Go to the bottom of this post and cut and paste the list of gene identifiers into a text editor such as textedit or atom.  Save the file as 88EcoliGenes.txt.
2.     Go to BioCyc.org and login and use change organism database to change your organism to Escherichia coli B str. REL606. 
3.     Open the ‘Smart Tables’ menu and choose the ‘My Smart Tables’ command.  It actually doesn’t matter which of the commands in the menu you choose.
4.     You’ll find the operations menu in the upper right corner of the SmartTables page.  Under the ‘New’ command, you’ll find ‘Smart Table from Uploaded File’.  Choose that command. 
5.     In the resulting pop-up window, click the ‘Choose File’ button and select the file you saved out in step 1.  Once you have located and selected the file, review the options below the ‘Choose File’ button in the  upload window.  Since this is a file of gene identifiers, you should keep the ‘Try to make objects of type’ box checked, along with the radio button next to ‘Gene’.  Also leave the other two check boxes checked.  Click the ‘Upload’ button and a new SmartTable with a single column will appear.  If you see a warning message, ignore it and continue.
6.     Add accession numbers for the genes, by locating the ‘ADD PROPERTY COLUMN’ dropdown menu, then choose ‘Accession-1’ from the list.  You’ll see a list of ECB identifiers numbers’ for each of the genes on the first page. 
7.     Now that the table has some basic identifiers for the E. coli B str. REL606 genes, you can proceed to adding orthologs from EcoCyc.  Find the ‘ADD TRANSFORM COLUMN’, and choose the second transform ‘Compare – map to other species PGDB’.  That will bring up a pop-up list of species, which will be rather long.  Scroll down through the list until you find ‘Escherichia coli K-12 substr. MG1655’.  Once you have selected the right organism, click the ‘Go’ button.
8.     Now a column containing ortholog gene names appears.  Each entry is linked to the gene page in its organism’s PGDB.  Try it by clicking on thrL, and notice the organism in the upper left of the gene page.  Clicking on the browser back button will take you back to the table and restore the current PGDB to EcoCyc.
9.     To add a column containing accession numbers of each orthologous gene, select the ortholog column by clicking anywhere in the column header except where the title is displayed.   When the column is selected the header turns darker (and the ‘Gene Name’ column lightens as it is no longer the selected column).  Then, as in step 6,  choose ‘Accession-1’ from the dropdown below ‘Add Property Column.’  A different list of accession numbers appears.  Notice that both lists of accession numbers are simply strings that don’t link to any genes.  A column of product names for the orthologs can be added in a similar fashion.
10.  Finally, suppose you were interested only in those B strain genes having orthologs in the K-12 strain. To remove genes lacking orthologs, select the third column (‘Map to Escherichia…’).  Now look over to the operations menu on the right.  Select the ‘Filter’ command. A  pop-up window titled ‘Column text-value filter’ will appear. 
11.  This dialog has a series of options chained together in a ‘sentence.’  In the second drop-down, change ‘a copy of …’ to ‘this SmartTable’.  In the third option, choose ‘contain an object’.  Click the ‘Go’ button.  You should be left with a SmartTable containing 84 rows, the B / K-12 strain ortholog pairs, as shown in the screen capture at the top of this post.
12.  Finally, you can save the results to a file by selecting the Export command from the Operations menu.  Choose ‘to Spreadsheet file’ and since all of the genes have names in column one, you can save as common names rather than frame IDs, which are more readable in a spreadsheet.



As always, I welcome your questions and comments below.

Here is the example data.  Cut the area between the horizontal lines and paste into a text editor, such as textedit or atom, and save the file as 88Ecoligenes.txt.


thrL
thrA
thrB
thrC
yaaX
yaaA
yaaJ
talB
mog
satP
yaaI
dnaK
dnaJ
insL-1
mokC
hokC
nhaA
nhaR
ECB_00020
ECB_00021
ECB_00022
insA-1
insB-1
ECB_00025
ECB_00026
rpsT
yaaY
ribF
ileS
lspA
fkpB
ispH
rihC
dapB
carA
carB
caiF
caiE
caiD
caiC
caiB
caiA
caiT
fixA
fixB
fixC
fixX
yaaU
kefF
kefC
folA
apaH
apaG
rsmA
pdxA
surA
lptD
djlA
rluA
hepA
polB
araD
araA
araB
araC
yabI
thiQ
thiP
tbpA
sgrR
setA
leuD
leuC
leuB
leuA
leuL
leuO
ilvI
ilvH
cra
mraZ
rsmH
ftsL
ftsI
murE
murF
mraY
murD





Thursday, March 22, 2018

Listing Gene Identifiers and Accession Numbers from a BioCyc Genome


This post describes a way to gather the identifiers associated with a gene, which are stored under several different object properties in BioCyc (in some cases referred to as slots).  These identifiers are useful for verifying the identity of gene references between EcoCyc and other gene databases and catalogs.  Use of these identifiers is more reliable than depending on the gene names.

BioCyc PGDBs store identifiers for genes in several places.  These identifiers include the PGDB’s own BioCyc identifier, unification links to other databases (stored as PGDB database links), and locus tags from the Genbank entry for the genome.  These additional identifiers are stored as properties (slots) of the gene frame called Accession-1 and Accession-2.  Different PGDBs may assign different sets of identifiers to these slots, but using these slots allows a consistent way to access these.  In this post, I’ll discuss how to use SmartTables to build a list of identifiers associated with a set of genes.  I’ll use EcoCyc as an example PGDB, both because it uses both accession slots, and because the demonstration doesn’t require an active subscription.  Since it uses SmartTables, you will need a free BioCyc account to follow this demonstration.  Here’s a screenshot of the final table.  The final, full table is also linked here.

 Screen capture of SmartTable with gene identifiers


















This is the step-by-step procedure.

1.     Go to EcoCyc.org and login.  If you are already logged in to BioCyc, change your organism to E. coli K12. Substr MG1655.

2.     Now open the ‘Smart Tables’ menu and choose the ‘Special Smart Tables’ command.

3.     This will take you to a page with a list of special smart tables corresponding to many types of entities a BioCyc user may find useful, such as all compounds, genes, or enzymes in E coli.  Click on the “All genes of E. coli...”, which will be the second row in the list.

4.     Since you are logged in, this will create an editable copy of the special SmartTable which lists all the genes in E. coli, including the Gene’s name, and the Accession-1 property, as well the left and right boundaries in the genome and the gene’s product.  However, as I mentioned, there are additional properties with alternative identifiers.

5.     Above the table, there are three drop-down boxes.  The middle one is labeled ‘Add Property Column’.  Additional gene identifiers are available in two columns: Object ID and Accession-2.  Add an Object ID column by clicking on the drop down list and selecting the column by name.  The column will appear at the far right.  This is EcoCyc’s own internal identifier.  Repeat the process for Accession-2.  The Accession-1 and Accession-2 identifiers for E. coli are locus tags from two different naming systems.  As I mentioned, the particular identifiers used will be different in different PGDBs.

6.     You can also add identifiers from one or more external databases in the same way.  Use the Add Property Column dropdown and choose ‘Database Links’.  Now a window with a list of external databases appears.  You can select one, or several while holding the appropriate key (control or command on Macs).  In this example, I selected the EchoBASE database because it has alternate identifiers for many, though not all the genes in EcoCyc.  Click the ‘Go’ button to add the column(s).

7.     Once you have the columns, you can use the right-sidebar Operations Menu on the right-side bar to export the Smart Table to a file.  However, if you’re not interested in saving the coordinates or gene product columns, you can delete those columns by selecting them (click in the colored space immediately above the column name), then choosing the ‘delete column’ command, which appears in both the delete and column submenus in the Operations Menu. 

I hope you have found this discussion useful and I welcome questions and comments.


Monday, January 15, 2018

Local Downloading of BioCyc PGDBs for Pathway Tools Users


BioCyc offers thousands of PGDBS, currently 10,980 to be exact, and while subscribers can access all of these via the BioCyc web site, most desktop Pathway Tools users will want to work with a limited subset of these thousands.


Installing the desktop version of Pathway Tools provides a number of advantages:
  • Ability to create and edit PGDBs locally
  • Compare your PGDB(s) with similar organisms in BioCyc
  • Ability to query PGDB data from Python, Perl, Java, Lisp
  • Faster speed
  • Develop metabolic models 


Working with a PGDB in desktop Pathway Tools requires having a local copy. Downloading all those nearly 11,000 files would take up a lot of space and downloading time, so we provide two ways to get only the PGDBs you need.

The first download method is to download an appropriate bundle of PGDBs when you download Pathway tools. If you have a BioCyc subscription, you will have a number of options. Without a subscription (if you only have an academic download license), you will only be able to download the bundle that includes MetaCyc and EcoCyc. With a BioCyc subscription, there are seven available bundles, although a few bundle options are not available for all operating systems. Also the bundles for Windows will sometimes contain fewer PGDBs due to space constraints.



Bundle
Contents
EcoCyc and MetaCyc
EcoCyc, MetaCyc

EcoCyc and MetaCyc +BsubCyc+YeastCyc
EcoCyc, MetaCyc, BsubCyc (Bacillus subtilis subtilis 168), Saccharomyces cerevisiae (YeastCyc)

EcoCyc and MetaCyc +Mammals
EcoCyc, MetaCyc, HumanCyc, Mus musculus (MouseCyc)

EcoCyc and MetaCyc  +Tier2
EcoCyc, MetaCyc, BsubCyc, HumanCyc, YeastCyc, Bacillus anthracis Ames (AnthraCyc), Arabodopsis thaliana, Agrobacterium fabrum C58, Caulbacter crescentus CB15, Helicobacter pylori 26695, Leishmania major Friedlin, Mycobacterium tuberculosis H37Rv, M. tuberculosis CDC1551, Peptoclostridium difficile 630, Plasmodium falciparum 3D7, Shigella flexneri 2a str. 2457T (ShigellaCyc), Synechococcus elongatus PC 7942, Vibrio cholera O1 biovar E1 Tor strain N16961

EcoCyc and MetaCyc + Bacilli  
EcoCyc, MetaCyc, Anthracyc, BsubCyc, Bacillus amyloliquefaciens DSM 7, B. anthracis Sterne, B. atrophaeus 1942, B. cereus ATCC 14579, B. licheniformis DSM 13, B. megaterium QM B1551, B. pseudofirmus OF4, B. pumilus SAFR-032, B. subtilis spizizenii W23

EcoCyc and MetaCyc + E. coli          
EcoCyc, MetaCyc, ShigellaCyc, E. coli 0157:H7 strain EDL933, E. coli CFT073, E. coli K-12 substr. W3110, E. coli UTI89, E. coli O157:H7 str. Sakai, E. coli B str. REL606, E. coli K-12 substr. W3110, E. coli ATCC 8739, E. coli 536, E. coli BL21(DE3), E. coli HS, Salmonella enterica enterica LT2; SGSC 1412; ATCC 700720, Shigella flexneri 301, S. flexneri 8401, S. boydii Sb227, S. sonnei Ss046

EcoCyc and MetaCyc + Mycobacteria     
EcoCyc, MetaCyc, Amycolicicoccus subflavus DQS3-9A1, M. avium 104, M. avium paratuberculosis K-10, M. bovis BCG Pasteur 1173P2, M. bovis Tokyo 172, M. gilvum PYR-GCK, M. leprae Br4923, M. marinum M, M. parascrofulaceum ATCC BAA-614, M. sp. MCS MCS, M. sp. JLS JLS, M. sp. KMS, M. gilvum Spyr1, M. sinense JDM601, Mycobacterium tuberculosis H37Rv, M. tuberculosis CDC1551, M. tuberculosis F11, M. tuberculosis KZN 1435, M. ulcerans Agy99, M. vanbaalenii PYR-1





The second download method doesn't limit you to what's available in these pre-built bundles.  SRI maintains an online repository of all SRI created PGDBs as well as dozens of PGDBs submitted by other researchers.  We call this repository the registry.  If you have a BioCyc subscription, you can use the built-in registry import command to download additional PGDBs from within Pathway Tools (Files->Import->Registry).  The importer provides a search capability so you can enter a substring and it will return a list of registry-resident PGDBs whose taxon names match the search string.  Select the taxa you want and go.  Note that after the loading is complete, you will want to close the importer window, which may be hidden by the main pathway tools window.  For more information about the registry, see chapter 6 of the Pathway Tools User's Guide.

We hope you have found this review of ways to get PGDBs helpful.  For most Pathway Tools users, the Registry is the most flexible and frequently the quickest way to set up your collection of PGDBs.

Monday, February 27, 2017

Subscribe to Update Notifications

BioCyc.org has a new capability to inform you of newly curated discoveries from the experimental literature in your areas of scientific interest.  These "update notifications" will be sent to you
in a single email in conjunction with each of the three yearly BioCyc releases.

You can define your areas of interest in several ways:
  • By entering one or more specific genes or pathways of interest.
  • By entering a pathway class of interest, e.g., if you specify the MetaCyc Detoxification pathway class, you will receive updates about new or revised detoxification pathways from all domains of life that are curated in MetaCyc. 
  • By specifying a Gene Ontology term such as Cell Killing, you will receive updates when new genes are annotated to that biological process, or when the curation of existing genes under that term are updated.
Updates are triggered whenever a gene or pathway receives new literature citations.  And each update-notification request must be associated with a single Tier 1 or Tier 2  BioCyc database (organism) -- you cannot request notifications from all databases.

To enter new update-notification requests, log in to your BioCyc account and go to the update-notification page.

Wednesday, July 13, 2016

PythonCyc: Using the Pathway Tools Python API

Pathway Tools is implemented using the Common Lisp (CL) programming language, but the PythonCyc package creates a bridge between Python and CL. That is, the PythonCyc package allows you to interact with Pathway Tools using the Python language. With PythonCyc you can write Python programs to execute Pathway Tools metabolic models, as well asto extract and modify data stored in Pathway/Genome Databases (PGDBs). It is also possible to call from Python many functions defined in Pathway Tools that manipulate genes, pathways, reactions, proteins, and more.

Wednesday, June 8, 2016

Bulk Updates to Your PGDB

One question that we frequently receive is about how to apply bulk updates to a PGDB. This kind of situation can come about for several reasons:
  • When a group maintains and curates organism data on an ongoing basis using their own software or database environment, and then wants to update a PGDB with all their changes in a single batch operation.
  • When a revised annotation for an organism is made available, and a user wishes to update their PGDB with the new data without losing any existing curation.
  • When a user has some systematic change that they want to apply to large number of objects, such as a change to the locus id format, the addition of a new set of synonyms, or adding links to a new external database.
  • When a user wants to import a large dataset obtained via a high-throughput experiment or computational prediction, such as for protein cellular location or transcription factor binding sites.
Because these are all common scenarios, it seems worthwhile to provide an overview of the various ways that Pathway Tools supports bulk updating of PGDBs.  Note that none of the features discussed here are particularly new, and all have been supported by Pathway Tools for several years.  All User Guide section numbers referenced below are for version 20.0.

It should first be noted that Pathway Tools comes with a full suite of editing and curation tools, so if you have only a handful of changes to make, you should use those to make the edits interactively. The techniques described in this article would normally only be used if you have so many updates that it would be tedious to make the edits manually. 

Wednesday, April 13, 2016

BioCyc to Adopt Subscription Model

BioCyc seeks the support of the scientific community as we begin a new chapter in the development of this bioinformatics resource.

We plan to upgrade the curation level and quality of many BioCyc databases to provide scientists with higher quality information resources for many important microbes, and forHomo sapiens. Such an effort requires large financial resources that -- despite numerous attempts over numerous years -- have not been forthcoming from government funding agencies. Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide scientists with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools. Our work on EcoCyc has demonstrated the way forward. EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications. EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model. EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms. BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that curation occurs irregularly. Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information. The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information. EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases forSaccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape. Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace. That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases. Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992. Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish. Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development. He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1]. Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995. Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen. Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year. No curated database exists for the important gram-positive model organism Bacillus subtilis. How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one. We are now turning to a similar model to support the curation and operation of BioCyc. We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution. We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis -- more frequently than they consult most journals. We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world. We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model. Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users. Phoenix Bioinformatics will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.
We plan to go slow with this transition to give our users time to adapt. We’ll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now. Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals. One subscription will grant access to all of BioCyc. To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.
Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them. We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries aren’t locked out.

Please spread the word to your colleagues -- the more groups who subscribe, the better quality resource we can build for the scientific community.