Friday, June 29, 2018

Generating a SmartTable of Orthologous Genes Across Multiple BioCyc Genomes

This post shows how, given a list of gene names and/or identifiers from one organism, to retrieve the orthologous genes in a second organism.    For example, this procedure could be used to find the orthologs in EcoCyc of a set of genes of uncertain function in another organism, potentially providing insights about their functions. 

The BioCyc project generates a database of orthologous genes between many of the organism Pathway Genome Databases (PGDBs) we maintain.  These orthologs are stored in a central MySQL database that is separate from the individual BioCyc PGDBs.  This post describes a way to retrieve orthologs for a list of genes using a SmartTable.  We don’t recommend generating a ortholog list for a whole genome because of performance problems that, as of June 2018, we are working to resolve.   

Since we will be using a PGDB in addition to EcoCyc, you will need a BioCyc subscription (not just a free account) to follow through this demonstration.  Here’s a screenshot of the final SmartTable, which you can access directly here. We will begin with a file of 88 genes from E. coli strain B str. REL606 and determine their orthologs in EcoCyc.  The columns in the resulting SmartTable are as follows:  the list of gene names from the input file, their ‘ECB’ accession IDs genes, the names of orthologous genes from EcoCyc, and  ‘b-number’ accession IDs from EcoCyc.  We also show how to add a column containing gene product names.

Here is the step-by-step procedure.

1.     Go to the bottom of this post and cut and paste the list of gene identifiers into a text editor such as textedit or atom.  Save the file as 88EcoliGenes.txt.
2.     Go to and login and use change organism database to change your organism to Escherichia coli B str. REL606. 
3.     Open the ‘Smart Tables’ menu and choose the ‘My Smart Tables’ command.  It actually doesn’t matter which of the commands in the menu you choose.
4.     You’ll find the operations menu in the upper right corner of the SmartTables page.  Under the ‘New’ command, you’ll find ‘Smart Table from Uploaded File’.  Choose that command. 
5.     In the resulting pop-up window, click the ‘Choose File’ button and select the file you saved out in step 1.  Once you have located and selected the file, review the options below the ‘Choose File’ button in the  upload window.  Since this is a file of gene identifiers, you should keep the ‘Try to make objects of type’ box checked, along with the radio button next to ‘Gene’.  Also leave the other two check boxes checked.  Click the ‘Upload’ button and a new SmartTable with a single column will appear.  If you see a warning message, ignore it and continue.
6.     Add accession numbers for the genes, by locating the ‘ADD PROPERTY COLUMN’ dropdown menu, then choose ‘Accession-1’ from the list.  You’ll see a list of ECB identifiers numbers’ for each of the genes on the first page. 
7.     Now that the table has some basic identifiers for the E. coli B str. REL606 genes, you can proceed to adding orthologs from EcoCyc.  Find the ‘ADD TRANSFORM COLUMN’, and choose the second transform ‘Compare – map to other species PGDB’.  That will bring up a pop-up list of species, which will be rather long.  Scroll down through the list until you find ‘Escherichia coli K-12 substr. MG1655’.  Once you have selected the right organism, click the ‘Go’ button.
8.     Now a column containing ortholog gene names appears.  Each entry is linked to the gene page in its organism’s PGDB.  Try it by clicking on thrL, and notice the organism in the upper left of the gene page.  Clicking on the browser back button will take you back to the table and restore the current PGDB to EcoCyc.
9.     To add a column containing accession numbers of each orthologous gene, select the ortholog column by clicking anywhere in the column header except where the title is displayed.   When the column is selected the header turns darker (and the ‘Gene Name’ column lightens as it is no longer the selected column).  Then, as in step 6,  choose ‘Accession-1’ from the dropdown below ‘Add Property Column.’  A different list of accession numbers appears.  Notice that both lists of accession numbers are simply strings that don’t link to any genes.  A column of product names for the orthologs can be added in a similar fashion.
10.  Finally, suppose you were interested only in those B strain genes having orthologs in the K-12 strain. To remove genes lacking orthologs, select the third column (‘Map to Escherichia…’).  Now look over to the operations menu on the right.  Select the ‘Filter’ command. A  pop-up window titled ‘Column text-value filter’ will appear. 
11.  This dialog has a series of options chained together in a ‘sentence.’  In the second drop-down, change ‘a copy of …’ to ‘this SmartTable’.  In the third option, choose ‘contain an object’.  Click the ‘Go’ button.  You should be left with a SmartTable containing 84 rows, the B / K-12 strain ortholog pairs, as shown in the screen capture at the top of this post.
12.  Finally, you can save the results to a file by selecting the Export command from the Operations menu.  Choose ‘to Spreadsheet file’ and since all of the genes have names in column one, you can save as common names rather than frame IDs, which are more readable in a spreadsheet.

As always, I welcome your questions and comments below.

Here is the example data.  Cut the area between the horizontal lines and paste into a text editor, such as textedit or atom, and save the file as 88Ecoligenes.txt.


Thursday, March 22, 2018

Listing Gene Identifiers and Accession Numbers from a BioCyc Genome

This post describes a way to gather the identifiers associated with a gene, which are stored under several different object properties in BioCyc (in some cases referred to as slots).  These identifiers are useful for verifying the identity of gene references between EcoCyc and other gene databases and catalogs.  Use of these identifiers is more reliable than depending on the gene names.

BioCyc PGDBs store identifiers for genes in several places.  These identifiers include the PGDB’s own BioCyc identifier, unification links to other databases (stored as PGDB database links), and locus tags from the Genbank entry for the genome.  These additional identifiers are stored as properties (slots) of the gene frame called Accession-1 and Accession-2.  Different PGDBs may assign different sets of identifiers to these slots, but using these slots allows a consistent way to access these.  In this post, I’ll discuss how to use SmartTables to build a list of identifiers associated with a set of genes.  I’ll use EcoCyc as an example PGDB, both because it uses both accession slots, and because the demonstration doesn’t require an active subscription.  Since it uses SmartTables, you will need a free BioCyc account to follow this demonstration.  Here’s a screenshot of the final table.  The final, full table is also linked here.

 Screen capture of SmartTable with gene identifiers

This is the step-by-step procedure.

1.     Go to and login.  If you are already logged in to BioCyc, change your organism to E. coli K12. Substr MG1655.

2.     Now open the ‘Smart Tables’ menu and choose the ‘Special Smart Tables’ command.

3.     This will take you to a page with a list of special smart tables corresponding to many types of entities a BioCyc user may find useful, such as all compounds, genes, or enzymes in E coli.  Click on the “All genes of E. coli...”, which will be the second row in the list.

4.     Since you are logged in, this will create an editable copy of the special SmartTable which lists all the genes in E. coli, including the Gene’s name, and the Accession-1 property, as well the left and right boundaries in the genome and the gene’s product.  However, as I mentioned, there are additional properties with alternative identifiers.

5.     Above the table, there are three drop-down boxes.  The middle one is labeled ‘Add Property Column’.  Additional gene identifiers are available in two columns: Object ID and Accession-2.  Add an Object ID column by clicking on the drop down list and selecting the column by name.  The column will appear at the far right.  This is EcoCyc’s own internal identifier.  Repeat the process for Accession-2.  The Accession-1 and Accession-2 identifiers for E. coli are locus tags from two different naming systems.  As I mentioned, the particular identifiers used will be different in different PGDBs.

6.     You can also add identifiers from one or more external databases in the same way.  Use the Add Property Column dropdown and choose ‘Database Links’.  Now a window with a list of external databases appears.  You can select one, or several while holding the appropriate key (control or command on Macs).  In this example, I selected the EchoBASE database because it has alternate identifiers for many, though not all the genes in EcoCyc.  Click the ‘Go’ button to add the column(s).

7.     Once you have the columns, you can use the right-sidebar Operations Menu on the right-side bar to export the Smart Table to a file.  However, if you’re not interested in saving the coordinates or gene product columns, you can delete those columns by selecting them (click in the colored space immediately above the column name), then choosing the ‘delete column’ command, which appears in both the delete and column submenus in the Operations Menu. 

I hope you have found this discussion useful and I welcome questions and comments.

Monday, January 15, 2018

Local Downloading of BioCyc PGDBs for Pathway Tools Users

BioCyc offers thousands of PGDBS, currently 10,980 to be exact, and while subscribers can access all of these via the BioCyc web site, most desktop Pathway Tools users will want to work with a limited subset of these thousands.

Installing the desktop version of Pathway Tools provides a number of advantages:
  • Ability to create and edit PGDBs locally
  • Compare your PGDB(s) with similar organisms in BioCyc
  • Ability to query PGDB data from Python, Perl, Java, Lisp
  • Faster speed
  • Develop metabolic models 

Working with a PGDB in desktop Pathway Tools requires having a local copy. Downloading all those nearly 11,000 files would take up a lot of space and downloading time, so we provide two ways to get only the PGDBs you need.

The first download method is to download an appropriate bundle of PGDBs when you download Pathway tools. If you have a BioCyc subscription, you will have a number of options. Without a subscription (if you only have an academic download license), you will only be able to download the bundle that includes MetaCyc and EcoCyc. With a BioCyc subscription, there are seven available bundles, although a few bundle options are not available for all operating systems. Also the bundles for Windows will sometimes contain fewer PGDBs due to space constraints.

EcoCyc and MetaCyc
EcoCyc, MetaCyc

EcoCyc and MetaCyc +BsubCyc+YeastCyc
EcoCyc, MetaCyc, BsubCyc (Bacillus subtilis subtilis 168), Saccharomyces cerevisiae (YeastCyc)

EcoCyc and MetaCyc +Mammals
EcoCyc, MetaCyc, HumanCyc, Mus musculus (MouseCyc)

EcoCyc and MetaCyc  +Tier2
EcoCyc, MetaCyc, BsubCyc, HumanCyc, YeastCyc, Bacillus anthracis Ames (AnthraCyc), Arabodopsis thaliana, Agrobacterium fabrum C58, Caulbacter crescentus CB15, Helicobacter pylori 26695, Leishmania major Friedlin, Mycobacterium tuberculosis H37Rv, M. tuberculosis CDC1551, Peptoclostridium difficile 630, Plasmodium falciparum 3D7, Shigella flexneri 2a str. 2457T (ShigellaCyc), Synechococcus elongatus PC 7942, Vibrio cholera O1 biovar E1 Tor strain N16961

EcoCyc and MetaCyc + Bacilli  
EcoCyc, MetaCyc, Anthracyc, BsubCyc, Bacillus amyloliquefaciens DSM 7, B. anthracis Sterne, B. atrophaeus 1942, B. cereus ATCC 14579, B. licheniformis DSM 13, B. megaterium QM B1551, B. pseudofirmus OF4, B. pumilus SAFR-032, B. subtilis spizizenii W23

EcoCyc and MetaCyc + E. coli          
EcoCyc, MetaCyc, ShigellaCyc, E. coli 0157:H7 strain EDL933, E. coli CFT073, E. coli K-12 substr. W3110, E. coli UTI89, E. coli O157:H7 str. Sakai, E. coli B str. REL606, E. coli K-12 substr. W3110, E. coli ATCC 8739, E. coli 536, E. coli BL21(DE3), E. coli HS, Salmonella enterica enterica LT2; SGSC 1412; ATCC 700720, Shigella flexneri 301, S. flexneri 8401, S. boydii Sb227, S. sonnei Ss046

EcoCyc and MetaCyc + Mycobacteria     
EcoCyc, MetaCyc, Amycolicicoccus subflavus DQS3-9A1, M. avium 104, M. avium paratuberculosis K-10, M. bovis BCG Pasteur 1173P2, M. bovis Tokyo 172, M. gilvum PYR-GCK, M. leprae Br4923, M. marinum M, M. parascrofulaceum ATCC BAA-614, M. sp. MCS MCS, M. sp. JLS JLS, M. sp. KMS, M. gilvum Spyr1, M. sinense JDM601, Mycobacterium tuberculosis H37Rv, M. tuberculosis CDC1551, M. tuberculosis F11, M. tuberculosis KZN 1435, M. ulcerans Agy99, M. vanbaalenii PYR-1

The second download method doesn't limit you to what's available in these pre-built bundles.  SRI maintains an online repository of all SRI created PGDBs as well as dozens of PGDBs submitted by other researchers.  We call this repository the registry.  If you have a BioCyc subscription, you can use the built-in registry import command to download additional PGDBs from within Pathway Tools (Files->Import->Registry).  The importer provides a search capability so you can enter a substring and it will return a list of registry-resident PGDBs whose taxon names match the search string.  Select the taxa you want and go.  Note that after the loading is complete, you will want to close the importer window, which may be hidden by the main pathway tools window.  For more information about the registry, see chapter 6 of the Pathway Tools User's Guide.

We hope you have found this review of ways to get PGDBs helpful.  For most Pathway Tools users, the Registry is the most flexible and frequently the quickest way to set up your collection of PGDBs.