Thursday, March 22, 2018

Listing Gene Identifiers and Accession Numbers from a BioCyc Genome


This post describes a way to gather the identifiers associated with a gene, which are stored under several different object properties in BioCyc (in some cases referred to as slots).  These identifiers are useful for verifying the identity of gene references between EcoCyc and other gene databases and catalogs.  Use of these identifiers is more reliable than depending on the gene names.

BioCyc PGDBs store identifiers for genes in several places.  These identifiers include the PGDB’s own BioCyc identifier, unification links to other databases (stored as PGDB database links), and locus tags from the Genbank entry for the genome.  These additional identifiers are stored as properties (slots) of the gene frame called Accession-1 and Accession-2.  Different PGDBs may assign different sets of identifiers to these slots, but using these slots allows a consistent way to access these.  In this post, I’ll discuss how to use SmartTables to build a list of identifiers associated with a set of genes.  I’ll use EcoCyc as an example PGDB, both because it uses both accession slots, and because the demonstration doesn’t require an active subscription.  Since it uses SmartTables, you will need a free BioCyc account to follow this demonstration.  Here’s a screenshot of the final table.  The final, full table is also linked here.

 Screen capture of SmartTable with gene identifiers


















This is the step-by-step procedure.

1.     Go to EcoCyc.org and login.  If you are already logged in to BioCyc, change your organism to E. coli K12. Substr MG1655.

2.     Now open the ‘Smart Tables’ menu and choose the ‘Special Smart Tables’ command.

3.     This will take you to a page with a list of special smart tables corresponding to many types of entities a BioCyc user may find useful, such as all compounds, genes, or enzymes in E coli.  Click on the “All genes of E. coli...”, which will be the second row in the list.

4.     Since you are logged in, this will create an editable copy of the special SmartTable which lists all the genes in E. coli, including the Gene’s name, and the Accession-1 property, as well the left and right boundaries in the genome and the gene’s product.  However, as I mentioned, there are additional properties with alternative identifiers.

5.     Above the table, there are three drop-down boxes.  The middle one is labeled ‘Add Property Column’.  Additional gene identifiers are available in two columns: Object ID and Accession-2.  Add an Object ID column by clicking on the drop down list and selecting the column by name.  The column will appear at the far right.  This is EcoCyc’s own internal identifier.  Repeat the process for Accession-2.  The Accession-1 and Accession-2 identifiers for E. coli are locus tags from two different naming systems.  As I mentioned, the particular identifiers used will be different in different PGDBs.

6.     You can also add identifiers from one or more external databases in the same way.  Use the Add Property Column dropdown and choose ‘Database Links’.  Now a window with a list of external databases appears.  You can select one, or several while holding the appropriate key (control or command on Macs).  In this example, I selected the EchoBASE database because it has alternate identifiers for many, though not all the genes in EcoCyc.  Click the ‘Go’ button to add the column(s).

7.     Once you have the columns, you can use the right-sidebar Operations Menu on the right-side bar to export the Smart Table to a file.  However, if you’re not interested in saving the coordinates or gene product columns, you can delete those columns by selecting them (click in the colored space immediately above the column name), then choosing the ‘delete column’ command, which appears in both the delete and column submenus in the Operations Menu. 

I hope you have found this discussion useful and I welcome questions and comments.