This post shows how, given a list of gene names and/or
identifiers from one organism, to retrieve the orthologous genes in a second
organism. For example, this procedure could be used to
find the orthologs in EcoCyc of a set of genes of uncertain function in another
organism, potentially providing insights about their functions.
The BioCyc project generates a database of orthologous genes
between many of the organism Pathway Genome Databases (PGDBs) we maintain. These orthologs are stored in a central MySQL
database that is separate from the individual BioCyc PGDBs. This post describes a way to retrieve
orthologs for a list of genes using a SmartTable. We don’t recommend generating a ortholog list
for a whole genome because of performance problems that, as of June 2018, we
are working to resolve.
Since we will
be using a PGDB in addition to EcoCyc, you will need a BioCyc subscription (not
just a free account) to follow through this demonstration. Here’s a screenshot of the final SmartTable,
which you can access directly here. We will begin with a file of 88 genes from E. coli strain B str. REL606 and determine their orthologs in EcoCyc. The columns in the resulting SmartTable are
as follows: the list of gene names from
the input file, their ‘ECB’ accession IDs genes, the names of orthologous genes
from EcoCyc, and ‘b-number’ accession
IDs from EcoCyc. We also show how to add a column containing gene product names.
Here is the step-by-step
procedure.
1.
Go to the bottom of this post and cut and paste
the list of gene identifiers into a text editor such as textedit or atom. Save the file as 88EcoliGenes.txt.
2.
Go to BioCyc.org and login and use change
organism database to change your organism to Escherichia coli B str. REL606.
3.
Open the ‘Smart Tables’ menu and choose the ‘My
Smart Tables’ command. It actually
doesn’t matter which of the commands in the menu you choose.
4.
You’ll find the operations menu in the upper
right corner of the SmartTables page.
Under the ‘New’ command, you’ll find ‘Smart Table from Uploaded File’. Choose that command.
5.
In the resulting pop-up window, click the ‘Choose
File’ button and select the file you saved out in step 1. Once you have located and selected the file,
review the options below the ‘Choose File’ button in the upload window.
Since this is a file of gene identifiers, you should keep the ‘Try to
make objects of type’ box checked, along with the radio button next to ‘Gene’. Also leave the other two check boxes checked.
Click the ‘Upload’ button and a new
SmartTable with a single column will appear.
If you see a warning message, ignore it and continue.
6.
Add accession numbers for the genes, by locating
the ‘ADD PROPERTY COLUMN’ dropdown menu, then choose ‘Accession-1’ from the
list. You’ll see a list of ECB
identifiers numbers’ for each of the genes on the first page.
7.
Now that the table has some basic identifiers
for the E. coli B str. REL606 genes, you can proceed to adding orthologs from EcoCyc. Find the ‘ADD TRANSFORM COLUMN’, and choose
the second transform ‘Compare – map to other species PGDB’. That will bring up a pop-up list of species,
which will be rather long. Scroll down
through the list until you find ‘Escherichia coli K-12 substr. MG1655’. Once you have selected the right organism,
click the ‘Go’ button.
8.
Now a column containing ortholog gene names
appears. Each entry is linked to the
gene page in its organism’s PGDB. Try it
by clicking on thrL, and notice the organism in the upper left of the gene
page. Clicking on the browser back
button will take you back to the table and restore the current PGDB to EcoCyc.
9. To add a column containing accession numbers of each orthologous gene,
select the ortholog column by clicking anywhere in the column header
except where the title is displayed.
When the column
is selected the header turns darker (and the ‘Gene Name’ column lightens as it
is no longer the selected column). Then,
as in step 6, choose ‘Accession-1’ from
the dropdown below ‘Add Property Column.’
A different list of accession numbers appears. Notice that both lists of accession numbers
are simply strings that don’t link to any genes. A column of product names for the orthologs
can be added in a similar fashion.
10. Finally,
suppose you were interested only in those B strain genes having orthologs in the
K-12 strain. To remove genes lacking orthologs, select the third column (‘Map to Escherichia…’). Now look over to the operations menu on the
right. Select the ‘Filter’ command. A pop-up window titled ‘Column text-value
filter’ will appear.
11. This
dialog has a series of options chained together in a ‘sentence.’ In the second drop-down, change ‘a copy of …’
to ‘this SmartTable’. In the third
option, choose ‘contain an object’.
Click the ‘Go’ button. You should
be left with a SmartTable containing 84 rows, the B / K-12 strain ortholog
pairs, as shown in the screen capture at the top of this post.
12. Finally,
you can save the results to a file by selecting the Export command from the
Operations menu. Choose ‘to Spreadsheet
file’ and since all of the genes have names in column one, you can save as
common names rather than frame IDs, which are more readable in a spreadsheet.
As always, I welcome your
questions and comments below.
Here is the example data.
Cut the area between the horizontal lines and paste into a text editor,
such as textedit or atom, and save the file as 88Ecoligenes.txt.
thrL
thrA
thrB
thrC
yaaX
yaaA
yaaJ
talB
mog
satP
yaaI
dnaK
dnaJ
insL-1
mokC
hokC
nhaA
nhaR
ECB_00020
ECB_00021
ECB_00022
insA-1
insB-1
ECB_00025
ECB_00026
rpsT
yaaY
ribF
ileS
lspA
fkpB
ispH
rihC
dapB
carA
carB
caiF
caiE
caiD
caiC
caiB
caiA
caiT
fixA
fixB
fixC
fixX
yaaU
kefF
kefC
folA
apaH
apaG
rsmA
pdxA
surA
lptD
djlA
rluA
hepA
polB
araD
araA
araB
araC
yabI
thiQ
thiP
tbpA
sgrR
setA
leuD
leuC
leuB
leuA
leuL
leuO
ilvI
ilvH
cra
mraZ
rsmH
ftsL
ftsI
murE
murF
mraY
murD