Tuesday, April 30, 2013

Metagenomics, PathoLogic and Pathway Abundance

Pathway abundance is a new parameter computed by PathoLogic, the tool that generates PGDBs in Pathway Tools. It is available starting with version 17.0 (March 2013) of Pathway Tools. Pathway abundances are computed from gene abundances supplied in metagenomics datasets, and are useful for comparing the metabolic profiles of different microbial communities.

Gene abundances are specified in the annotated genome file. Only the PathoLogic file format supports the specification of gene abundances. That is, gene abundance specification is not supported for the Genbank format. See Section “The PathoLogic File Format” in the Pathway Tools User Guide for more information about the PathoLogic format and how to specify the abundance attribute for a gene.

No preprocessing of the gene abundances is done by PathoLogic. That is, all gene abundances are taken as specified without doing any filtering such as outlier removal. We assume that any preprocessing of gene abundances has been done prior to generation of the annotated genome file. In particular, if some gene abundances are considered too low to be considered for the pathway abundances, these gene abundances should be omitted.

The abundance of a pathway is computed based on the gene abundances involved in the pathway. More precisely, assume that R is the set of reactions in pathway P for which gene abundances are specified, |R| the size of R and ga is the given abundance of gene g. The abundance of pathway P is

That is, the abundance of a pathway is the sum of the abundances of the genes catalyzing the reactions of the pathway, divided by the number of reactions of the pathway for which gene abundances are given. Notice that this formula does take into account all the known isozymes catalyzing a reaction and the spontaneous reactions do not take part in the computation.

Once PathoLogic has inferred the pathways from the annotated genome file, the computed abundances of the pathways can be found in the file pathways-report.txt under the subdirectory report of your PGDB. This report file lists all pathways that were inferred present in the PGDB alongside various computed parameters (e.g., confidence factor) including the computed abundances.

A Sequence Data File for Associating Enzymes with MetaCyc Reactions

Metabolic reconstruction typically proceeds in two steps: (1) analyze the proteome of a sequenced organism to infer the set of reactions catalyzed by the organism (the organism's reactome), and (2) infer the metabolic pathways present in the organism from the reactome.

Step (1) usually involves computing associations between protein sequences and the reactions in a pathway database such as MetaCyc. Such associations represent the inference that a given protein catalyzes that MetaCyc reaction.

Such associations can be inferred using a variety of sequence-analysis methods.  To aid researchers in associating sequences to MetaCyc reactions, each release of MetaCyc includes a file that associates MetaCyc reaction IDs with the UniProt identifiers of enzymes known to catalyze those reactions.  Note that not all MetaCyc reactions have EC numbers (because not all enzyme-catalyzed reactions have yet been assigned EC numbers), therefore EC numbers are not a comprehensive mechanism for associating sequences to reactions.  The file is called uniprot-seq-ids.dat and is included in the MetaCyc data file distribution.

The file contains a Lisp list of the form:

((RXN-1  EC#  ID1 ... IDn)
 (RXN-2  EC#  ID1 ... IDn)

where RXN-1 is the MetaCyc unique ID of a MetaCyc reaction, EC# is the EC number of that reaction (or NIL if the reaction has not been assigned an EC number), and each ID1 etc are the UniProt IDs of UniProt proteins that catalyze RXN-1.

Note that when we prepare this file we intentionally filter out those UniProt proteins that have high sequence similarity to other UniProt proteins already listed for a given reaction, to bound both the size of the file and the cost of the downstream sequence comparisons.