Tuesday, April 30, 2013

A Sequence Data File for Associating Enzymes with MetaCyc Reactions

Metabolic reconstruction typically proceeds in two steps: (1) analyze the proteome of a sequenced organism to infer the set of reactions catalyzed by the organism (the organism's reactome), and (2) infer the metabolic pathways present in the organism from the reactome.

Step (1) usually involves computing associations between protein sequences and the reactions in a pathway database such as MetaCyc. Such associations represent the inference that a given protein catalyzes that MetaCyc reaction.

Such associations can be inferred using a variety of sequence-analysis methods.  To aid researchers in associating sequences to MetaCyc reactions, each release of MetaCyc includes a file that associates MetaCyc reaction IDs with the UniProt identifiers of enzymes known to catalyze those reactions.  Note that not all MetaCyc reactions have EC numbers (because not all enzyme-catalyzed reactions have yet been assigned EC numbers), therefore EC numbers are not a comprehensive mechanism for associating sequences to reactions.  The file is called uniprot-seq-ids.dat and is included in the MetaCyc data file distribution.

The file contains a Lisp list of the form:

((RXN-1  EC#  ID1 ... IDn)
 (RXN-2  EC#  ID1 ... IDn)
 ...)

where RXN-1 is the MetaCyc unique ID of a MetaCyc reaction, EC# is the EC number of that reaction (or NIL if the reaction has not been assigned an EC number), and each ID1 etc are the UniProt IDs of UniProt proteins that catalyze RXN-1.

Note that when we prepare this file we intentionally filter out those UniProt proteins that have high sequence similarity to other UniProt proteins already listed for a given reaction, to bound both the size of the file and the cost of the downstream sequence comparisons.

No comments:

Post a Comment