BioCyc and Pathway Tools Blog: How Does Metabolic Pathway Prediction Work?

Like many problems in bioinformatics, accurate prediction of metabolic pathways depends on a tight coordination between an algorithm and a database. The pathway-prediction algorithm used in the Pathway Tools software that powers BioCyc.org is called PathoLogic; the database is MetaCyc. Here we provide an overview of the latest version of PathoLogic pathway prediction; [1] describes an older version of the algorithm.

The input to PathoLogic is an annotated genome, meaning gene locations and gene functions have been predicted. The genome can be supplied in the form of a GenBank (.gbk or .gbff) or GFF3 file.

The essence of the PathoLogic algorithm is to recognize known pathways from MetaCyc in the genome being analyzed. PathoLogic performs two steps: reactome inference and pathway inference.

Reactome Inference

Reactome inference infers the metabolic reactions catalyzed by enzymes in the annotated genome. PathoLogic examines each protein in the annotated genome and tries to infer what reaction (if any) the protein catalyzes. It examines the name of the protein (stored in the /product field of GenBank files), and the EC number and/or Gene Ontology terms associated with the protein, if any. If, for example, the software encounters the name “pyruvate kinase”, it will look in the MetaCyc database for a match. In MetaCyc this name is associated with the EC number 2.7.1.40 and with the following reaction:

phosphoenolpyruvate + ADP + H⁺ → pyruvate + ATP

Thus, when MetaCyc encounters either the name pyruvate kinase or the EC number 2.7.1.40 in the annotated genome, it will associate the gene product with this reaction.

One tricky aspect of reaction inference is that genome annotation pipelines often assign protein function names that include additional terms that can obscure the enzymatic function.

Examples of such product names include:

pyruvate kinase, PykF

putative pyruvate kinase

pyruvate kinase, hypothetical

pyruvate kinase, cytoplasmic

tryptophan synthase, beta subunit

L-tryptophan isonitrile synthase 1

Thus, the enzyme name matcher component of PathoLogic uses a battery of regular expressions (textual patterns) to strip off this additional text in search of the core enzymatic function in a given product name. The name matcher can actually end up querying many different variations of an enzyme name until it finds a variation that is known in MetaCyc. When an enzyme name is not recognized by PathoLogic, it tries to look up the gene name (symbol) associated with the enzyme if a gene name is provided (several restrictions are used to reduce incorrect matches based on gene names).

Pathway Inference

Once the reactions of the organism have been inferred, PathoLogic considers every pathway in MetaCyc, and computes a score that indicates the likelihood that the pathway is present. The pathway score is computed from the sum of the scores of the reactions in the pathway, divided by the number of reactions in the pathway (excluding spontaneous reactions). A reaction score is computed by summing these factors:

Is an enzyme catalyzing the reaction present in the organism?
How unique is the reaction to this pathway? Is it found only in this pathway, or in other pathways as well? The less unique a reaction, the lower its score.
Some reactions in a pathway are designated as key reactions, meaning the reaction distinguishes the pathway from other similar pathways; the presence of an enzyme that catalyzes a key reaction boosts the score of that reaction.

A rule-based expert system makes the final determination of whether the pathway is inferred as present by considering the following factors:

The pathway score.
The presence of designated key non-reactions for the pathway – reactions whose presence inhibits inference of the pathway .
Was some other variant of this pathway assigned a superior score?
Is the pathway outside its taxonomic range and lacking at least one reaction?
Special logic is provided for distinguishing which of the many variants of glycolysis and of the TCA cycle should be inferred.

MetaCyc

We have designed the MetaCyc database to assist the PathoLogic algorithm in a number of respects:

The more pathways MetaCyc contains, the more comprehensive PathoLogic’s inferences will be.
The boundaries (extents) of MetaCyc pathways are designed to correspond to evolutionary units. When pathway extents are overly large, pathway inference is likely to infer the presence of too large a pathway, even if some of its parts are not present.
MetaCyc specifies the taxonomic range of many pathways.
MetaCyc specifies key reactions and key non-reactions for many pathways.
MetaCyc specifies which pathways are variants (alternatives) of one another.

[1] Karp, P.D, Latendresse, M., and Caspi, R., "The Pathway Tools pathway prediction algorithm," Standards in Genomic Science, 5:424-9 2011.

BioCyc and Pathway Tools Blog

Monday, October 19, 2020

How Does Metabolic Pathway Prediction Work?

Reactome Inference

Pathway Inference

MetaCyc

2 comments:

Monday, October 19, 2020

How Does Metabolic Pathway Prediction Work?

Reactome Inference

Pathway Inference

MetaCyc

2 comments:

Subscribe To