Monday, October 19, 2020

How Does Metabolic Pathway Prediction Work?

Like many problems in bioinformatics, accurate prediction of metabolic pathways depends on a tight coordination between an algorithm and a database. The pathway-prediction algorithm used in the Pathway Tools software that powers is called PathoLogic; the database is MetaCyc. Here we provide an overview of the latest version of PathoLogic pathway prediction; [1] describes an older version of the algorithm.

The input to PathoLogic is an annotated genome, meaning gene locations and gene functions have been predicted. The genome can be supplied in the form of a GenBank (.gbk or .gbff) or GFF3 file. 

The essence of the PathoLogic algorithm is to recognize known pathways from MetaCyc in the genome being analyzed. PathoLogic performs two steps: reactome inference and pathway inference. 

Reactome Inference 

Reactome inference infers the metabolic reactions catalyzed by enzymes in the annotated genome. PathoLogic examines each protein in the annotated genome and tries to infer what reaction (if any) the protein catalyzes. It examines the name of the protein (stored in the /product field of GenBank files), and the EC number and/or Gene Ontology terms associated with the protein, if any. If, for example, the software encounters the name “pyruvate kinase”, it will look in the MetaCyc database for a match. In MetaCyc this name is associated with the EC number and with the following reaction:

Thus, when MetaCyc encounters either the name pyruvate kinase or the EC number in the annotated genome, it will associate the gene product with this reaction. 

One tricky aspect of reaction inference is that genome annotation pipelines often assign protein function names that include additional terms that can obscure the enzymatic function. 
Examples of such product names include: 
pyruvate kinase, PykF 
putative pyruvate kinase 
pyruvate kinase, hypothetical 
pyruvate kinase, cytoplasmic 
tryptophan synthase, beta subunit 
L-tryptophan isonitrile synthase 1 

Thus, the enzyme name matcher component of PathoLogic uses a battery of regular expressions (textual patterns) to strip off this additional text in search of the core enzymatic function in a given product name. The name matcher can actually end up querying many different variations of an enzyme name until it finds a variation that is known in MetaCyc. When an enzyme name is not recognized by PathoLogic, it tries to look up the gene name (symbol) associated with the enzyme if a gene name is provided (several restrictions are used to reduce incorrect matches based on gene names). 

Pathway Inference 

Once the reactions of the organism have been inferred, PathoLogic considers every pathway in MetaCyc, and computes a score that indicates the likelihood that the pathway is present. The pathway score is computed from the sum of the scores of the reactions in the pathway, divided by the number of reactions in the pathway (excluding spontaneous reactions). A reaction score is computed by summing these factors:
  • Is an enzyme catalyzing the reaction present in the organism?
  • How unique is the reaction to this pathway? Is it found only in this pathway, or in other pathways as well? The less unique a reaction, the lower its score. 
  • Some reactions in a pathway are designated as key reactions, meaning the reaction distinguishes the pathway from other similar pathways; the presence of an enzyme that catalyzes a key reaction boosts the score of that reaction.
 A rule-based expert system makes the final determination of whether the pathway is inferred as present by considering the following factors:
  • The pathway score.
  • The presence of designated key non-reactions for the pathway – reactions whose presence inhibits inference of the pathway .
  • Was some other variant of this pathway assigned a superior score? 
  • Is the pathway outside its taxonomic range and lacking at least one reaction? 
  • Special logic is provided for distinguishing which of the many variants of glycolysis and of the TCA cycle should be inferred.


We have designed the MetaCyc database to assist the PathoLogic algorithm in a number of respects: 
  • The more pathways MetaCyc contains, the more comprehensive PathoLogic’s inferences will be.
  • The boundaries (extents) of MetaCyc pathways are designed to correspond to evolutionary units. When pathway extents are overly large, pathway inference is likely to infer the presence of too large a pathway, even if some of its parts are not present. 
  • MetaCyc specifies the taxonomic range of many pathways. 
  • MetaCyc specifies key reactions and key non-reactions for many pathways. 
  • MetaCyc specifies which pathways are variants (alternatives) of one another.
[1] Karp, P.D, Latendresse, M., and Caspi, R., "The Pathway Tools pathway prediction algorithm," Standards in Genomic Science, 5:424-9 2011.


  1. Can we also get a pathway from multiple genes belonging to different microorganisms? Thanks

    1. Each database is intended to represent a single organism, and pathway prediction is only within a single database, so in that respect, the answer to your question is no. However, depending on what you are trying to do, there are ways to get at that information.

      It is technically possible to create a multi-organism database that combines the genomes from multiple organisms. Pathway prediction would then be done over the entire set of genes in all the included genomes. The downside is that PathoLogic is not really set up to recognize that different replicons are associated with different organisms, so the genes would not automatically be marked with the organism to which they belong (that information could be added, but there is currently no way in the GUI to do that, so you'd need help from someone who understood the underlying representation).

      If you have multiple databases, each for a single organism, you can search for metabolic paths (not predefined pathways) that encompass multiple databases. For an example, go to (using Firefox -- this functionality doesn't work in other browsers), select Metabolism -> Metabolic Route Search, and then check the box for routes across multiple organisms. If you have Pathway Tools installed locally, you would have to run it in web mode in order to access this functionality.

      Finally, the software does have a limited capability to identify "distributed pathways," currently defined as pathways that are not inferred in either of two databases, but would be inferred if the genome complements of those two databases were combined. To do this using the desktop software, you would create a community overview diagram containing a small number organisms of interest (Overviews -> Build Community Overview) and then select Overviews -> Highlight -> Pathways -> Distributed Pathways.