The input to PathoLogic is an annotated genome, meaning gene locations and gene functions have been predicted. The genome can be supplied in the form of a GenBank (.gbk or .gbff) or GFF3 file.
The essence of the PathoLogic algorithm is to recognize known pathways from MetaCyc in the genome being analyzed. PathoLogic performs two steps: reactome inference and pathway inference.
Reactome inference infers the metabolic reactions catalyzed by enzymes in the annotated genome. PathoLogic examines each protein in the annotated genome and tries to infer what reaction (if any) the protein catalyzes. It examines the name of the protein (stored in the /product field of GenBank files), and the EC number and/or Gene Ontology terms associated with the protein, if any. If, for example, the software encounters the name “pyruvate kinase”, it will look in the MetaCyc database for a match. In MetaCyc this name is associated with the EC number 188.8.131.52 and with the following reaction:
Thus, when MetaCyc encounters either the name pyruvate kinase or the EC number 184.108.40.206 in the annotated genome, it will associate the gene product with this reaction.
One tricky aspect of reaction inference is that genome annotation pipelines often assign protein function names that include additional terms that can obscure the enzymatic function.
Examples of such product names include:
pyruvate kinase, PykF
putative pyruvate kinase
pyruvate kinase, hypothetical
pyruvate kinase, cytoplasmic
tryptophan synthase, beta subunit
L-tryptophan isonitrile synthase 1
Thus, the enzyme name matcher component of PathoLogic uses a battery of regular expressions (textual patterns) to strip off this additional text in search of the core enzymatic function in a given product name. The name matcher can actually end up querying many different variations of an enzyme name until it finds a variation that is known in MetaCyc. When an enzyme name is not recognized by PathoLogic, it tries to look up the gene name (symbol) associated with the enzyme if a gene name is provided (several restrictions are used to reduce incorrect matches based on gene names).
Once the reactions of the organism have been inferred, PathoLogic considers every pathway in MetaCyc, and computes a score that indicates the likelihood that the pathway is present. The pathway score is computed from the sum of the scores of the reactions in the pathway, divided by the number of reactions in the pathway (excluding spontaneous reactions). A reaction score is computed by summing these factors:
- Is an enzyme catalyzing the reaction present in the organism?
- How unique is the reaction to this pathway? Is it found only in this pathway, or in other pathways as well? The less unique a reaction, the lower its score.
- Some reactions in a pathway are designated as key reactions, meaning the reaction distinguishes the pathway from other similar pathways; the presence of an enzyme that catalyzes a key reaction boosts the score of that reaction.
- The pathway score.
- The presence of designated key non-reactions for the pathway – reactions whose presence inhibits inference of the pathway .
- Was some other variant of this pathway assigned a superior score?
- Is the pathway outside its taxonomic range and lacking at least one reaction?
- Special logic is provided for distinguishing which of the many variants of glycolysis and of the TCA cycle should be inferred.
We have designed the MetaCyc database to assist the PathoLogic algorithm in a number of respects:
- The more pathways MetaCyc contains, the more comprehensive PathoLogic’s inferences will be.
- The boundaries (extents) of MetaCyc pathways are designed to correspond to evolutionary units. When pathway extents are overly large, pathway inference is likely to infer the presence of too large a pathway, even if some of its parts are not present.
- MetaCyc specifies the taxonomic range of many pathways.
- MetaCyc specifies key reactions and key non-reactions for many pathways.
- MetaCyc specifies which pathways are variants (alternatives) of one another.