Wednesday, July 13, 2016

PythonCyc: Using the Pathway Tools Python API

Pathway Tools is implemented using the Common Lisp (CL) programming language, but the PythonCyc package creates a bridge between Python and CL. That is, the PythonCyc package allows you to interact with Pathway Tools using the Python language. With PythonCyc you can write Python programs to execute Pathway Tools metabolic models, as well asto extract and modify data stored in Pathway/Genome Databases (PGDBs). It is also possible to call from Python many functions defined in Pathway Tools that manipulate genes, pathways, reactions, proteins, and more.

Wednesday, June 8, 2016

Bulk Updates to Your PGDB

One question that we frequently receive is about how to apply bulk updates to a PGDB. This kind of situation can come about for several reasons:
  • When a group maintains and curates organism data on an ongoing basis using their own software or database environment, and then wants to update a PGDB with all their changes in a single batch operation.
  • When a revised annotation for an organism is made available, and a user wishes to update their PGDB with the new data without losing any existing curation.
  • When a user has some systematic change that they want to apply to large number of objects, such as a change to the locus id format, the addition of a new set of synonyms, or adding links to a new external database.
  • When a user wants to import a large dataset obtained via a high-throughput experiment or computational prediction, such as for protein cellular location or transcription factor binding sites.
Because these are all common scenarios, it seems worthwhile to provide an overview of the various ways that Pathway Tools supports bulk updating of PGDBs.  Note that none of the features discussed here are particularly new, and all have been supported by Pathway Tools for several years.  All User Guide section numbers referenced below are for version 20.0.

It should first be noted that Pathway Tools comes with a full suite of editing and curation tools, so if you have only a handful of changes to make, you should use those to make the edits interactively. The techniques described in this article would normally only be used if you have so many updates that it would be tedious to make the edits manually. 

Wednesday, April 13, 2016

BioCyc to Adopt Subscription Model

BioCyc seeks the support of the scientific community as we begin a new chapter in the development of this bioinformatics resource.

We plan to upgrade the curation level and quality of many BioCyc databases to provide scientists with higher quality information resources for many important microbes, and forHomo sapiens. Such an effort requires large financial resources that -- despite numerous attempts over numerous years -- have not been forthcoming from government funding agencies. Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide scientists with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools. Our work on EcoCyc has demonstrated the way forward. EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications. EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model. EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms. BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that curation occurs irregularly. Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information. The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information. EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases forSaccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape. Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace. That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases. Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992. Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish. Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development. He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1]. Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995. Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen. Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year. No curated database exists for the important gram-positive model organism Bacillus subtilis. How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one. We are now turning to a similar model to support the curation and operation of BioCyc. We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution. We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis -- more frequently than they consult most journals. We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world. We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model. Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users. Phoenix Bioinformatics will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.
We plan to go slow with this transition to give our users time to adapt. We’ll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now. Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals. One subscription will grant access to all of BioCyc. To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.
Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them. We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries aren’t locked out.

Please spread the word to your colleagues -- the more groups who subscribe, the better quality resource we can build for the scientific community.

Thursday, March 10, 2016

Introducing Pathway Collages...

Figure 1
Pathway Tools has long been recognized for the quality of our automatically generated individual metabolic pathway diagrams, which are intuitive to biologists, can be shown at varying levels of detail, and can be customized in various ways, including with the overlay of omics data. When a more global view is called for, our cellular overview diagram depicts the entire metabolic network for an organism, with capabilities for selective highlighting and overlay of omics data. However, to understand some biochemical situations, viewing a single pathway is insufficient, whereas viewing the entire metabolic network results in information overload. Pathway Collages, new in Pathway Tools version 19.5, are an attempt to bridge this gap, allowing users to create high-quality, customized, user-manipulable diagrams containing collections of user-specified pathways.

Pathway Collages can be explored and edited via the Pathway Collage Viewer web browser application. This application, implemented using the Cytoscape.js open-source JavaScript graph visualization library, supports panning, zooming, and all the editing and customization operations described in this post and the documentation embedded within the Pathway Collage Viewer itself. Feel free to experiment yourself with the example pathway collage online at http://biocyc.org/cytoscape-js/ovsubset.html?graph=example1&showHelp=T, or create your own following the instructions below.

Figure 2
Three example Pathway Collage figures are illustrated here. Figure 1 depicts a Pathway Collage consisting of four E. coli pathways overlaid with gene expression data. This diagram has already been manually adjusted by repositioning the pathways relative to each other and tweaking node font sizes and shapes. Metabolites that are shared between pathways are indicated by drawing connecting lines between them. 

Figure 2 shows a collage consisting of two E. coli pathways overlaid with predicted reaction flux data. In this diagram, rather than drawing connecting lines, compounds that are shared between the two pathways are merged, showing glycolysis flowing seamlessly into fermentation.
Figure 3

Figure 3 depicts a collage containing a larger number of pathways at a lower zoom level, so metabolite, enzyme and gene names are automatically suppressed (the font size of the pathway labels has been increased so those labels remain visible). In addition to manually repositioning pathways, merging some common nodes, and changing the default colors, some metabolites of interest have been highlighted in purple.

Now that you've seen what you can do with a Pathway Collage, how can you create one for yourself? Pathway Collages can be created from either the BioCyc website (or other Pathway Tools-based website) or from desktop Pathway Tools. There are five basic steps.
  1. Specify the set of pathways to be included. The simplest and most reliable way to specify a set of pathways is to generate a SmartTable containing the desired pathways, and then export the SmartTable to a Pathway Collage. This works both for the desktop and web versions of Pathway Tools, and enables you to keep your list of pathways around in case you ever want to edit it or regenerate your collage. There are other ways to specify a set of pathways, such as by interactively clicking on them in the cellular overview diagram (desktop only), from an omics dataset (web only), or by creating a seed collage from a single pathway and then interactively adding more (web only). We may add additional options to specify pathways in the future. Consult the documentation for more details.
  2. Export to Pathway Collage Viewer. Pathway Tools will compute automatic layouts of the individual pathways within the collage, then position those diagrams next to one another horizontally, and send that initial layout of the collage to the Pathway Collage Viewer application in your web browser.
  3. Interactively refine and customize the collage. This can involve repositioning items, showing connections, adding, deleting or merging elements, editing labels, highlighting elements of interest, and/or customizing node and edge styles. By default, only the metabolites along the main backbone of a pathway are included in the diagrams, but side metabolites can be added interactively. Additional pathways involving a metabolite of interest can also be added interactively.
  4. Import omics data to be visualized on the collage (optional). Omics data can be added either before or after the collage is generated. The collage can display omics data associated with either genes, metabolites, or reactions. When multi-timepoint gene expression data is displayed, the display of enzyme names is suppressed.
  5. Save or export the collage. At any time, a pathway collage can be saved as a JSON-format graph file on your computer; that file can later be loaded back in to the collage viewer (not all browsers support this operation --- we recommend using Chrome or Firefox). A pathway collage can also be exported to a PNG-format image file for use in presentations or publications. The image will be generated with a resolution comparable to that of the display at the time the image is created (up to some maximum), therefore, the highest-quality images are obtained if the collage is displayed at a high zoom level when exporting.
For more information on Pathway Collages, see the Pathway Tools Website User Guide or the help documentation within the Pathway Collage Viewer itself.

Monday, November 16, 2015

Everything you always wanted to know about the Enzyme Commission Part II


In this blog we will discuss a few more aspects of the Enzyme Commission and its classification work that were not covered in the previous blog.

Scope of Enzyme Classification
The classification system used by the EC aims to cover enzymes that fall under one of the following six broad categories:

Class 1: Oxidoreductases
Class 2: Transferases
Class 3: Hydrolases
Class 4: Lyases
Class 5: Isomerases
Class 6: Ligases

As you can see, transporters are not covered by the EC list unless they also catalyze an additional reaction that falls under one of these categories (e.g. the phosphoenolpyruvate-dependent phosphotransferase transporters known as PTS). While peptidases fit under class 3, the Enzyme Commission has limited the classification of peptidases in recent years due to the difficulty in drafting reactions that accurately describe the peptidase specificity.


Principles of Classification
Each top class contains several subclasses. For example, Class 4 contains the subclasses 4.1 carbon-carbon lyases, 4.2 carbon-oxygen lyases, 4.3 carbon-nitrogen lyases, etc. The subclasses, in turn, contain sub-subclasses, e.g. 4.1.1, carboxy-lyases. The sub-subclass in which an enzyme resides defines the first three fields in the enzyme’s EC number. The fourth and last field is simply a serial number within that sub-subclass.

The subclasses and sub-subclasses sometime contain the numbers 98 and 99. In general, when both of those numbers exist under the same parent class, 98 is reserved for well-characterized enzymes that do not fit the other subclasses, while 99 indicates some uncertainty about the enzyme (for example, when the identity of an electron acceptor is not known).

The principles of classification are too complex to describe here. They are described in detail at http://enzyme-database.org/rules.php.

Most of the enzymes fit well in one of the existing sub-subclasses. However, some enzymes catalyze complex reactions that do not fit any particular class. In other cases an enzyme might fit in more than one class. In these cases the commission members need to discuss the issue and decide, and occasionally a new sub-subclass is defined.


What Is The Process of Classifying An Enzyme?
Members of the Enzyme Commission create new entries using an online system Called DraftEnz, which was developed by A. McDonald. The members define the exact sub-subclass to which the enzyme belongs, and the entry receives at this point a temporary internal serial number (e.g. 3.1.3.d). The new entry is reviewed by the other members of the commission, who may suggest modifications to any part of the entry. When a member is satisfied with the entry, he or she may vote for it, and when an entry has received at least two non-author votes, it is ready to move to the next stage, which is internal review.

When a sufficient number of new entries have received the necessary votes, a batch of new entries is moved to internal review, at which time they can be viewed at a dedicated web page, and receive their final serial numbers. All the members of the commission are requested to review them. The internal review process ensures that all members get to review all entries, and problems that were not caught earlier are likely to be spotted.

After one month at internal review, the entries are moved to public review. At this stage the entries are visible to the public at the ExplorEnz website by clicking on the tab “New/Amended Enzymes”. The entries are kept at this stage for another month to allow sufficient time for the community to provide feedback. Once the entries clear this stage, they are moved to ExplorEnz and become official.


Some Statistics
In addition to creating new entries, the commission often revises older entries to reflect newer information that has been generated after the entries were created. Existing entries can be revised, deleted, and sometimes transferred to a different EC number. Entries are transferred if new information shows that the reaction catalyzed by the enzyme is different than what was previously thought, requiring the classification under a different sub-subclass, or if new information shows that the enzyme is identical to an enzyme that is classified under a different EC number.

Currently there are 5638 entries in the EC list of enzymes. This number does not include 664 entries that have been transferred and 303 entries that have been deleted.

Since 2010 the commission has created or modified 2221 entries. This is an impressive number for a small group of volunteers, but it is probably a drop in the bucket considering the vast number of well-characterized enzymes that have not been classified yet.


What You Can Do to Help
If you would like to help, it is straight forward to create a new EC entry! You do not even have to suggest the sub-subclass (although you can if you would like). Take a look at a few of the EC entries to get familiar with the format. Then, go to http://enzyme-database.org/forms.php and fill out the form for a new submission. Just make sure you read the information at the beginning of the form, which explains what the requirements are.

Wednesday, November 4, 2015

Everything you always wanted to know about the Enzyme Commission


If you have used BioCyc, you probably noticed that many reactions have EC numbers printed next to them. EC numbers are everywhere – in the primary literature, in annotated genomes, in databases, in online encyclopedias. Where do they come from and what exactly do they mean?

A Bit of History
In the early days the naming of enzymes was not systematic. As a result, many different enzymes were given the same name and, on the other hand, several different names were assigned to the same enzyme. Many of the names were not particularly helpful; for example, the enzyme now known as EC 1.6.99.1, NADPH dehydrogenase, was originally named “old yellow enzyme”.
To sort out the mess, Dixon and Webb introduced a classification system in their 1958 book “Enzymes”, which was based on the reaction catalyzed by the enzyme. Although it was rather limited, it provided the foundation for the current classification system. At about the same time, the International Union of Biochemistry has decided to form an official international commission on enzymes to develop a better classification and naming system. The first full report of the commission was published in 1965, using a six-category system that is still used today. Although this is not the place to describe classification principles, in general each enzyme receives a unique four-component identification number that not only identifies it, but also provides insight into the enzymatic activity of the enzyme. Each EC entry provides additional information such as lists of names and synonyms, references, and often commentary. Full details about the principles of the classification system can be found at http://enzyme-database.org/rules.php and https://en.wikipedia.org/wiki/Enzyme_Commission_number.

The Present
Fast forward 50 years, and the Enzyme Commission (EC) is still going strong. The importance and usefulness of the EC numbers has only increased with time. With the explosion in sequencing volume, having an accurate genome annotation has become critical, and EC numbers provide a well-defined, non-ambiguous method for annotation of enzyme function. Software packages such as Pathway Tools make the most out of this information, assigning the appropriate reactions to the annotated genes based on their EC numbers when building metabolic networks for newly-sequenced genomes. The content of the enzyme list, which used to be published in books, is made available through two online databases that are updated several times a year. A searchable MySQL version of the database, including downloadble data in multiple formats, is available at the ExplorEnz database at http://enzyme-database.org. More than 5600 enzymes are currently classified, and hundreds are added each year.

Who is the EC?
The Enzyme Commission is now part of the IUPAC-IUBMB Joint Commission on Biochemical Nomenclature (JCBN). It consists of a small number of experts who volunteer their time to the project. Active members (listed alphabetically) include K. Axelsen (Switzerland), R. Cammack (UK), R. Caspi (USA), M. Kotera (Japan), A. McDonald (Ireland), G.P. Moss (UK), D. Schomburg (Germany), I. Schomburg (Germany), and K.F. Tipton (Ireland). The commission members are using an online curation system that was developed by A. McDonald, called ExplorEnz. Members of the committee continue to classify new enzymes, modify existing entries as new information becomes available, and extend or modify the classification rules to accommodate new challenges.
If you would like to request a new EC entry for an enzyme that hasn’t been classified yet, or submit an error or update report about an existing entry, submission forms are available at http://enzyme-database.org/forms.php. Since MetaCyc curator R. Caspi is a member of the EC, you are also welcome to send EC-related questions or comments to biocyc-support@AI.SRI.COM.

Additional Information
  1.  Dixon, M. and Webb, E.C. (1958), Enzymes. Longmans Green, London, pp. 183–227.
  2.  Tipton, K. and Boyce, S. (2000) History of the enzyme nomenclature system. Bioinformatics, 16, 34-40.
  3.  McDonald, A.G., Boyce, S. and Tipton, K.F. (2009) ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Res, 37, D593-597.
  4.  McDonald, A.G. and Tipton, K.F. (2014) Fifty-five years of enzyme classification: advances and difficulties. Febs J, 281, 583-592.

Friday, April 24, 2015

A New Curated BioCyc Database for Clostridium difficile


Peptoclostridium (Clostridium) difficile (commonly nicknamed “Cdiff”) is a spore-forming bacterium that causes serious healthcare-associated infections. In the United States alone, it is estimated that Cdiff infections were responsible for more than 29,000 deaths in 20111. Antibiotic resistance and recurrent infections are common problems in treating Cdiff infections.

The BioCyc collection currently contains twelve Clostridium/Peptoclostridium difficile databases; all of them can be easily accessed from a new home page, http://cdifficile.biocyc.org/. We chose the database for a strain commonly used in the laboratory, Peptoclostridium difficile 630, for a pilot project to update the genome annotation and to add literature curation.

Wednesday, April 15, 2015

Querying Databases by Organism Properties

The latest release (version 19.0) of BioCyc includes PGDBs for 5500 different organisms, and we expect that number to grow with every future release. With such numbers, unless you already have a specific species and strain in mind, it becomes impractical to browse through the complete list of organisms. We already allow users of the BioCyc website to select organisms specifically by name or taxonomic class. We describe here extensions to that selection process that enable users to search for organisms based on a larger set of properties of the organism, such when and where the sample was collected and what kind of environment it lives in.

Friday, March 6, 2015

Procedure for Creating Metabolic Models from Sequenced Genomes




In the past, construction of quantitative metabolic flux models has been an extremely time-consuming process, requiring 12-18 months to create a bacterial model.  One of our main goals in designing the MetaFlux module for creating metabolic models within Pathway Tools has been to speed up this process by automating as many of its steps as possible, and by providing software power tools for debugging metabolic models (a viewpoint that was put forward by our colleague Jeremy Zucker).  We can now create metabolic models using MetaFlux in approximately 1 month.
 This blog surveys our recommended procedure for creating metabolic models from sequenced genomes using Pathway Tools.  

Thursday, February 26, 2015

Metabolic Modeling to Predict Organism Phenotypes


Here we explore one of the major applications of steady-state metabolic modeling: the prediction of organism growth rates under varying perturbations.  The two most common perturbations studied with metabolic models are variations in the nutrients available to the organism (e.g., changes in carbon source, nitrogen source, and oxygen availability), and the presence of gene knockouts.  These two perturbations can be combined since the effects of gene knockouts can be modeled under different nutrient mixes.