Wednesday, April 15, 2015

Querying Databases by Organism Properties

The latest release (version 19.0) of BioCyc includes PGDBs for 5500 different organisms, and we expect that number to grow with every future release. With such numbers, unless you already have a specific species and strain in mind, it becomes impractical to browse through the complete list of organisms. We already allow users of the BioCyc website to select organisms specifically by name or taxonomic class. We describe here extensions to that selection process that enable users to search for organisms based on a larger set of properties of the organism, such when and where the sample was collected and what kind of environment it lives in.

A standard for encoding this kind of "metadata" about an organism sample was proposed by the Genomic Standards Consortium (GSC) in a 2008 paper in Nature Biotechnology. The resulting MIGS (Minimal Information about a Genome Sequence) and related standards cover a vast assortment of organism properties, environmental conditions, and sample collection and treatment details. Organizations are starting to include this data in their sequencing pipelines, such that some of this data now appears in submitted GenBank files and in online databases such as GOLD (the Genomes OnLine Database) and NCBI BioSample.

Rather than attempt to encode the entire MIGS standard in Pathway Tools, we identified a handful of organism and sample properties that we considered to be of general interest, and for which significant quantities of data are available. We updated our GenBank input file parsers to automatically extract this data, if present, when creating a new PGDB, and we extended our editing tools to allow a PGDB curator to supply or edit this data manually. In addition, we imported data into many of our existing PGDBs where we could find it from a variety of online sources, primarily the NCBI BioProject and BioSample databases and PATRIC. About 60% (3251) of the PGDBs currently in BioCyc now have data for at least one of these properties. The attributes we chose to represent are the following, with the numbers of PGDBs that have data for each attribute included in parentheses after the attribute name (more detailed descriptions available in the BioCyc Website User Guide):
  • Environment (2618): This property encompasses terms that describe the environmental features, habitats and materials where the sample was taken. This can include biome-level terms, such as desert, deciduous woodland, coral reef; geographic features such as harbor, cliff, lake; environmental material such as air, soil, water; and/or host environment, such as blood, skin, gut. This attribute combines the MIGS concepts biome, feature, material, body_habitat, body_site and body_product.
  • Geographic Location (1616): The geographical origin of the sample, defined by country or sea name, and/or specific region name.
  • Latitude (155): The latitude of the geographical origin of the sample.
  • Longitude (155): The longitude of the geographical origin of the sample.
  • Depth/Altitude (28): The depth or altitude in meters at which the sample was collected.
  • Collection Date (965): The date the sample was collected.
  • Relationship to Oxygen (573): Whether the organism is an aerobe or anaerobe, and what form.
  • Trophic Level (101): The position of the organism in a food chain.
  • Temperature Range (317): A qualitative description of what kind of temperature range the organism grows best in, e.g. mesophile, psychrophile, thermophile, hyperthermophile.
  • Biotic Relationship (286): Whether the organism is free-living or in a host, and if the latter, what type of relationship is observed.
  • Pathogenicity (15): The general class of organisms to which the organism is pathogenic.
  • Host (1358): The host from which the sample was isolated.
  • Human Microbiome Body Site (1371): Specifically for samples collected as part of the Human Microbiome Project (or other human-host samples for which data is available), the general body site from which the sample was collected, e.g. oral, blood, gastrointestinal tract, etc.
  • Health/Disease State (377): The health or disease state of the specific host at the time of collection.
  • Ploidy (3): The ploidy level of the genome, e.g. haploid, diploid, triploid, allopolyploid.
Currently, organism properties can be queried only when Pathway Tools is running in web server mode (such as for the BioCyc website). Either when selecting a single PGDB to visit, or selecting multiple organisms for comparative operations or cross-organism search, the By Organism Properties tab enables the user to generate complex queries that combine one or more of the above properties and generate a table of results from which one or more organisms can be selected. A couple of example queries and their results are illustrated in the screen snapshots below.

Using an Organism Properties query to select a single database for browsing.

Using an Organism Properties query to select multiple databases for a cross-organism search or comparison operation.

Implementation Details

The organism and sample property data for a PGDB is stored directly within that PGDB. However, in order to be able to efficiently query the data across many PGDBs, we need to create an index file for this data. This index is known as the PGDB-Metadata-KB. If you are running Pathway Tools locally, and create a new PGDB using PathoLogic, then an entry for your PGDB will automatically be added to this PGDB-Metadata-KB (this will be done regardless of whether or not your input file contains any of the organism properties we collect). In general, this process is transparent to the user. However, we have seen cases in which users have attempted to run multiple instances of Batch PathoLogic at the same time and have encountered concurrency problems in which multiple processes are attempting to update the PGDB-Metadata-KB at once. If you happen to encounter this problem, you can turn off the indexing altogether by supplying the -disable-metadata-saving command line argument when invoking Pathway Tools. If you do this, and later make your PGDBs available by running a Pathway Tools web server, then you will not be able to query by organism properties unless you manually build the index. To build the index, you must start Pathway Tools in Lisp mode (i.e. using the -lisp command line argument) and enter the following Lisp command:


Once this is complete, you can exit Pathway Tools, and the indexed data should be available to you next time you restart. Alternatively, you can eliminate the By Organism Properties tab altogether, by supplying the -disable-metadata-tab command line argument when starting your web server.

No comments:

Post a Comment