Wednesday, June 8, 2016

Bulk Updates to Your PGDB

One question that we frequently receive is about how to apply bulk updates to a PGDB. This kind of situation can come about for several reasons:
  • When a group maintains and curates organism data on an ongoing basis using their own software or database environment, and then wants to update a PGDB with all their changes in a single batch operation.
  • When a revised annotation for an organism is made available, and a user wishes to update their PGDB with the new data without losing any existing curation.
  • When a user has some systematic change that they want to apply to large number of objects, such as a change to the locus id format, the addition of a new set of synonyms, or adding links to a new external database.
  • When a user wants to import a large dataset obtained via a high-throughput experiment or computational prediction, such as for protein cellular location or transcription factor binding sites.
Because these are all common scenarios, it seems worthwhile to provide an overview of the various ways that Pathway Tools supports bulk updating of PGDBs.  Note that none of the features discussed here are particularly new, and all have been supported by Pathway Tools for several years.  All User Guide section numbers referenced below are for version 20.0.

It should first be noted that Pathway Tools comes with a full suite of editing and curation tools, so if you have only a handful of changes to make, you should use those to make the edits interactively. The techniques described in this article would normally only be used if you have so many updates that it would be tedious to make the edits manually. 



Bulk Imports Using Specialized Tools


For certain kinds of bulk data imports, we have developed specialized import tools particularly for that datatype.  These include the following:

  • Links to an external database: Use the command File -> Import -> DB Links from File.  For more information and file format, see User Guide section 9.5.10.4, Bulk Import of Links from a File.
  • Phenotype microarray data: Use the command File -> Import -> Phenotype Microarray Data from Spreadsheet or OPM.  For more information, see User Guide section 3.10.1, Importing Phenotype Microarray Data.
  • Gene essentiality data: First manually create the growth medium object if it does not yet exist using the command File -> Create -> Growth Medium.  Then, from the growth medium page, right click on the growth medium name and select the command Edit -> Import Knockout Data from File.
  • Transcription factor binding sites: Currently the only way to upload this data is by using the Lisp API.  However, detailed instructions for doing this have been provided in an earlier blog post.

Import or Update from Spreadsheet


For updates that all affect the same type of object, and that only change individual string or numeric data values, not relationships between objects, it is relatively simple to upload the changes or additions from a spreadsheet file.  See User Guide section 5.6, Frame Import/Export, for a description of this feature.  In general, rather than trying to create the spreadsheet from scratch, it is recommended that you first export the objects (frames) and any relevant attributes (slots) from the PGDB to a spreadsheet file using the command File -> Export -> Selected Frames to File, and then manipulate the spreadsheet data as needed before attempting to import it back into the PGDB using the command File -> Import -> Frames from File.  This will ensure that the file format is correct, and the correct frame identifiers are in place.

For example, if you wish to update a large number of gene names, and you have a mapping from locus ids (which PathoLogic stores in the ACCESSION-1 slot of each gene frame) to revised gene names, you would export the genes with their ACCESSION-1 and COMMON-NAME slots. The resulting text file, which can be imported into a spreadsheet program, will have a column for frame identifiers, columns for the slots you selected, and possibly some other columns that you should not touch.  You can then substitute your updated names in the spreadsheet's COMMON-NAME column based on the locus id values in the ACCESSION-1 column, without changing any of the other fields (it is fine to reorder rows, or delete rows that have no changes, though), save the spreadsheet as a text file, and reimport it.  On import, you can specify whether changed values should replace the existing values in the database or be added to them -- for the COMMON-NAME example, you would replace the existing values, but if you were using this feature to add new synonyms or GO terms, for example, you might want to keep both old and new values.

Although the spreadsheet import can also be used to create new frames or update relationships between frames, in most cases we do not recommend using it in this way, as you are unlikely to be able to recreate all the necessary relationships between frames without an intimate knowledge of the Pathway Tools schema, and it is easy to introduce errors in this way.

Update for Revised Annotation


A panel summarizing the changes in a revised annotation file.
If you have a set of updates that includes more than one type of change to genes or proteins, or involves creating new genes or updating functional assignments, it is best to treat this as an update to the annotation. Open PathoLogic (Tools -> PathoLogic), and use the command Build -> Update Build for New Annotation.  If you actually have an updated annotation file in either GenBank or PathoLogic format, you can supply it here.  If not, you will have to generate one that contains your updates (it should be relatively straightforward to write a program that outputs data from a relational database to a PathoLogic format file). See User Guide section 7.8, Update PGDB Genome Annotation, for more information.  This tool will compare the contents of the supplied file with the existing data in the PGDB to see what has changed, and bring up a dialog that summarizes the changes.  You will be able to accept or reject each class of changes, either en masse or on an individual basis.

The Pathway Tools APIs


If you know how to write simple computer programs, another way to perform bulk updates is via one of the Pathway Tools APIs.  APIs are available for Lisp, Perl, Java and Python.  If you do not already have a language preference, we recommend learning the Lisp API, as Pathway Tools itself is written in Lisp, so it is easier to interact with and troubleshoot problems using the Lisp interface. More information and documentation with some basic examples are available at http://bioinformatics.ai.sri.com/ptools/ptools-resources.shtml.  Almost any kinds of updates can be implemented using the APIs, but be careful when writing a program that makes anything more than the simplest kinds of changes, as bad data can trigger errors in the Pathway Tools software.  Be sure to keep a backup copy of your PGDB in case you need to revert.

No comments:

Post a Comment