Thursday, April 28, 2011

Talkin' 'Bout My Regulation

The Regulation Summary Diagram for the bglG gene
in EcoCyc


You may have seen our new regulation summary diagrams on our EcoCyc gene pages. Or played with the regulatory overview. And you may have wished that you could get those visualizations for your own PGDBs. Unfortunately, there is as of yet no equivalent of PathoLogic for regulation -- no tool that will infer regulatory relationships, transcription factor sites, etc. from the genome annotation. Much of the regulatory data in EcoCyc was painstakingly curated from the primary literature (either by us or by the folks at RegulonDB) and entered piece by piece using our curation tools. That makes it an extremely valuable resource, but makes it difficult to replicate in other PGDBs without expending an equivalent amount of effort.
The Regulatory Overview Diagram for EcoCyc

Fortunately, there are other, faster ways to generate regulation data. High-throughput experiments and computational prediction programs can identify regulatory relationships and/or transcription factor binding sites en masse, and a number of groups have generated such data for their own organisms. The question that remains is how to bulk-load that data into a PGDB.



Currently the only way to bulk-load regulation data is via one of the programmer APIs. Pathway Tools has APIs in three languages, Lisp (the native language of Pathway Tools), Perl and Java (PerlCyc and JavaCyc, the Perl and Java APIs were created and are maintained elsewhere, and must be obtained separately from Pathway Tools -- they also lag behind the Lisp API in terms of functionality, although they are easy to extend). While you are of course free to write your own program using any of the APIs, the Lisp API already includes a program to load transcription factor regulatory relationships into a PGDB from a file.

Several different groups have approached us asking for help in uploading their transcription factor data. Unfortunately, they all seem to have slightly different file formats and types of data they wish to include. Some include specific binding site locations, others just an indication of which transcription factor regulates which gene. Some include the mode of regulation (activating or inhibiting), whereas in other cases (such as for sequence-based predictions of binding-site locations) the mode is unknown. Thus, we have written a short lisp program that is flexible enough to handle these differences. It is not, however, as flexible as I'd like it to be. Ideally, I'd like to extend it to allow you to enter types of regulation other than transcription factor initiation. And I'd like the ability to be able to enter a different citation and/or evidence code (or perhaps more than one) for each individual interaction. And of course, eventually, it should become part of the GUI, which would make it much easier for our users to access. However, none of that has happened yet, and we have users right now who want access to this kind of functionality, so in the meantime here are some instructions for how to upload transcription factor regulation into Pathway Tools version 15.0.

Create a tab-delimited text file. Each line of the file should contain at the very least a transcription factor and a regulated gene. It can optionally contain binding-site left and right coordinates, and the mode of regulation (+ or -). A single citation and evidence code can be supplied for the entire dataset. Regulation is assumed to be regulation of transcription initiation by transcription factor binding. The columns should contain the following information:
  • Transcription Factor: The gene for the transcription factor (TF). If the transcription factor is a complex or binds a ligand or undergoes some post-translational modification, you may instead prefer to supply the actual form of the protein that is the active transcription factor. In this case, you can specify the protein frame id instead, but it must already be recognized as a TF (meaning at least one transcriptional regulation interaction for it must have been hand-curated).
  • Regulated Gene: A gene name or identifier. This will be ignored if it is not the first gene in its transcription unit (transcription factors are assumed to regulate all the genes in a transcription unit together).
  • Binding Site Left and Right Coordinates (optional): absolute coordinates on the genome. Since only center positions and length are actually stored in the database, it doesn't matter if left and right are switched.
  • Mode (optional): + (activation) or - (inhibition)
Once you have created this file, start up pathway-tools with the -lisp option to access the Lisp API. The GUI will not come up -- instead, you will get a Lisp prompt that looks something like this:

EC(1):

You can bring up the GUI by typing (pt) at the prompt and hitting Enter. Bring up the GUI and select your organism, then exit the GUI -- this will not exit the program, but will just return you to the Lisp prompt.

The simplest way to load your transcription-factor regulation is to type the following at the prompt:

    (load-predicted-bsites-from-file "/my-path/my-file")

substituting in your own path and filename, of course. This assumes that the file contains 5 columns, in order the TF, the binding-site left coordinate, the binding-site right coordinate, the target gene and the mode. It also assumes that regulation was predicted computationally (as you can perhaps tell from the function name, the program was originally written to load predicted binding sites -- though it was later extended, the name was left unchanged), so assigns the evidence code EV-COMP-AINF to all the interactions it creates (see our evidence code ontology for more information about evidence codes), and it does not assign any literature citation. You can customize the function, however, based on the information you actually have. For example, if your data is based on high-throughput data that doesn't give you binding sites, you might have a file with three columns, the TF, the target gene and the mode, and you might want to assign the evidence code EV-EXP-IEP-GENE-EXPRESSION-ANALYSIS and a link to some publication that describes the experiment. In that case, you'd load the data like this:

    (load-predicted-bsites-from-file "/my-path/my-file"
                :tf-column 0
                :gene-column 1
                :mode-column 2
                :bsite-left-column nil
                :bsite-right-column nil
                :ev-code 'EV-EXP-IEP-GENE-EXPRESSION-ANALYSIS
                :cit "123456")

substituting in your path and filename, and the PubMed identifier of your publication in place of "123456". Remember to count the column numbers starting at 0, not 1, and include the value NIL for any data items that are missing from your input file.

As the program runs, it will print out a listing of any lines that could not be processed. Once it is done, you can bring up the GUI with (pt) again, and examine the data to make sure it looks ok before either saving or reverting your database.

No comments:

Post a Comment