April 28, 2014

Restriction digestion of eukaryotic genomes in R

There are multiple desktop tools (Bioedit, Emboss, various basic bioinformatic tools) and browser based tools (NEBcutter, biotools, In silico restriction digestion and various other tools) available for performing restriction digestion of smaller prokaryotic genomes or smaller eukaryotic chromosomes individually. However, no desktop tool or browser based tool is available for free use to perform restriction digestion on the whole genome of eukaryotes with larger genomes. Tools like Emboss handles this sort of task but in a primitive way that is not helpful for the downstream analysis of the results right away.

I have been working on methylation analysis using RRBS method. This method is based on the restriction digestion pattern of the enzyme MspI on the whole genome. I wanted to perform in silico digestion of MspI on mouse genome(mm10) to virtually see the pattern of digestion. After scanning the web finally I narrowed down on a bioconductor package "Biostrings" that helped me achieve this task. Here, I give the code to perform this task. Since the package I used is an R package, it also helped me perform a variety of downstream analysis pretty fast.

This method is based on the ability of the "Biostrings" package to recognize the MspI restriction site (CCGG) on the mouse genome (BSgenome.Mmusculus.UCSC.mm10 bioconductor package loaded into R). Following tasks are peformed by the script below:
  • Load the needed bioconductor and R packages
  • Identify the MspI restriction sites (genomic co-ordinates) per chromosome in the genome.
  • Extract the start and end co-ordinates of the dna fragments resulting from the genomic digestion (using gaps)
  • Create a dataframe of the genomic co-ordinates of the digested fragments fro each chromosome for easier downstream analysis
  • Plot the frequency of the length of digested fragments using ggplot2
  • A reproducible document is also available at R pubs

Update on 19th August 2014:
I came to know that some people tried to replicate this code and noticed some errors on their side. So, I would say that some issues may crop up when the version of the libraries change. Users may note that the above code is working with the following session information in R.
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8      
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C             
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] scales_0.2.4                          plyr_1.8.1                          
 [3] ggplot2_1.0.0                         BSgenome.Mmusculus.UCSC.mm10_1.3.1000
 [5] BSgenome_1.32.0                       GenomicRanges_1.16.3                
 [7] GenomeInfoDb_1.0.2                    Biostrings_2.32.1                   
 [9] XVector_0.4.0                         IRanges_1.22.9                      
[11] BiocGenerics_0.10.0                 

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2   
 [6] labeling_0.2     MASS_7.3-33      munsell_0.4.2    proto_0.3-10     Rcpp_0.11.2    
[11] reshape2_1.4     Rsamtools_1.16.1 stats4_3.1.1     stringr_0.6.2    tools_3.1.1    
[16] zlibbioc_1.10.0

April 8, 2014

ENCODE Transcription factor tracks

I was trying to overlap the differentially methylated CpGs at the transcription factor binding sites for my data. But I am puzzled to see that the ENCODE consortium which has spent huge amount of money on performing experiments and publishing papers, did not pay much attention to giving details of the tracks and their naming structure as a quick reference that can be easily found. After browsing a lot on the net, I got the clue from this webpage about the details of the transcription factor chip experiments.

In this post I summarize the details about how to download all the tracks and understanding the naming of the tracks.

Transcription factor related ChIP-seq tracks are available for individual download from this link. However, for those wanting to perform analysis from all the tracks, it is painful to download each track and merge the tracks ensuring the identity of each track. Fortunately, I found that Sartor lab has created a bed file merging all the ENCODE TF tracks. This file also ensured the identity of each row of the track by labelling its source. This could be easily converted into 'GRanges' object either by custom functions in base R or using import function from rtracklayer package from Bioconductor.

Another issue with these tracks is their naming structure. While the consortium ensured that every track name includes all the necessary information, its structure was not documented anywhere (that could be easily obtained). After few hours of browsing and collecting the information, I understood the structure of the TF track name as follows:

  • List of tracks and the related metadata for the tracks is available for download from this webpage (click the files.txt link)
  • The downloaded file is not uniformly tabulated. So, I had to fiddle with it to make it look uniform
  • Further I extracted only the details that matter to understand the file/track name. You may download this file from here.

Here I will explain the name of one track as an example:
Track name = wgEncodeAwgTfbsSydhK562CjunIfna6hUniPk
Every track name includes the following elements:
  1. Text common to all: wgEncodeAwgTfbs
  2. Laboratories involved in generating the track: Sydh (Labs from Stanford,yale, USC, Harvard)
    • Other possible values appearing in the track names are (Haib, UChicago, Uta, UW)
  3. Cell line used=K562
    • There are over 92 types of cell lines used. So, this part is highly variable. Details of the cell types used is available from this page.
  4. Antibody used = c-Jun
    • There are over 190 antibodies used (=TFs probed). This is also a vairable part of the track name. In some cases, they have provided the catalog number of the antibody purchased.
  5. Treatment=ifna6h (Means cells are treated with IFNA for 6 hrs)
    • This part gives details of the cell treatment. When cells are not subjected to any treatement, there is no mention of this part. Overall, there are 29 variables at this place. When the treatment is for 36h, it is taken as standard and not mentioned in the name of the track.
  6. Algorithm used=UniPk (Uniform peak calling). Common for every track!
I hope, this information is useful for others doing a similar analysis.

Update (26th June 2014)

Here are few more links that give additional information about the ENCODE transcription factor ChIP experiments.
  1.  List of antibodies registered at Data coordination center (DCC), ENCODE is available at  http://genome.ucsc.edu/ENCODE/antibodies.html. This link provides the list of antibodies and their targets probed in ENCODE.
  2. For  those interested in the comprehensive list of transcription factors across various genomes this link may be useful. However these are predicted transcription factors and not human curated. http://www.bioguo.org/AnimalTFDB/index.php
  3. For those who want know about the type of experiments performed in ENCODE, this page may be helpful (http://www.encodeproject.org/ENCODE/dataMatrix/encodeDataSummaryHuman.html). This page also lists the number of experiments performed. Also lists the various ChIP-seq experiments.
  4. Finally FAQs are available at http://www.encodeproject.org/ENCODE/FAQ/ 
  5. Human curated list of transcription factors is available at http://www.tfcat.ca/ . Reference article for the same is at http://genomebiology.com/2009/10/3/R29