October 15, 2013

Reduced Representation Bisulfie Sequencing Data Analysis

I am writing this post to help researchers trying to analyse their own RRBS data. This is non-technical explanation. Just following the steps may help on any ubuntu/linux system. Requirements on the computer:
  1. Bowtie (Manual and Source available from here)
  2. Human genome in fasta format (Download hg19.2bit from here)
  3. Cutadapt (Manual and installation instructions are detailed here)
  4. Trim galore (Perl script is available here)
  5. Bismark (Excellent resource is available here)
  6. Methylkit (useful R package for downstream analysis on this google code page)
  7. Install all the above scripts/programs in the directories in your PATH or export it.
#! /usr/bin/perl
#This programme is intended to run all the preliminary rrbs steps starting from Trimming, Mapping to Lambda, Mapping to human genome, methylation extraction, .bed file making and 5x, 10x coverage filtering.
#This program expects all the files to be present in one folder and they end with '.fastq'. So make sure that no other '.fastq' file exists in the folder.
#Collects the list of files to be analysed
@files = `ls | grep ".fastq\$"`;
$num_files = scalar@files;
print "Files to be analysed for the RRBS are the following $num_files files:\n##########\n@files############\n";
#performs the RRBS analysis for each file
foreach $trim_file(@files) {
chomp;
#Quality and Adapter trimming is performed
trim_galore ($trim_file);
@file_words = split(/\./, $trim_file);
$root = @file_words[0]; #extracts the file stub to be used
$trimmed_file = "$root"."_trimmed.fq"; #print "$trimmed_file\n";
#Performs the removal of endfilled two nucleotides in a full length sequence (50bp single end read)
rm_endfilled_c ($trimmed_file, $root);
$final_file = "$root"."_trim.fastq";
$lambda = "$root"."_lam";
$hg19 = "$root"."_hg19";
$remove = "$final_file"."_C_to_T.fastq";
#performs mapping with bismark first with lambda genome and next with hg19 (previously prepared by bismark)
bismark_lam ($final_file, $lambda);
bismark_hg19 ($final_file, $hg19);
system ("rm $remove");#removes the temporary CT converted file
$path = `pwd`; #print "$path\n";
chomp $path;
$dir = "$path"."/"."$hg19"; #print "$path\n";#changes directory into the bismark result
chdir ($dir)|| die "fuck the $dir\n";
methylation_extractor_bedmake ($root);
chdir ($path) || die "Can not move back to $path\n";
}
open (REP, ">bismark_report.txt");
print REP "======================= Extracting the report for lambda mapping ======================\n";
system("mkdir bismark_reports");
chdir ("bismark_reports");
system ("cp ../*lam/*Bismark_mapping_report.txt .");
@files = `ls`;
extract_bismark_mapping_report(@files);
print REP "======================= Extracting the report for hg19 mapping ======================\n";
system ("cp ../*hg19/*Bismark_mapping_report.txt .");
@files = `ls`;
extract_bismark_mapping_report(@files);
chdir ($path);
system ("rm -r bismark_reports");
merge_strand_coverage (@files, $dir);
convert_combinedBED_to_methylKitformat(@files);
sub convert_combinedBED_to_methylKitformat {
foreach $root (@files) {
$root =~ s/_trim.fastq_Bismark_mapping_report.txt//;
$file = $root."_CpG_combindcov_5x.bed"; #print "$file\n";
print ("Converting $root bed file into Methylkit format \n");
$output = $root."_CpG_combindcov_5x.txt";
open (OU, ">$output");
print OU ("chrBase\tchr\tbase\tstrand\tcoverage\tfreqC\tfreqT\n");
convert ($file);
close OU;
}
}
sub convert {
open (FI, $file);
while (<FI>) {
@line = split (/\t/, $_);
@cov = split (/[(:+)]/, $line[3]); #print "@cov\n";
$per_T = 100 - $line[4];
print OU ("$line[0].$line[2]\t$line[0]\t$line[2]\tF\t$cov[1]\t$line[4]\t$per_T\n");
}
}
sub merge_strand_coverage {
foreach $root (@files) {
$root =~ s/_trim.fastq_Bismark_mapping_report.txt\n//;
chdir ($dir);
$otfile=$root."_OT_cpg_1x.bed";
$obfile=$root."_OB_cpg_1x.bed";
print "Merging the coverage information from both strands\n";
%othash = create_hash ($otfile);
%obhash = create_hash ($obfile);
$outfile = $root."_CpG_combindcov_5x.bed";
open (OUT, ">tmp");
foreach $key (keys %obhash) {
if (exists $othash{$key}) { #checks if the corresponding CpG info exists for the plus strand
($ot_totalcov, $ot_met,$ot_nonmet,$ot_cpgpos) = getcpginfo ($othash{$key}); #extracts the coverage info from + strand
($ob_totalcov, $ob_met,$ob_nonmet,$ob_cpgpos) = getcpginfo ($obhash{$key}); #extracts the coverage info from - strand
# print "$ot_totalcov $ot_met $ot_nonmet $ot_cpgpos $ob_totalcov $ob_met $ob_nonmet $ob_cpgpos\n";
$new_met = $ot_met + $ob_met; #Calculates the new coverage info
$new_nonmet = $ot_nonmet + $ob_nonmet;
$new_totalcov = $ot_totalcov + $ob_totalcov;
# print "$new_met\t$new_nonmet\t$new_totalcov\n";
$meth_percentage = $new_met*100/$new_totalcov;
if ($meth_percentage <= 20) { $rgb = "124,252,0";} # gives the color lawn green
elsif ($meth_percentage <= 40 ) { $rgb = "105,105,105";} #color Sienna
elsif ($meth_percentage <= 60 ) { $rgb = "0,0,255";} #color Blue
elsif ($meth_percentage <= 80 ) { $rgb = "0,0,0";} #color Black
elsif ($meth_percentage > 80) { $rgb = "255,0,0";} #color Red
@newline = split (/\t/, $othash{$key});
$chr = $newline[0];
$pos1 = $newline[1];
$pos2 = $newline[2];
$newcpg = "CpG-".$pos2."(".$new_totalcov.":".$new_met."+".$new_nonmet.")";
if ($new_totalcov >=5) {
print OUT"$chr\t$pos1\t$pos2\t$newcpg\t$meth_percentage\t+\t0\t0\t$rgb\n"; #prints the merged coverage info
}
}
elsif (!exists $othash{$key}) {#If the corresponding coverage info doesnt exists for the +strand
@newline = split (/\t/, $obhash{$key});
$chr = $newline[0];
$pos1 = $newline[1]-1;#converts the co-ordinates into + strand co-ordinates
$pos2 = $newline[2]-1;
($ob_totalcov, $ob_met,$ob_nonmet,$ob_cpgpos) = getcpginfo ($obhash{$key});
$newcpg = "CpG-".$pos2."(".$ob_totalcov.":".$ob_met."+".$ob_nonmet.")"; #modifies the - strand cpg info into +strand info
$rgb = $newline[8];
if ($ob_totalcov >=5) {
print OUT"$chr\t$pos1\t$pos2\t$newcpg\t$newline[4]\t+\t0\t0\t$rgb";#prints the -strand info as a +strand info
}
}
}
foreach $key (keys %othash) {
if(!exists $obhash{$key}) {
($ot_totalcov, $ot_met,$ot_nonmet,$ot_cpgpos) = getcpginfo ($othash{$key});
if($ot_totalcov>=5) {
print OUT $othash{$key};
}
}
}
close OUT;
system ("sort -k1,1 -k2,2n tmp > $outfile; rm tmp");
}
}
sub create_hash { #subroutine for creating hash
$file = shift @_;
open (IN, $file);
%hash = ();
while (<IN>) {
$line = $_;
$cpgpos = getcpginfo($line);
if ($file =~ /OT/) {
$key = $chr."_"."$cpgpos ";
}
elsif ($file =~ /OB/) {
$new_cpgpos = $cpgpos-1;
$key = $chr."_"."$new_cpgpos ";
}
$hash{$key} = $line;
}
return %hash;
}
sub getcpginfo { #subroutine for getting cpginfo
@line = split (/\t/, shift @_);
$chr = $line[0];
$line[3] =~ /CpG-(\d+)\((\d+):(\d+)\+(\d+)\)/;
$cpgpos = $1;
$totalcov = $2;
$met = $3;
$nonmet = $4;
return ($totalcov,$met,$nonmet,$cpgpos);
}
sub extract_bismark_mapping_report{
print REP "Sample\tReads_analysed\tMapped_reads\tMap_eff\tCs_Anal\tMethyl Cs\tBisulf_convsn_eff(%)\n";
foreach $file (@files) {
open (BIS, "$file") || print "File $file can not be opened. $!\n";
@bismark = <BIS>;
$root = $file;
$root =~ s/_trim.fastq_Bismark_mapping_report.txt\n//g;
$total_seqs = analysed_seqs (@bismark);
$seqs = get_mapped_seqs(@bismark);
$map_eff = get_map_eff (@bismark);
$total_cs = total_c_analysed (@bismark);
($total_methC, $bis_eff) = c_to_convrsn_eff (@bismark);
print REP "$root\t$total_seqs\t$seqs\t$map_eff\t$total_cs\t$total_methC\t$bis_eff\n";
system ("rm $file");
}
}
sub analysed_seqs {
my @temp;
@temp = split (/\t/, $bismark[6]);
chomp $temp[1];
return $temp[1];
}
sub get_mapped_seqs {
my @temp;
@temp = split (/\t/, $bismark[7]);
chomp $temp[1];
return $temp[1];
}
sub get_map_eff {
my @temp;
@temp = split (/\t/, $bismark[8]);
chomp $temp[1];
return $temp[1];
}
sub total_c_analysed {
my @temp;
@temp = split (/\t/, $bismark[23]);
chomp $temp[1];
return $temp[1];
}
sub c_to_convrsn_eff {
my @temp26; my @temp27; my @temp28; my $total; my $eff;
@temp26 = split (/\t/, $bismark[25]);
@temp27 = split (/\t/, $bismark[26]);
@temp28 = split (/\t/, $bismark[27]);
chomp $temp26[1]; chomp $temp27[1]; chomp $temp28[1];
$total = $temp26[1]+$temp27[1]+$temp28[1];
$eff = ($total_cs-$total)*100/$total_cs;
return ($total, $eff);
}
sub methylation_extractor_bedmake {
$sam = "$root"."_trim.fastq_bismark.sam";
system ("methylation_extractor -s --merge_non_CpG $sam");
$cpg_OT_context = "CpG_OT_".$sam.".txt"; print "$cpg_OT_context\n";
$cpg_OB_context = "CpG_OB_".$sam.".txt"; print "$cpg_OB_context\n";
$otbed = "$root"."_OT_cpg_1x.bed"; print "$otbed\n";
$obbed = "$root"."_OB_cpg_1x.bed"; print "$obbed\n";
system ("genome_methylation_bismark2bedGraph_v4.pl --s --counts $cpg_OT_context > $otbed");
system ("genome_methylation_bismark2bedGraph_v4.pl --s --counts $cpg_OB_context > $obbed");
system ("rm *CTOT* *CTOB*");
}
sub bismark_lam {
print "================================\nMapping the Data to lambda genome\n============================\n";
system ("bismark_modifiedv0.7.5 -n 2 -l 51 /home/kalyankpy/dname/genome_lambda/ $final_file -o $lambda/");
}
sub bismark_hg19 {
print "================================\nMapping the Data to hg19 genome\n==============================\n";
system ("bismark_modifiedv0.7.5 -n 2 -l 51 /home/kalyankpy/dname/genome_hg19/ $final_file -o $hg19/");
}
sub trim_galore {
system ("trim_galore_mod --rrbs --length 30 --fastqc $trim_file");
}
#Removes the endfilled two nucleotides in a full length 51 bp read
sub rm_endfilled_c {
open (IN, " paste - - - - < $trimmed_file|");
print "Trimming the last two terminal nucleotides in a full length read from $trimmed_file\n";
$endfil_removed_file = "$root"."_trim.fastq";
open (OUT, ">$endfil_removed_file");
print ("Created $endfil_removed_file\n");
while (<IN>) {
@elem = split (/\t/, $_);
$h = $elem[0];
$s = $elem[1];
$y = $elem[2];
$q = $elem[3];
if (length($s) == 51 && $s =~ /CTG$|TTG$/){ #checking for the full length read ending with the pattern. Modify the number if the sequencing is of different length
$s = substr($s,0,49);
$q = substr($q,0,49);
print OUT ("$h\n$s\n$y\n$q\n");
}
else {
print OUT ("$h\n$s\n$y\n$q");
}
}
close IN;
close OUT;
}
view raw rrbs_wrapper hosted with ❤ by GitHub

May 4, 2013

Survival tips for biologists to navigate on linux

As a biologist I noticed that there are few resources that teach programming or linux commands from the biologists view. Here, I try to summarize the basic linux commands needed to navigate the linux folder structure, getting help and other basic tasks. This information given below will be just enough for survival. For detailed learning you may refer to other sources available on internet.

Basic linux tasks and related command for survival (commands are writtien in brackets):
1. Getting help on any command (man)
## get help on any command in linux
> man command
2. Creating directories (mkdir)
## create a directory
> mkdir name_of_new_directory
# for help on creating a directory
> man mkdir
3. Changing directories (cd)
## change directory
> cd /location/of/the/directory
# change to home directory
> cd
# change to one higher level directory
> cd ..
# change to second higher level directory
> cd ../..
# for help of changing directories
> man cd
4. Removing file/s or directory (rm)
## rm a file or directory
> rm name_of_file(s)_to_remove
# remove a directory
> rm -r name_of_folder(s)
# remove all files inside a directory without asking confirmation
> rm -rf name_of_folder(s)
# for help on removing files/folders
> man rm
5. Copying file/s or directory (cp)
## copy file(s) or directory(s)
> cp /old/location/of/file(s) /new/location/of/the/same/file(s)
# copy a directory to a new location
> cp -r /old/location/of/directory /new/location/of/the/same/directory
# copy the files elsewhere into the present directory
> cp /location/of/file .
# for help on copying
> man cp
6. Move the location of file/s or directory (mv)
## move a file or directory
mv /old/location/of/file /new/location/of/the/same/file
# mv a directory to a new location
> mv -r /old/location/of/directory /new/location/of/the/same/directory
# mv the files elsewhere into the present directory
> mv /location/of/file .
# for help on moving
> man mv
7. Rename a file or list of file/s (rename)
## rename a file or files
> rename old_file_name new_file_name
# rename a list of files with some common name to some other common name
# for example there are files "abc12, abc23, abc34" and we want to name them "xyz12, xyz23, xyz34"
> rename 's/abc/xyz/g' abc*
# for help on renaming
> man rename

March 20, 2013

Scope of Reduced Representation Bisulfite Sequencing Data

RRBS method explores the methylation status across the genome but at specific locations defined by the MspI recognition sites. These sites are mostly located within CpG Islands. So, how to visulaize the scope of the RRBS data - regions from where we can expect the methylation status in the human genome.

I show the localization of the CpG Island on the human genome in the following graphic. These are the most possible locations for the RRBS data sampling

This is a Karyogram view of the CpG Island on human chromosomes. Chromosomes were plotted relative to their size. Each CpG island is denoted by a single dot at the relative position on the chromosome. CpG Island close to each other are perceived as a connecting line to the human eye (hundred of dot placed closely). This picture also gives us an Idea that certain parts of the chromosomes either lack CpG islands (chr1, chr13, chr14, chr15). This could be due to incomplete mapping of human genome on these chromosomes.

I made this graphic using ggbio and ggplot2.

Methylation Status of CpG Islands Across Human Genome

Researchers are aware that a majority of the CpG islands are unmethylated. How to represent this fact in a graphic? 

The above picture created with ggplot2 explains us how the CpG methylation across the CpG Islands is distributed for each chromosome. Each CpG island is shown as a single dot(.). Methylation on the CpG island is identified by the color gradient. This explains that most of the CpG Islands are unmethylated (overlapped dots are seen as a line on the left side of the picture). Sparsely methylated CpG Islands can be identified as blue dots on the right side. Click the picture for a larger view.