Genotype Calling
Table of Contents

Project Description
 Genotypes are obtained using a microarray technology.
 Make a simple genotype caller based on ratios of the probes.
 Identify clusters in the genotype plots and use our own algorithms to genotype data. How does this compare to the other methods?
Background on Genotype Calling
 SNP Genotyping: [http://en.wikipedia.org/wiki/SNP_genotyping]
 G. Yang et al. Highthroughput Microarraybased Genotyping. In IEEE CSB, 2004.
About the Authors
Jackson Pang
1st year MSEE with background in computer architecture, highperformance computing, embedded network programming, detection and estimation.
Digvijay Singh
1st Year MSEE with backgorund in computer architecture, signal processing, networks, operating systems, compilers and embedded computing.
Goal for end of Quarter
 Our goal for this quarter is to make a simple genotype caller based on raw data of the probes.
Weekly Schedule
 Week 4 Deal with probe level data and begin formulating the problem.
 Week 5 Look into candidate algorithms and analyze probe level data more deeply.
 Week 6 Begin implementing and evaluating the genotype calling algorithms.
 Week 7 Finalize Implementation of the algorithms.
 Week 8 Evaluate and Test algorithms with multiple data sets and previous results.
 Week 9 Present Results and Report.
WEEK9 Weekly Update (May 28, 2009)
Weekly Progress
 Presented results and report.
 The presentation contains all details, including relevant background on genotypecalling.
 The presentation is available here in pdf and ppt.
 Rfiles are here [http://cs124project2009.wikidot.com/localfiles/gtcalling/rfiles_pang_singh.tgz]
Next Week Plan
 The project is officially finished!
Grade for the Week
 Grade: A
Problems that Came Up
 Well, the usual presentation making problems cropped up.
Problems Solved This Week
 Finished our presentation and reporting for the project. We are done!
WEEK8 Weekly Update (May 21, 2009)
Weekly Progress
 Analyzed the performance of both the sectorsweep and Kmeans clustering techniques with respect to the Affymatrix 100k Xba calling data available from HapMap project's website.
Figures
 We show the comparison of our algorithms genotypecalling to the HapMap genotyped data.
 For both our algorithms, the calls were consistent with greater than 95% accuracy. The Kmeans did slightly better than sectorsweep in the comparison.
KMeans
Sectorsweep
Next Week Plan
 Finalize presentation for the project.
Grade for the Week
 Grade: A
Problems that Came Up
 We had made some small indexing errors in the comparison of our algorithms' results with the HapMap gentoyped data. This led to a near 40% consistency that really shocked us!
Problems Solved This Week
 Compared our results to all 50k genotyped SNPs from the HapMap website. Comparison proves that our algorithms have done well.
References
 GENOTYPED HapMap DATA: [http://www.hapmap.org/].
WEEK 7 Weekly Update (May 14, 2009)
Weekly Progress
 Implemented a calling method based on Kmeans clustering technique.
 Ran the Kmeans algorithm over all of the affy100k SNPs and obtained the calling data for all 270 individuals.
Figures
 After using our Kmeans genotype caller, the clusters have been marked as BB (red), AB (black) and AA (green) for a sample SNP.
R Code
# performs clustering based on the input data sets
senseA < read.table ("senseA.txt", header = T)
senseB < read.table ("senseB.txt", header = T)
#rowNames < read.delim ("rowOrder.names", quote = "\"", sep = "/")
snp_vec < matrix (data = NA, nrow = ncol(senseA), ncol = 2)
callings < matrix (data = NA, nrow = nrow (senseA), ncol = ncol(senseA))
colnames(callings) < colnames(senseA)
rownames(callings) < rownames(senseA)
call_map < matrix (data = NA, nrow = 1, ncol = 3)
ratios < matrix (data = NA, nrow = 1, ncol = 3)
# this loop will run once for every individual (270)
for (i in 1:nrow(senseA)){
#for (i in 1:10){
#############################################
# Perform Kmeans
snp_vec[ ,1] < t(senseA[i ,])
snp_vec[ ,2] < t(senseB[i ,])
colnames(snp_vec) < c("Sense A", "Sense B")
cl < kmeans ( snp_vec ,
centers = 3,
iter.max = 50,
nstart = 10,
algorithm = "HartiganWong" )
##############################
# Classifier implementation
##############################
senseAmin < min (senseA[i, ]);
senseBmin < min (senseB[i, ]);
min < 1
max < 1
mid < 1
# origin recentering first then classify
for (j in 1:3) {
if (senseAmin < cl$centers[j,1])
adiff < (cl$centers[j,1]  senseAmin)
else
adiff < (cl$centers[j,1])
if (senseBmin < cl$centers[j,2])
bdiff < (cl$centers[j,2]  senseBmin)
else
bdiff < (cl$centers[j,2])
ratios[j] < (bdiff / adiff)
# classify
if (ratios [j] < ratios[min]) {
min < j;
}
if (ratios [j] > ratios[max]) {
max < j;
}
}
if ((min == 1 && max == 3)  (min == 3 && max == 1)) {
mid < 2
} else
if ((min == 1 && max == 2)  (min == 2 && max == 1)) {
mid < 3
} else
if ((min == 2 && max == 3)  (min == 3 && max == 2)) {
mid < 1
}
call_map [min] < "A"
call_map [max] < "B"
call_map [mid] < "H"
#store callings in table
for (k in 1:ncol(senseA)) {
callings [i ,k] < call_map [ cl$cluster [k] ]
}
#pdf (paste(rownames(senseA[i, ]),"pdf", sep = "."))
#plot (snp_vec,
# col = cl$cluster,
# xlab = "Sense A",
# ylab = "Sense B",
# main = paste("3Center Kmeans Clustering Plot for", rownames(senseA[i, ]))
# )
#points (cl$centers, col = 2:1, pch = "o", cex = 2, lwd = 100)
#text (cl$centers[1,1], cl$centers[1,2], paste(call_map[1]), pos = 1, cex = 2, col = "blue")
#text (cl$centers[2,1], cl$centers[2,2], paste(call_map[2]), pos = 1, cex = 2, col = "blue")
#text (cl$centers[3,1], cl$centers[3,2], paste(call_map[3]), pos = 1, cex = 2, col = "blue")
#dev.off()
#############################################
}
write.table (callings, file = "kmeansCalling.txt", append = F,
row.names = T, col.names = T)
Next Week Plan
 Analyze the performance of both the sectorsweep and Kmeans clustering techniques with respect to the Affymatrix 100k Xba calling data available from Hapmap.
Grade for the Week
 Grade: A
Problems that Came Up
 This week's problems were mainly implementationrelated. One problem was that the center of the kmeans clusters can reside right on the axis of the chart. Therefore, there is a chance that using the B/A ratio to classify a SNP returns undefined since the denominator is zero.
Problems Solved This Week
 Added the bound checking code to ensure that the normalization only takes place when the centers are not on top of the axis.
References
 Hartigan, J. A. (1975). Clustering Algorithms.
WEEK 6 Weekly Update (May 7, 2009)
Weekly Progress
 Implemented a sectorsweep genotype calling algorithm.
 Tested the algorithm on data obtained in the previous week.
Slopebased Genotype Calling Algorithm
 We present the pseudocode for our sectorsweep genotype caller.
 The basic idea is to "call" or "mark" a point based on its slope.
 All points close to a 45 degree slope are called as AB and others are called as AA or BB.
Sector_Sweep_Genotype
start
# Parameters of slopebased search
increment = 0.1 # The value by which the slope is changed when searching for threshold
density_drop = 0.75 # Maximum drop in number of points that can be tolerated as a contiguous cluster
inv_slope = 1 # Start slope at 45 degrees
upper_threshold = 1 # The slope threshold separating AB and BB clusters
max_points = 0 # Maximum number of points lying withing 'increment' of all the slopes
current_points = 0 # Number of points lying withing 'increment' of the current slope
terminate = FALSE
while(terminate is FALSE)
do
# Count number of points within 'increment' of current slope
for each data point (x,y)
if(1/inv_slope <= y/x <= 1/(inv_slope  increment))
current_points = current_points + 1
# See of this is the maximum number of points at any slope
if(current_points > max_points)
max_points = current_points
# If this slope has less than a 'density_drop' fraction of points than the maximum, then this cluster isn't contiguous anymore
if(current_points < density_drop*max_points)
upper_threshold = 1/inv_slope
terminate = TRUE
else
inv_slope = inv_slope  increment
# For strange (??) cases
if(1/inv_slope <=0 )
terminate = TRUE
upper_threshold = 1
current_points = 0
end while
slope = 1 # Start slope at 45 degrees
lower_threshold = 1 # The slope threshold separating AB and AA clusters
max_points = 0 # Maximum number of points lying withing 'increment' of all the slopes
current_points = 0 # Number of points lying withing 'increment' of the current slope
terminate = FALSE
while(terminate is FALSE)
do
# Count number of points within 'increment' of current slope
for each data point (x,y)
if(slope >= y/x >= slope  increment)
current_points = current_points + 1
# See of this is the maximum number of points at any slope
if(current_points > max_points)
max_points = current_points
# If this slope has less than a 'density_drop' fraction of points than the maximum, then this cluster isn't contiguous anymore
if(current_points < density_drop*max_points)
lower_threshold = slope
terminate = TRUE
else
slope = slope  increment
# For strange (??) cases
if(slope <=0 )
terminate = TRUE
lower_threshold = 1
current_points = 0
end while
# Finally, genotype call the data, based on the thresholds obtained
for each data point (x,y)
if(upper_threshold >= y/x >= lower_threshold)
mark (x,y) as AB
if(y/x < lower_threshold)
mark (x,y) as AA
if(y/x > upper_threshold)
mark (x,y) as BB
end
 The algorithm is simple to implement as it only uses slope data based on the probe intensity (sense) values for Allele A and B.
 On the flip side, it is not very robust and may not work well for complicated cases where the clusters aren't as clear.
Sample Raw Data
 An example of the data obtained is shown in the figure below.
 Each point is the probe intensity value of an individual for some SNP.
 This data is ungenotyped and the three clusters representing BB, AB and AA can be clearly seen.
Data after using SectorSweep Genotype Calling Algorithm
 After using our slopebased genotype caller, the clusters have been marked as BB (green), AB (red) and AA (blue).
Next Week Plan
 Implement a more complicated and robust genotype calling algorithm.
Grade for the Week
 Grade: A
 We implemented a simple genotype calling algorithm as was planned in previous weeks. So, we are on track with our plan.
Problems that Came Up
 No big problems came up as we already had the data in a usable format and it was easily imported, processed in R.
 Algorithm implementation and testing was not too difficult once the data was imported into R.
Problems Solved This Week
 Used actual Affy100k data for genotype calling.
 Presence of distinct clusters in called genotypes validated the data we had extracted in previous weeks.
References
 Y. Lin et al. Smarter Clustering Methods for SNP Genotype Calling. In Bioinformatics, 2009.
WEEK 5 Weekly Update (April 30, 2009)
Weekly Progress
 Used Oligo libraries for Bioconductor to convert raw affy100k data to usable text data. The text files contain intensity (sense values) for the SNPs for individuals. This data can be directly used in MATLAB or Octave etc. for direct algorithm development.
 Also, we have found a reference containing all relevant information on genotype calling and sample algorithms that we will try on the data.
R Code
 After loading Oligo and Bioconductor.
R> library(oligo)
R> library(hapmap100kxba)
R> pathCelFiles < system.file("celFiles", package = "hapmap100kxba")
R> fullFilenames < list.celfiles(path = pathCelFiles, full.names = TRUE)
R> temporaryDir < tempdir()
R> rawData < read.celfiles(fullFilenames)
R> preProcessedData < snprma(rawData)
Normalizing to Hapmap.
Summarizing.
R> A < senseThetaA(preProcessedData)
R> write.table(A, "A.dat")
R> B < senseThetaB(preProcessedData)
R> write.table(b, "B.dat")
 This writes the probe intensity (sense) reading values (for A and B) into the files 'A.dat' and 'B.dat'. This gives us a usable data file format that can be read straight into MATLAB or R.
Sample Data
 An example of a few lines of data from file 'A.dat' is shown. Note that the data is for three individuals for each of the SNPs (in rows).
"SNP_A1500014" 11.0019846094543 11.1613403058736 11.4167426000789
"SNP_A1500923" 10.5649038218336 10.7857168045851 11.7683224554234
"SNP_A1501755" 11.4143531913055 11.9222583189184 11.4783974273146
Processing Multiple Data Files
 One problem with the above data is that it is only for three individuals. To get data for more individuals we need to process each of the data set's files one at a time using the method mentioned above.
 We have developed a BASH script to solve this problem and get the data from multiple files.
 The basic idea is that the script loops to process data files one at a time by repeatedly "echoing" the commands to form an R script for multiple data files.
#!/bin/bash
src_dir=$1
dst_dir=$2
#################################
# Make the destination directory if it doesn't exist
#################################
test e $dst  mkdir p $dst
#test e ./tmp  mkdir p ./tmp
#################################
# Grab a list of files from the source directory
#################################
ls $src_dir > celfiles.del
#################################
# Insert them in an array so we can fetch later
#################################
exec<celfiles.del
m=0
while read line
do
m=`expr $m + 1`
cfiles[$m]=$line
done
#################################
# Generate R script then execute in R
#################################
i=1
j=2
while test $i lt $m
do
################################
# create the names of 2 files to be operated into ./tmp
#################################
#touch ./tmp/${cfiles[$i]}
#touch ./tmp/${cfiles[$j]}
#################################
# Build the R script
#################################
echo "library(oligo)" > rscript.r
echo "fullFilenames < c(\"test1\", \"test2\")" >> rscript.r
echo "fullFilenames[1] < \"""$src_dir${cfiles[$i]}\"" >> rscript.r
echo "fullFilenames[2] < \"""$src_dir${cfiles[$i+1]}\"" >> rscript.r
#echo "fullFilenames < list.celfiles(path = ./tmp, full.names=TRUE)">> rscript.r
echo "rawData < read.celfiles(fullFilenames)" >> rscript.r
echo "preProcessedData < snprma (rawData)" >> rscript.r
echo "a < senseThetaA(preProcessedData)" >> rscript.r
echo "write.table(a, \"""$dst_dir/"${cfiles[$i]}"."${cfiles[$i+1]}".senseA.txt\")" >> rscript.r
echo "b < senseThetaB(preProcessedData)" >> rscript.r
echo "write.table(b, \"""$dst_dir/"${cfiles[$i]}"."${cfiles[$i+1]}".senseB.txt\")" >> rscript.r
#################################
# Finally execute R
#################################
R nosave q < rscript.r
#rm f ./tmp/*
i=`expr $i + 2`
done
############################
# cleanup
#rm rf tmp
rm rf celfiles.del
Next Week Plan
 Begin implementing and evaluating the genotype calling algorithms.
Grade for the Week
 Grade: A
 We were able to finally get the data from the (complicated) raw data sets.
 We have also selected some sample algorithms for genotype calling that we will begin implementing and evaluating.
Problems that Came Up
 The biggest problem we had to tackle was to deal with the large data sets and to make sense of the data.
Problems Solved This Week
 Converted raw data to usable format. Selected algorithms for implementation.
References
WEEK 4 Weekly Update (April 23, 2009)
Weekly Progress
 Installed R, Bioconductor package, downloaded affy100k raw data sets and performed visualization.
Figures
R Code
 After loading Bioconductor and Oligo.
R> library(oligo)
R> library(hapmap100kxba)
R> pathCelFiles < system.file("celFiles", package = "hapmap100kxba")
R> fullFilenames < list.celfiles(path = pathCelFiles,
full.names = TRUE)
R> temporaryDir < tempdir()
R> rawData < read.celfiles(fullFilenames)
> preProcessedData < snprma(rawData)
Normalizing to Hapmap.
Summarizing.
> theA < getA(preProcessedData)
> theM < getM(preProcessedData)
> dim(theA)
[1] 58960 3 2
> library(geneplotter)
Loading required package: lattice
Loading required package: annotate
Loading required package: xtable
KernSmooth 2.22 installed
Copyright M. P. Wand 1997
Attaching package: 'geneplotter'
The following object(s) are masked from package:lattice :
panel.smoothScatter
> smoothScatter(theA[, 1, 1], theM[, 1, 1], main = "MAplot (Antisense)",xlab = "Average Intensity", ylab = "Logratio (A/B)")
Next Week Plan
 Look into data more deeply and identify candidate algorithms.
Grade for the Week
 Grade: A
 We performed well for a starter week and processed a lot of data to get this project moving quickly.
Problems that Came Up
 We had a hard time trying to import the CEL data into R. First we tried the "affy" package. We ran into an CDF import problem. Then, we tried "aroma.affymetrix" package. Again, we had problems trying to import the CEL data. Finally, we tried "oligo" package and we were able to import CEL data.
Problems Solved This Week
 We are able to access to probe intensity data and plot the normalized values in a MAplot format. We'd like to thank Jimmie Ye for his help.
References
 R: [http://www.rproject.org/]
 BIOCONDUCTOR: [http://www.bioconductor.org/]
 OLIGO: [http://www.bioconductor.org/packages/2.3/bioc/manuals/oligo/man/oligo.pdf]
page revision: 86, last edited: 12 Jun 2009 00:55