Genotype Calling

Project Description

  • Genotypes are obtained using a microarray technology.
  • Make a simple genotype caller based on ratios of the probes.
  • Identify clusters in the genotype plots and use our own algorithms to genotype data. How does this compare to the other methods?

Background on Genotype Calling

About the Authors

Jackson Pang

1st year MSEE with background in computer architecture, high-performance computing, embedded network programming, detection and estimation.

Digvijay Singh

1st Year MSEE with backgorund in computer architecture, signal processing, networks, operating systems, compilers and embedded computing.

Goal for end of Quarter

  • Our goal for this quarter is to make a simple genotype caller based on raw data of the probes.

Weekly Schedule

  • Week 4- Deal with probe level data and begin formulating the problem.
  • Week 5- Look into candidate algorithms and analyze probe level data more deeply.
  • Week 6- Begin implementing and evaluating the genotype calling algorithms.
  • Week 7- Finalize Implementation of the algorithms.
  • Week 8- Evaluate and Test algorithms with multiple data sets and previous results.
  • Week 9- Present Results and Report.

WEEK9 Weekly Update (May 28, 2009)

Weekly Progress

Next Week Plan

  • The project is officially finished!

Grade for the Week

  • Grade: A

Problems that Came Up

  • Well, the usual presentation making problems cropped up.

Problems Solved This Week

  • Finished our presentation and reporting for the project. We are done!

WEEK8 Weekly Update (May 21, 2009)

Weekly Progress

  • Analyzed the performance of both the sector-sweep and K-means clustering techniques with respect to the Affymatrix 100k Xba calling data available from HapMap project's website.

Figures

  • We show the comparison of our algorithms genotype-calling to the HapMap genotyped data.
  • For both our algorithms, the calls were consistent with greater than 95% accuracy. The K-means did slightly better than sector-sweep in the comparison.
K-Means
bar_k.bmphist_k.bmp
Sector-sweep
bar_slope.bmphist_slope.bmp

Next Week Plan

  • Finalize presentation for the project.

Grade for the Week

  • Grade: A

Problems that Came Up

  • We had made some small indexing errors in the comparison of our algorithms' results with the HapMap gentoyped data. This led to a near 40% consistency that really shocked us!

Problems Solved This Week

  • Compared our results to all 50k genotyped SNPs from the HapMap website. Comparison proves that our algorithms have done well.

References

WEEK 7 Weekly Update (May 14, 2009)

Weekly Progress

  • Implemented a calling method based on K-means clustering technique.
  • Ran the K-means algorithm over all of the affy100k SNPs and obtained the calling data for all 270 individuals.

Figures

  • After using our K-means genotype caller, the clusters have been marked as BB (red), AB (black) and AA (green) for a sample SNP.
clu1.bmp

R Code

# performs clustering based on the input data sets

senseA <- read.table ("senseA.txt", header = T)
senseB <- read.table ("senseB.txt", header = T)
#rowNames <- read.delim ("rowOrder.names", quote = "\"", sep = "/")

snp_vec <- matrix (data = NA, nrow = ncol(senseA), ncol = 2)

callings <- matrix (data = NA, nrow = nrow (senseA), ncol = ncol(senseA))
colnames(callings) <- colnames(senseA)
rownames(callings) <- rownames(senseA)

call_map <- matrix (data = NA, nrow = 1, ncol = 3)

ratios <- matrix (data = NA, nrow = 1, ncol = 3)

# this loop will run once for every individual (270)
for (i in 1:nrow(senseA)){
#for (i in 1:10){

    #############################################
  # Perform K-means
    snp_vec[ ,1] <- t(senseA[i ,])
    snp_vec[ ,2] <- t(senseB[i ,])

    colnames(snp_vec) <- c("Sense A", "Sense B")

    cl <- kmeans ( snp_vec , 
                                            centers = 3, 
                                            iter.max = 50, 
                                            nstart = 10,
                                            algorithm = "Hartigan-Wong" )

    ##############################    
    # Classifier implementation
    ##############################    
    senseAmin <- min (senseA[i, ]);
  senseBmin <- min (senseB[i, ]);

    min <- 1
    max <- 1
    mid <- 1

    # origin re-centering first then classify
    for (j in 1:3) {
        if (senseAmin < cl$centers[j,1])
            adiff <- (cl$centers[j,1] - senseAmin)
        else
            adiff <- (cl$centers[j,1])

        if (senseBmin < cl$centers[j,2])
            bdiff <- (cl$centers[j,2] - senseBmin)
        else
            bdiff <- (cl$centers[j,2])

        ratios[j] <- (bdiff / adiff)

        # classify
        if (ratios [j] < ratios[min]) {
            min <- j;
        }

        if (ratios [j] > ratios[max]) {
            max <- j;
        }
    }

    if ((min == 1 && max == 3) || (min == 3 && max == 1)) {
        mid <- 2
    } else 
    if ((min == 1 && max == 2) || (min == 2 && max == 1)) {
        mid <- 3
    } else
    if ((min == 2 && max == 3) || (min == 3 && max == 2)) {
        mid <- 1
    }

    call_map [min] <- "A"
    call_map [max] <- "B"
    call_map [mid] <- "H"

    #store callings in table
    for (k in 1:ncol(senseA)) {
        callings [i ,k] <- call_map [ cl$cluster [k] ]
    }

    #pdf (paste(rownames(senseA[i, ]),"pdf", sep = "."))
    #plot (snp_vec, 
    #        col = cl$cluster,
    #        xlab = "Sense A",
    #        ylab = "Sense B",
    #        main = paste("3-Center K-means Clustering Plot for", rownames(senseA[i, ]))
    #        )
    #points (cl$centers, col = 2:1, pch = "o", cex = 2, lwd = 100)
    #text (cl$centers[1,1], cl$centers[1,2], paste(call_map[1]), pos = 1, cex = 2, col = "blue")
    #text (cl$centers[2,1], cl$centers[2,2], paste(call_map[2]), pos = 1, cex = 2, col = "blue")
    #text (cl$centers[3,1], cl$centers[3,2], paste(call_map[3]), pos = 1, cex = 2, col = "blue")

    #dev.off()
    #############################################

}

write.table (callings, file = "kmeansCalling.txt", append = F,
                row.names = T, col.names = T)

Next Week Plan

  • Analyze the performance of both the sector-sweep and K-means clustering techniques with respect to the Affymatrix 100k Xba calling data available from Hapmap.

Grade for the Week

  • Grade: A

Problems that Came Up

  • This week's problems were mainly implementation-related. One problem was that the center of the k-means clusters can reside right on the axis of the chart. Therefore, there is a chance that using the B/A ratio to classify a SNP returns undefined since the denominator is zero.

Problems Solved This Week

  • Added the bound checking code to ensure that the normalization only takes place when the centers are not on top of the axis.

References

  • Hartigan, J. A. (1975). Clustering Algorithms.

WEEK 6 Weekly Update (May 7, 2009)

Weekly Progress

  • Implemented a sector-sweep genotype calling algorithm.
  • Tested the algorithm on data obtained in the previous week.

Slope-based Genotype Calling Algorithm

  • We present the pseudo-code for our sector-sweep genotype caller.
  • The basic idea is to "call" or "mark" a point based on its slope.
  • All points close to a 45 degree slope are called as AB and others are called as AA or BB.
Sector_Sweep_Genotype
start
    # Parameters of slope-based search
    increment = 0.1            # The value by which the slope is changed when searching for threshold
    density_drop = 0.75        # Maximum drop in number of points that can be tolerated as a contiguous cluster

    inv_slope = 1                # Start slope at 45 degrees
    upper_threshold = 1        # The slope threshold separating AB and BB clusters
    max_points = 0            # Maximum number of points lying withing 'increment' of all the slopes
    current_points = 0            # Number of points lying withing 'increment' of the current slope

    terminate = FALSE
    while(terminate is FALSE)
    do
        # Count number of points within 'increment' of current slope
        for each data point (x,y)
            if(1/inv_slope <= y/x <= 1/(inv_slope - increment))
                current_points = current_points + 1

        # See of this is the maximum number of points at any slope
        if(current_points > max_points)
            max_points = current_points

        # If this slope has less than a 'density_drop' fraction of points than the maximum, then this cluster isn't contiguous anymore
        if(current_points < density_drop*max_points)
            upper_threshold = 1/inv_slope
            terminate = TRUE
        else
            inv_slope = inv_slope - increment

        # For strange (??) cases    
        if(1/inv_slope <=0 )
            terminate = TRUE
            upper_threshold = 1

        current_points = 0
    end while

    slope = 1                    # Start slope at 45 degrees
    lower_threshold = 1        # The slope threshold separating AB and AA clusters
    max_points = 0            # Maximum number of points lying withing 'increment' of all the slopes
    current_points = 0            # Number of points lying withing 'increment' of the current slope

    terminate = FALSE
    while(terminate is FALSE)
    do
        # Count number of points within 'increment' of current slope
        for each data point (x,y)
            if(slope >= y/x >= slope - increment)
                current_points = current_points + 1

        # See of this is the maximum number of points at any slope
        if(current_points > max_points)
            max_points = current_points

        # If this slope has less than a 'density_drop' fraction of points than the maximum, then this cluster isn't contiguous anymore
        if(current_points < density_drop*max_points)
            lower_threshold = slope
            terminate = TRUE
        else
            slope = slope - increment

        # For strange (??) cases    
        if(slope <=0 )
            terminate = TRUE
            lower_threshold = 1

        current_points = 0
    end while

    # Finally, genotype call the data, based on the thresholds obtained
    for each data point (x,y)
        if(upper_threshold >= y/x >= lower_threshold)
                mark (x,y) as AB
        if(y/x < lower_threshold)
                mark (x,y) as AA
        if(y/x > upper_threshold)
                mark (x,y) as BB
end
  • The algorithm is simple to implement as it only uses slope data based on the probe intensity (sense) values for Allele A and B.
  • On the flip side, it is not very robust and may not work well for complicated cases where the clusters aren't as clear.

Sample Raw Data

  • An example of the data obtained is shown in the figure below.
  • Each point is the probe intensity value of an individual for some SNP.
  • This data is un-genotyped and the three clusters representing BB, AB and AA can be clearly seen.
clu0.bmp

Data after using Sector-Sweep Genotype Calling Algorithm

  • After using our slope-based genotype caller, the clusters have been marked as BB (green), AB (red) and AA (blue).
clu.bmp

Next Week Plan

  • Implement a more complicated and robust genotype calling algorithm.

Grade for the Week

  • Grade: A
  • We implemented a simple genotype calling algorithm as was planned in previous weeks. So, we are on track with our plan.

Problems that Came Up

  • No big problems came up as we already had the data in a usable format and it was easily imported, processed in R.
  • Algorithm implementation and testing was not too difficult once the data was imported into R.

Problems Solved This Week

  • Used actual Affy100k data for genotype calling.
  • Presence of distinct clusters in called genotypes validated the data we had extracted in previous weeks.

References

  • Y. Lin et al. Smarter Clustering Methods for SNP Genotype Calling. In Bioinformatics, 2009.

WEEK 5 Weekly Update (April 30, 2009)

Weekly Progress

  • Used Oligo libraries for Bioconductor to convert raw affy100k data to usable text data. The text files contain intensity (sense values) for the SNPs for individuals. This data can be directly used in MATLAB or Octave etc. for direct algorithm development.
  • Also, we have found a reference containing all relevant information on genotype calling and sample algorithms that we will try on the data.

R Code

  • After loading Oligo and Bioconductor.
R> library(oligo)
R> library(hapmap100kxba)
R> pathCelFiles <- system.file("celFiles", package = "hapmap100kxba")
R> fullFilenames <- list.celfiles(path = pathCelFiles, full.names = TRUE)
R> temporaryDir <- tempdir()
R> rawData <- read.celfiles(fullFilenames)
R>  preProcessedData <- snprma(rawData)
Normalizing to Hapmap.
Summarizing.
R> A <- senseThetaA(preProcessedData)
R> write.table(A, "A.dat")
R> B <- senseThetaB(preProcessedData)
R> write.table(b, "B.dat")
  • This writes the probe intensity (sense) reading values (for A and B) into the files 'A.dat' and 'B.dat'. This gives us a usable data file format that can be read straight into MATLAB or R.

Sample Data

  • An example of a few lines of data from file 'A.dat' is shown. Note that the data is for three individuals for each of the SNPs (in rows).
"SNP_A-1500014" 11.0019846094543 11.1613403058736 11.4167426000789
"SNP_A-1500923" 10.5649038218336 10.7857168045851 11.7683224554234
"SNP_A-1501755" 11.4143531913055 11.9222583189184 11.4783974273146

Processing Multiple Data Files

  • One problem with the above data is that it is only for three individuals. To get data for more individuals we need to process each of the data set's files one at a time using the method mentioned above.
  • We have developed a BASH script to solve this problem and get the data from multiple files.
  • The basic idea is that the script loops to process data files one at a time by repeatedly "echoing" the commands to form an R script for multiple data files.
#!/bin/bash
src_dir=$1
dst_dir=$2 

#################################
# Make the destination directory if it doesn't exist
#################################
test -e $dst || mkdir -p $dst
#test -e ./tmp || mkdir -p ./tmp 

#################################
# Grab a list of files from the source directory
#################################
ls $src_dir > celfiles.del

#################################
# Insert them in an array so we can fetch later
#################################
exec<celfiles.del
m=0
while read line
do 
  m=`expr $m + 1`
  cfiles[$m]=$line
done

#################################
# Generate R script then execute in R
#################################
i=1
j=2
while test $i -lt $m
do 

  ################################
  # create the names of 2 files to be operated into ./tmp 
  #################################
  #touch ./tmp/${cfiles[$i]}
  #touch ./tmp/${cfiles[$j]}

  #################################
  # Build the R script
  #################################

  echo "library(oligo)" > rscript.r
  echo "fullFilenames <- c(\"test1\", \"test2\")" >> rscript.r
  echo "fullFilenames[1] <- \"""$src_dir${cfiles[$i]}\"" >> rscript.r
  echo "fullFilenames[2] <- \"""$src_dir${cfiles[$i+1]}\"" >> rscript.r
  #echo "fullFilenames <- list.celfiles(path = ./tmp, full.names=TRUE)">> rscript.r
  echo "rawData <- read.celfiles(fullFilenames)" >> rscript.r
  echo "preProcessedData <- snprma (rawData)" >> rscript.r
  echo "a <- senseThetaA(preProcessedData)" >> rscript.r
  echo "write.table(a, \"""$dst_dir/"${cfiles[$i]}"."${cfiles[$i+1]}".senseA.txt\")" >> rscript.r
  echo "b <- senseThetaB(preProcessedData)" >> rscript.r
  echo "write.table(b, \"""$dst_dir/"${cfiles[$i]}"."${cfiles[$i+1]}".senseB.txt\")" >> rscript.r

  #################################
  # Finally execute R
  #################################
  R --no-save -q < rscript.r

  #rm -f ./tmp/*

  i=`expr $i + 2`
done

############################
# cleanup
#rm -rf tmp
rm -rf celfiles.del

Next Week Plan

  • Begin implementing and evaluating the genotype calling algorithms.

Grade for the Week

  • Grade: A
  • We were able to finally get the data from the (complicated) raw data sets.
  • We have also selected some sample algorithms for genotype calling that we will begin implementing and evaluating.

Problems that Came Up

  • The biggest problem we had to tackle was to deal with the large data sets and to make sense of the data.

Problems Solved This Week

  • Converted raw data to usable format. Selected algorithms for implementation.

References

WEEK 4 Weekly Update (April 23, 2009)

Weekly Progress

  • Installed R, Bioconductor package, downloaded affy100k raw data sets and performed visualization.

Figures

snp_rma.png

R Code

  • After loading Bioconductor and Oligo.
R> library(oligo)
R> library(hapmap100kxba)
R> pathCelFiles <- system.file("celFiles", package = "hapmap100kxba")
R> fullFilenames <- list.celfiles(path = pathCelFiles,
     full.names = TRUE)
R> temporaryDir <- tempdir()
R> rawData <- read.celfiles(fullFilenames)
>  preProcessedData <- snprma(rawData)
Normalizing to Hapmap.
Summarizing.
> theA <- getA(preProcessedData)
> theM <- getM(preProcessedData)
> dim(theA)
[1] 58960     3     2
> library(geneplotter)
Loading required package: lattice
Loading required package: annotate
Loading required package: xtable
KernSmooth 2.22 installed
Copyright M. P. Wand 1997

Attaching package: 'geneplotter'

        The following object(s) are masked from package:lattice :

         panel.smoothScatter

> smoothScatter(theA[, 1, 1], theM[, 1, 1], main = "MA-plot (Antisense)",xlab = "Average Intensity", ylab = "Log-ratio (A/B)")

Next Week Plan

  • Look into data more deeply and identify candidate algorithms.

Grade for the Week

  • Grade: A
  • We performed well for a starter week and processed a lot of data to get this project moving quickly.

Problems that Came Up

  • We had a hard time trying to import the CEL data into R. First we tried the "affy" package. We ran into an CDF import problem. Then, we tried "aroma.affymetrix" package. Again, we had problems trying to import the CEL data. Finally, we tried "oligo" package and we were able to import CEL data.

Problems Solved This Week

  • We are able to access to probe intensity data and plot the normalized values in a MA-plot format. We'd like to thank Jimmie Ye for his help.

References

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License