My name is Matt Schuerman. I am a graduate student in Computer Science. I went to UCSD for my undergraduate education and received degrees in Physics and "The Mathematics of Computer Science." After that I went into industry for a few years. Last year I decided I was tired of getting paid to work; instead I wanted to pay others to make me work! Hence graduate school. I work in the MAGIX lab and plan to do my thesis in computer graphics.
We are doing a project 28: Genotype Calling. For now we are attempting the "Medium" level of the project, which is identifying clusters in the genotype plots and classifying individual's genotype based on those clusters. There already exist a variety of algorithms to do clustering such as this. These algorithms are successful, but are often limited by the computational power required to run them. We plan to implement my own clustering algorithm and compare the results to these algorithms.
Goal for the Quarter
Hopefully we will be able to do roughly as well as existing algorithms in terms of classifying individuals. However, we would like to do this with an algorithm with less tuned parameters and better run time.
So far we have downloaded R and BioConductor and installed them on our computer. We are working on reading CEL files from the HapMap site into R. Once we can read the files we plan on converting them into a more manageable intermediate format and test some of BioConductor's built in clustering algorithms. We would like to finish these tasks within a week. So far the only task has been making this wiki page, which we've done, so I give myself an A.
Reading individual CEL files from the HapMap has proven to be very difficult. However, we have managed to extract the probe intensities from a set of sample data. Before this can be done several packages must be installed. The R code to do this is shown below:
R> source("http://bioconductor.org/biocLite.R") R> biocLite("oligoClasses") R> biocLite("oligo") R> biocLite("hapmap100kxba") R> biocLite("hapmap100khind") R> biocLite("pd.mapping50k.xba240") R> biocLite("pd.mapping50k.hind240")
Once these packages have been installed Oligo can be used to read and normalize the probe intensities for sample data from the HapMap. An example of this for the XBA data is shown below (some recommend including a line to define the temp directory which is not included here):
R> library("oligo") R> library("hapmap100kxba") R> pathCelFiles <- system.file("celFiles", package = "hapmap100kxba") R> fullFilenames <- list.celfiles(path = pathCelFiles, full.names = TRUE) R> rawData <- read.celfiles(fullFilenames) R> preProcessedData <- snprma(rawData)
The "preProcessedData" variable now contains probe intensities for 3 individuals over thousands of SNPs. For each SNP in each individual there are four probe values listed: sense A, sense B, anti-sense A, and anti-sense B. "Sense" and "anti-sense" represent complementary probe values. DNA is a double helix which is broken down the middle when probing is done. So each place in DNA really has two values (one for each helix) which can be measured. Sense and anti-sense a probe values at the same location: one to measure each side of the original double helix. For genotype calling which is chosen does not matter so long as the selection is consistent throughout the project. An example of how to get the "sense" probe intensities is shown below:
R> a <- senseThetaA(preProcessedData) R> write.table(a, "senseA.txt") R> b <- senseThetaB(preProcessedData) R> write.table(b, "senseB.txt")
The values are now contained in two text files (senseB.txt and senseA.txt) which can be read in by a variety of software packages. An example of the first few lines of one of these files is shown below:
"1" "2" "3"
"SNP_A-1507972" 11.0019846094543 11.1613403058736 11.4167426000789
"SNP_A-1510136" 10.5649038218336 10.7857168045851 11.7683224554234
"SNP_A-1511055" 11.4143531913055 11.9222583189184 11.4783974273146
"SNP_A-1518245" 12.0376655921879 11.1235324915263 10.8383048693874
"SNP_A-1641749" 9.5493310289614 9.6780805229471 9.8717939149977
"SNP_A-1641751" 10.5943019604841 10.8377228412975 10.6608865213013
"SNP_A-1641753" 11.8663082543653 10.2940949443682 12.0310136433338
"SNP_A-1641755" 12.4720811725513 12.1530857155924 12.0507630444764
"SNP_A-1641757" 13.0702843315926 12.8825500145301 11.253626622964
"SNP_A-1641759" 10.5060109037526 10.7737128587571 10.5801789732667
"SNP_A-1641761" 11.2450865557147 12.4856750249178 12.651674552252
"SNP_A-1641763" 10.5010269325118 10.7892433061207 11.1429061874913
"SNP_A-1641765" 12.9816945102881 13.6337445902109 12.8695699825954
"SNP_A-1641767" 12.8300427364320 13.9930501743677 13.3944448392191
"SNP_A-1641769" 13.2983822584815 13.2546897804780 12.1054922413162
"SNP_A-1641771" 9.07831452175107 9.07831452175107 9.07831452175107
"SNP_A-1641773" 10.8604162717045 10.6533118757745 10.8832466293822
"SNP_A-1641775" 10.1994853511519 11.7569313862574 11.9359831190193
However, these samples appear to contain values for only 3 individuals, which is a problem. An arbitrary number of CEL files can be read by Oligo, but it takes A LOT of memory. To read in any set of CEL files copy them all to a specific directory ("G:\cs224\testData" on my machine) and use the following commands in R:
R> library("oligo") R> fullFilenames <- list.celfiles(path = "G:/cs224/testData", full.names = TRUE) R> rawData <- read.celfiles(fullFilenames) R> preProcessedData <- snprma(rawData)
This will process all the CEL files in G:/cs224/testData if your computer has the memory. To deal with memory constraints batch processing with many small sets of files might be required. However, the professor has agreed to try using some of the lab resources to do this computation. We will wait for the results of that effort before we begin doing this processing myself. The goal of being able to read CEL files is progressing slowly, but should be done within the next week.
The goal for next week is to begin implementing some initial clustering ideas. Mathematica will be used for these initial implementations.
This week's progress slower than desired. Part of that was studying for the midterm. Another part of that is the learning curve associated with learning the required packages in bioConductor. Hopefully next week will be better. I give myself a B+.
Nick in the professor's lab was able to run all the XBA 100k CEL files through the R commands listed last week. The resulting text files total ~500 Mb compressed and we are told producing the files required ~15 Gb of RAM during processing.
Below is an example scatter plot of probe intensities. We read in both the sense and anti-sense probe values and average them for each individual. This is a scatter-plot of those averaged probe intensities for SNP 1759055:
The first thing to do is find a way to make an initial guess about the centers of the clusters. We did this by applying a grid to the graph and counting the number of points which appeared in each grid cell. We then convolve a 5x5 gaussian kernel with the resulting image from that griding. We look for peaks on the resulting image and use those peaks as initial cluster centers. Below is a plot of such a grid (it is inverted because of a griding artifact):
The result is that there are 4 peaks found using the griding procedure. Below is a figure with the locations of those peaks indicated by large black dots:
Upon initial inspection these peaks seem to be in roughly the right location. However, there is one peak too many. This will be taken care of later in the algorithm. The peaks are now assigned some initial covariance matrices. These matrices have a normal primary axis and a longer primary axis. The longer primary axis points along the line from the origin to the location of the peak. This is because looking at the data we have noticed that clusters can extend quite a bit radially, but are often more compact cross-radially.
Now that initial peaks have been assigned locations and covariance matrices we are in position to enter the main loop of the algorithm:
- Assign each point to the cluster it is closest to in state space
- Once all points have been assigned re-compute the location and covariance matrix of each cluster
- Check the distances in state space between the updated clusters. If two clusters are close in state space, then merge those clusters
This loop continues until the clusters stop changing. At that point we have a complete genotype call for a particular SNP. Distances in state space are measured using Mahalanobis distances. The formula for a Mahalanobis distance is:(1)
Where C-1 is the inverse of the covariance matrix of the cluster (or the sum of the two covariance matrices when comparing two clusters) and P's are vectors of the x,y locations.
At the beginning of the above mentioned loop the clustering looks like this:
The different colors indicate the clusters the data points are in. The data points were assigned using the initial cluster locations and covariance matrices. After one iteration the clustering looks like this:
The extra cluster has been merged with another. For this example no points have moved between clusters, but in other examples that is common. Also notice that the mean locations of all the clusters have shifted. After the next iteration the clustering looks the same:
Since there were no changes in clustering the algorithm can now stop. It is important to note that it is possible to end up with more than 3 clusters. If this happens we merge the two closest clusters then re-calculate all the cluster's parameters. This continues until only 3 clusters remain.
Once the data points have been partitioned into clusters we now need to assign calls to them. Take the ratio of the coordinates of the center of each cluster (y/x):
- If y/x ~ 1.3, then that cluster is called as BB
- If y/x ~ 1.0, then that cluster is called AB
- If y/x ~ 0.7, then that cluster is called AA
If more than one cluster is present then some clusters can be called based on their center ratio relative to other clusters which have already been called. Clearly if there are 3 clusters all we have to do is sort them based on ratio and use that for calling.
We have not been able to get at the truth data yet, but upon visual inspection this clustering appears correct. Our plan for next week is to start comparing our results to truth from the HapMap. Once again the professor has agreed to help find a way to convert Affy SNP labels to standard SNP labels. One of his graduate students has already done this and he will send us what he found. We also need to search through the data to find the hardest cases we can to test our algorithm with. So far things have been going well. We accomplished our goals from last week, so I give myself an A for this week.
Finding the HapMap's official genotype calls and putting those calls into readable format has proven to be a cumbersome task. We feel that the HapMap project would benefit from establishing more user-friendly data structures, as much of our time working on the project thus far has been devoted just to reading the files available online. We have not found an effective way to automate the process. In the future this might limit our ability to thoroughly test our algorithm. The genotype calls are available on the HapMap's FTP site at the following address:
This is what we believe to be the latest full set of non-redundant genotype calls. They are organized by chromosome and population, as indicated by the filenames. Once decompressed the files have the following format (this is from file genotypes_chr1_CHB.txt):
rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code NA18524 NA18526 NA18529 …
rs2949420 A/T Chr1 45257 + ncbi_b34 sanger urn:lsid:illumina.hapmap.org:Protocol:Golden_Gate_1.0.0:1 urn:lsid:sanger.hapmap.org:Assay:4499502:1 urn:LSID:dcc.hapmap.org:Panel:Han_Chinese:1 QC+ TT TT TT …
rs4965412 A/G Chr1 60165 - ncbi_b34 sanger urn:lsid:illumina.hapmap.org:Protocol:Golden_Gate_1.0.0:1 urn:lsid:sanger.hapmap.org:Assay:4581185:1 urn:LSID:dcc.hapmap.org:Panel:Han_Chinese:1 QC+ GG GG GG …
rs4030303 C/T Chr1 72434 - ncbi_b34 sanger urn:lsid:illumina.hapmap.org:Protocol:Golden_Gate_1.0.0:1 urn:lsid:sanger.hapmap.org:Assay:4319326:1 urn:LSID:dcc.hapmap.org:Panel:Han_Chinese:1 QC+ CC CC CC …
rs940550 C/T Chr1 78032 + ncbi_b34 sanger urn:lsid:illumina.hapmap.org:Protocol:Golden_Gate_1.0.0:1 urn:lsid:sanger.hapmap.org:Assay:1362797:1 urn:LSID:dcc.hapmap.org:Panel:Han_Chinese:1 QC+ TT TT TT …
The trailing "…" indicate the end of a line. We truncated the rest of the individual IDs and calls on each line for the sake of clarity. For our purposes we are only interested in the rs# and genotype calls for each individual. Once we have an rs# we can use grep to find all instances of that rs# across all chromosomes.
However, all the probe intensities from the CEL files use Affy SNP IDs instead of rs#, so we must find a way to convert between the two. Jimmie and Nick in the professor's lab showed us how to do this with the following R code:
R> library(oligo) R> raw <- read.celfiles('CEU_NA11830_XBA.CEL') R> annot <- annotation(raw) R> conn <- db(get(annot)) R> sql <- 'SELECT man_fsetid,dbsnp_rs_id FROM featureSet WHERE man_fsetid LIKE "SNP%"' R> m = dbGetQuery(conn,sql) R> write.table(m,'rsIdMap.txt',row.names=F,col.names=F)
This code will write the Affy SNP Ids and rs# of all the SNPs in the CEU_NA11830_XBA.CEL file. Other CEL files can be used if one is interested in particular SNPs. The top few lines of the outputted file is shown below:
So we can now get the HapMap's genotype calls for any particular SNP. The steps are as follows:
- Find the Affy SNP Id of the genotype clustered
- Use the lookup table generated by the above R code to convert that Affy SNP Id into an rs#
- Use grep (or zgrep if the files are still compressed) to find all instances of that rs# across all chromosome files
- Choose the correct chromosome and compare the calls to those of the clustering algorithm
Choose the "correct" chromosome is a judgment call and part of the reason this process has not been automated. Basically we just look to see if the calls for the most part agree with those from our clustering algorithm to find what we think is the correct chromosome. Once we have the calls we read them into Mathematica and compare them with our algorithm's results. Below are the results for SNP_A-1759055 for the CEU population. The black circles indicate incorrect calls (ignore the one at (10,10)).
For this SNP the CEU population genotype calls are 100% accurate. Below are the calls for the JPT and CHB populations combined:
For this SNP the JPT+CHB population genotype calls are 98.75% accurate (1 incorrect call). Below are the calls for the YRI population:
For this SNP the YRI population genotype calls are 98.82% accurate (1 incorrect call).
Not much data has been processed at this point. However, we have found a few interesting results:
- We have found that the clustering performs much better when done on individual populations instead of on all populations at once (all populations were clustered at once in the examples from week 6). This is because different populations can have different clusters which overlap. For example we have seen that the AA cluster of one population might be in the same location as the AB cluster of another population, which results in many mis-called data points. Separating the populations is the only way to prevent this from happening. The Chinese and Japanese populations are the exception, which are similar and can be pooled together give a number of data points per SNP on par with the other populations.
- NNs, which indicate bad probe values, can appear in the middle of what seem to be reasonable intensity values. This was surprising, but the professor says this is explainable by experimental errors. Such values will be excluded from our statistics.
- The algorithm seems to work equally well across all populations.
This week we also re-organized the code to make it more compact and usable. In the coming week we plan to continue testing. When we have more results we will post them along with the code. We could also present next week, which would delay any such releases. This week has been slower than expected due to the troubles with reading the HapMap genotype calls, so I give myself an A-.
Another group doing the same project sent us a file containing genotype calls for the XBA Affy 100k chip on the CEU population. These calls were obtained from the Affy website, so they can safely be used as truth data. The data format of these calls was much more amenable to automated processing. For this reason we will be presenting results of the CEU population. But based on the test cases done by hand across all 3 populations we have no reason to believe the accuracy of the results will be significantly different between populations.
We have processed 1,111 SNPs (the first 1,111 SNPs in the truth file) and found that our algorithm correctly calls genotypes 97.05% of the time. This is close to the performance of the best known genotyping algorithms. The main reason this performance is not closer to 99% is a few cases where only one cluster exists and it is mis-called. The result is a case where 0% of the calls are correct, which heavily effects the average call success statistics.
More testing needs to be done. The majority of the time required to process these SNPs was consumed by reformatting files, which we will find a better way to do. Next week we will post more results and a version of the code in PDF format. Not as much was done this week as we hoped, but part of that is due to preparing for the presentation on Friday. I give myself an A-.
At this point is seems fitting to go over the main error cases we have observed. Outside of the occasional mis-clustered data point there are 3 categories of errors:
- Exactly one cluster exists (which is correct), but that cluster is mis-called
- Two clusters are merged when they should not be
- Two clusters are not merged when they should be
The the first category is the worst. One instance of this type of error results in a zero being included in the average call success statistics. The best way to deal with this probably to take an average of the cluster location ratios over all the data in the Hapmap, and to use those averages to recalibrate the ratios we use in our code.
Below is an example of the second category of error.
The second green cluster (the one with the black outlines around the points) should have been merged with the red cluster instead of the green one. This probably occurred because the covariance matrix of the red cluster is tilted in the direction of a line through the two red points instead of a line through the origin to the center of the cluster. This could be dealt with by placing a lower bound on how many points a cluster must have before it's covariance is taken (before that it is assigned by a default).
The final category of error is shown below:
We are do not currently have a strategy for dealing with this type of error. To the eye is looks like the incorrect red cluster should really be it's own cluster instead of being part of the green one. Some type of outside information would have to be taken into account to fix this type of case.
Below is a link to a PDF version of the Mathematica code we used. At the end of the PDF are an example 10 case calls from the CEU population.
A copy of the code in it's original notebook format can also be obtained from the author. The run-time complexity of this algorithm is $O(n^2)$. However, in practice the algorithm runs in closer to linear time. The only place where the clustering function could require $O(n^2)$ operations instead of $O(n)$ is in the repeated re-assignment of data points. We have never seen this re-assignment take more than a few iterations. However, we cannot prove that this will always be the case, so we are using the more conservative complexity estimate. Regardless, this is much better than the run-time complexities of all of the published algorithms we are aware of. For this week I give myself a B+.
Below is a link to the presentation we gave in class. Some of the numbers are old. The algorithm has gotten better since then, as shown in the rest of this page.
Although not every SNP for every population has been analyzed we have processed enough data to be reasonably confident of our results. 97.05% is not all that we had hoped for, but it is within the ballpark of some of the best known algorithms. Given more time we are confident we could improve our algorithms performance further. However, we were able to achieve these results with a novel and very general algorithm with almost no tuned parameters, which was an important part of our initial goal. Our algorithm is also much faster than those in literature. On the whole we think we completed the objective we set for ourselves at the beginning of the quarter.