DSpace Repository

Statistical Models for SNP Detection

Show simple item record

dc.contributor.advisor Ahn, Hongshik en_US
dc.contributor.author Cai, Shengnan en_US
dc.contributor.other Department of Applied Mathematics and Statistics en_US
dc.date.accessioned 2012-05-15T18:02:22Z
dc.date.accessioned 2015-04-24T14:45:13Z
dc.date.available 2012-05-15T18:02:22Z
dc.date.available 2015-04-24T14:45:13Z
dc.date.issued 2010-12-01 en_US
dc.identifier Cai_grad.sunysb_0771E_10371.pdf en_US
dc.identifier.uri http://hdl.handle.net/1951/55377 en_US
dc.identifier.uri http://hdl.handle.net/11401/70951 en_US
dc.description.abstract Variations in DNA sequences of humans have a strong association with many diseases. Single Nucleotide Polymorphism (SNP) is the most common type of DNA variations. Our research is to detect SNPs from the data generated by Polymerase Chain Reaction (PCR) and next generation sequencing methods. In the first part of the study, we had a relatively small data set with fewer known SNPs as the training data. We developed a classification model based on the cross validation method. From the first part of the research, we gained knowledge of the properties of the data. In the next phase, we obtained a much larger data set with a much larger group of known SNPs. We developed eight measures for every genetic position with these data. Using these eight measures as the predictor variables, we applied several classification methods such as Random Forest (RF), Support Vector Machines (SVM), Single Decision Tree (ST) and Logistic Regression (LR); then used cross validation to evaluate these classification methods. By comparing the predictive accuracy, sensitivity and specificity, we found the best performing model for the data. To compare the performances of these models while the number of observations for each genetic position (cover depth) is small, we randomly drew out subsets from the whole data and applied these classification models. Variable selection is also used to our study. The result shows, SVM using the selected variables has a significant higher average accuracy than the other methods in general, but RF using the selected variables performs the best when the cover depth is as small as 20. en_US
dc.description.sponsorship This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree. en_US
dc.format Monograph en_US
dc.format.medium Electronic Resource en_US
dc.language.iso en_US en_US
dc.publisher The Graduate School, Stony Brook University: Stony Brook, NY. en_US
dc.subject.lcsh Statistics en_US
dc.subject.other Classification models, Cross validation, Next generation sequencing, SNP detection, Variable selection en_US
dc.title Statistical Models for SNP Detection en_US
dc.type Dissertation en_US
dc.mimetype Application/PDF en_US
dc.contributor.committeemember Hongshik Ahn en_US
dc.contributor.committeemember Nancy Mendell en_US
dc.contributor.committeemember Stephen Finch en_US
dc.contributor.committeemember Sangjin Hong en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace

Advanced Search


My Account