Genome-wide association studies (GWAS) obtain a genome-wide set of genetic variants in different individuals, through multi-center and large sample clinical trials, to detect and verify genetic variants in association with phenotypes. Current GWAS typically focus on associations between single-nucleotide polymorphisms (SNP) and phenotypes like major human diseases. However, for majority of complex phenotypes, single SNP common variants only explain less than 10% of phenotypic variations, which is known as “missing heritability”. A signicant amount of phenotypic variation can be explained by common variants, if genome-wide SNPs are jointly analyzed. Thus, haplotype association is much more powerful for unveiling the etiology of complex phenotypes than single SNP association. But with the increasing number of SNPs, the number of haplotypes increase dramatically, and the population frequency of each haplotype is very low. This high dimensional massive sparse data brings great challenges to the statistical analysis. We studied the structure of haplotype and developed a novel haplotype association method, in order to effectively find more causal variants.
A novel haplotype association method is presented, and its power demonstrated. Relying on a two-layer hidden Markov model for linkage disequilibrium (LD), the method first infers ancestral haplotypes and their loadings at each marker for each individual. The loadings are then used to quantify local haplotype sharing between individuals at each marker. A Bayesian regression model was developed to link the local haplotype sharing and phenotypes to test for association. Compared to existing haplotype association methods, our method integrated out phase uncertainty, avoided arbitrariness in specifying haplotypes, and had the same number of tests as the single SNP analysis. In addition, we reduced the time complexity from putatively quadratic to linear, consequently, our method is applicable to big data sets.
We developed an algorithm software, applied the software to data from the Wellcome Trust Case Control Consortium, and discovered eight novel associations between seven gene regions and five disease phenotypes. Among these, GRIK4, which encodes a protein that belongs to the glutamate-gated ionic channel family, is strongly associated with both coronary artery disease and rheumatoid arthritis.
Based on the above, we introduced Bayesian matrix regression to extend the haplotype association method from single phenotype analysis to multi-phenotype jointly analysis, and developed the second version algorithm software. We applied the software to a set of immune responseres data to trivalent vaccine and discovered two trans-acting response-eQTLs. The first is between a SNP in IFNAR2, a gene encodes an interferon alpha binding protein, and a probe in OR2AG1, a member of olfactory receptor. The second is between a SNP in CALCR, a calcitonin receptor that maintains calcium homeostasis, and a probe in IFI27, an interferon alpha-inducible protein 27.
Meanwhile, we applied Bayesian inference to the field of disease screening, and developed a novel method to analyze NIPT dataset to detect fetal trisomy such as the Down syndrome. The power comparison demonstrated that our Bayesian method is markedly better than the current Z-test method. We analyzed 3405 NIPS samples and spotted at least 9 (out of 51) possible Z-test false positives. Compared with Z-test method, Bayesian method emphasize fetal DNA fraction in NIPS to improve the accuracy of screening, permit even lower sequencing coverage, and can provide positive predictive value and negative predictive value, which are of clinical importance. Based on the clinical trials, the corresponding commercial software is on trial.