Population-statistical method of studying genetics: the essence and significance
In recent years, a very wide variety of statistical methodologies have been put forward at various levels of complexity for analyzing genotype data and identifying genetic variations that may be responsible for increasing susceptibility to diseases. What is the population-statistical method for studying genetics? What is its essence and significance in the study of heredity?
Type of computational biology
Statistical genetics is a scientific area related to the development of population-statistical methods for deriving from genetic data. This term is most often used in the context of human genetics. Research in this area typically includes the development of a theory or methodology to support research in one of three interrelated areas:
- population genetics - the study of evolutionary processes affecting genetic variability between organisms.
- genetic epidemiology - the study of the influence of genes on disease.
- quantitative genetics - the study of the influence of genes on "normal" phenotypes.
Statistical genetics tend to work closely with geneticists, molecular biologists, clinicians, and bioinformatics. Statistical genetics is a type of computational biology.
Subject of study
Population genetics has been studying the genetic structure of populations and their gene pool. It also covers issues related to the interaction of factors that determine both the constancy and the change in the structure of the genome. What is a population? This is a collection of individuals of the same species that freely intersect and occupy a specific territorial area, and also have a common gene pool (gene pool), passing from generation to generation.
The population-statistical method of genetics is used in the study of hereditary diseases, the alternation of normal and pathological genes, genotypes and phenotypes in populations of different localities, countries and cities.What is its uniqueness? The essence of the population-statistical method is that it is aimed at studying the patterns of the spread of hereditary diseases in populations differing in their structure. We study the possibility of predicting their repetition in future generations.
Population-statistical method and its value
Statistical genetic analysis of quantitative traits in large pedigrees is a huge computational task due to the need to take into account independence among relatives. With the growing awareness that variants of rare sequences may be important in quantitative human variations, studies of heritability and associations involving large pedigrees will increase in frequency due to the greater likelihood of seeing multiple copies of rare variants among related individuals.
It is therefore important to have statistical genetic test procedures that use all available information to extract evidence regarding the genetic association. Optimal testing of the association of phenotypes is associated with an accurate calculation of the statistics of the ratio of truth, which requires re-inversion of potentially large matrices.In the context of combining the entire genome sequence, this calculation may be incorrect.
Statistical methods of genetic analysis
In connection with the advancement in laboratory technologies, the population-statistical method and genetic epidemiological approaches to complex diseases are changing rapidly to cope with the enormity of genetic data. As laboratory technologies change, it has become possible to generate more complete genetic data of genomic generation with whole genome sequence data.
There were problems with multiple testing and the emergence of rare genetic variants, which were limited to traditional statistical methods, which led to the development of methods of a rare variant analysis. Current research focuses not only on the analysis of individual genetic variants, but also on the analysis of several genetic variants, especially using network methods.
The rapid development of genetics
Research in the field of genetics has developed rapidly, ranging from studies of individual areas and ending with large-scale genome research.And although the study of genetic associations has been conducted for many years, even for the simplest analyzes there is little consensus about the most appropriate statistical procedures.
Statistical genetics is an area of convergence of genetics and quantitative analysis. Over the past few years, it has experienced a dramatic paradigm shift, from a predominantly theoretical subject in which there is little opportunity for empirical data to strictly disciplined, where the existence of large repositories of genetic data allows researchers to generate and research new scientific hypotheses.
With the advent of relatively cost-effective technology with a high throughput of genotyping, it is now possible to explore the etiology of complex diseases, the biological processes through which DNA is inherited and the evolutionary histories of human populations. From a medical point of view, the progress of using the population-statistical method in studying the role of heredity is in the development and analysis of pharmacogenetic studies, that is, studies in which genetic variability correlates with the response to drugs.
This may ultimately lead to the development of a “personalized medicine” approach in health care. Of course, for each of these areas of research, specialized methods of inference and computation are required. This review of population-statistical methods in genetics is limited to association mapping: a powerful methodology that is thought to help in understanding the genetic basis of human diseases and other phenotypes of interest.
Instead of trying to highlight the association comparison methods, the exposure is narrowed down to include only data analysis approaches for research in case of illness or when only sick people are available. The purpose of this article is to invite the reader to a non-technical tour of a number of selected population-statistical genetics methods currently used for gene mapping.
The main example of the population-statistical method is the Hardy-Weinberg law. It is based on a pattern that was discovered in 1908 by J. Hardy, a mathematician from England, and doctor V.Weinberg from Germany to breed a perfect population. The law was therefore named after the two names. In order for the population to be ideal, the following conditions are necessary:
- Organisms must be freely crossed.
- There is no selection and mutational phenomena.
- Migration processes, both external and internal, are limited.
- Dominant homozygotes, heterozygotes and recessive homozygotes are inherited unchanged.
Perfect balance can be upset by a number of factors, including closely related marriages, mutations, selection, migrations, and more. The Hardy-Weinberg Act is considered to be the basis for considering genetic transformations occurring in natural and artificially created populations of plants, animals and humans.
Principles of association
A distinctive feature of case-control design is that the subjects included in the sample are randomly selected from a given population according to the status of the disease retrospectively. The genetic compositions of individuals belonging to the two groups, cases and controls, are compared in the hope that their differences in certain narrow regions of the genome can serve as a causal explanation of the status of the disease.Among the different types of genetic markers, single nucleotide polymorphisms (SNPs) play a central role in the mapping of complex diseases. For the entire human genome, there are at least 10 million SNPs with a frequency of> 1%, which is thought to constitute about 90% of a person’s genetic variation.
The fundamental concept in association mapping is the link disequilibrium between the genetic marker and the locus that influences the trait under study. It captures the deviation from probabilistic independence among alleles or genetic markers. For example, linkage disequilibrium between two alleles, for exampleA andB,can be quantified by measuring the difference betweenpAB,the probability of observing the haplotypeAB (t.e.linear arrangement of two alleles on the same chromosome, inherited as a whole) and the productpApB,WherepAandpB-probabilities of observing allelesA andB respectively.However, in most cases, the haplotypes are not directly accessible, and their frequencies should be determined with full probability from the genotype data.
Derivation methodsBased on variations of the expectations minimization algorithm, an iterative technique for obtaining maximum likelihood estimates in models of missing data is a popular choice for obtaining sample haplotype frequencies. The accuracy of the algorithm for minimizing expectations for estimating the frequencies of haplotypes in various simulation schemes, both a function of allele frequencies and many other factors, has been documented. Recent developments use observation in which, in short regions, haplotypes in a population tend to group into groups, and this clustering tends to vary along the chromosome.
The resulting patterns of genetic variation can be well described by hidden Markov models, and parameter estimates were made using an algorithm to derive the haplotype phase, as well as missing genotype data. Alternatively, a measure of composite genotypic disequilibrium can be calculated directly from billocus genotypic data, assuming random mating, it corresponds to the above-mentioned allelic meridian.A number of other common coefficients and their properties have been studied both analytically and through modeling.
The twin method in the study of the genome
The areas of application of the population-statistical and twin methods include the study of patterns of inheritance of characters in pairs of twins. Proposed as early as 1875 to scientists by Halton, this method was initially used to assess the role of heredity and the environment in the development of human mental properties. Now it is widely used in the study of heredity and variability of normal and pathological signs. It can be used to identify the hereditary nature of a particular trait, determine the penetrance of an allele and assess external factors affecting the body.
The essence of the twin method:
- In different groups of twins, the same trait is compared, the similarity or difference of their genotypes is also taken into account.
- In monozygotic twins, there is a complete genetic identity. Their comparison in conditions of different postembryonic development makes it possible to detect the signs that were formed due to the external environment.
The study of the genome in the population-statistical method of studying human genetics allows a more comprehensive search for genetic risk factors. In the near future, these studies will be less expensive and, therefore, more accessible. From a statistical and computational point of view, studies of the genome as a whole offer non-trivial problems associated, among other things, with a very large number of markers that should be included in the analysis, compared with usually smaller sample sizes.
The development of new analytical methods
The question that causes a lot of discussion and fueling the development of new analytical methods is whether complex diseases are caused by one common option or many options that have small effects. A common hypothesis about a common disease indicates that the genetic risk of common diseases will often be caused by the alleles causing the disease, which are found at relatively high frequencies. So far, evidence in his favor has been limited.
It is acceptable to assume that common diseases are expectedwill be controlled by more complex genetic mechanisms characterized by the joint action of several genes, each gene having only a small marginal effect, possibly because natural selection removes genes that have larger effects. In this case, the groups of markers should be tested together for association, which can be done in two main ways: grouping markers together in genotypes with several loci so that the basic unit of statistical analysis is still individual or through haplotypes, thus effectively doubling the sample size.
General methods for haplotypes
Instead of examining each marker separately, it is possible to jointly test specific combinations of allelic variants in a series of tightly bound markers on the same chromosome, that is, haplotypes. Including information from several adjacent markers, the haplotypes retain the overall structure and more directly reflect true polymorphisms.
The easiest way to check if there is a link between the haplotype and the status of the disease is to treat each haplotype as a separate category, perhaps by combining all rare haplotypes into an additional class.This process is usually performed in two stages: first, haplotype frequencies are estimated, then a standard test for association is calculated, for example, statistics likelihood ratio. To cope with the bloated statistics of the test due to the assessment of the haplotype, the distribution of the test under zero can be obtained by randomly shuffling the disease state and then re-evaluating the frequencies of the haplotypes.
Although this approach makes it possible to assess the overall relationship between haplotypes and the disease, it does not draw conclusions about the effects of specific haplotypes or haplotype features. To solve these problems, a number of tests of the specific effects of the haplotype are based on the estimated probability of the disease, where the disease status is considered as a result, and the haplotypes introduce the regression model as covariates. Subjects with ambiguous haplotypes are placed by calculating the expected value of covariates due to the subject's genotypes, using the expected frequencies of the haplotypes.
Population-statistical method for studying human genetics
In populations of humans, formed by relatively recent mixing of certain groups of ancestors, such as African Americans, throughput extends over greater distances than in others,less heterogeneous populations. For diseases that vary in prevalence between two or more ancestral populations, this long-range bandwidth can be used to look for genetic variants responsible for the ethnic difference in the risk of disease.
The main point is that in mixed populations, markers with a locus responsible for the ethnic difference in the risk of disease will have a greater than expected proportion of ancestors from the high-risk population. Gene mapping can be performed by searching for narrow genomic regions that show excessive proportions of pedigrees from one of the constituent populations of ancestors in a methodology called impurity mapping.
The membership of the population in each locus for all subjects should be statistically evaluated by typical markers. The generally accepted probabilistic model for describing stochastic variation in the pedigree suggests that chromosomes can be represented by general generation blocks, with breakpoints between adjacent blocks, occurring as a Poisson process, and transitions between adjacent ancestral blocks controlled by the Markov chain.According to this model, several methods of derivation were constructed in order to assess the origin of diseased chromosomes and to detect the represented populations of ancestors.
Modeling studies and analytical calculations show that comparison of impurities has several advantages compared with established approaches to population-based mapping, for example, much less markers are required to search the entire genome and it is less susceptible to the influence of allelic heterogeneity.