This vignette explains the assumptions made during both individual and family simulations, and how we model the liability distributions based on the work by Hujoel et al (2020). We detail both the liability threshold model used for simulation and the idea behind liability threshold conditional on family history (LT-FH).

Simulation

geneference enables simulation of genetic data for either an individual or including its family. Each simulation will have \(n\) (rows) individuals or families with \(m\) (columns) independent Single-Nucleotide Polymorphisms (SNPs), where \(q\) of them are causal for the phenotype. Each column represents the number of minor alleles for the given position in the genome of an individual. We note that this ranges from 0 to 2, as an individual could have the minor allele on none, one, or both of the chromosomes in a pair. Only the simulated genotypes of the target individual will be saved in genotypes.ped, but the liabilities and case/control status of each of the individual’s family members will be saved in phenotypes.txt. The method of determining the liability and phenotype status will be explained in the next section.
In the simulation with no family history, each SNP is sampled from a binomial distribution \(SNP_j \sim binom(2,\,MAF_j)\), where the probability of having the minor allele (i.e. the minor allele frequence, MAF) is sampled from a uniform distribution \(MAF_j \sim unif(0.01,\,0.49)\). MAFs are saved in MAFs.txt. Furthermore, the effect size of each causal SNP is sampled from a normal distribution \(\beta_j \sim N(0,\,\frac{h^2}{q})\), where \(h^2\) is the heritability of trait, and \(q\) is the number of causal SNPs. The non-causal SNPs will have an effect size of 0. The effect sizes are saved in beta.txt.
When we expand the simulation to include family members, the parents’ genotypes are simulated in the aforementioned manner, but to get the target’s genotypes as well as the siblings’ genotypes, we combine the parents’ SNPs through Punnett squares.

Liability threshold model

The liability threshold model assigns a phenotype status, case or control, to each person based on their full liability \(\epsilon\), which can be divided into two parts, namely the genetic liability, \(\epsilon_{g}\), and the environmental liability, \(\epsilon_{e}\). As the SNPs are simulated independently, no linkage disequilibrium will be taken into consideration. The genetic liability is assumed to have a linear relationship with the number of minor alleles, meaning that the effect size of a given SNP is constant. The genetic liability is therefore calculated by multiplying the SNPs by their effect size and summing them \(\epsilon_g = X_i \beta\), where \(X_i\) is the i’th person’s SNPs and \(\beta\) is the effect sizes. It is assumed that \(\epsilon_g \sim N(0,\,h^2)\), where \(h^2\) is the heritability. The liability from the environment is sampled and is assumed to follow a normal distribution, \(\epsilon_e \sim N(0,\,1-h^2)\), which makes the full liability \(\epsilon_g+\epsilon_e=\epsilon \sim N(0,\,1)\). The liability theshold model assigns a case to an individual if \(\epsilon\geq T\) and control otherwise. Here, \(T\) is the threshold of disease liability, and is chosen such that \(P(\epsilon\geq T) = k\), where \(k\) is the prevalence of trait, e.g \(5\%\).

LT-FH model

The LT-FH method relies on the liability threshold model, but expands it to be able to take case/control status and/or family history into consideration. Specifically, this is done by conditioning the genetic liability on phenotype status of the family when calculating posterior mean genetic liabilities for each target person. The advantage of doing this is that we incorporate prior knowledge about the family history to make a accurate estimate of the true genetic liability. In addition, it leads to a non-binary representation of an otherwise binary phenotype status. This is important as, it allows us to distinguish between having a liability just above and having a liability far greater than the threshold \(T\).

To model a family and their liabilities, it is assumed that they follow a multivariate normal distribution, with, for two family members, the \(cov(\epsilon_1, \epsilon_2) = K_{12}h^2\), where \(K_{12}\) is the coefficient of the relationship, for example \(1/2\) for a parent-offspring pair. Therefore, we have \[ \begin{pmatrix} \epsilon_{o,e} \\ \epsilon_{o,g} \\ \epsilon_{p1} \\ \epsilon_{p2} \\ \epsilon_{s} \end{pmatrix} \sim MVN_5 \begin{pmatrix} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{pmatrix},\, \begin{pmatrix} 1-h^2 & 0 & 0 & 0 & 0 \\ 0 & h^2 & 0.5h^2 & 0.5h^2 & 0.5h^2 \\ 0 & 0.5h^2 & 1 & 0 & 0.5h^2 \\ 0 & 0.5h^2 & 0 & 1 & 0.5h^2 \\ 0 & 0.5h^2 & 0.5h^2 & 0.5h^2 & 1 \end{pmatrix} \end{pmatrix}, \] where \(\epsilon_{o,e}\) and \(\epsilon_{o,g}\) is the environmental and genetic liability of the target, \(\epsilon_{p1}\) and \(\epsilon_{p2}\) are the full liabilities of the parents, and \(\epsilon_s\) is the full liability of the sibling. In this distribution, we only have one sibling, but it can be expanded to an arbitrary number of siblings.

To be able to include the case/control status for the family, each target individual is assigned a configuration. The configuration is created based on the case/control status of the target, the parents and siblings, such that each individual with the same family history is in the same configuration. We use the configuration to create the posterior mean genetic liability \(E(\epsilon_{o,g}|z_o,z_{p1},z_{p2},\overrightarrow{z_s})\), where \((z_o)\) is the case-control status of the target individual, \((z_{p1}, z_{p2})\) is case-control of the parents, and \((\overrightarrow{z_s})\) is the case-control status of the siblings.
The case/control status of the family members informs us about what range the full liabilities lie in, since they have to be greater than \(T\) if case and less than \(T\) if control. The multivariate normal distribution above, does not include the target’s own full liability, which will be affected by the targets own case/control status. We know from the assumptions in the liability threshold model that the targets full liability \(\epsilon_{o,f} = \epsilon_{o,g}+\epsilon_{o,e}\). Including this changes the distribution, such that we instead have \[ \begin{pmatrix} \epsilon_{o,g} \\ \epsilon_{o,f} \\ \epsilon_{p1} \\ \epsilon_{p2} \\ \epsilon_{s} \end{pmatrix} \sim MVN_5\left( \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{pmatrix}, \begin{pmatrix} h^2 & h^2 & 0.5h^2 & 0.5h^2 & 0.5h^2 \\ h^2 & 1 & 0.5h^2 & 0.5h^2 & 0.5h^2 \\ 0.5h^2 & 0.5h^2 & 1 & 0 & 0.5h^2\\ 0.5h^2 & 0.5h^2 & 0 & 1 & 0.5h^2\\ 0.5h^2 & 0.5h^2 & 0.5h^2 & 0.5h^2 & 1 \end{pmatrix}\right). \] This is again with only one sibling, but can be expanded to an arbitrary number of siblings. Based on this distribution the marginal distribution of a liability conditioned on the other liabilities can be found as shown in the example below. The helper function covmatrix() can be used to create theoretical covariance matrices with a specified number of siblings and value of \(h^2\). Furthermore, calculate_cov() can be used to estimate the covariance matrix of simulated data. The covariance matrices generated by these functions follow the same order as described in this vignette.

Example

For a family with no siblings we have:

\[ \epsilon_{o,g}|\epsilon_{o,f},\epsilon_{p1},\epsilon_{p2} \sim N\left(\frac{(h^2-\frac{1}{2}h^4)\cdot \epsilon_{o,f} + (\frac{1}{2}h^2 - \frac{1}{2}h^4)\cdot (\epsilon_{p1}+\epsilon_{p2})}{1-\frac{1}{2}h^4}, \,\frac{h^2 - \frac{3}{2}h^4 + \frac{1}{2}h^6}{1-\frac{1}{2}h^4}\right). \] By including the knowledge about the configuration, we can sample the genetic liability from a normal distribution when the full liabilities have values that correspond to having the correct case/control status. For a more detailed explanation of this procedure, see vignette("gibbs").

Implementing configurations

The configuration that each individual is assigned to, is based on the phenotype status of the individual itself and its family. The function assign_ltfh_phenotype() appends the following columns to phenotypes.txt:

  • conf: Configuration of a family
  • conf_class: Equivalence class of a configuration
  • ltfh_pheno: Estimated posterior mean genetic liability for the given equivalence class

The column conf represents the configuration where each number is the phenotype status of a family member. The numbers are written in the following order: own, parent 1, parent 2, sibling 1, …, sibling n. Each number is “1” if control, and “2” if case. E.g., a configuration of “1212”, means that the individual itself is control, while parent 1 is case, parent 2 is control and sibling 1 is case. This is just with one sibling, but can be expanded to an arbitrary number of siblings.
Different configurations can be equivalent from a theoretical standpoint, since it does not matter which of the parents is a case, or which of the siblings is a case. We therefore use the column conf_class to group configurations. This column consists of four numbers separated by “x”. The first is the individual’s own case-control status, the second is the sum of the parents’ case-control status, the third is the number of siblings and the fourth is the sum of the siblings case-control status. For instance, the configurations “12121” and “11212” will both be assigned the configuration class “1x3x2x3”.
The column ltfh_pheno contains posterior mean genetic liabilities of an individual. Individuals with the same configuration class are assigned the same posterior mean genetic liability.