Simulation with random family sizes — sim_random

Simulate genetic data, including genotypes, phenotype status and liabilities, for individuals with family history, where each individual is given a random number of siblings.

sim_random_family(n, m, q, hsq, k, sib_fert, dist = 0, path = "")

Arguments

n	number of genotypes (individuals).
m	number of SNPS per genotype.
q	number of causal SNPs, i.e. SNPs that effect chances of having the phenotype.
hsq	squared heritability parameter.
k	prevalence of phenotype.
sib_fert	either the distribution vector or a fertility rate. See details section.
dist	if `sib_fert` is a distribution vector, then `dist` is used to specify how many siblings the probabilities in `sib_fert` correspond to.
path	directory where the files will be stored. If nothing is specified, `sim_random_family` writes its files in the current working directory.

Value

A list where first entry is the number of individuals, i.e. the n parameter supplied to sim_varied_family(), and the second entry is the number of siblings those individuals have, i.e. the dist parameter supplied to sim_varied_family().
This is returned once the simulation is done printing the following five files to the path parameter specified in the function call:

Three text files:
- beta.txt - a file of m rows with one column. The i'th row is the true effect of the i'th SNP.
- MAFs.txt - a file of m rows with one column. The i'th row is the true Minor Allelle Frequency of the i'th SNP.
- phenotypes.txt - a file of n rows, number of columns depend on number of siblings. The file contains the phenotype status and liability of each individual as well as information on the liabilities and phenotype status of their parents and siblings.
genotypes.map - a file created such that PLINK will work with the genotype data.
genotypes.ped - the simulated genotypes in a PLINK-readable format. Note: The function only saves genotype data for the target individual.

Details

Parents' genotypes are simulated and used for creating the genotypes of the individuals and their siblings. For the methodology behind the simulation, see vignette("liability-distribution").
The number of siblings generated by either a multinomial distribution or a Poisson distribution. The function then calls sim_varied_family() with the appropriate n and dist parameters.
The choice of distribution depends on the input given to sib_fert. If a a single number is given, the function uses the Poisson distribution to randomly select how many siblings the different individuals have. In this case, dist should not be specified.
If a vector is given, then the function uses sib_fert as probabilities of a multinomial distribution to select how many individuals have the number of siblings specified in each entry of dist.
E.g., sib_fert = c(1/6, 1/6, 2/3) and dist = c(0, 2, 3) means that the probability of having 0 siblings is 1/6, the probability of having 2 siblings is 1/6 and the probability of having 3 siblings is 2/3.
Since individuals have a different number of siblings, some entries in phenotypes.txt will be missing, denoted by -9.
E.g., an individual with 1 sibling in a dataset where the maximum number of siblings is 3, would have -9 in all columns relating to sibling 2 and sibling 3.
sim_random_family makes use of parallel computation in order to decrease the running time. As one CPU core is left unused, the user should be able to do other work while the simulation is running.

Warning

Simulating large datasets takes time and generates large files. For details on time complexity and required disk space, see vignette("sim-benchmarks").
The largest file generated is genotypes.ped. See convert_geno_file() to convert it to another file format, thereby reducing its size significantly.