Simulate genetic data, including genotypes, phenotype status and liabilities, for individuals with family history, where each individual is given a random number of siblings.
sim_random_family(n, m, q, hsq, k, sib_fert, dist = 0, path = "")
n | number of genotypes (individuals). |
---|---|
m | number of SNPS per genotype. |
q | number of causal SNPs, i.e. SNPs that effect chances of having the phenotype. |
hsq | squared heritability parameter. |
k | prevalence of phenotype. |
sib_fert | either the distribution vector or a fertility rate. See details section. |
dist | if |
path | directory where the files will be stored. If nothing is
specified, |
A list where first entry is the number of individuals, i.e. the n
parameter supplied to sim_varied_family()
, and the second entry is the
number of siblings those individuals have, i.e. the dist
parameter
supplied to sim_varied_family()
.
This is returned once the simulation is done printing the following five
files to the path
parameter specified in the function call:
Three text files:
beta.txt - a file of m
rows with one column. The i'th row is
the true effect of the i'th SNP.
MAFs.txt - a file of m
rows with one column. The i'th row is
the true Minor Allelle Frequency of the i'th SNP.
phenotypes.txt - a file of n
rows, number of columns depend on
number of siblings. The file contains the phenotype status and liability
of each individual as well as information on the liabilities and
phenotype status of their parents and siblings.
genotypes.map - a file created such that PLINK will work with the genotype data.
genotypes.ped - the simulated genotypes in a PLINK-readable format. Note: The function only saves genotype data for the target individual.
Parents' genotypes are simulated and used for creating the genotypes of
the individuals and their siblings. For the methodology behind the
simulation, see vignette("liability-distribution")
.
The number of siblings generated by either a multinomial distribution or
a Poisson distribution. The function then calls sim_varied_family()
with the appropriate n
and dist
parameters.
The choice of distribution depends on the input given to sib_fert
. If a
a single number is given, the function uses the Poisson distribution to
randomly select how many siblings the different individuals have. In this
case, dist
should not be specified.
If a vector is given, then the function uses sib_fert
as probabilities
of a multinomial distribution to select how many individuals have the number
of siblings specified in each entry of dist
.
E.g., sib_fert = c(1/6, 1/6, 2/3)
and dist = c(0, 2, 3)
means that the
probability of having 0 siblings is 1/6, the probability of having 2
siblings is 1/6 and the probability of having 3 siblings is 2/3.
Since individuals have a different number of siblings, some entries in
phenotypes.txt
will be missing, denoted by -9.
E.g., an individual with 1 sibling in a dataset where the maximum number of
siblings is 3, would have -9 in all columns relating to sibling 2 and
sibling 3.
sim_random_family
makes use of parallel computation in order to
decrease the running time. As one CPU core is left unused, the user
should be able to do other work while the simulation is running.
Simulating large datasets takes time and generates large files. For details
on time complexity and required disk space, see
vignette("sim-benchmarks")
.
The largest file generated is genotypes.ped
. See convert_geno_file()
to convert it
to another file format, thereby reducing its size significantly.