Simulate genetic data, including genotypes, phenotype status and liabilities, for individuals and their family, where each individual has a specified number of siblings.
sim_varied_family(n, m, q, hsq, k, dist, path = "")
n | number of genotypes (individuals), given as a vector of same length
as |
---|---|
m | number of SNPs per genotype. |
q | number of causal SNPs, i.e. SNPs that effect chances of having the phenotype. |
hsq | squared heritability parameter. |
k | prevalence of phenotype. |
dist | the distribution of siblings. Given as a vector with the same
length as |
path | directory where the files will be stored. If nothing is
specified, |
Does not return any value, but prints the following five files to
the path
parameter specified in the function call:
Three text files:
beta.txt - a file of m
rows with one column. The i'th row is
the true effect of the i'th SNP.
MAFs.txt - a file of m
rows with one column. The i'th row is
the true Minor Allelle Frequency of the i'th SNP.
phenotypes.txt - a file of n
rows, number of columns depend on
number of siblings. The file contains the phenotype status and liability
of each individual as well as information on the liabilities and
phenotype status of their parents and siblings.
genotypes.map - a file created such that PLINK will work with the genotype data.
genotypes.ped - the simulated genotypes in a PLINK-readable format. Note: The function only saves genotype data for the target individual.
Parents' genotypes are simulated and used for creating the genotypes of
the individuals and their siblings. For the methodology behind the
simulation, see vignette("liability-distribution")
.
Note: Each entry in dist
denotes a number of siblings. Each entry in
n
then denotes how many individuals have the corresponding number of
siblings.
E.g., n = c(100, 200, 300, 400)
and dist = c(0, 2, 3, 5)
would
give a total of 100 + 200 + 300 + 400 = 1000 individuals, where 100
individuals have 0 siblings, 200 have 2 siblings, and so on.
Since individuals have a different number of siblings, some entries in
phenotypes.txt
will be missing, denoted by -9.
E.g., an individual with 1 sibling in a dataset where the maximum number of
siblings is 3, would have -9 in all columns relating to sibling 2 and
sibling 3.
sim_varied_family
makes use of parallel computation in order to
decrease the running time. As one CPU core is left unused, the user
should be able to do other work while the simulation is running.
Simulating large datasets takes time and generates large files. For details
on time complexity and required disk space, see
vignette("sim-benchmarks")
.
The largest file generated is genotypes.ped
. See convert_geno_file()
to convert it
to another file format, thereby reducing its size significantly.