Aside: excluding related samples
Note. This section is optional. We won't actually use it in the assocation test for this tutorial.
In most GWAS studies, the samples are supposed to be collected without any close relationships - so finding closely related samples could be a sign of sampling biases. Closely related samples might also share phenotypes for non-genetic reasons (they might be more likely to share environmental exposures than unrelated people, for example if they live in the same families). For both these reasons it is normal to try to remove close relationships from the data before running a GWAS.
Note
As we are using data simulated from a small region of the genome in this practical, it is not possible to calculate realistic relatedness between individuals in these data. However, as an example we'll show how this can be done below.
(You can compute relatedness in the samples from the case-control GWAS practical if you like. See the PCA practical for more on this.)
For our purposes we have provided code below for you to try this, but we won't actually use this filter to carrying out the regression analyses in this tutorial.
Finding samples to exclude
plink
does not directly allow us to filter samples by relatedness. Instead, we will write a small script to make a
list of samples to exclude - and then use that file to filter them out in plink
.
For most of this course we've used R to do our scripting - so let's try something else instead. We'll try out python
instead! Open a new file, called (say) exclude_high_ibd.py
and let's add some code.
First, add a function to load the missingness file into a python dict
structure (which is a 'hash map' i.e. mapping from keys to
values):
def load_missingness( filename ):
"""Load the .imiss file into a python dictionary (as a key to value mapping)"""
result = {}
with open(filename, "r") as missfile:
missfile.readline()
for line in missfile:
fid, iid, pheno_miss, num_miss, num_geno, fmiss = line.strip().split()
# Python dicts can have keys that are tuples!
# Here we use the pair (fid,iid) as the key
result[(fid,iid)] = float(fmiss)
return result
Next, add a function to read the .genome
file and find high-ibd sample pairs:
def load_high_ibd_pairs( filename, cutoff = 0.185 ):
"""Load the .genome file and return pairs of IDs with PI_HAT bigger than the cutoff"""
result = set()
with open(filename, "r") as ibdfile:
ibdfile.readline()
for line in ibdfile:
fid1, iid1, fid2, iid2, rt, ex, z0, z1, z2, pihat, phe, dst, ppc, ratio = line.strip().split()
sid1 = fid1 + "_" + iid1
sid2 = fid2 + "_" + iid2
if float(pihat) > float(cutoff):
result.add(( (fid1,iid1), (fid2,iid2)) )
return result
Finally let's use that to remove, for each high IBD pair, the sample with the most missingness:
import sys
# Read the filenames from the command line arguments
missingness_file = sys.argv[1]
ibd_file = sys.argv[2]
output_file = sys.argv[3]
print( "++ Loading missingness data from %s...", missingness_file )
missingness = load_missingness( missingness_file )
print( "++ Loading high-IBD pairs from %s...",ibd_file )
high_ibd = load_high_ibd_pairs( ibd_file, cutoff = 0.185 )
print( "++ Computing exclusions..." )
ids_to_exclude = dict()
for pair in ibd:
# Exclude the sample with greater missingness
if missingness[ pair[0] ] > missingness[ pair[1] ]:
ids_to_exclude.add( pair[0] )
else:
ids_to_exclude.add( pair[1] )
print( "++ Writing exclusions to %s...", output_file )
with open( output_file, "w" ) as outfile:
fid, iid in ids:
outfile.write( fid + "\t" + iid + "\n" )
print( "++ Thank you for using exclude_high_ibd.py" )
Put all the above code, with the functions, into a file called exclude_high_ibd.py
now.
Warning
In python you have to be really careful about indentation.
Unlike R, javascript, C++, Rust, or Ruby, which all use {}
braces - and unlike julia which uses begin
and end
-
python uses indentation to define blocks of code. If something is not working it may well be that a line of code has
the wrong indentation, relative to those around it.
In particular, you must make sure you are either using tabs to indent, or using spaces to indent the above code - mixing the two will lead to all sorts of problems! (The snippets above use tabs.)
If you need a full version of this file, see here.
Removing the samples from the dataset
Now let's use that to generate a list of samples to exclude:
python exclude_missing.py output/AMR_genotypes.imiss output/AMR_genotypes.genome output/AMR_genotypes.relatedness.remove
Note on python versions
On a mac, you may find you need to use python3
to run python, instead of python
.
You should now have an output file called output/AMR_genotypes.relatedness.remove
containing the IDs of samples to remove.
Now we've generated a list of samples to remove, we can actually remove them like this:
./plink \
--bfile Genotype_data/AMR_genotypes.filtered \
--remove output/AMR_genotypes.relatedness.remove \
--make-bed \
--out Genotype_data/AMR_genotypes.filtered.ibd
Note on improving the algorithm
If you look at the above code, you'll see it is pretty simple. For every pair of samples with high IBD, it selects the sample with the highest missingness, and writes it to the exclusion file.
It's actually easy to do better than this in many cases. A better algorithm finds the sample with the most relationships and removes it. And then repeats this process with the remaining samples until there aren't any high IBD pairs left. This algorithm typically excludes fewer samples while still leaving no high IBD pairs.
This is of course important because the sample size is a key factor in determining the power of a GWAS! Remember, all else being equal, the regression standard error scales like
with the sample size of the study.
We really don't want to lose samples, unless we have to!