'Oxford' file formats format
.gen file
The GEN file format is designed to hold either directly-typed or imputed genotype data. To handle this it stores probabilities of each genotype for each sample, rather than a hard-called genotype.
The file consists of one line per SNP, with the following columns:
column | description |
---|---|
chromosome | The chromosome identifier. Note. this column is optional - not all GEN files have it. |
SNPID | An ID for the SNP |
rsid | Another ID for the SNP |
position | Position |
Allele A | The first allele in the data |
Allele B | The second allele in the data |
The remaining columns are in sets of 3 containing genotype probabilities for genotypes AA
, AB
, and BB
for each sample.
Note
See also the BGEN format which is widely used for imputed genotype data.
.sample file
The sample file has three parts
- a header line detailing the names of the columns in the file
- a line detailing the types of variables stored in each column - this is either '0' (for identifier), 'D' (discrete variable), 'B' for a binary phenotype, or 'C' or 'P' for continuous variables.
- a line for each individual detailing the information for that individual
Here's an example of a .sample
file:
ID pheno sex cov1 cov2
0 B D D C
sample_1 1 male british 2.5
sample_2 0 female british 12.2
sample_2 0 female french 12.2
...
.info file
This file consists of one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses):
- SNP identifier from -g file (snp_id)
- rsID (rs_id)
- base pair position (position)
- expected frequency of allele coded '1' in the -o file (exp_freq_a1)
- measure of the observed statistical information associated with the allele frequency estimate (info)
- average certainty of best-guess genotypes (certainty)
- internal "type" assigned to SNP (type)