Getting setup

Loading libraries

We'll begin by loading a series of R libraries containing useful functions for data visualisation. If you have not yet installed these libraries, you can do so by running the following code:

install.packages("tidyverse")
install.packages("matrixStats")
install.packages("reshape2")
install.packages("pheatmap")
install.packages("ggpubr")
install.packages("ggrepel")
install.packages("hexbin")

Let's now load the libraries.

library(tidyverse)
library(matrixStats)
library(reshape2)
library(pheatmap)
library(ggpubr)
library(ggrepel)
library(hexbin)

Loading data

We will now load the GTEx gene expression table, as well as a table containing annotations for each of the genes measured in the study. The data can be found in this folder - download both files now.

Note

We recommend working in a new folder for this - for example by running this in your terminal:

mkdir rna-seq_practical
cd rna-seq_practical
curl -O https://www.chg.ox.ac.uk/~gav/projects/chg-training-resources/data/data_visualisation/visualising_rna-seq_data/GTEx-v8-RNAseq_mean-TPM_QCed.tsv
curl -O https://www.chg.ox.ac.uk/~gav/projects/chg-training-resources/data/data_visualisation/visualising_rna-seq_data/GTEx_gene-annotations.tsv

This might take a minute or so to download.

Now make sure your R session is working in the same directory using setwd() or the Session -> Set Working Directory menu option in RStudio.

The data files are in tab-separated value (TSV) format, which means they are composed of columns, separated by tabular spaces. You can read this type of file into R by using the read.table() function, as specified below:

library( tidyverse )
gtex <- read_tsv( "GTEx-v8-RNAseq_mean-TPM_QCed.tsv" )
gene_annotations <- read_tsv( "GTEx_gene-annotations.tsv" )

Checking the expression data

To check this worked, let's have a closer look at the main expression table by previewing the first 10 rows and columns. Note how each row corresponds to a gene and each column corresponds to a tissue. Entries in the table contain the expression levels of each gene in each tissue in TPM units.

print(gtex)

You should see something like this:

# A tibble: 14,240 × 55
   gene_name Adipose_Subcutaneous Adipose_Visceral_Omentum Adrenal_Gland Artery_Aorta Artery_Coronary Artery_Tibial Bladder Brain_Amygdala
   <chr>                    <dbl>                    <dbl>         <dbl>        <dbl>           <dbl>         <dbl>   <dbl>          <dbl>
 1 LINC01128                7.44                     7.80           4.03        8.85            8.72         11.1      7.62          2.98 
 2 LINC00115                4.49                     3.42           2.75        6.23            5.94          7.82     7.57          1.45 
 3 SAMD11                   0.476                    2.93           3.78        5.72            3.10          7.93    19.1           0.533
 4 NOC2L                   52.9                     48.2           37.2        55.5            51.9          62.6     71.9          18.3  
 5 KLHL17                   9.68                     8.97           7.89       13.9            13.4          10.7     17.6           3.05 
 6 PLEKHN1                  0.560                    0.648          1.93        0.744           0.868         0.718    4.22          0.107
 7 HES4                    21.8                     24.2            2.91       46.6            64.6          98.4     33.1          16.7  
 8 ISG15                   36.6                     35.0           13.5        50.8            43.7          50.4     33.0          43.6  
 9 AGRN                    18.5                     21.5            4.78       25.3            25.8          22.7     20.0          18.8  
10 C1orf159                 9.30                     6.11           5.08        7.60            8.15          7.57    10.3           5.51 
# ℹ 14,230 more rows
# ℹ 46 more variables: Brain_Anterior_cingulate_cortex_BA24 <dbl>, Brain_Caudate_basal_ganglia <dbl>, Brain_Cerebellar_Hemisphere <dbl>,
#   Brain_Cerebellum <dbl>, Brain_Cortex <dbl>, Brain_Frontal_Cortex_BA9 <dbl>, Brain_Hippocampus <dbl>, Brain_Hypothalamus <dbl>,
#   Brain_Nucleus_accumbens_basal_ganglia <dbl>, Brain_Putamen_basal_ganglia <dbl>, Brain_Spinal_cord_cervical_c1 <dbl>,
#   Brain_Substantia_nigra <dbl>, Breast_Mammary_Tissue <dbl>, Cells_Cultured_fibroblasts <dbl>, Cells_EBV_transformed_lymphocytes <dbl>,
#   Cervix_Ectocervix <dbl>, Cervix_Endocervix <dbl>, Colon_Sigmoid <dbl>, Colon_Transverse <dbl>, Esophagus_Gastroesophageal_Junction <dbl>,
#   Esophagus_Mucosa <dbl>, Esophagus_Muscularis <dbl>, Fallopian_Tube <dbl>, Heart_Atrial_Appendage <dbl>, Heart_Left_Ventricle <dbl>, …
# ℹ Use `print(n = ...)` to see more rows

Note

You've explored tibbles before. How many rows and columns does the data have?

The data is essentially a big table of expression values. To work with it more easily, let's convert to a matrix:

gene_names = gtex$gene_name
gtex = as.matrix( gtex[,2:ncol(gtex)] )
row.names( gtex ) = gene_names

And print it again:

gtex[1:10,1:10]

          Adipose_Subcutaneous Adipose_Visceral_Omentum Adrenal_Gland Artery_Aorta Artery_Coronary Artery_Tibial  Bladder Brain_Amygdala
LINC01128             7.441180                 7.797220       4.03224     8.845420        8.720360     11.097400  7.62418       2.977490
LINC00115             4.490250                 3.424820       2.75144     6.225130        5.935010      7.817070  7.57255       1.453660
SAMD11                0.476230                 2.931190       3.78499     5.717060        3.104580      7.931910 19.05830       0.532897
NOC2L                52.861400                48.187700      37.17370    55.491600       51.901500     62.564000 71.86850      18.344900
KLHL17                9.678200                 8.967750       7.89155    13.917400       13.392400     10.740400 17.60890       3.048490
PLEKHN1               0.559729                 0.648233       1.92579     0.744398        0.868276      0.717782  4.22114       0.107346
HES4                 21.789300                24.187600       2.91384    46.628900       64.610400     98.388500 33.05170      16.737000
ISG15                36.599900                35.009300      13.45140    50.760100       43.741600     50.375500 32.98460      43.609600
AGRN                 18.538800                21.475700       4.78231    25.299100       25.774100     22.740600 20.00260      18.847400
C1orf159              9.295820                 6.112750       5.07871     7.598480        8.150340      7.573130 10.28930       5.512330
          Brain_Anterior_cingulate_cortex_BA24 Brain_Caudate_basal_ganglia
LINC01128                             4.992650                    3.862590
LINC00115                             1.423120                    2.043400
SAMD11                                0.466223                    2.217410
NOC2L                                27.560400                   25.096000
KLHL17                                5.058480                    4.410430
PLEKHN1                               0.100712                    0.121139
HES4                                 20.636600                   31.458800
ISG15                                25.535200                   27.367600
AGRN                                 22.643900                   19.735000
C1orf159                              6.878150                    6.623100

Checking the gene annotations

Let's also look at the first rows in the gene annotations table. You'll notice that this table contains information pertaining each gene, such as its name, unique ID, genomic location (chromosome, start position, and end position), and its gene family.

print(gene_annotations)

# A tibble: 14,240 × 7
   gene_name gene_id         gene_symbol chr     start     end biotype       
   <chr>     <chr>           <chr>       <chr>   <dbl>   <dbl> <chr>         
 1 LINC01128 ENSG00000228794 LINC01128   1      825138  859446 lncRNA        
 2 LINC00115 ENSG00000225880 LINC00115   1      826206  827522 lncRNA        
 3 SAMD11    ENSG00000187634 SAMD11      1      923923  944575 protein_coding
 4 NOC2L     ENSG00000188976 NOC2L       1      944203  959309 protein_coding
 5 KLHL17    ENSG00000187961 KLHL17      1      960584  965719 protein_coding
 6 PLEKHN1   ENSG00000187583 PLEKHN1     1      966482  975865 protein_coding
 7 HES4      ENSG00000188290 HES4        1      998962 1000172 protein_coding
 8 ISG15     ENSG00000187608 ISG15       1     1001138 1014540 protein_coding
 9 AGRN      ENSG00000188157 AGRN        1     1020120 1056118 protein_coding
10 C1orf159  ENSG00000131591 C1orf159    1     1081818 1116361 protein_coding
# ℹ 14,230 more rows
# ℹ Use `print(n = ...)` to see more rows

You are now ready to start exploring this data set.

Getting setup

Loading libraries​

Loading data​

Note

Checking the expression data​

Note

Checking the gene annotations​

Loading libraries

Loading data

Checking the expression data

Checking the gene annotations