Setting up Conda
One of the easiest ways to set up your environment that works across platforms is to use conda. Conda creates 'virtual environments' that don't break the rest of your system, and uses a comprehensive package manager. It has a dedicated bioconda channel that makes it easy to install software for biomedical research.
Installing conda
The recommended way is to install miniforge
which is a minimal environment that lets you
install the packages you want.
Warning
Due to licensing changes, we now recommend using miniforge
not miniconda
.
To get miniforge
, visit [releases page](https://conda-forge.org/miniforge/#latest-release and choose the appropriate version:
If you are on Mac OS X you almost certainly want the arm64
version. (The exception is if you are on an older Mac that has Intel silicon, in which case choose the x86_64
version.)
If you are on Linux download the linux x86_64
version.
If you are on Windows, download the linux x86_64
version anyway. This is because we will
install it into the Linux subsystem for Windows.
The installer is a bash (.sh
) file which we'll have to run in the terminal to install. It will have been downloaded
into your 'Downloads' folder - let's change directory there and go and make sure it has downloaded:
- on Mac OS X:
$ cd Downloads
- on Windows:
$ cd /mnt/c/Users/<username>/Downloads
- on Linux: probably
$ cd Downloads
You can check what's there by running ls Mini*
- you should see there's a file of the form
Miniforge-<version>-<platform>.sh
.
Note
Because this is an installer downloaded from the internet, you should check it's the
real thing before installing it. Run sha256sum <miniconda filename>
(linux or Ubuntu for Windows) or shasum -a 256 <miniconda filename>
(Mac OS X) as described and compare the output to the SHA256 has in the output table. If it's different, don't install!
See this page for more information.
To install, make sure you have cd
into the downloads folder as above, and then run the installer:
$ bash ./Miniforge3-<version>-<platform>.sh
(You should fill in the right filename, or use the <tab>
key to auto-fill.)
You will be asked to accept the license and choose an install location. If in doubt, the defaults
install to a folder called miniforge3
in your home directory, which is fine. Say 'yes' when asked
if you want to initialise the installer.
Activating and deactivating conda
If you read the blurb this command outputs, you'll see it says it is going to activate the conda environment by default on startup. You'll know whether conda is activated because it the current conda 'environment' will show up in your command prompt. The default one is called 'base' so your prompt will look like this:
(base) <username>@<computer>:~$
If it doesn't look like this, try activating it by typing conda activate
now.
You can then also deactivate the environment (going back to normal) with the conda deactivate
command
$ conda deactivate
And you can reactivate it with - you guessed it!
$ conda activate
Using conda to install software
Conda makes installing stuff easy. The first thing it's good to have is a better (faster) version of conda
itself, called mamba
:
$ conda install mamba
Type 'y' and press <enter> to install
(You might find that mamba
is already installed, in which case you don't need to do anything here.)
Creating a new environment
Before installing anything else, let's use conda to create a new 'environment' to work in. Let's call it gms
:
$ conda create --name gms
And then activate it with:
$ conda activate gms
Note
This is the downside of using conda: you have to remember what environment you're in at any one time.
The upside is that you get a flexible way to install bits of software that we'll need without affecting the rest of your system.
Let's try using conda
/ mamba
to install samtools
, which is a workhorse tool for handling
next-generation sequencing data, into our gms
environment. While you can download the source code and compile it yourself,
conda makes this easy. You'll want a fairly recent version, so let's get version 1.15
which is
available from the bioconda channel:
$ mamba install -c bioconda 'samtools>=1.15'
or
$ conda install -c bioconda 'samtools>=1.15'
You may have noticed we added -c bioconda
in the above command. This is because the most up-to-date versions of
samtools
live in the bioconda channel (rather than in conda-forge). If you look at
the output you'll see that this is getting htslib
and samtools
from bioconda, but also libdeflate
from conda-forge
(and possibly other packages). Go ahead and install. Running samtools now gives you some output:
$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.16.1 (using htslib 1.16)
Usage: samtools <command> [options]
...
Congratulations! You've just used conda to install some software into the gms
environment.
Remember
Remember that the newly-installed software is only present in the gms
environment.
If your prompt doesn't say (gms)
at the start, the environment isn't activated and you won't be able to run the software. Try it now:
$ conda deactivate
$ samtools
command not found: samtools
To use this version of samtools, you must remember to activate the environment first:
$ conda activate gms
$ samtools
Question
Another useful program is bcftools
. Can you install that into your gms
environment as well?
Adding bioconda
For biomedical work you will want to use both bioconda
and conda-forge
a great deal. To avoid version issues it's
therefore best to go ahead and set these channels up permanently. The bioconda page
explains how to do this, namely, run:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
This bit of configuration says: "search conda-forge first, then bioconda, for packages". This will help it find up-to-date versions of the software we need.
How conda works - what even is an 'environment'?
UNIX figures out how to find programs and other things using so-called 'environment variables'. You
can see them all using the env
command:
$ env
When conda manages 'environments', all it is really doing is changing environment variables to point to its own copies of files.
For example the HOME
environment variable points at your home folder:
$ echo ${HOME}
/users/<username> (or similar)
Let's go there now and see what's there:
$ cd ${HOME}
$ ls
If you've followed the above, you should see that conda
has created a directory called miniforge3
in there where it puts the things it installs. For example the base environment executable programs go in miniforge3/bin
:
$ ls miniforge3/bin
If you look there you will see (among many other things) the mamba
executable - because we just installed it.
Note
You won't see the samtools command, because we installed it into the gms
environment. Instead, that has been
placed in a folder specific to that environment:
$ ls miniforge3/envs/gms/bin/
If you install anything else into that environment, that's where it will go.
To make the environment work, when you activate conda it sets relevant environment variables to point to
this gms
folder.
In particular when you conda activate gms
, conda adds this bin
directory to your PATH
environment variable. The
terminal uses to know where to look for programs. You can see what's happened by printing out the PATH
variable like this:
$ echo ${PATH}
You should see that the first entry is something like /users/<username>/miniforge3/envs/gms/bin
(followed by some
other paths). So if you type samtools
, the first place the terminal looks for samtools
is in that folder.
If you deactivate the conda environment, PATH
changes to remove that folder and samtools will no longer work:
$ conda deactivate
$ samtools
Command 'samtools' not found...
However samtools
is still there on your filesystem - as it happens, you can still run it by
specifying its full path:
$ ./miniforge3/envs/gms/bin/samtools
Summary
In other words conda isn't doing anything magical here: it's just managing your environment variables for you. In
particular it is installing programs into a specific folder and making sure the PATH
variable points at the right
folder. This is basically how 'environments' work: they are systems of environment variables that tell the UNIX shell
where to look for things.