Bioinformatics applications: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
(32 intermediate revisions by the same user not shown)
Line 5: Line 5:
One of the challenges to deal with big genomics data set is their long runtimes. The effective and efficient use of the computing resources on the Advanced Research Computing (ARC) cluster can speed up the runtimes.
One of the challenges to deal with big genomics data set is their long runtimes. The effective and efficient use of the computing resources on the Advanced Research Computing (ARC) cluster can speed up the runtimes.


The ARC cluster offers a variety of hardware gears to suit the requirements of any workflow. As a first step, review the compute resources on the ARC cluster (weblink). It will facilitate the process of choosing the appropriate partition depending on the nature of the workflow. For example, choose the gpu-v100 partition for GPU accelerators or the bigmem partition for a memory intensive workflow. Each partition has multiple compute nodes with similar hardware specifications. The illustration below shows that gpu-v100 partition has 13 compute nodes, cpu2019 partition has 40 compute nodes, cpu2013 has 14 compute nodes and so on.
The ARC cluster offers a variety of hardware gears to suit the requirements of any workflow. As a first step, review the compute resources on the ARC cluster (https://rcs.ucalgary.ca/ARC_Cluster_Guide). It will facilitate the process of choosing the appropriate partition depending on the nature of the workflow. For example, choose the gpu-v100 partition for GPU accelerators or the bigmem partition for a memory intensive workflow. Each partition has multiple compute nodes with similar hardware specifications. The illustration below shows that gpu-v100 partition has 13 compute nodes, cpu2019 partition has 40 compute nodes, cpu2013 has 14 compute nodes and so on.


[[File:ARC ClusterV2.png|none|center|The ARC cluster is a diverse collection of hardwares ]]


[[File:ARC ClusterV3.png|center|The ARC cluster is a diverse collection of hardwares ]]
This section will review the SLURM job submission scripts for the bioinformatics applications below.  
This section will review the SLURM job submission scripts for the bioinformatics applications below.  
# FastQC - A high throughput sequence QC analysis tool
# FastQC - A high throughput sequence QC analysis tool
Line 24: Line 24:
     Loading requirement: java/1.8.0 biobuilds/conda
     Loading requirement: java/1.8.0 biobuilds/conda
  [tannistha.nandi@arc ~]$ fastqc --version
  [tannistha.nandi@arc ~]$ fastqc --version
     FastQC v0.11.5
     '''FastQC v0.11.5'''
  [tannistha.nandi@arc ~]$ fastqc --help
  [tannistha.nandi@arc ~]$ fastqc --help
     FastQC - A high throughput sequence QC analysis tool
     FastQC - A high throughput sequence QC analysis tool
Line 31: Line 31:
     fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1..seqfileN
     fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1..seqfileN


In this example, I will work with the following <big>'''33GB'''</big> of genome sequencing data in fastq file format.  
In this example, we will work with the following <big>'''35GB'''</big> of genome sequencing data in fastq file format.  
  [tannistha.nandi@arc ~]$ ls -l  
  [tannistha.nandi@arc ~]$ ls -l  
  -rw-rw-r-- 1 tannistha.nandi tannistha.nandi 33GB Jan 24. 2018 SN4570284.fq.gz
  -rw-rw-r-- 1 tannistha.nandi tannistha.nandi 35GB Jan 24. 2018 SN4570284.fq.gz


I will submit the fastqc job on a backfill partition called apophis-bf where each compute node has 40 cores with 185GB of RAM. The job script below requests for <big>'''1'''</big> core (--cpus-per-task) for <big>'''1'''</big> hour, along with <big>'''300 MB'''</big> of RAM on the apophis-bf partition. The RAM estimate of 300MB is based on the fastqc user guide that states each thread will be allocated 250MB of memory. On the ARC cluster, the CPUs and cores refer to the same thing. When the job scheduler SLURM will allocate these resources on a compute node, the job will run the first command to load the biobuilds/2017.11 module, which will set the path to fastqc. The next line will launch and run the fastqc application on the data file.
We will submit the fastqc job on a backfill partition called apophis-bf where each compute node has 40 cores with 185GB of RAM. The job script below requests for <big>'''1'''</big> core (--cpus-per-task) for <big>'''1'''</big> hour, along with <big>'''300 MB'''</big> of RAM on the apophis-bf partition. The RAM estimate of 300MB is based on the fastqc user guide that states each thread will be allocated 250MB of memory. On the ARC cluster, the CPUs and cores refer to the same thing. When the job scheduler SLURM will allocate these resources on a compute node, the job will run the first command to load the biobuilds/2017.11 module, which will set the path to fastqc. The next line will launch and run the fastqc application on the data file.


  #!/bin/bash
  #!/bin/bash
  #<------------------------Request for Resources----------------------->
  #<------------------------Request for Resources----------------------->
  #SBATCH –job-name=fastqc-S
  #SBATCH –job-name=fastqc-S
  #SBATCH --mem=300M
  #SBATCH '''--mem=300MB'''
  #SBATCH –-nodes=1
  #SBATCH –-nodes=1
  #SBATCH --ntasks=1
  #SBATCH --ntasks=1
  #SBATCH –-cpus-per-task=1
  #SBATCH '''–-cpus-per-task=1'''
  #SBATCH --time= 01:00:00
  #SBATCH --time= 01:00:00
  #SBATCH –-partition=apophis-bf
  #SBATCH –-partition=apophis-bf
Line 59: Line 59:
|+ Job efficiency
|+ Job efficiency
|-
|-
! Job name !! Header text
! Job name !! fastqc-S
|-
|-
| # files || 1
| # files || 1
|-
|-
| File size || 33GB
| File size || 35GB
|-
|-
| # cores || 1
| # cores || 1
Line 69: Line 69:
| CPU efficiency || 99.28%
| CPU efficiency || 99.28%
|-
|-
| Run time || 00:46:13
| Run time || 46 mins
|-
|-
| Memory utilized|| 285.46 MB
| Memory utilized|| 286MB
|}
|}
==== Strategy to speed up the fastqc runtime ====
We will use the fastqc runtime flag --threads or -t in short. The flag -t specifies the number of files which can be processed simultaneously.  Generally, a single thread will run on 1 core, two threads will run on 2 cores and so on.
‘Fastqc’ can accept multiple input files. One strategy would be to split up the 35GB data file into smaller chunks. Here we have used a python-based tool called ‘fastqsplitter 1.2.0’ (https://pypi.org/project/fastqsplitter/)  to split up the fastq file into 10 chunks.
[tannistha.nandi@arc ~]$ salloc –mem=6G -t 01:00:00 -c 10 -N 1 -n 1 -p apophis-bf
[tannistha.nandi@cn0513 ~]$ fastqsplitter -t 10 -i SN4570284.fq.gz -o
SN4570284_S0.fq.gz -o SN4570284_S1.fq.gz -o SN4570284_S2.fq.gz -o SN4570284_S3.fq.gz -o
SN4570284_S4.fq.gz -o SN4570284_S5.fq.gz -o SN4570284_S6.fq.gz -o SN4570284_S7.fq.gz -o
SN4570284_S8.fq.gz -o SN4570284_S9.fq.gz
[tannistha.nandi@cn0513 ~]$ ls -l
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S0.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S1.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S2.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S3.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S4.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S5.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S6.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S7.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S8.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S9.fq.gz
The core and memory allocation need be adjusted to run fastqc in a multi-threaded mode. The compute resource request will be based on the following factors:
# Number of input files
# Number of cores
# Memory /RAM
# Input file size
There are 10 fastq files, ~3GB each. A total of 10 cores can be requested, one fastq file per core. Accordingly, we need to allocate memory for all 10 cores. A single core fastqc job on a file size of 3GB used ~187 MB of RAM (based on a separate job called 'fastqc-S1'). Thus, we scaled up the memory request for 10 cores to a total of 2000MB of RAM. The multi-threaded job script is saved to a file called ‘fastqc-multicore.sh’.
#!/bin/bash
#<------------------------Request for Resources----------------------->
#SBATCH –-jobname=fastqc-M
#SBATCH '''--mem=2000MB'''
#SBATCH –-nodes=1
#SBATCH –-ntasks=1
#SBATCH '''–-cpus-per-task=10'''
#SBATCH --time=01:00:00
#SBATCH –-partition=apophis-bf
#<------------------------Set environment variables------------------->
module load biobuilds/2017.11
#<------------------------Run python script--------------------------->
fastqc '''--threads 10''' -o output_dir --noextract -f fastq SN4570284_S0.fq.gz SN4570284_S1.fq.gz SN4570284_1_S2.fq.gz SN4570284_S3.fq.gz SN4570284_S4.fq.gz SN4570284_S5.fq.gz SN4570284_1_S6.fq.gz SN4570284_S7.fq.gz SN4570284_S8.fq.gz SN4570284_S9.fq.gz
Save the above script as 'fastqc-multicore.sh' and submit it to SLURM using a sbatch command as below:
[tannistha.nandi@arc ~]$ sbatch fastqc-multicore.sh
Submitted batch job 9658870
The runtime of multi-threaded fastqc job (job id 9658870) on the same input data size (35G) is ~5 minutes. Thus, the run time of fastqc analysis on the same input size can be reduced significantly using its multi-threaded functionality.
==== Performance evaluation ====
It is one of the recommended approaches to reduce the runtime for a given input size. In this example, the multi-threaded feature improved the run time by 10X over a single threaded job. Another point to take note of is that the 4X increased memory utilization of the multi-threaded process. The RAM usage trend is based on the input file size and the way memory allocation is handled by the application.
[[File:Fastqc performance.png|center]]
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
=== BWA (Burrows-Wheeler Aligner)===
BWA is a popular software for mapping sequencing reads to a reference genome. It is available as a module on the ARC cluster. To run the installed version of bwa, load the biobuilds/2017.11 module as below:
[tannistha.nandi@arc ~]$ module load biobuilds/2017.11
  Loading biobuilds/2017.11
  Loading requirement: java/1.8.0 biobuilds/conda
[tannistha.nandi@arc ~]$ bwa
  Program: bwa (alignment via Burrows-Wheeler transformation)
  '''Version: 0.7.17-r1188'''
  Contact: Heng Li <lh3@sanger.ac.uk>
  Usage:  bwa <command> [options]

Revision as of 21:05, 11 June 2021

Strategies to write efficient bioinformatics workflows for the ARC high performance computing (HPC) cluster

One of the challenges to deal with big genomics data set is their long runtimes. The effective and efficient use of the computing resources on the Advanced Research Computing (ARC) cluster can speed up the runtimes.

The ARC cluster offers a variety of hardware gears to suit the requirements of any workflow. As a first step, review the compute resources on the ARC cluster (https://rcs.ucalgary.ca/ARC_Cluster_Guide). It will facilitate the process of choosing the appropriate partition depending on the nature of the workflow. For example, choose the gpu-v100 partition for GPU accelerators or the bigmem partition for a memory intensive workflow. Each partition has multiple compute nodes with similar hardware specifications. The illustration below shows that gpu-v100 partition has 13 compute nodes, cpu2019 partition has 40 compute nodes, cpu2013 has 14 compute nodes and so on.


The ARC cluster is a diverse collection of hardwares

This section will review the SLURM job submission scripts for the bioinformatics applications below.

  1. FastQC - A high throughput sequence QC analysis tool
  2. Burrows-Wheeler Aligner (BWA)
  3. Samtools


FastQC

FastQC assess the quality of your sequencing data. It is available as a module on the ARC cluster. To run the installed version of fastqc, load the biobuilds/2017.11 module as below:

[tannistha.nandi@arc ~]$ module load biobuilds/2017.11
    Loading biobuilds/2017.11
    Loading requirement: java/1.8.0 biobuilds/conda
[tannistha.nandi@arc ~]$ fastqc --version
    FastQC v0.11.5
[tannistha.nandi@arc ~]$ fastqc --help
    FastQC - A high throughput sequence QC analysis tool
    SYNOPSIS
    fastqc seqfile1 seqfile2 .. seqfileN
    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1..seqfileN

In this example, we will work with the following 35GB of genome sequencing data in fastq file format.

[tannistha.nandi@arc ~]$ ls -l 
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi 35GB Jan 24. 2018 SN4570284.fq.gz

We will submit the fastqc job on a backfill partition called apophis-bf where each compute node has 40 cores with 185GB of RAM. The job script below requests for 1 core (--cpus-per-task) for 1 hour, along with 300 MB of RAM on the apophis-bf partition. The RAM estimate of 300MB is based on the fastqc user guide that states each thread will be allocated 250MB of memory. On the ARC cluster, the CPUs and cores refer to the same thing. When the job scheduler SLURM will allocate these resources on a compute node, the job will run the first command to load the biobuilds/2017.11 module, which will set the path to fastqc. The next line will launch and run the fastqc application on the data file.

#!/bin/bash
#<------------------------Request for Resources----------------------->
#SBATCH –job-name=fastqc-S		
#SBATCH --mem=300MB				
#SBATCH –-nodes=1			
#SBATCH --ntasks=1			
#SBATCH –-cpus-per-task=1			
#SBATCH --time= 01:00:00		
#SBATCH –-partition=apophis-bf		
#<------------------------Set environment variables------------------->
module load biobuilds/2017.11
#<------------------------Run python script--------------------------->
fastqc -o output --noextract -f fastq SN4570284.fq.gz

Save the above script in a file called ‘fastqc.sh’ and submit it to SLURM using a sbatch command as below.

[tannistha.nandi@arc ~]$ sbatch fastqc.sh
Submitted batch job 9658868

The runtime of job id 9658868 is ~46 minutes on a single core with a memory utilization of ~286 MB of RAM.

Job efficiency
Job name fastqc-S
# files 1
File size 35GB
# cores 1
CPU efficiency 99.28%
Run time 46 mins
Memory utilized 286MB

Strategy to speed up the fastqc runtime

We will use the fastqc runtime flag --threads or -t in short. The flag -t specifies the number of files which can be processed simultaneously. Generally, a single thread will run on 1 core, two threads will run on 2 cores and so on. ‘Fastqc’ can accept multiple input files. One strategy would be to split up the 35GB data file into smaller chunks. Here we have used a python-based tool called ‘fastqsplitter 1.2.0’ (https://pypi.org/project/fastqsplitter/) to split up the fastq file into 10 chunks.

[tannistha.nandi@arc ~]$ salloc –mem=6G -t 01:00:00 -c 10 -N 1 -n 1 -p apophis-bf
[tannistha.nandi@cn0513 ~]$ fastqsplitter -t 10 -i SN4570284.fq.gz -o 
SN4570284_S0.fq.gz -o SN4570284_S1.fq.gz -o SN4570284_S2.fq.gz -o SN4570284_S3.fq.gz -o 
SN4570284_S4.fq.gz -o SN4570284_S5.fq.gz -o SN4570284_S6.fq.gz -o SN4570284_S7.fq.gz -o 
SN4570284_S8.fq.gz -o SN4570284_S9.fq.gz 
[tannistha.nandi@cn0513 ~]$ ls -l 
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S0.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S1.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S2.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S3.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S4.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S5.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S6.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S7.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S8.fq.gz
-rw-rw-r-- 1 tannistha.nandi tannistha.nandi  3GB May 24 23:15 SN4570284_S9.fq.gz

The core and memory allocation need be adjusted to run fastqc in a multi-threaded mode. The compute resource request will be based on the following factors:

  1. Number of input files
  2. Number of cores
  3. Memory /RAM
  4. Input file size

There are 10 fastq files, ~3GB each. A total of 10 cores can be requested, one fastq file per core. Accordingly, we need to allocate memory for all 10 cores. A single core fastqc job on a file size of 3GB used ~187 MB of RAM (based on a separate job called 'fastqc-S1'). Thus, we scaled up the memory request for 10 cores to a total of 2000MB of RAM. The multi-threaded job script is saved to a file called ‘fastqc-multicore.sh’.

#!/bin/bash
#<------------------------Request for Resources----------------------->
#SBATCH –-jobname=fastqc-M		
#SBATCH --mem=2000MB				
#SBATCH –-nodes=1			
#SBATCH –-ntasks=1			
#SBATCH –-cpus-per-task=10			
#SBATCH --time=01:00:00		
#SBATCH –-partition=apophis-bf		
#<------------------------Set environment variables------------------->
module load biobuilds/2017.11
#<------------------------Run python script--------------------------->
fastqc --threads 10 -o output_dir --noextract -f fastq SN4570284_S0.fq.gz SN4570284_S1.fq.gz SN4570284_1_S2.fq.gz SN4570284_S3.fq.gz SN4570284_S4.fq.gz SN4570284_S5.fq.gz SN4570284_1_S6.fq.gz SN4570284_S7.fq.gz SN4570284_S8.fq.gz SN4570284_S9.fq.gz

Save the above script as 'fastqc-multicore.sh' and submit it to SLURM using a sbatch command as below:

[tannistha.nandi@arc ~]$ sbatch fastqc-multicore.sh
Submitted batch job 9658870

The runtime of multi-threaded fastqc job (job id 9658870) on the same input data size (35G) is ~5 minutes. Thus, the run time of fastqc analysis on the same input size can be reduced significantly using its multi-threaded functionality.

Performance evaluation

It is one of the recommended approaches to reduce the runtime for a given input size. In this example, the multi-threaded feature improved the run time by 10X over a single threaded job. Another point to take note of is that the 4X increased memory utilization of the multi-threaded process. The RAM usage trend is based on the input file size and the way memory allocation is handled by the application.

Fastqc performance.png


BWA (Burrows-Wheeler Aligner)

BWA is a popular software for mapping sequencing reads to a reference genome. It is available as a module on the ARC cluster. To run the installed version of bwa, load the biobuilds/2017.11 module as below:

[tannistha.nandi@arc ~]$ module load biobuilds/2017.11
 Loading biobuilds/2017.11
 Loading requirement: java/1.8.0 biobuilds/conda
[tannistha.nandi@arc ~]$ bwa 
 Program: bwa (alignment via Burrows-Wheeler transformation)
 Version: 0.7.17-r1188
 Contact: Heng Li <lh3@sanger.ac.uk>
 Usage:   bwa <command> [options]