Running alphafold3: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
No edit summary
Updated Alphafold3 usage instructions
 
Line 9: Line 9:
* Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).
* Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).
* The alphafold3 container is stored in /global/software/alphafold3.  There are example job scripts in that directory that are referenced below.
* The alphafold3 container is stored in /global/software/alphafold3.  There are example job scripts in that directory that are referenced below.
* Due to the nature of the pipeline stage, we need to split each alphafold run into separate pipeline and inference stages.  If we don't do this, a valuable GPU node is tied up for a long time preventing others from using it.  In our testing using an input of 120 proteins the pipeline stage took a day and the inference stage took only an hour
* The public databases that Alphafold uses have been pre-downloaded and are stored in a location for anyone on Arc to use.  This location, /bulk/public/alphafold3/public_databases,  is reflected in the example job scripts.




== The Pipeline stage ==
== The Pipeline stage ==


* Download the model and put it in ./models
* This stage is very time consuming.  Much of the time is spent waiting to load a very large amount of data from the filesystem.  The example jobscript in /global/software/alphafold3 copies the /bulk/public/alphafold3/public_databases/mmcif_files to /dev/shm which is a filesystem that exists entirely in the compute node's memory (this makes it very fast).  This staging of data takes 20-30 min so you'll want to run a fairly large number of proteins in one job to amortize the cost of copying the mmcif_files to the compute node.
* Generate input data.  Make sure you've split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we'll use filenames starting with xa and ending with .json so xa*.json
* Download the model and put it in ./models (This is a legal/license requirement)
* Generate input data.  Make sure you've split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we'll use filenames starting with xa and ending with .json so xa*.json.  The example job script runs 5 copies of alphafold at the same time on the compute node as alphafold3's pipeline stage can really only effectively make use of 4 cpus for a single job.  This way the system is running multiple 4 cpu jobs at the same time.
* Create ./pipelineoutputs for the resulting files
* Create ./pipelineoutputs for the resulting files
* Run the pipeline stage on cpus by submitting runpipeline.slurm
* Run the pipeline stage on cpus by submitting runpipeline.slurm.  Required Job resources:
** --mem=500GB # We need to request enough memory for the public_databases/mmcif_files plus whatever the job needs.
** -c 20 # 20 cpus
** -N 1  # 1 node


== The Inference stage ==
== The Inference stage ==


* This is the part of alphafold that requires a GPU.
* Compared to the pipeline stage, the inference is fairly quick. 
* copy the runinference.slurm file to a directory you have write access to.
* copy the runinference.slurm file to a directory you have write access to.
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You'll have to copy them all.  You can use find to help:
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You'll have to copy them all.  You can use find to help:
  cd pipelineoutputs
  cd pipelineoutputs
  cp $(find -type f -name \*.json) ../inferenceinputs
  cp $(find -type f -name \*.json) ../inferenceinputs
* submit the runpipeline.slurm to the scheduler.
* submit the runpipeline.slurm to the scheduler. Required job resources:
** -p gpu-h100
** --gres=gpu:1 # Alphafold can only use 1 gpu
** --mem=120G # This might need to be increased depending on the inputs.
** --ntasks-per-node=4


== References ==
== References ==
* https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md
* https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md
* https://docs.alliancecan.ca/wiki/AlphaFold3
* https://docs.alliancecan.ca/wiki/AlphaFold3

Latest revision as of 17:55, 12 November 2025

Running Alphafold3 on Arc

Alphafold3 has been compiled into an Apptainer image for use on Arc.

Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory. The second step runs the inference and requires the use of a GPU. The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step. To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.

Prerequisites

  • Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters. See https://forms.gle/svvpY4u2jsHEwWYS6 to register. The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).
  • The alphafold3 container is stored in /global/software/alphafold3. There are example job scripts in that directory that are referenced below.
  • Due to the nature of the pipeline stage, we need to split each alphafold run into separate pipeline and inference stages. If we don't do this, a valuable GPU node is tied up for a long time preventing others from using it. In our testing using an input of 120 proteins the pipeline stage took a day and the inference stage took only an hour
  • The public databases that Alphafold uses have been pre-downloaded and are stored in a location for anyone on Arc to use. This location, /bulk/public/alphafold3/public_databases, is reflected in the example job scripts.


The Pipeline stage

  • This stage is very time consuming. Much of the time is spent waiting to load a very large amount of data from the filesystem. The example jobscript in /global/software/alphafold3 copies the /bulk/public/alphafold3/public_databases/mmcif_files to /dev/shm which is a filesystem that exists entirely in the compute node's memory (this makes it very fast). This staging of data takes 20-30 min so you'll want to run a fairly large number of proteins in one job to amortize the cost of copying the mmcif_files to the compute node.
  • Download the model and put it in ./models (This is a legal/license requirement)
  • Generate input data. Make sure you've split your input sequences into 5 roughly equally sized files (by number of input sequences). For the examples we'll use filenames starting with xa and ending with .json so xa*.json. The example job script runs 5 copies of alphafold at the same time on the compute node as alphafold3's pipeline stage can really only effectively make use of 4 cpus for a single job. This way the system is running multiple 4 cpu jobs at the same time.
  • Create ./pipelineoutputs for the resulting files
  • Run the pipeline stage on cpus by submitting runpipeline.slurm. Required Job resources:
    • --mem=500GB # We need to request enough memory for the public_databases/mmcif_files plus whatever the job needs.
    • -c 20 # 20 cpus
    • -N 1 # 1 node

The Inference stage

  • This is the part of alphafold that requires a GPU.
  • Compared to the pipeline stage, the inference is fairly quick.
  • copy the runinference.slurm file to a directory you have write access to.
  • Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs. Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs. You'll have to copy them all. You can use find to help:
cd pipelineoutputs
cp $(find -type f -name \*.json) ../inferenceinputs
  • submit the runpipeline.slurm to the scheduler. Required job resources:
    • -p gpu-h100
    • --gres=gpu:1 # Alphafold can only use 1 gpu
    • --mem=120G # This might need to be increased depending on the inputs.
    • --ntasks-per-node=4

References