Running alphafold3
Running Alphafold3 on Arc
Alphafold3 has been compiled into an Apptainer image for use on Arc.
Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory. The second step runs the inference and requires the use of a GPU. The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step. To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.
Prerequisites
Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters. See https://forms.gle/svvpY4u2jsHEwWYS6 to register. The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).
The Pipeline stage
- Download the model and put it in ./models
- Generate input data. Make sure you've split your input sequences into 5 roughly equally sized files (by number of input sequences). For the examples we'll use filenames starting with xa and ending with .json so xa*.json
- Create ./pipelineoutputs for the resulting files
- Run the pipeline stage on cpus by submitting runpipeline.slurm
The Inference stage
- copy the runinference.slurm file to a directory you have write access to.
- Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs. Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs. You'll have to copy them all. You can use find to help:
cd pipelineoutputs cp $(find -type f -name \*.json) ../inferenceinputs
- submit the runpipeline.slurm to the scheduler.