CESM: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
 
(17 intermediate revisions by 2 users not shown)
Line 39: Line 39:
   Loading requirement: gcc/9.4.0 cmake/3.17.3 git/2.25.0 svn/1.10.6 openmpi/4.1.1-gnu lib/openblas/0.3.13-gnu
   Loading requirement: gcc/9.4.0 cmake/3.17.3 git/2.25.0 svn/1.10.6 openmpi/4.1.1-gnu lib/openblas/0.3.13-gnu
</pre>
</pre>
This installation of CESM comes with its own dedicated installs of Python and Perl.
To verify that the software has been properly activated you
can check the locations of some of the commands provided by the install:
<pre>
$ which python
alias python='python3'
/global/software/cesm/python/3.10.4/bin/python3
$ which perl
/global/software/cesm/perl/5.34.1/bin/perl
$ which create_newcase
/global/software/cesm/cesm-2.1.3/cime/scripts/create_newcase
</pre>
If you have any other software modules loaded on ARC, they may interfere with CESM. Please avoid loading too many modules at the same time.
There is a shared data directory, pointed by the <code>DIN_LOC_ROOT</code> environmental variable for CESM data sets.
Sharing this storage directory should reduce the amount of data that needs to be downloaded as well as save storage space in users home directories.
= Using CESM on ARC =
== Machines and Queues ==
The ARC cluster uses the '''SLURM''' scheduling system to manage and control jobs.
'''SLURM''' assumes that their is '''one main queue''' for jobs that need to be executed,
and that the cluster consists of several '''partitions'''.
The partitions are collections of compute nodes, that are grouped based on some common properly.
On ARC most partitions are grouped based on '''hardware similarity''', '''scheduling limits''', and '''ownership'''.
'''CESM''' has its own model of a compute cluster, that is based on multiple '''queues''' and '''machine''' types.
In practice, on ARC '''CESM''' is setup to use '''arc40''', '''arc48''', and '''arc52''' machine types,
the compute nodes of which have '''40''', '''48''', and '''52''' CPU cores per node, respectively.
However, these machine types can be used in several SLURM partitions.
In this case, these partitions do contain machines of the same kind, but the run time limits are different.
'''CESM''' model treats these '''SLURM''' partitions as '''queues'''.
To create a new case with '''CESM''', therefore,
the '''machine type''' as well as the '''target queue''' have to be indicated.
Queues of the '''arc40''' machine types:
<pre>
Queue        #Nodes        #CPUs    Max#nodes    MaxRuntime          Comment
name          total        /node        /user        hours
----------------------------------------------------------------------------
cpu2019          40            40            6          168
cpu2019-bf05    87            40          20            5          default
</pre>
Queues of the '''arc48''' machine types:
<pre>
Queue        #Nodes        #CPUs    Max#nodes    MaxRuntime          Comment
name          total        /node        /user        hours
----------------------------------------------------------------------------
cpu2021          34            48          12          168          default
cpu2021-bf24      7            48            4            24 
</pre>
Queues of the '''arc52''' machine types:
<pre>
Queue        #Nodes        #CPUs    Max#nodes    MaxRuntime          Comment
name          total        /node        /user        hours
----------------------------------------------------------------------------
cpu2022          52            52          10          168          default
cpu2022-bf24    16            52            4            24 
</pre>
Please note, that this information may change as the cluster is constantly changes based on new hardware being added and
old hardware being removed.
== Creating and running a case ==
The '''default''' machine type is '''arc40''' and this type's '''default queue''' is '''cpu2019-bf05'''.
These are the machine type and the queue  which will be used if they are not specified at the '''create_newcase''' step.
Please note that the CIME scripts directory activated by the module so that there is no need to use the full path in the command name,
and therefore you can invoke it in you work directory and reference the cases by their '''relative path'''.
Again, no need to use full absolute path to the case directories.
The routine is to
* '''Create''' a new case using the <code>create_newcase</code> command on the login node.
* '''Setup''' and '''build''' the executable code for the case on the login node.
* '''Submit''' the case to the SLURM scheduler using the <code>./case.submit</code> command on the login node.
* '''Monitor''' the status of the jobs using the <code>squeue</code> SLURM command, as well as
* The CESM output in the <code>CaseStatus</code> output file.
Here is the pattern:
<pre>
$ cd  ~/cases/
$ create_newcase --case casename --compset ... --res ... --machine arc40 --queue cpu2019
$ cd casename
$ ./case.setup
$ ./preview_run
$ ./case.build
$ ./case.submit
</pre>
and an example:
<pre>
$ create_newcase --case testX3 --compset X --res f19_g16 --machine arc40
Compset longname is 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Compset specification file is /global/software/cesm/cesm-2.1.3/cime/src/drivers/mct/cime_config/config_compsets.xml
Compset forcing is 1972-2004
....
....
  Creating Case directory /work/dmitri.rozmanov/tests/cesm/testX3
$ cd testX3
$ ./case.setup
....
$ ./preview_run
CASE INFO:
  nodes: 1
  total tasks: 40
  tasks per node: 40 
  thread count: 1
BATCH INFO:
  FOR JOB: case.run
    ENV:
      module command is /global/software/Modules/3.2.10/bin/modulecmd python load cesm/dev
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      sbatch --time 05:00:00 --partition cpu2019-bf05 .case.run --resubmit
    MPIRUN (job=case.run):
      mpiexec  -n 40  /home/drozmano/cesm/scratch/testX3/bld/cesm.exe  >> cesm.log.$LID 2>&1
  FOR JOB: case.st_archive
    ENV:
      module command is /global/software/Modules/3.2.10/bin/modulecmd python load cesm/dev
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      sbatch --time 0:20:00 --partition cpu2019-bf05  --dependency=afterok:0 case.st_archive --resubmit
$ ./case.build
......
Building cesm with output to /home/drozmano/cesm/scratch/testX3/bld/cesm.bldlog.220805-155036
Time spent not building: 0.852010 sec
Time spent building: 69.274885 sec
MODEL BUILD HAS FINISHED SUCCESSFULLY
$ ./case.submit
.....
</pre>
Once the case is submitted, one can check the jobs running from the user's account:
<pre>
$ squeue-long -u drozmano
JOBID    USER                    STATE    PARTITION  TIME_LIMIT  TIME        NODES TASKS CPUS  MIN_MEMORY TRES_PER_NREASON         
15397675  drozmano                PENDING  cpu2019-bf      20:00  0:00        1    1    1    1G        N/A      Dependency                             
15397674  drozmano                RUNNING  cpu2019    168:00:00  0:04        1    40    40    1G        N/A      None           
</pre>
Here, CESM submitted two jobs, one that actually runs the simulation (15397674),
and another one that will resubmit the computation, if it is not done during the allocated time.
The second job depends on the successful end of the computational job and will run when the first job is successfully finished.
While the simulation is running the <code>CaseStatus</code> file can be checked to see the simulation status:
<pre>
$ cat CaseStatus
......
......
2022-08-05 15:51:46: case.build success
---------------------------------------------------
2022-08-05 15:53:10: case.submit starting
---------------------------------------------------
2022-08-05 15:53:12: case.submit success case.run:15397674, case.st_archive:15397675
---------------------------------------------------
2022-08-05 15:53:19: case.run starting
---------------------------------------------------
2022-08-05 15:53:24: model execution starting
---------------------------------------------------
2022-08-05 15:54:17: model execution success
---------------------------------------------------
2022-08-05 15:54:17: case.run success
---------------------------------------------------
2022-08-05 15:54:19: st_archive starting
---------------------------------------------------
2022-08-05 15:54:33: st_archive success
---------------------------------------------------
</pre>
Success.
This example is based on the official CIME manual:
: https://esmci.github.io/cime/versions/master/html/users_guide/introduction-and-overview.html#quick-start
= Links =
[[ARC Software pages]]
[[Category:Software]]
[[Category:ARC]]
{{Navbox ARC}}

Latest revision as of 19:33, 18 October 2023

General

  • National Center for Atmospheric Research (NCAR):
https://ncar.ucar.edu/

The Community Earth System Model is a fully-coupled global climate model developed in collaboration with colleagues in the research community. CESM provides state-of-the-art computer simulations of Earth's past, present, and future climate states.


CESM2 is built on the CIME framework. The majority of the CESM2 User’s Guide is contained in the CIME documentation.


The Common Infrastructure for Modeling the Earth (CIME - pronounced “SEAM”) provides a Case Control System for configuring, compiling and executing Earth system models, data and stub model components, a driver and associated tools and libraries.

CESM on ARC

Currently, there are two versions of CESM are installed on ARC, but only one works and supported. The supported version is 2.1.3.

CESM is installed and setup to be used via environmental modules, using the module command.

$ module avail cesm
------------------- /global/software/Modules/4.6.0/modulefiles -------------------
cesm/2.1.1  cesm/2.1.3 

To activate it please load its module:

$ module load cesm/2.1.3

Loading cesm/2.1.3
  Loading requirement: gcc/9.4.0 cmake/3.17.3 git/2.25.0 svn/1.10.6 openmpi/4.1.1-gnu lib/openblas/0.3.13-gnu

This installation of CESM comes with its own dedicated installs of Python and Perl. To verify that the software has been properly activated you can check the locations of some of the commands provided by the install:

$ which python
alias python='python3'
	/global/software/cesm/python/3.10.4/bin/python3

$ which perl
/global/software/cesm/perl/5.34.1/bin/perl

$ which create_newcase
/global/software/cesm/cesm-2.1.3/cime/scripts/create_newcase

If you have any other software modules loaded on ARC, they may interfere with CESM. Please avoid loading too many modules at the same time.

There is a shared data directory, pointed by the DIN_LOC_ROOT environmental variable for CESM data sets. Sharing this storage directory should reduce the amount of data that needs to be downloaded as well as save storage space in users home directories.

Using CESM on ARC

Machines and Queues

The ARC cluster uses the SLURM scheduling system to manage and control jobs. SLURM assumes that their is one main queue for jobs that need to be executed, and that the cluster consists of several partitions. The partitions are collections of compute nodes, that are grouped based on some common properly. On ARC most partitions are grouped based on hardware similarity, scheduling limits, and ownership. CESM has its own model of a compute cluster, that is based on multiple queues and machine types.

In practice, on ARC CESM is setup to use arc40, arc48, and arc52 machine types, the compute nodes of which have 40, 48, and 52 CPU cores per node, respectively. However, these machine types can be used in several SLURM partitions. In this case, these partitions do contain machines of the same kind, but the run time limits are different. CESM model treats these SLURM partitions as queues. To create a new case with CESM, therefore, the machine type as well as the target queue have to be indicated.


Queues of the arc40 machine types:

Queue         #Nodes         #CPUs    Max#nodes    MaxRuntime           Comment
name           total         /node        /user         hours
----------------------------------------------------------------------------
cpu2019          40             40            6           168
cpu2019-bf05     87             40           20             5           default 

Queues of the arc48 machine types:

Queue         #Nodes         #CPUs    Max#nodes    MaxRuntime           Comment
name           total         /node        /user         hours
----------------------------------------------------------------------------
cpu2021          34             48           12           168           default
cpu2021-bf24      7             48            4            24  

Queues of the arc52 machine types:

Queue         #Nodes         #CPUs    Max#nodes    MaxRuntime           Comment
name           total         /node        /user         hours
----------------------------------------------------------------------------
cpu2022          52             52           10           168           default
cpu2022-bf24     16             52            4            24  

Please note, that this information may change as the cluster is constantly changes based on new hardware being added and old hardware being removed.

Creating and running a case

The default machine type is arc40 and this type's default queue is cpu2019-bf05. These are the machine type and the queue which will be used if they are not specified at the create_newcase step.

Please note that the CIME scripts directory activated by the module so that there is no need to use the full path in the command name, and therefore you can invoke it in you work directory and reference the cases by their relative path. Again, no need to use full absolute path to the case directories.


The routine is to

  • Create a new case using the create_newcase command on the login node.
  • Setup and build the executable code for the case on the login node.
  • Submit the case to the SLURM scheduler using the ./case.submit command on the login node.
  • Monitor the status of the jobs using the squeue SLURM command, as well as
  • The CESM output in the CaseStatus output file.

Here is the pattern:

$ cd  ~/cases/
$ create_newcase --case casename --compset ... --res ... --machine arc40 --queue cpu2019

$ cd casename
$ ./case.setup

$ ./preview_run

$ ./case.build

$ ./case.submit

and an example:

$ create_newcase --case testX3 --compset X --res f19_g16 --machine arc40
Compset longname is 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Compset specification file is /global/software/cesm/cesm-2.1.3/cime/src/drivers/mct/cime_config/config_compsets.xml
Compset forcing is 1972-2004
....
....
  Creating Case directory /work/dmitri.rozmanov/tests/cesm/testX3

$ cd testX3
$ ./case.setup
....

$ ./preview_run
CASE INFO:
  nodes: 1
  total tasks: 40
  tasks per node: 40   
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
      module command is /global/software/Modules/3.2.10/bin/modulecmd python load cesm/dev
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      sbatch --time 05:00:00 --partition cpu2019-bf05 .case.run --resubmit

    MPIRUN (job=case.run):
      mpiexec  -n 40  /home/drozmano/cesm/scratch/testX3/bld/cesm.exe  >> cesm.log.$LID 2>&1

  FOR JOB: case.st_archive
    ENV:
      module command is /global/software/Modules/3.2.10/bin/modulecmd python load cesm/dev
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      sbatch --time 0:20:00 --partition cpu2019-bf05  --dependency=afterok:0 case.st_archive --resubmit

$ ./case.build
......
Building cesm with output to /home/drozmano/cesm/scratch/testX3/bld/cesm.bldlog.220805-155036 
Time spent not building: 0.852010 sec
Time spent building: 69.274885 sec
MODEL BUILD HAS FINISHED SUCCESSFULLY

$ ./case.submit
.....

Once the case is submitted, one can check the jobs running from the user's account:

$ squeue-long -u drozmano
JOBID     USER                    STATE     PARTITION  TIME_LIMIT   TIME        NODES TASKS CPUS  MIN_MEMORY TRES_PER_NREASON           
15397675  drozmano                PENDING   cpu2019-bf      20:00   0:00        1     1     1     1G         N/A       Dependency                              
15397674  drozmano                RUNNING   cpu2019     168:00:00   0:04        1     40    40    1G         N/A       None             

Here, CESM submitted two jobs, one that actually runs the simulation (15397674), and another one that will resubmit the computation, if it is not done during the allocated time. The second job depends on the successful end of the computational job and will run when the first job is successfully finished.

While the simulation is running the CaseStatus file can be checked to see the simulation status:

$ cat CaseStatus
......
......
2022-08-05 15:51:46: case.build success 
 ---------------------------------------------------
2022-08-05 15:53:10: case.submit starting 
 ---------------------------------------------------
2022-08-05 15:53:12: case.submit success case.run:15397674, case.st_archive:15397675
 ---------------------------------------------------
2022-08-05 15:53:19: case.run starting 
 ---------------------------------------------------
2022-08-05 15:53:24: model execution starting 
 ---------------------------------------------------
2022-08-05 15:54:17: model execution success 
 ---------------------------------------------------
2022-08-05 15:54:17: case.run success 
 ---------------------------------------------------
2022-08-05 15:54:19: st_archive starting 
 ---------------------------------------------------
2022-08-05 15:54:33: st_archive success 
 ---------------------------------------------------

Success.

This example is based on the official CIME manual:

https://esmci.github.io/cime/versions/master/html/users_guide/introduction-and-overview.html#quick-start

Links

ARC Software pages