Managing software on ARC: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
 
(28 intermediate revisions by 2 users not shown)
Line 27: Line 27:
Unlike some clusters, there are no modules loaded by default.  
Unlike some clusters, there are no modules loaded by default.  
So, for example, to use '''Intel compilers''', or to use '''Open MPI''' parallel programming, you must load an appropriate module.
So, for example, to use '''Intel compilers''', or to use '''Open MPI''' parallel programming, you must load an appropriate module.
== Getting software centrally installed ==
It is possible to ask ARC analysts to install a specific software package on ARC centrally, so it will be available for every user.
However, the software has to meet '''some condition''' to qualify for central installation:
* It has to be of '''some general interest'''. If you are the only person who is ever going to use it, it is better to have it installed for yourself only.
* The software has to be '''somewhat stable''', not under active development. The reason is simple. Once a version of the software is installed it cannot be removed, as it can be used by somebody. Thus, any update, upgrade or new version will have to be installed '''in parallel''' with any existing installation. If you expect to need the software updated in '''less than a year''', it is better not to have it centrally installed. If you install it in your home directory, you will be the owner of the installation and will be able to update it at any time at your convenience.


= Installing Software in User's Home Directory =
= Installing Software in User's Home Directory =
Line 52: Line 61:
If we (analysts) install software centrally, we install it into <code>/global/software</code> shared directory,  
If we (analysts) install software centrally, we install it into <code>/global/software</code> shared directory,  
if a user wants to install a software package and manage it on his/her own,  
if a user wants to install a software package and manage it on his/her own,  
than it has to be installed into the user's home directory, that is <code>/home/$HOME</code>.  
than it has to be installed into the user's home directory, that is <code>/home/$USER</code>, or just <code>$HOME</code>.  
In such cases package managers cannot be used and the software often '''has to be compiled''' on '''ARC''' and  
In such cases package managers cannot be used and the software often '''has to be compiled''' on '''ARC''' and  
the '''desired installation location''' has to be specified during the compilation process.
the '''desired installation location''' has to be specified during the compilation process.
If there is a '''dependency''', library or another software package, that has to be present in the system,  
If there is a '''dependency''', library or another software package, that has to be present in the system,  
then the dependency has to be installed the same way prior to the compilation.
then the dependency has to be installed the same way prior to the compilation.
== Planning ==
Before installing a software package you should think about the structure for your future installs.
There are many ways to do that, but here is a simple and tried way to '''organize software in your home directory''':
* You '''software will be stored''' in the '''software''' subdirectory in your home, you can refer to it as <code>$HOME/software</code> path on the command line.
* The directories for a '''software package of some specific version''' can be installed in a sub-directory named <code>name-version</code>.
: For example, if you want to install the version 3.6.2 of GNU R, it can go into the <code>$HOME/software/r-3.6.2</code> sub-directory.
* The '''distribution files and archives''' can be downloaded to the <code>$HOME/software/src/software_name</code> sub-directory.
: For the R example above, it would be <code>$HOME/software/src/r/</code> directory.
This is how '''you can setup''' this directory structure:
<pre>
$ cd
$ mkdir $HOME/software
$ mkdir $HOME/software/src
</pre>
You can check if you have it:
<pre>
$ ls -l
....
drwxr-xr-x 4 username username  4096 Jun  8  2021 software
....
</pre>
== Getting Software ==
To install the software package you want you have to obtain a '''distribution source''' for it.
Depending on the software and/or your choice you may get either:
* An archive containing '''pre-compiled binary files''' of the program (usually <code>.zip</code> or <code>.tar.gz</code> files).
: You can unpack the archive and place the files to a directory or your choice.
* An archive containing '''source codes of the program''' (usually <code>.tar.gz</code> or <code>.tar.bz2</code> files).
: These need to be compiled before you can use them. Typically, the binary files have to be installed after the compilation.
* A '''binary installer''', program that can be run and it will install a copy of '''pre-compiled software''' for you (can be <code>.sh</code> file, or no extension).
: You have to run it to initiate the installation process.
== Building software ==
=== Libraries ===
The <code>.so</code> libraries are the shared libraries, similar to <code>.dll</code> in the Windows world.
During build process the compiled code is linked against those libraries
by pointing to the functions inside them without actually copying the code from the libraries into your newly build executable binary file.
Thus, when the program runs those libraries need to be loaded into memory,
so that the functions that are used in your code can be executed.
However, to load the library it needs to be found first.
There are many libraries that are installed in the system and they are properly registered with the system,
so they can be found automatically.
The library you compiled yourself are not registered properly,
as only system admins can do the registration.
You have to let the system know where to find the libraries you need.
There a special environmental variable, <code>LD_LIBRARY_PATH</code>, that can be used for this purpose.
It contains a list of additional directories where the system needs to look if it needs to find a library.
It does not contain one path, but several, separated by the <code>:</code> character.
So, you have to augment it and not to replace its value.
Let us imagine that you install a library into <code>$HOME/software</code>.
This will create several directories inside the <code>software</code> directory,
such as <code>bin</code>, <code>lib</code>, <code>share</code>, <code>include</code>.
May be more, may be not all of them. The libraries will be in the "lib" directory.
This how to do it. On your command line you have to issue this command:
$ export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH
This will add the <code>$HOME/software/lib</code> directory to the list of directories to search for required libraries.
You have to do it '''every time you login''' to ARC and want to use the software.
More importantly, you do not need to execute the command on the login node if you are going to run it in a job.
You have to add this line to the '''job script''', so it is executed on the compute node that will be allocated for the job, before running the main program.
=== Activating the software ===
Also, you may want to add the <code>bin</code> directory of the libraries, as well as the main software to the <code>PATH</code> variable.
Like this:
$ export PATH=$HOME/software/bin:$PATH
Again, it has to be done every time you have a fresh session.
You may want to put this command into a '''script''' in the <code>$HOME/software/</code> directory,
so you can run them all with one command.
For example, if you put all the init work into <code>$HOME/software/init-my-code.sh</code> script, then you can activate your software with
$ source $HOME/software/init-my-code.sh
command.
= Installing software via Conda =
Many software packages today are available via '''Conda''' repositories.
'''Conda''' is a general purpose software package manager and can be used to conveniently install software that offer this option.
* First, you have to install your own '''Conda''' following this article, [[Conda on ARC]].
* Then, can follow the software installation manual on how to install it using the '''Conda''' option.
= Containers =
== Background ==
A '''container''' allows you to stick an '''application and all of its dependencies into a single package'''.
This makes your application portable, shareable, and reproducible.
'''Containers''' foster '''portability''' and '''reproducibility''' because they package ALL of an applications dependencies, including its own tiny operating system!
This means your application won’t break when you port it to a new environment. Your app brings its environment with it.
=== Containers vs VMs ===
'''Containers''' and '''Virtual Machines (VMs)''' are both types of virtualization. But they are different in the way they do it.
'''VMs''':
* '''Full virtualization''', including the core of the system (OS kernel).
* Can run on a '''many different''' Operating System (OS).
* Starting a VM is similar to booting a computer, that is, relatively '''long process'''.
'''Containers''':
* '''Partial virtualization''', OS kernel is shared with the host OS.
* Cannot run on a different OS. Practically, they only work in '''Linux'''.
* Starting a container is '''much quicker'''.
== Apptainer / Singularity ==
* On-line Tutorial: https://singularity-tutorial.github.io/
* [[How to convert a Docker container to an Apptainer container]].


[[Category:ARC]]
[[Category:ARC]]
[[Category:Software]]
[[Category:Guides]]
 
{{Navbox ARC}}

Latest revision as of 17:09, 11 December 2023

Overview

In addition to basic software distributed with most Linux systems, additional application packages and libraries have been installed for use on ARC under /global/software. Also see the module avail command below for a list of some of the installed software packages. Write to support@hpc.ucalgary.ca if you need additional software installed.

Environment modules

To facilitate the use of some of the software on the ARC cluster one can load a corresponding environmental module file, which may add an installation directory to the PATH variable used to locate the executable files, or help the software find libraries upon which it depends. An overview of modules on WestGrid is largely applicable to ARC.

To list the software for which an environment module file has been created, use the command:

$ module avail

Then, to set up your environment to use a particular package, use the module load command. For example, to load a module for Python use:

$ module load python/anaconda-3.6-5.1.0

If you need to undo the changes made by loading the module, you can use the module unload command:

$ module unload python/anaconda-3.6-5.1.0

To see currently loaded modules, type:

$ module list

Unlike some clusters, there are no modules loaded by default. So, for example, to use Intel compilers, or to use Open MPI parallel programming, you must load an appropriate module.

Getting software centrally installed

It is possible to ask ARC analysts to install a specific software package on ARC centrally, so it will be available for every user. However, the software has to meet some condition to qualify for central installation:

  • It has to be of some general interest. If you are the only person who is ever going to use it, it is better to have it installed for yourself only.
  • The software has to be somewhat stable, not under active development. The reason is simple. Once a version of the software is installed it cannot be removed, as it can be used by somebody. Thus, any update, upgrade or new version will have to be installed in parallel with any existing installation. If you expect to need the software updated in less than a year, it is better not to have it centrally installed. If you install it in your home directory, you will be the owner of the installation and will be able to update it at any time at your convenience.

Installing Software in User's Home Directory

Background

If you are a user on the ARC cluster, you can install software yourself into your own home directory. You should be able to follow the specific software installation instructions which you can find on the software distribution site or inside the source / distribution archive. Typically, most manuals and guides assume that you have admin privileges on the system you are installing software on. This is not the case on ARC, and this could be the main source of difficulties with installations. You have to adjust the instructions you follow to reflect the fact that you are installing into your home directory and not into the standard system location. The system locations requires admin privileges to write to.

ARC is a multi-user system and on such a system users cannot do anything that affects other users. Using the sudo command, for example, implies that you what to do something that affect other users, this is why you cannot use it on ARC.


Using a package manager (apt, yum, etc.) does require changing common system directories, which would affect all users on the system. At the same time, the package manager would install software onto the login node only. Login node is not supposed to run your computations, the compute nodes are, but the package manager that you run on the login node cannot install software to the compute nodes, which are different computers.


Thus, software that you want to use on ARC must be installed onto a shared file system that is accessible by all the nodes in the cluster. If we (analysts) install software centrally, we install it into /global/software shared directory, if a user wants to install a software package and manage it on his/her own, than it has to be installed into the user's home directory, that is /home/$USER, or just $HOME. In such cases package managers cannot be used and the software often has to be compiled on ARC and the desired installation location has to be specified during the compilation process. If there is a dependency, library or another software package, that has to be present in the system, then the dependency has to be installed the same way prior to the compilation.

Planning

Before installing a software package you should think about the structure for your future installs. There are many ways to do that, but here is a simple and tried way to organize software in your home directory:


  • You software will be stored in the software subdirectory in your home, you can refer to it as $HOME/software path on the command line.
  • The directories for a software package of some specific version can be installed in a sub-directory named name-version.
For example, if you want to install the version 3.6.2 of GNU R, it can go into the $HOME/software/r-3.6.2 sub-directory.
  • The distribution files and archives can be downloaded to the $HOME/software/src/software_name sub-directory.
For the R example above, it would be $HOME/software/src/r/ directory.


This is how you can setup this directory structure:

$ cd

$ mkdir $HOME/software

$ mkdir $HOME/software/src

You can check if you have it:

$ ls -l
....
drwxr-xr-x 4 username username  4096 Jun  8  2021 software
....

Getting Software

To install the software package you want you have to obtain a distribution source for it. Depending on the software and/or your choice you may get either:


  • An archive containing pre-compiled binary files of the program (usually .zip or .tar.gz files).
You can unpack the archive and place the files to a directory or your choice.


  • An archive containing source codes of the program (usually .tar.gz or .tar.bz2 files).
These need to be compiled before you can use them. Typically, the binary files have to be installed after the compilation.


  • A binary installer, program that can be run and it will install a copy of pre-compiled software for you (can be .sh file, or no extension).
You have to run it to initiate the installation process.

Building software

Libraries

The .so libraries are the shared libraries, similar to .dll in the Windows world. During build process the compiled code is linked against those libraries by pointing to the functions inside them without actually copying the code from the libraries into your newly build executable binary file. Thus, when the program runs those libraries need to be loaded into memory, so that the functions that are used in your code can be executed. However, to load the library it needs to be found first.


There are many libraries that are installed in the system and they are properly registered with the system, so they can be found automatically. The library you compiled yourself are not registered properly, as only system admins can do the registration. You have to let the system know where to find the libraries you need.


There a special environmental variable, LD_LIBRARY_PATH, that can be used for this purpose. It contains a list of additional directories where the system needs to look if it needs to find a library. It does not contain one path, but several, separated by the : character. So, you have to augment it and not to replace its value.


Let us imagine that you install a library into $HOME/software. This will create several directories inside the software directory, such as bin, lib, share, include. May be more, may be not all of them. The libraries will be in the "lib" directory.

This how to do it. On your command line you have to issue this command:

$ export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH

This will add the $HOME/software/lib directory to the list of directories to search for required libraries.


You have to do it every time you login to ARC and want to use the software. More importantly, you do not need to execute the command on the login node if you are going to run it in a job. You have to add this line to the job script, so it is executed on the compute node that will be allocated for the job, before running the main program.

Activating the software

Also, you may want to add the bin directory of the libraries, as well as the main software to the PATH variable. Like this:

$ export PATH=$HOME/software/bin:$PATH

Again, it has to be done every time you have a fresh session.

You may want to put this command into a script in the $HOME/software/ directory, so you can run them all with one command. For example, if you put all the init work into $HOME/software/init-my-code.sh script, then you can activate your software with

$ source $HOME/software/init-my-code.sh

command.

Installing software via Conda

Many software packages today are available via Conda repositories. Conda is a general purpose software package manager and can be used to conveniently install software that offer this option.

  • First, you have to install your own Conda following this article, Conda on ARC.
  • Then, can follow the software installation manual on how to install it using the Conda option.

Containers

Background

A container allows you to stick an application and all of its dependencies into a single package. This makes your application portable, shareable, and reproducible.


Containers foster portability and reproducibility because they package ALL of an applications dependencies, including its own tiny operating system! This means your application won’t break when you port it to a new environment. Your app brings its environment with it.

Containers vs VMs

Containers and Virtual Machines (VMs) are both types of virtualization. But they are different in the way they do it.

VMs:

  • Full virtualization, including the core of the system (OS kernel).
  • Can run on a many different Operating System (OS).
  • Starting a VM is similar to booting a computer, that is, relatively long process.

Containers:

  • Partial virtualization, OS kernel is shared with the host OS.
  • Cannot run on a different OS. Practically, they only work in Linux.
  • Starting a container is much quicker.

Apptainer / Singularity