How to work with large number of small files

From RCSWiki
Jump to navigation Jump to search

Background

Singularity Overlays

Persistent overlay directories allow you to overlay a writable file system on an immutable read-only container for the illusion of read-write access. You can run a container and make changes, and these changes are kept separately from the base container image.

Mounting Zip-arichives with fuse-zip

Please note, that mounting an archive is a system level operation, it must be properly completed to keep the compute nodes on ARC healthy. If you mount an archive on ARC it must be properly unmounted afterwards.

If you mount an archive in your job script, you have to unmount it in the same job script and actually wait until the operation is complete.


A possible complication here is that that the unmount operation takes time and the command tries to do this in the background, that is it returns control to the script before its completion. Once the job script is finished, SLURM will kill all the processes that belong to job's user, it will kill the unmount process as well, leaving the mount point occupied.

# Get a brief help.
$ fuse-zip --help
....

# Mount the archive as a file system.
$ fuse-zip -r archive.zip mountpoint

# Use the data inside
$ ls mountpoint/
$ ... 

# Unmount the archive.
$ fusermount -u mountpoint

/dev/shmem

From Python

Working with files inside a TAR archive

  • tarfile module:
https://www.askpython.com/python-modules/tarfile-module
https://stackoverflow.com/questions/27220376/python-read-file-within-tar-archive

Using HDF5 data file format

To be continued....

Links

How-Tos