How to work with large number of small files: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Background =
= Background =
= Singularity Overlays =
* https://docs.sylabs.io/guides/3.5/user-guide/persistent_overlays.html
Persistent overlay directories allow you to overlay a writable file system on an immutable read-only container
for the illusion of read-write access.
You can run a container and make changes, and these changes are kept separately from the base container image.


= Mounting Zip-arichives with <code>fuse-zip</code> =
= Mounting Zip-arichives with <code>fuse-zip</code> =


* to be continued.....
Please note, that mounting an archive is a system level operation, it must be properly completed to keep the compute nodes on ARC healthy.
If you mount an archive on ARC it must be properly unmounted afterwards.  
 
If you mount an archive in your job script, you have to unmount it in the same job script and actually wait until the operation is complete.
 
 
A '''possible complication''' here is that that the unmount operation takes time and
the command tries to do this in the background, that is it returns control to the script before its completion.
Once the job script is finished, SLURM will kill all the processes that belong to job's user,
it will kill the unmount process as well, leaving the mount point occupied.  


<pre>
<syntaxhighlight lang=bash>
# Get a brief help.
$ fuse-zip --help
$ fuse-zip --help
....
....


# Mount the archive as a file system.
$ fuse-zip -r archive.zip mountpoint
$ fuse-zip -r archive.zip mountpoint


# Use the data inside
$ ls mountpoint/
$ ...
# Unmount the archive.
$ fusermount -u mountpoint
$ fusermount -u mountpoint
</pre>
</syntaxhighlight>
 
= <code>/dev/shmem </code> =


= From Python =
= From Python =

Revision as of 17:53, 18 November 2022

Background

Singularity Overlays

Persistent overlay directories allow you to overlay a writable file system on an immutable read-only container for the illusion of read-write access. You can run a container and make changes, and these changes are kept separately from the base container image.

Mounting Zip-arichives with fuse-zip

Please note, that mounting an archive is a system level operation, it must be properly completed to keep the compute nodes on ARC healthy. If you mount an archive on ARC it must be properly unmounted afterwards.

If you mount an archive in your job script, you have to unmount it in the same job script and actually wait until the operation is complete.


A possible complication here is that that the unmount operation takes time and the command tries to do this in the background, that is it returns control to the script before its completion. Once the job script is finished, SLURM will kill all the processes that belong to job's user, it will kill the unmount process as well, leaving the mount point occupied.

# Get a brief help.
$ fuse-zip --help
....

# Mount the archive as a file system.
$ fuse-zip -r archive.zip mountpoint

# Use the data inside
$ ls mountpoint/
$ ... 

# Unmount the archive.
$ fusermount -u mountpoint

/dev/shmem

From Python

Working with files inside a TAR archive

  • tarfile module:
https://www.askpython.com/python-modules/tarfile-module
https://stackoverflow.com/questions/27220376/python-read-file-within-tar-archive

Links

How-Tos