Caching#

Caching is used to speed up parallel computations and handle memory management. When data is cached, patches are read in from disk on demand by the worker processes when measuring correlations. While this may repeatedly load the same patch from different workers, it is still much faster than sending the data to the worker directly, which can take a considerable amount of time and memory.

For these reasons it is beneficial to choose a cache location on a fast device, such as an SSD, RAID device or even a RAM file system, if the catalogs are small enough (e.g. /dev/shm on many UNIX systems).

Note

The cache directory is not created automatically and an OSError will be raised if it does not exist.

Using command line tools#

For the command line tools, the cache path can be configured using yaw_cli init --cache-path or the cachepath value in the data section of the YAML configuration. The default location is to use the project directory itself.

Caching is disabled by default and must be enabled per catalog by setting cache: true in the YAML configuration for a catalog or suppling the command line flag --*-cache, where * is either of ref, unk, or rand.

Using the python API#

When working from within python, caching can be enabled by passing the a path to the cache_directory argument of catalog constructors NewCatalog.from_file and NewCatalog.from_dataframe.

Note

Currently using a cache directory has the side effect that the data is not held in memory after the cache data has been written. The catalog instance will be in the unloaded state, until the load() method is called.

Catalogs can also be restored from a cache directory using the NewCatalog.from_cache method, since the cache directory is persistent if not deleted.

Warning

Caching is currently not fully implemented for the treecorr backend.