Concepts#
The two most important design concepts of yet_another_wizz are caching and spatial patches. Caching the input catalog data is important to improve the code performance and memory management, patches are important to compute uncertainties and covariances.
Choosing appropriate patches can have significant performance impacts. The best choice for the number of patches depends on the data samples. The optimum performance is achieved with a small number of patches that can be optimally distributed on the available CPUs. When measuring correlations, each CPU needs to load two patches simultaneously into memory (together with some overhead from building a binary search tree).
Note
The number of patches should be large enough to get a robust covariance estimate from spatial resampling.
Example of spatial patches generated by yet_another_wizz for a data set with stellar masks.#
Caching#
Caching is used to speed up parallel computations and reduce the memory usage
of the code. Every input data set is automatically cached when creating a new
Catalog object, which are used for the correlation function
measurements.
Therefore, every catalog instance is tied to a cache directory on disk, where
the data is stored in Patch es and read back on demand.
Therefore it is beneficial to choose a cache location on a fast device, such as
an SSD, RAID system, or file systems optimised for parallel processing.
Tip
On UNIX-like systems, using a RAM disk maybe be an fast option if the system has sufficient spare memory.
When created, an input catalog is split into patches, which represent smaller chunks of the input data. The cache directory of a catalog typically as the following structure:
[cache_directory]/
├╴ patch_ids.bin # list of patch IDs
├╴ patch_0/ # first patch with ID=0
│ ├╴ data.bin # chunk of the catalog data
│ └╴ meta.yml # meta data describing the chunk
├╴ patch_1/
│ └╴ ...
├╴ patch_2/
│ ...
└╴ patch_N/
Patches#
Spatial patches serve two purposes in yet_another_wizz:
Dividing the an input catalog in spatial regions that can be used for jackknife resampling to estimate uncertainties when measuring correlations.
Dividing the an input catalog in smaller chunks that can be loaded on-demand to reduce the memory footprint of the application.
Both should be taken into account when designing the patch layout of the input catalogs.
Caution
It is crucial to use exactly the same patches for all input catalogs when
measuring correlations with autocorrelate() or
crosscorrelate(). This ensures that the patches correctly
identified and properly aligned when running the pair counting.
A patch is defined by a center point coordinate and a radius, the minimum
enclosing circle given the center point. There are three ways to define a set
of patches for a catalog (see e.g. Catalog.from_file):
Providing a list of center coordinates for each patch:
Data points are assigned to the nearest patch center, similar to a Voronoi diagram. This method is useful to ensure that multiple catalogs share the same patch centers, as is required for correlation measurements.
Providing patch indices alongside the input data:
The patch indices are used to assign data points to patches. It is assumed that the set of patch indices is contiguous and counting from 0.
Automatically generating the patches:
A sparse sample of the input catalog is read and treecorr is used to compute a spcified number of patch centers. Afterwards, use the patch centers and assign objects to the nearest center as above.
Note
Patch centers created with treecorr are not deterministic and will differ every time they are constructed.