yaw.Catalog#

class yaw.Catalog(cache_directory: Path | str, *, max_workers: int | None = None)[source]#

Bases: Mapping[int, Patch]

A container for catalog data.

Catalogs are the core data structure for managing point data catalogs. Besides right ascension and declination coordinates, catalogs may have additional per-object weights and redshifts.

Catalogs divided into spatial Patch es, which each cache a portion of the data on disk to minimise the memory footprint when dealing with large data-sets, allowing to process the data in a patch-wise manner, only loading data from disk when they are needed. Additionally, the patches are used to estimate uncertainties using jackknife resampling.

Note

The number of patches should be sufficently large to support the redshift binning used for correlation measurements. The number of patches is also a trade-off between runtime and memory footprint during correlation measurements.

The cached data is organised in a single directory, with one sub-directory for each spatial Patch:

[cache_directory]/
  ├╴ patch_ids.bin  # list of patch IDs for this catalog
  ├╴ patch_0/
  │    └╴ ...  # patch data
  ├╴ patch_1/
  │  ...
  └╴ patch_N/

Caution

Empty patches are currently not supported and the catalog creation will fail if a patch without any data is encountered (e.g. if the input catalog is too sparse or inhomogeneous).

Parameters:

cache_directory – The cache directory to use for this catalog, must exist and contain a valid catalog cache.

Keyword Arguments:

max_workers – Limit the number of parallel workers for this operation (all by default, only multiprocessing).

Methods

__init__(cache_directory, *[, max_workers])

build_trees([binning, closed, leafsize, ...])

Build binary search trees on for each patch.

from_dataframe(cache_directory, dataframe, ...)

Create a new catalog instance from a pandas.DataFrame.

from_file(cache_directory, path, *, ra_name, ...)

Create a new catalog instance from a data file.

from_random(cache_directory, generator, ...)

Create a new catalog instance from a data file.

get(key[, default])

Return the Patch for ID if exists, else default.

get_centers()

Get the center coordinates of the patches.

get_num_records()

Get the number of records in each patches.

get_radii()

Get the radii of the patches.

get_sum_weights()

Get the sum of weights of the patches.

items()

A set-like object providing a view of (key, value) pairs.

keys()

A set-like object providing a view of all patch IDs.

values()

A set-like object providing a view of all Patch es.

Attributes

cache_directory

has_redshifts

Whether redshifts are available.

has_weights

Whether weights are available.

num_patches

The number of patches of this catalog.

build_trees(binning: NDArray | None = None, *, closed: Closed | str = Closed.right, leafsize: int = 16, force: bool = False, progress: bool = False, max_workers: int | None = None) None[source]#

Build binary search trees on for each patch.

The trees are cached in the patches’ cache directory and can be retrieved through yaw.trees.BinnedTrees(patch).

Parameters:

binning – Optional array with redshift bin edges to apply to the data before building trees.

Keyword Arguments:
  • closed – Indicating which side of the bin edges is a closed interval, see Closed for valid options.

  • leafsize – Leafsize when building trees.

  • force – Whether to overwrite any existing, cached trees.

  • progress – Show a progress on the terminal (disabled by default).

  • max_workers – Limit the number of parallel workers for this operation (all by default, only multiprocessing).

classmethod from_dataframe(cache_directory: Path | str, dataframe: DataFrame, *, ra_name: str, dec_name: str, weight_name: str | None = None, redshift_name: str | None = None, patch_centers: AngularCoordinates | Catalog | None = None, patch_name: str | None = None, patch_num: int | None = None, kappa_name: str | None = None, degrees: bool = True, overwrite: bool = False, progress: bool = False, max_workers: int | None = None, chunksize: int | None = None, probe_size: int = -1, **reader_kwargs) Self[source]#

Create a new catalog instance from a pandas.DataFrame.

Assign objects from the input data frame to spatial patches, write the patches to a cache on disk, and compute the patch meta data.

Note

One of the optional patch creation arguments (patch_centers, patch_name, or patch_num) must be provided.

Parameters:
  • cache_directory – The cache directory to use for this catalog. Created automatically or overwritten if requested.

  • dataframe – The input data frame. May also be an object that supports mapping from string (column name) to data (numpy array-like).

Keyword Arguments:
  • ra_name – Column name in the data frame for right ascension.

  • dec_name – Column name in the data frame for declination.

  • weight_name – Optional column name in the data frame for weights.

  • redshift_name – Optional column name in the data frame for redshifts.

  • patch_centers – A list of patch centers to use when creating the patches. Can be either AngularCoordinates or an other Catalog as reference.

  • patch_name – Optional column name in the data frame for a column with integer patch indices. Indices must be contiguous and starting from 0. Ignored if patch_centers is given.

  • patch_num – Automatically compute patch centers from a sparse sample of the input data using treecorr. Requires an additional scan of the input file to read a sparse sampling of the object coordinates. Ignored if patch_centers or patch_name is given.

  • kappa_name – Optional column name in the data frame for kappa (or other scalar field).

  • degrees – Whether the input coordinates are given in degreees (default).

  • overwrite – Whether to overwrite an existing catalog at the given cache location. If the directory is not a valid catalog, a FileExistsError is raised.

  • progress – Show a progress on the terminal (disabled by default).

  • max_workers – Limit the number of parallel workers for this operation (all by default, only multiprocessing).

  • chunksize – The maximum number of records to load into memory at once when processing the input file in chunks.

  • probe_size – The approximate number of records to read when generating patch centers (patch_num).

Returns:

A new catalog instance.

Raises:

FileExistsError – If the cache directory exists or is not a valid catalog when providing overwrite=True.

classmethod from_file(cache_directory: Path | str, path: Path | str, *, ra_name: str, dec_name: str, weight_name: str | None = None, redshift_name: str | None = None, patch_centers: AngularCoordinates | Catalog | None = None, patch_name: str | None = None, patch_num: int | None = None, kappa_name: str | None = None, degrees: bool = True, overwrite: bool = False, progress: bool = False, max_workers: int | None = None, chunksize: int | None = None, probe_size: int = -1, **reader_kwargs) Self[source]#

Create a new catalog instance from a data file.

Processes the input file in chunks, assign objects to spatial patches, write the patches to a cache on disk, and compute the patch meta data. Supported file formats are FITS, Parquet, and HDF5.

Note

One of the optional patch creation arguments (patch_centers, patch_name, or patch_num) must be provided.

Parameters:
  • cache_directory – The cache directory to use for this catalog. Created automatically or overwritten if requested.

  • path – The path to the input data file.

Keyword Arguments:
  • ra_name – Column or path name in the file for right ascension.

  • dec_name – Column or path name in the file for declination.

  • weight_name – Optional column or path name in the file for weights.

  • redshift_name – Optional column or path name in the file for redshifts.

  • patch_centers – A list of patch centers to use when creating the patches. Can be either AngularCoordinates or an other Catalog as reference.

  • patch_name – Optional column or path name for a column with integer patch indices. Indices must be contiguous and starting from 0. Ignored if patch_centers is given.

  • patch_num – Automatically compute patch centers from a sparse sample of the input data using treecorr. Requires an additional scan of the input file to read a sparse sampling of the object coordinates. Ignored if patch_centers or patch_name is given.

  • kappa_name – Optional column or path name in the file for kappa (or other scalar field).

  • degrees – Whether the input coordinates are given in degreees (default).

  • overwrite – Whether to overwrite an existing catalog at the given cache location. If the directory is not a valid catalog, a FileExistsError is raised.

  • progress – Show a progress on the terminal (disabled by default).

  • max_workers – Limit the number of parallel workers for this operation (all by default, only multiprocessing).

  • chunksize – The maximum number of records to load into memory at once when processing the input file in chunks.

  • probe_size – The approximate number of records to read when generating patch centers (patch_num).

Returns:

A new catalog instance.

Raises:

FileExistsError – If the cache directory exists or is not a valid catalog when providing overwrite=True.

Additional reader keyword arguments are passed on to the file reader class constuctor.

classmethod from_random(cache_directory: Path | str, generator: RandomsBase, num_randoms: int, *, patch_centers: AngularCoordinates | Catalog | None = None, patch_num: int | None = None, overwrite: bool = False, progress: bool = False, max_workers: int | None = None, chunksize: int | None = None, probe_size: int = -1) Self[source]#

Create a new catalog instance from a data file.

Generate a catalog from uniform random data points in chunks, assign objects to spatial patches, write the patches to a cache on disk, and compute the patch meta data.

The generator object must be created separately by the user.

Note

One of the optional patch creation arguments (patch_centers, or patch_num) must be provided (patch_name is not supported).

Parameters:
  • cache_directory – The cache directory to use for this catalog. Created automatically or overwritten if requested.

  • generator – A random generator (RandomsBase) instance from which samples are drawn.

  • num_randoms – The number of randoms to generate.

Keyword Arguments:
  • patch_centers – A list of patch centers to use when creating the patches. Can be either AngularCoordinates or an other Catalog as reference.

  • patch_num – Automatically compute patch centers from a sparse sample of the input data using treecorr. Requires an additional scan of the input file to read a sparse sampling of the object coordinates. Ignored if patch_centers or patch_name is given.

  • overwrite – Whether to overwrite an existing catalog at the given cache location. If the directory is not a valid catalog, a FileExistsError is raised.

  • progress – Show a progress on the terminal (disabled by default).

  • max_workers – Limit the number of parallel workers for this operation (all by default, only multiprocessing).

  • chunksize – The maximum number of records to generate and write at once.

  • probe_size – The number of initial random samples to draw read when generating patch centers (patch_num).

Returns:

A new catalog instance.

Raises:

FileExistsError – If the cache directory exists or is not a valid catalog when providing overwrite=True.

get(key, default=None)#

Return the Patch for ID if exists, else default.

get_centers() AngularCoordinates[source]#

Get the center coordinates of the patches.

get_num_records() tuple[int, ...][source]#

Get the number of records in each patches.

get_radii() AngularDistances[source]#

Get the radii of the patches.

get_sum_weights() tuple[float, ...][source]#

Get the sum of weights of the patches.

property has_redshifts: bool#

Whether redshifts are available.

property has_weights: bool#

Whether weights are available.

items()#

A set-like object providing a view of (key, value) pairs.

keys()#

A set-like object providing a view of all patch IDs.

property num_patches: int#

The number of patches of this catalog.

values()#

A set-like object providing a view of all Patch es.