yaw.catalogs.NewCatalog#

class yaw.catalogs.NewCatalog(backend: str = 'scipy')[source]#

Bases: object

Factory class for data catalogues implemented by the backends.

A catalogue provides all the functionality to compute pair counts for correlation measurements by implementing an interface to the object positions, spatial patches for error estimation, and data management if the data is cached on disk. Aside from accessing the data directly, the most important methods are the correlate() (pair counting) and true_redshifts() (redshift histogram, if redshifts are provided).

A new catalogue can be created using an instance of this factory class. The sole argument is the name of the backend for which catalogue instances should be produced. For example

>>> yaw.NewCatalog("scipy")
NewCatalog<scipy>()

is the default factory, which produces catalogues for the scipy backend through its constructor methods.

A key concept is caching, which can be used to reduce memory usage or even speed up the computation for some backends. A cache directory is a directory in which temporary data is stored in different formats (depending on the backend), such that parts of the data (typically individual spatial patches) can be read back into memory on demand.

Warning

The scipy backend does not preserve the order the input data, but instead groups objects by there spatial patch.
The treecorr backend does currently not support restoration from cache.

Create a new catalogue factory.

Parameters:: backend (str) – Specify the backend for which the catalog instances should be produced for. For availble options see backend.

Methods

`__init__`([backend])	Create a new catalogue factory.
`from_cache`(cache_directory[, progress])	Restore the catalogue from its cache directory.
`from_dataframe`(data, ra_name, dec_name, *[, ...])	Build a catalogue from in-memory data.
`from_file`(filepath, patches, ra, dec, *[, ...])	Build catalogue from data file.

from_cache(cache_directory: str, progress: bool = False) → BaseCatalog[source]#

Restore the catalogue from its cache directory.

Parameters:

cache_directory (str) – Path to the cache directory.
progress (bool, optional) – Display a progress bar while restoring patches.

Returns:

BaseCatalog

from_dataframe(data: DataFrame, ra_name: str, dec_name: str, *, patch_name: str | None = None, patch_centers: BaseCatalog | Coordinate | None = None, n_patches: int | None = None, redshift_name: str | None = None, weight_name: str | None = None, cache_directory: str | None = None, progress: bool = False) → BaseCatalog[source]#

Build a catalogue from in-memory data.

Specify the names of the required and or available columns in a pandas.DataFrame. Additional parameters control the creation spatial patches used for error estimates. Patches can be assigned based on a column in the data frame (patch_name), constructed from a set of existing patch centers (patch_centers), or generated with k-means clustering (n_patches).

Parameters:

data (pandas.Dataframe) – Holds the catalog data.
ra_name (str) – Name of the column with right ascension data in degrees.
dec_name (str) – Name of the column with declination data in degress.

Keyword Arguments:

patch_name (str, optional) – Name of the column that specifies the patch index, i.e. assigning each object to a spatial patch. Index starts counting from 0 (see Spatial patches).
patch_centers (BaseCatalog, Coordinate, optional) – Assign objects to existing patch centers based on their coordinates. Must be either a different catalog instance or a vector of coordinates.
n_patches (int, optional) – Assign objects to a given number of patches, generated using k-means clustering.
redshift_name (str, optional) – Name of the column with point-redshift estimates.
weight_name (str, optional) – Name of the column with object weights.
cache_directory (str, optional) – Path to directory used to cache patch data, must exists (see Caching). If provided, patch data is automatically unloaded from memory.
progress (bool, optional) – Display a progress bar while creating patches.

Note

Either of patch_name, patch_centers, or n_patches is required.

Caching may significantly speed up parallel computations (e.g. correlate()), accessing data attributes will trigger loading cached data as long as the catalog remains in the unloaded state (see load() and unload()).

The underlying patch data can be accessed through indexing and iterating the Catalog instance.

Note

TODO: Provide an example.

Build catalogue from data file.

Loads the input file and constructs the catalogue using the specified column names.

Parameters:

filepath (str) – Path to the input data file.
patches (str, int, BaseCatalog, Coordinate) – Specifies the construction of patches. If str, patch indices are read from the file. If int, generates this number of patches. Otherwise assign objects based on existing patch centers from a catalog instance or a coordinate vector.
ra (str) – Name of the column with right ascension data in degrees.
dec (str) – Name of the column with declination data in degress.

Keyword Arguments:

redshift (str, optional) – Name of the column with point-redshift estimates.
weight (str, optional) – Name of the column with object weights.
sparse (int, optional) – Load every N-th row of the input data.
cache_directory (str, optional) – Path to directory used to cache patch data, must exists (see Caching). If provided, patch data is automatically unloaded from memory.
file_ext (str, optional) – Hint for the input file type, if a uncommon file extension is used.
progress (bool, optional) – Display a progress bar while creating patches.

Returns:

BaseCatalog

Note

Currently, the following file extensions are recognised automatically:

FITS: .fits, .cat
CSV: .csv
HDF5: .hdf5, .h5,
Parquet: .pqt, .parquet
Feather: .feather

Otherwise provide the appropriate extension (including the dot) in the file_ext argument.