yaw.catalogs.NewCatalog#
- class yaw.catalogs.NewCatalog(backend: str = 'scipy')[source]#
Bases:
objectFactory class for data catalogues implemented by the backends.
A catalogue provides all the functionality to compute pair counts for correlation measurements by implementing an interface to the object positions, spatial patches for error estimation, and data management if the data is cached on disk. Aside from accessing the data directly, the most important methods are the
correlate()(pair counting) andtrue_redshifts()(redshift histogram, if redshifts are provided).A new catalogue can be created using an instance of this factory class. The sole argument is the name of the backend for which catalogue instances should be produced. For example
>>> yaw.NewCatalog("scipy") NewCatalog<scipy>()
is the default factory, which produces catalogues for the
scipybackend through its constructor methods.A key concept is caching, which can be used to reduce memory usage or even speed up the computation for some backends. A cache directory is a directory in which temporary data is stored in different formats (depending on the backend), such that parts of the data (typically individual spatial patches) can be read back into memory on demand.
Warning
The
scipybackend does not preserve the order the input data, but instead groups objects by there spatial patch.The
treecorrbackend does currently not support restoration from cache.
Create a new catalogue factory.
- Parameters:
backend (
str) – Specify the backend for which the catalog instances should be produced for. For availble options seebackend.
Methods
__init__([backend])Create a new catalogue factory.
from_cache(cache_directory[, progress])Restore the catalogue from its cache directory.
from_dataframe(data, ra_name, dec_name, *[, ...])Build a catalogue from in-memory data.
from_file(filepath, patches, ra, dec, *[, ...])Build catalogue from data file.
- from_cache(cache_directory: str, progress: bool = False) BaseCatalog[source]#
Restore the catalogue from its cache directory.
- Parameters:
cache_directory (
str) – Path to the cache directory.progress (
bool, optional) – Display a progress bar while restoring patches.
- Returns:
- from_dataframe(data: DataFrame, ra_name: str, dec_name: str, *, patch_name: str | None = None, patch_centers: BaseCatalog | Coordinate | None = None, n_patches: int | None = None, redshift_name: str | None = None, weight_name: str | None = None, cache_directory: str | None = None, progress: bool = False) BaseCatalog[source]#
Build a catalogue from in-memory data.
Specify the names of the required and or available columns in a
pandas.DataFrame. Additional parameters control the creation spatial patches used for error estimates. Patches can be assigned based on a column in the data frame (patch_name), constructed from a set of existing patch centers (patch_centers), or generated with k-means clustering (n_patches).- Parameters:
data (
pandas.Dataframe) – Holds the catalog data.ra_name (
str) – Name of the column with right ascension data in degrees.dec_name (
str) – Name of the column with declination data in degress.
- Keyword Arguments:
patch_name (
str, optional) – Name of the column that specifies the patch index, i.e. assigning each object to a spatial patch. Index starts counting from 0 (see Spatial patches).patch_centers (
BaseCatalog,Coordinate, optional) – Assign objects to existing patch centers based on their coordinates. Must be either a different catalog instance or a vector of coordinates.n_patches (
int, optional) – Assign objects to a given number of patches, generated using k-means clustering.redshift_name (
str, optional) – Name of the column with point-redshift estimates.weight_name (
str, optional) – Name of the column with object weights.cache_directory (
str, optional) – Path to directory used to cache patch data, must exists (see Caching). If provided, patch data is automatically unloaded from memory.progress (
bool, optional) – Display a progress bar while creating patches.
Note
Either of
patch_name,patch_centers, orn_patchesis required.Caching may significantly speed up parallel computations (e.g.
correlate()), accessing data attributes will trigger loading cached data as long as the catalog remains in the unloaded state (seeload()andunload()).The underlying patch data can be accessed through indexing and iterating the Catalog instance.
Note
TODO: Provide an example.
- from_file(filepath: str, patches: str | int | BaseCatalog | Coordinate, ra: str, dec: str, *, redshift: str | None = None, weight: str | None = None, sparse: int | None = None, cache_directory: str | None = None, file_ext: str | None = None, progress: bool = False, **kwargs) BaseCatalog[source]#
Build catalogue from data file.
Loads the input file and constructs the catalogue using the specified column names.
- Parameters:
filepath (
str) – Path to the input data file.patches (
str,int,BaseCatalog,Coordinate) – Specifies the construction of patches. If str, patch indices are read from the file. If int, generates this number of patches. Otherwise assign objects based on existing patch centers from a catalog instance or a coordinate vector.ra (
str) – Name of the column with right ascension data in degrees.dec (
str) – Name of the column with declination data in degress.
- Keyword Arguments:
redshift (
str, optional) – Name of the column with point-redshift estimates.weight (
str, optional) – Name of the column with object weights.sparse (
int, optional) – Load every N-th row of the input data.cache_directory (
str, optional) – Path to directory used to cache patch data, must exists (see Caching). If provided, patch data is automatically unloaded from memory.file_ext (
str, optional) – Hint for the input file type, if a uncommon file extension is used.progress (
bool, optional) – Display a progress bar while creating patches.
- Returns:
Note
Currently, the following file extensions are recognised automatically:
FITS:
.fits,.catCSV:
.csvHDF5:
.hdf5,.h5,Parquet:
.pqt,.parquetFeather:
.feather
Otherwise provide the appropriate extension (including the dot) in the
file_extargument.