Batch processing#
One of the key features of yaw_cli is the ability to track all input
parameters and input files and to record all processing steps applied to the
data in a project. This information is stored in the setup.yaml file. This
serves two purposes:
Reproduce the outputs from a single configuration file as long as the inputs are unchanged.
Run
yaw_cliin a batch-processing mode from a single configuration file instead of running multiple subcommands (init,cross,zcc, etc.) manually.
yaw_cli run#
This batch-processing feature is implemented in the special subcommand
yaw_cli run. The command requires only two arguments, the name of the
output (project) directory, and the path to the setup.yaml file provided
with the --setup argument.
Optional arguments control the number of threads to use for parallel computing, the verbosity level of the command line outputs, and providing a custom cache directory location. More details can be obtained from the built-in help:
$ yaw_cli run --help
usage: yaw_cli run [-h] [-v] [--threads <int>] [--progress] [-d] -s <file>
[--config-from <file>] [--cache-path <path>]
<path>
Read a task list and configuration from a setup file (e.g. as generated by
'init'). Apply the tasks to the specified data samples.
positional arguments:
<path> project directory, must not exist
options:
-h, --help show this help message and exit
-v, --verbose show additional information in terminal, repeat to
show debug messages
--threads <int> number of threads to use (default: from configuration)
--progress show a progress bar if the backend supports it
setup configuration:
select a setup file to run with optional modifcations
-d, --dump dump an empty setup file with default values to the
terminal
-s <file>, --setup <file>
setup YAML file with configuration, input files and
task list
--config-from <file> load the 'configuration' section from this setup file
--cache-path <path> replace the 'data.cachepath' value in the setup file
Configuration file layout#
The setup.yaml configuration is a YAML file that has three main sections,
named configuration, data and tasks. A configuration file with default
values, place holders for file paths, and a list of all available tasks can be
generated as follows:
$ yaw_cli run --dump
# yet_another_wizz setup configuration
# NOTE: (opt) in commment indicates entries that may
# be omitted.
# This section configures the correlation measurements
# and redshift binning of the clustering redshift
# estimates.
configuration:
backend: # (opt) backend specific parameters
thread_num: null # (opt) default number of threads to use
crosspatch: true # (opt) whether to count pairs across patch
# boundaries (scipy backend only)
rbin_slop: 0.01 # (opt) TreeCorr 'rbin_slop' parameter
binning: # specify the redshift binning for the clustering
# redshifts
zbins: null # list of custom redshift bin edges, if method
# is set to 'manual'
method: linear # (opt) redshift binning method, 'logspace'
# means equal size in log(1+z) (comoving,
# linear, logspace)
zmin: null # lower redshift limit
zmax: null # upper redshift limit
zbin_num: 30 # (opt) number of redshift bins
scales: # specify the correlation measurement scales
rmin: null # (list of) lower scale limit in kpc
# (pyhsical)
rmax: null # (list of) upper scale limit in kpc
# (pyhsical)
rweight: null # (opt) weight galaxy pairs by their
# separation to power 'rweight'
rbin_num: 50 # (opt) number of bins in log r used (i.e.
# resolution) to compute distance weights
cosmology: Planck15 # (opt) cosmological model used for distance
# calculations (WMAP1, WMAP3, WMAP5, WMAP7, WMAP9,
# Planck13, Planck15, Planck18)
# This section defines the input data products and
# their meta data. These can be FITS, PARQUET, CSV or
# FEATHER files.
data:
backend: scipy # (opt) name of the data catalog backend (scipy,
# treecorr)
cachepath: null # (opt) cache directory path, e.g. on fast storage
# device (recommended for 'backend=scipy', default
# is within project directory)
n_patches: null # (opt) number of automatic spatial patches to use
# for input catalogs below, provide only if no
# 'data/rand.patches' provided
reference: # (opt) reference data sample with know redshifts
data: # data catalog file and column names
filepath: ... # input file path
ra: ra # right ascension in degrees
dec: dec # declination in degrees
redshift: z # redshift of objects (required)
patches: patch # (opt) integer index for patch
# assignment, couting from 0...N-1
weight: weight # (opt) object weight
cache: false # (opt) whether to cache the file in the
# cache directory
rand: null # random catalog for data sample, omit or
# repeat arguments from 'data' above
unknown: # (opt) unknown data sample for which clustering
# redshifts are estimated, typically in
# tomographic redshift bins, see below
data: # data catalog file and column names
filepath: # either a single file path (no
# tomographic bins) or a mapping of
# integer bin index to file path (as shown
# below)
1: ... # bin 1
2: ... # bin 2
ra: ra # right ascension in degrees
dec: dec # declination in degrees
redshift: z # (opt) redshift of objects, if provided,
# enables computing the autocorrelation of
# the unknown sample
patches: patch # (opt) integer index for patch
# assignment, couting from 0...N-1
weight: weight # (opt) object weight
cache: false # (opt) whether to cache the file in the
# cache directory
rand: null # random catalog for data sample, omit or
# repeat arguments from 'data' above
# ('filepath' format must must match 'data'
# above)
# The section below is entirely optional and used to
# specify tasks to execute when using the 'yaw_cli
# run' command. The list is generated and updated
# automatically when running 'yaw_cli' subcommands.
# Tasks can be provided as single list entry, e.g.
# - cross
# - zcc
# to get a basic cluster redshift estimate or with the
# optional parameters listed below (all values
# optional, defaults listed).
tasks:
- cross: # compute the crosscorrelation
rr: false # compute random-random pair counts if both
# randoms are available
- auto_ref: # compute the reference sample autocorrelation for
# bias mitigation
rr: true # do not compute random-random pair counts
- auto_unk: # compute the unknown sample autocorrelation for
# bias mitigation
rr: true # do not compute random-random pair counts
- ztrue # compute true redshift distributions for unknown
# data (requires point estimate)
- drop_cache # delete temporary data in cache directory, has no
# arguments
- zcc: # compute clustering redshift estimates for the
# unknown data, task can be added repeatedly if
# different a 'tag' is used
tag: fid # unique identifier for different
# configurations
bias_ref: true # whether to mitigate the reference sample
# bias using its autocorrelation function (if
# available)
bias_unk: true # whether to mitigate the unknown sample bias
# using its autocorrelation functions (if
# available)
est_cross: null # correlation estimator for crosscorrelations
# (PH, DP, HM, LS)
est_auto: null # correlation estimator for autocorrelations
# (PH, DP, HM, LS)
method: jackknife # resampling method for covariance estimates
# (jackknife, bootstrap)
crosspatch: true # whether to include cross-patch pair counts
# when resampling
n_boot: 500 # number of bootstrap samples
global_norm: false # normalise pair counts globally instead of
# patch-wise
seed: 12345 # random seed for bootstrap sample generation
- plot # generate automatic check plots
Note
All parameters with a leading (opt) in their comment are optional and
can be omitted from the configuration file, the same applies to all items
listed in tasks.
Configuration#
This section maps one-to-one to a yaw.config.Configuration instance and
specifies the correlation backend related parameters, the correlation
measurement scales, and the redshift binning. The parameter descriptions in the
box above are mostly self-explanatory, however there is one peculiarity:
Note
The configuration of the redshift bins has two mutually exclusive parameter group. The binning must specifed as either of:
binning.zbins, i.e. providing a list of bin edges, orbinning.zmin,binning.zmax, (binning.zbin_num,binning.method), i.e. providing parameters used to generate a binning automatically.
If both are provided, binning.zbins is ignored.
Data#
This section specifes the input data files, split in two subsections reference and unknown. Either section is optional, e.g. if no unknown sample is needed for the tasks to perform, the section can be omitted.
Both sections each contain two subsections called data and rand, which specify the data and optionally random datasets. While the data subsection is always required, the rand can be omitted.
Note
Computing a crosscorrelation requires at least one of the two possible random samples (data or rand).
In each section, only the filepath, ra, and dec parameters are
required, the reference section additionally requries redshifts through the
z parameter. In the unknown section, filepath may also specify many
input files (e.g. different tomographic bins), however these must all have the
same column names. Instead of providing a single file path, provide a mapping of
subset / bin index to file path, e.g.
filepath:
1: path/to/sample1
5: path/to/sample5
instead of
filepath: path/to/sample
Note
Spatial patches, which are used for error and covariance estimation, must be
defined consistently for all input samples. Either use the n_patches
parameter to generate them automatically, or provide a column in the input
files with an integer patch index using the patches parameters, e.g.:
data:
filepath: ...
patches: name_of_patch_column
Tasks#
This section is lists all tasks to be applied to the input data. The default
setup.yaml will contain all possible tasks with a listing of all parameter
default values. The setup.yaml in a project directory always contains a
correctly ordered list of tasks (see above), without any duplicates (i.e.
replacing existing entries with the most recent calls).
Every task and all task parameters are optional and be omitted. For example,
tasks:
- cross:
rr: false
- zcc
and
tasks:
- cross
- zcc
are equivalent, since rr: false is the default value. Note that the task
zcc can be repeated arbitrarily many times, as long as the tag names differ.
If the tag name is identical, only the last version is kept. For example,
tasks:
- cross
- auto
- zcc:
tag: no_bias_mitigation
bias_ref: false
- zcc:
tag: fid
will generate two redshift estimates, one called no_bias_mitigation, which
does not use the reference sample autocorrelation to mitigate galaxy bias and
one called fid, where the bias is mitigated. (Since fid is the default
tag, it is also possible to omit the last line entirely.)
Advanced usage#
The --config-from argument for yaw_cli run allows to rerun a previous
analysis setup (same input files and list of tasks), but using the
configuration section from a different input file. This is particularly useful
if one only wishes to change the measurement scales or redshift binning, etc.
For example
yaw_cli version2 -s version1/setup.yaml --config-from new_config.yaml
creates a new project directory called version2. The task list and input
files are taken from the setup file of an existing project called version1,
but the configuration section is read from the new_config.yaml (ignoring
any other file contents).