Batch processing#

One of the key features of yaw_cli is the ability to track all input parameters and input files and to record all processing steps applied to the data in a project. This information is stored in the setup.yaml file. This serves two purposes:

  • Reproduce the outputs from a single configuration file as long as the inputs are unchanged.

  • Run yaw_cli in a batch-processing mode from a single configuration file instead of running multiple subcommands (init, cross, zcc, etc.) manually.

yaw_cli run#

This batch-processing feature is implemented in the special subcommand yaw_cli run. The command requires only two arguments, the name of the output (project) directory, and the path to the setup.yaml file provided with the --setup argument.

Optional arguments control the number of threads to use for parallel computing, the verbosity level of the command line outputs, and providing a custom cache directory location. More details can be obtained from the built-in help:

$ yaw_cli run --help
usage: yaw_cli run [-h] [-v] [--threads <int>] [--progress] [-d] -s <file>
                   [--config-from <file>] [--cache-path <path>]
                   <path>

Read a task list and configuration from a setup file (e.g. as generated by
'init'). Apply the tasks to the specified data samples.

positional arguments:
  <path>                project directory, must not exist

options:
  -h, --help            show this help message and exit
  -v, --verbose         show additional information in terminal, repeat to
                        show debug messages
  --threads <int>       number of threads to use (default: from configuration)
  --progress            show a progress bar if the backend supports it

setup configuration:
  select a setup file to run with optional modifcations

  -d, --dump            dump an empty setup file with default values to the
                        terminal
  -s <file>, --setup <file>
                        setup YAML file with configuration, input files and
                        task list
  --config-from <file>  load the 'configuration' section from this setup file
  --cache-path <path>   replace the 'data.cachepath' value in the setup file

Configuration file layout#

The setup.yaml configuration is a YAML file that has three main sections, named configuration, data and tasks. A configuration file with default values, place holders for file paths, and a list of all available tasks can be generated as follows:

$ yaw_cli run --dump
# yet_another_wizz setup configuration

# NOTE: (opt) in commment indicates entries that may
# be omitted.

# This section configures the correlation measurements
# and redshift binning of the clustering redshift
# estimates.
configuration:
    backend:                # (opt) backend specific parameters
        thread_num: null        # (opt) default number of threads to use
        crosspatch: true        # (opt) whether to count pairs across patch
                                # boundaries (scipy backend only)
        rbin_slop: 0.01         # (opt) TreeCorr 'rbin_slop' parameter
    binning:                # specify the redshift binning for the clustering
                            # redshifts
        zbins: null             # list of custom redshift bin edges, if method
                                # is set to 'manual'
        method: linear          # (opt) redshift binning method, 'logspace'
                                # means equal size in log(1+z) (comoving,
                                # linear, logspace)
        zmin: null              # lower redshift limit
        zmax: null              # upper redshift limit
        zbin_num: 30            # (opt) number of redshift bins
    scales:                 # specify the correlation measurement scales
        rmin: null              # (list of) lower scale limit in kpc
                                # (pyhsical)
        rmax: null              # (list of) upper scale limit in kpc
                                # (pyhsical)
        rweight: null           # (opt) weight galaxy pairs by their
                                # separation to power 'rweight'
        rbin_num: 50            # (opt) number of bins in log r used (i.e.
                                # resolution) to compute distance weights
    cosmology: Planck15     # (opt) cosmological model used for distance
                            # calculations (WMAP1, WMAP3, WMAP5, WMAP7, WMAP9,
                            # Planck13, Planck15, Planck18)

# This section defines the input data products and
# their meta data. These can be FITS, PARQUET, CSV or
# FEATHER files.
data:
    backend: scipy          # (opt) name of the data catalog backend (scipy,
                            # treecorr)
    cachepath: null         # (opt) cache directory path, e.g. on fast storage
                            # device (recommended for 'backend=scipy', default
                            # is within project directory)
    n_patches: null         # (opt) number of automatic spatial patches to use
                            # for input catalogs below, provide only if no
                            # 'data/rand.patches' provided
    reference:              # (opt) reference data sample with know redshifts
        data:                   # data catalog file and column names
            filepath: ...           # input file path
            ra: ra                  # right ascension in degrees
            dec: dec                # declination in degrees
            redshift: z             # redshift of objects (required)
            patches: patch          # (opt) integer index for patch
                                    # assignment, couting from 0...N-1
            weight: weight          # (opt) object weight
            cache: false            # (opt) whether to cache the file in the
                                    # cache directory
        rand: null              # random catalog for data sample, omit or
                                # repeat arguments from 'data' above
    unknown:                # (opt) unknown data sample for which clustering
                            # redshifts are estimated, typically in
                            # tomographic redshift bins, see below
        data:                   # data catalog file and column names
            filepath:               # either a single file path (no
                                    # tomographic bins) or a mapping of
                                    # integer bin index to file path (as shown
                                    # below)
                1: ...                  # bin 1
                2: ...                  # bin 2
            ra: ra                  # right ascension in degrees
            dec: dec                # declination in degrees
            redshift: z             # (opt) redshift of objects, if provided,
                                    # enables computing the autocorrelation of
                                    # the unknown sample
            patches: patch          # (opt) integer index for patch
                                    # assignment, couting from 0...N-1
            weight: weight          # (opt) object weight
            cache: false            # (opt) whether to cache the file in the
                                    # cache directory
        rand: null              # random catalog for data sample, omit or
                                # repeat arguments from 'data' above
                                # ('filepath' format must must match 'data'
                                # above)

# The section below is entirely optional and used to
# specify tasks to execute when using the 'yaw_cli
# run' command. The list is generated and updated
# automatically when running 'yaw_cli' subcommands.
# Tasks can be provided as single list entry, e.g.
#   - cross
#   - zcc
# to get a basic cluster redshift estimate or with the
# optional parameters listed below (all values
# optional, defaults listed).
tasks:
  - cross:                  # compute the crosscorrelation
        rr: false               # compute random-random pair counts if both
                                # randoms are available
  - auto_ref:               # compute the reference sample autocorrelation for
                            # bias mitigation
        rr: true                # do not compute random-random pair counts
  - auto_unk:               # compute the unknown sample autocorrelation for
                            # bias mitigation
        rr: true                # do not compute random-random pair counts
  - ztrue                   # compute true redshift distributions for unknown
                            # data (requires point estimate)
  - drop_cache              # delete temporary data in cache directory, has no
                            # arguments
  - zcc:                    # compute clustering redshift estimates for the
                            # unknown data, task can be added repeatedly if
                            # different a 'tag' is used
        tag: fid                # unique identifier for different
                                # configurations
        bias_ref: true          # whether to mitigate the reference sample
                                # bias using its autocorrelation function (if
                                # available)
        bias_unk: true          # whether to mitigate the unknown sample bias
                                # using its autocorrelation functions (if
                                # available)
        est_cross: null         # correlation estimator for crosscorrelations
                                # (PH, DP, HM, LS)
        est_auto: null          # correlation estimator for autocorrelations
                                # (PH, DP, HM, LS)
        method: jackknife       # resampling method for covariance estimates
                                # (jackknife, bootstrap)
        crosspatch: true        # whether to include cross-patch pair counts
                                # when resampling
        n_boot: 500             # number of bootstrap samples
        global_norm: false      # normalise pair counts globally instead of
                                # patch-wise
        seed: 12345             # random seed for bootstrap sample generation
  - plot                    # generate automatic check plots

Note

All parameters with a leading (opt) in their comment are optional and can be omitted from the configuration file, the same applies to all items listed in tasks.

Configuration#

This section maps one-to-one to a yaw.config.Configuration instance and specifies the correlation backend related parameters, the correlation measurement scales, and the redshift binning. The parameter descriptions in the box above are mostly self-explanatory, however there is one peculiarity:

Note

The configuration of the redshift bins has two mutually exclusive parameter group. The binning must specifed as either of:

  • binning.zbins, i.e. providing a list of bin edges, or

  • binning.zmin, binning.zmax, (binning.zbin_num, binning.method), i.e. providing parameters used to generate a binning automatically.

If both are provided, binning.zbins is ignored.

Data#

This section specifes the input data files, split in two subsections reference and unknown. Either section is optional, e.g. if no unknown sample is needed for the tasks to perform, the section can be omitted.

Both sections each contain two subsections called data and rand, which specify the data and optionally random datasets. While the data subsection is always required, the rand can be omitted.

Note

Computing a crosscorrelation requires at least one of the two possible random samples (data or rand).

In each section, only the filepath, ra, and dec parameters are required, the reference section additionally requries redshifts through the z parameter. In the unknown section, filepath may also specify many input files (e.g. different tomographic bins), however these must all have the same column names. Instead of providing a single file path, provide a mapping of subset / bin index to file path, e.g.

filepath:
    1: path/to/sample1
    5: path/to/sample5

instead of

filepath: path/to/sample

Note

Spatial patches, which are used for error and covariance estimation, must be defined consistently for all input samples. Either use the n_patches parameter to generate them automatically, or provide a column in the input files with an integer patch index using the patches parameters, e.g.:

data:
    filepath: ...
    patches: name_of_patch_column

Tasks#

This section is lists all tasks to be applied to the input data. The default setup.yaml will contain all possible tasks with a listing of all parameter default values. The setup.yaml in a project directory always contains a correctly ordered list of tasks (see above), without any duplicates (i.e. replacing existing entries with the most recent calls).

Every task and all task parameters are optional and be omitted. For example,

tasks:
    - cross:
          rr: false
    - zcc

and

tasks:
    - cross
    - zcc

are equivalent, since rr: false is the default value. Note that the task zcc can be repeated arbitrarily many times, as long as the tag names differ. If the tag name is identical, only the last version is kept. For example,

tasks:
    - cross
    - auto
    - zcc:
          tag: no_bias_mitigation
          bias_ref: false
    - zcc:
          tag: fid

will generate two redshift estimates, one called no_bias_mitigation, which does not use the reference sample autocorrelation to mitigate galaxy bias and one called fid, where the bias is mitigated. (Since fid is the default tag, it is also possible to omit the last line entirely.)

Advanced usage#

The --config-from argument for yaw_cli run allows to rerun a previous analysis setup (same input files and list of tasks), but using the configuration section from a different input file. This is particularly useful if one only wishes to change the measurement scales or redshift binning, etc.

For example

yaw_cli version2 -s version1/setup.yaml --config-from new_config.yaml

creates a new project directory called version2. The task list and input files are taken from the setup file of an existing project called version1, but the configuration section is read from the new_config.yaml (ignoring any other file contents).