Starting a new project#

The command line tool is designed to operate on separate projects. A project uses a fixed set of parameters (e.g. redshift binning and correlation measurement scales) to compute clustering redshifts with a single reference sample and one or many unknown data sets that spatially overlap.

Note

Using multiplereference samples as input for a project is currently not supported, however it is possible to merge the outputs from different projects.

New projects are created with the yaw_cli init [path] subcommand, where the path specifies a directory (must not exist) in which all data products are stored and managed. This command specifies the majority of the paramters for the correlation measurements, including the measurement scales, the redshift binning, as well as optional parameters such as the cosmological model for distance calculations and the automatic generation of spatial patches. A list of all command line arguments can be obtained by typing

$ yaw_cli init --help
usage: yaw_cli init [-h] [-v] [-s <file>] [--backend {scipy,treecorr}]
                    [--cache-path <path>] [--n-patches <int>]
                    [--cosmology {WMAP1,WMAP3,WMAP5,WMAP7,WMAP9,Planck13,Planck15,Planck18}]
                    --ref-path <file> --ref-ra <str> --ref-dec <str> --ref-z
                    <str> [--ref-w <str>] [--ref-patch <str>] [--ref-cache]
                    [--rand-path <file>] [--rand-ra <str>] [--rand-dec <str>]
                    [--rand-z <str>] [--rand-w <str>] [--rand-patch <str>]
                    [--rand-cache] --rmin [<float> ...] --rmax [<float> ...]
                    [--rweight <float>] [--rbin-num <int>]
                    [--zbins [<float> ...]]
                    [--method {comoving,linear,logspace}] [--zmin <float>]
                    [--zmax <float>] [--zbin-num <int>] [--thread-num <int>]
                    [--no-crosspatch] [--rbin-slop <float>]
                    <path>

Initialise and create a project directory with a configuration. Specify the
reference sample data and optionally randoms.

positional arguments:
  <path>                project directory, must not exist

options:
  -h, --help            show this help message and exit
  -v, --verbose         show additional information in terminal, repeat to
                        show debug messages
  -s <file>, --setup <file>
                        optionl setup YAML file (e.g. from 'yaw_cli run -d')
                        with base configuration that is overwritten by
                        arguments below

additional arguments:
  --backend {scipy,treecorr}
                        backend used for pair counting (default: scipy)
  --cache-path <path>   non-standard location for the cache directory (e.g. on
                        faster storage, default: [project directory]/cache)
  --n-patches <int>     split all input data into this number of spatial
                        patches for covariance estimation (default: patch
                        index for catalogs)
  --cosmology {WMAP1,WMAP3,WMAP5,WMAP7,WMAP9,Planck13,Planck15,Planck18}
                        cosmological model used for distance calculations (see
                        astropy.cosmology, default: Planck15)

reference (data):
  specify the reference (data) input file

  --ref-path <file>     input file path
  --ref-ra <str>        column name of right ascension
  --ref-dec <str>       column name of declination
  --ref-z <str>         column name of redshift
  --ref-w <str>         column name of object weight
  --ref-patch <str>     column name of patch assignment index
  --ref-cache           cache the data in the project's cache directory

reference (random):
  specify the reference (random) input file (optional)

  --rand-path <file>    input file path
  --rand-ra <str>       column name of right ascension
  --rand-dec <str>      column name of declination
  --rand-z <str>        column name of redshift
  --rand-w <str>        column name of object weight
  --rand-patch <str>    column name of patch assignment index
  --rand-cache          cache the data in the project's cache directory

measurement scales:
  sets the physical scales for the correlation measurements

  --rmin [<float> ...]  (list of) lower scale limit in kpc (pyhsical)
  --rmax [<float> ...]  (list of) upper scale limit in kpc (pyhsical)
  --rweight <float>     weight galaxy pairs by their separation to power
                        'rweight' (default: no weighting applied)
  --rbin-num <int>      number of bins in log r used (i.e. resolution) to
                        compute distance weights (default: 50)

redshift binning:
  sets the redshift binning for the clustering redshifts

  --zbins [<float> ...]
                        list of custom redshift bin edges, if method is set to
                        'manual'
  --method {comoving,linear,logspace}
                        redshift binning method, 'logspace' means equal size
                        in log(1+z) (default: linear)
  --zmin <float>        lower redshift limit (default: None)
  --zmax <float>        upper redshift limit (default: None)
  --zbin-num <int>      number of redshift bins (default: 30)

backend specific:
  parameters that are specific to pair counting backends

  --thread-num <int>    default number of threads to use (default: all)
  --no-crosspatch       whether to count pairs across patch boundaries (scipy
                        backend only)
  --rbin-slop <float>   TreeCorr 'rbin_slop' parameter (default: 0.01),
                        without 'rweight' this just a single radial bin,
                        otherwise 'rbin_num'

Note

The configuration of the redshift bins has two mutually exclusive parameter group. The binning must specifed as either of:

  • --zbins, i.e. providing a list of bin edges, or

  • --zmin, --zmax, (--zbin-num, --method), i.e. providing parameters used to generate a binning automatically.

If both are provided, --zbins is ignored.

The reference sample#

Since the reference sample used for a project is static, the reference sample is already specifed at this stage by providing an input path --ref-path and the requred column names for right ascension (--ref-ra), declination (--ref-dec, in degrees) and per-object redshifts (--ref-z), weights (--ref-w) are optional.

Similarly, a random sample for the reference sample can be provided using the corresponding --rand-* arguments. Note that the reference randoms also require per-object redshifts. If no reference randoms are provided, randoms for the unknown sample are required (see yaw_cli cross).

Spatial patches and caching#

It is important to specify consistent spatial patches for a project, since these are used to compute uncertainty estimates and covariances. There are two options:

  1. Generate the patches automatically using a k-means clustering algorithm. The code ensures that all data and random catalogues have the patch centers.

  2. Provide manual patch assignements from a column with integer patch indices --ref-patch and --rand-patch. The code will only check that the patches align roughly, but the user must ensure that they are consistent for all input samples.

Warning

For performance reasons it is highly recommended to cache all input data sets using the flags --ref-cache and --rand-cache. For more details refer to Caching.

Outputs#

The init subcommand creates an empty project directory, in which all data products are stored. The configuration is stored in the newly created setup.yaml YAML file, together with a declaration of input files and processing steps applied (see next page). Logs for debugging are stored in setup.log, the patch center coordinates are stored in patch_centers.dat. Finally, the redshift distribution of the reference sample is computed and stored as true/nz_reference.*.