Sample attributes reference ........................... Relationships between samples ,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ``id`` String, required, must be different for each sample. Used as filename prefix for output files. Autogenerated if not specified. ``condition`` String, optional. Labels basal tier in the experimental design hierarchy. ``biorep`` String, optional. Labels secondary tier in experimental design hierarchy. ``techrep`` String, optional. Labels tertiary tier in experimental design hierarchy. ``aggregateProfileName`` String. Optional. Not typically specified by user, built automatically by Hich based on ``aggregate`` Hich-config block. Labels the aggregate profile applied to the sample. .. note:: ``id`` is typically built algorithmically by concatenating ``techrep``, ``biorep``, and ``condition``, as well as the :ref:`aggregateProfileName `, but can be manually specified. For example, the following sample file (``id`` unspecified) and aggregate profile block would result in the ``id`` values ``c1_b1_t1_profile1``, ``c1_b1_t1_profile2``, ``c1_b2_t1_profile1``, and ``c1_b2_t1_profile2``. .. code:: c techrep biorep condition t1 b1 c1 t1 b2 c1 .. code:: c aggregate: profile1: mergeTechrepToBiorep: true dedupMaxMismatch: 3 techrepDedupMethod: "sum" profile2: mergeTechrepToBiorep: true dedupMaxMismatch: 0 techrepDedupMethod: "max" These attributes are used by :ref:`aggregate profiles ` and :ref:`sample selection strategies ` to control how samples are downsampled, merged, and deduplicated, as well as how features are called. Resource files ,,,,,,,,,,,,,, ``assembly`` String. Required. Genome assembly label. May be used to download genome reference if unspecified. Example: "hg38". ``genomeReference`` String. Required for building fragment index, chromsizes, and aligner index. Can be downloaded by Hich automatically for common genomes if ``assembly`` is included. Path or URL to genome reference fasta file. If ``genomeReference`` is unspecified but ``assembly`` is one of the supported options, Hich downloads the genome reference from the ENCODE project or NCBI. If multiple samples will use the downloaded reference, it will only be downloaded once and shared by all the samples that need it. Supported options for automatically downloading genome reference: ``hg38``, ``homo_sapiens``, or ``GRCh38`` (GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz), ``mm10`` (mm10_no_alt_analysis_set_ENCODE.fasta.gz), ``dm6`` (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz), ``galGal5`` or ``bgalGal5`` (GCA_027408225.1_bGalGal5.pri_genomic.fna.gz), ``danRer11`` (GCF_000002035.6_GRCz11_genomic.fna.gz), or ``ce10`` (GCF_000002985.6_WBcel235_genomic.fna.gz). ``indexDir`` String. Required for alignment, optional otherwise. Path to directory containing aligner index. ``indexPrefix`` String. Required for alignment, optional otherwise. Prefix of all aligner index files (it is required that all aligner index files share a common prefix). Example: if the aligner index files are ``hg38.*``, then ``indexPrefix`` should be ``hg38``. ``chromsizes`` String. Required to build fragment index, parse sam/bam to pairs or ingest pairs, or build contact matrix. Path to tab-delimited headerless file with contig names in first column, length of contig in base pairs as second column. Automatically created based on ``genomeReference`` if unspecified and shared among samples with a common reference that all left ``genomeReference`` unspecified. ``restrictionEnzymes`` String. Required to build fragment index, but fragment index is optional. Space-delimited list of restriction enzyme names used in restriction digest for the sample. Any combination of enzymes in the REBASE database as accessed via `biopython's restriction enzymes module `_ can be used (i.e. ``DpnII DdeI``), as well as ``Arima``, ``Phase Proximo 2021+ Plant`` (or ``Phase Plant``), ``Phase Proximo 2021+ Animal`` (or ``Phase Animal``), ``Phase Proximo 2021+ Microbiome`` (or ``Phase Microbiome``), ``Phase Proximo 2021+ Human`` (or ``Phase Human``) or ``Phase Proximo 2021+ Fungal`` (or ``Phase Fungal``). ``fragmentIndex`` String. Optional. Path to BED file containing start and end positions of restriction fragments for the digest used for the sample. If the ``restrictionEnzymes`` option is specified but ``fragmentIndex`` is not, then Hich will create a ``fragmentIndex`` file based on the ``restrictionEnzymes`` and ``genomeReference`` and share it among samples with the same reference and enzymes. Aligning reads ,,,,,,,,,,,,,,,,,,,, Hich toolkit: `bwa mem `_, `bwa-mem2 `_, `bsbolt `_ See also: :ref:`Resource files` under ``assembly``, ``genomeReference``, ``indexDir``, ``indexPrefix``. ``fastq1`` String. Required for alignment. Path to R1 or single-end read fastq file (may be gzipped). ``fastq2`` String. Required for alignment if samples are paired-end and non-interleaved. Path to R2 fastq file (may be gzipped). Leave blank or unspecified if using single-end reads. ``aligner`` String. Required for alignment. Aligner to use for aligning the sample. Options: ``bwa`` (slower, lower memory footprint), ``bwa-mem2`` (fast, higher memory footprint), ``bsbolt`` (methyl Hi-C) ``bwaFlags`` List of strings. Required for alignment. CLI options passed to aligner (note that all aligners including BSBolt are based on ``bwa mem``). Typically, use ``-SP5M``. Do not use the ``bwa`` option ``-t`` or the BSBolt options ``-OT``, ``-O``, ``-DB``, ``-F1``, or ``-F2`` as these are hardcoded by Hich based on other sample attributes. Example: ``bwaFlags: ["-S", "-P", "-5", "-M"]`` Filters ,,,,,,,,,,,, ``minMapq`` Integer. Optional. Reads below this MAPQ cutoff will be discarded. Note that different aligners approximate MAPQ differently. The approach used by ``bwa`` is what's relevant for Hich. Hi-C contacts ingested from .pairs or parsed from .sam/.bam ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Hich toolkit: `samtools `_, `pairtools `_ ``sambam`` String. Optional. Specified by user for .sam/.bam files to ingest as inputs into Hich rather than being built via alignment from .fastq data. Path to .sam/.bam file containing aligned reads which will be parsed using ``pairtools parse2`` to obtain a `4DN .pairs `_ file. ``pairs`` String. Optional. Specified by user for `4DN .pairs `_ files (may be gzipped) to ingest as inputs into Hich rather than being built via alignment from .fastq data. ``pairtoolsParse2Params`` List of flags passed to `pairtools parse2 `_. Uses ``minMapq`` if specified for the sample as a default value for the ``pairtools parse2 --min-mapq`` option, but this can be overridden by passing ``--min-mapq`` n ``pairtoolsParse2Params``. Hardcoded options that should not be provided here: ``--flip``, ``--assembly``, ``--chroms-path``. .. note:: Hich will inspect .sam/.bam files to determine if they are sorted, and sort them automatically by name (required for inputs to ``pairtools parse``) only if necessary. It will then sort the output by position. .. note:: ``pairtools parse2`` has a ``--drop-readid`` parameter, which can drastically shrink the disk space required for the .pairs file. This is useful, but for single cell data (see below), it was challenging to engineer a way to drop this column when it's necessary to extract the ``cellID`` column value from the ``readID`` column of the .sam/.bam file used as input to parsing the .pairs file. For this reason, the ``--drop-readid`` parameter is not actually passed to ``pairtools parse2``. Instead, ``--placeholder readID .`` is passed to ``hich reshape``, which accomplishes the same result while permitting ``cellID`` to be extracted from the ``readID`` column if necessary. Optional single-cell attributes _______________________________ .. note:: These attributes can be ignored for bulk data. For single cell-aware fragment filtering, deduplication and to maintain cell ID for future analysis, Hich must put a unique identifier for the cell attributed to each contact in the .pairs file into a column labeled ``cellID`` in the .pairs file. This identifier can be extracted by Hich automatically from the read ID or from a .sam/.bam tag using the sample attributes in this section using the Hich CLI command ``hich reshape``. ``cellBarcodeField`` Required if parsing cell ID from a .sam/.bam file. Should be either ``readID`` the name of a .sam/.bam tag. This field will be parsed for each read in the .sam/.bam file in order to extract the value of the ``cellID``. The patterns used to accomplish this extraction are specified below. ``cellBarcodeRegexPattern`` Optional. Should be a Python regex compatible with re (regexes can be tested at `regex101.com `_). Along with ``cellBarcodeGroup``, the regex will be applied to parse the field specified in ``cellBarcodeField`` and the match will be put into the ``cellID`` field of the .pairs file. Overrides ``cellBarcodeParsePattern`` if both are specified. ``cellBarcodeGroup`` Optional. An integer specifying which match group from the regex specified by ``cellBarcodeRegexPattern`` should be used as the value of ``cellID``. 0 uses all match groups. Defaults to 0 if ``cellBarcodeField`` and ``cellBarcodeRegexPattern`` are specified and ``cellBarcodeGroup`` is not. ``cellBarcodeParsePattern`` Optional. An alternative and potentially simpler way to parse ``cellBarcodeField`` by using Python's `parse `_ library syntax. From the pattern specified the ``{cellID}`` named part will be extracted and put into the ``cellID`` column in the .pairs file. Example: ``{}:{cellID}`` will extract the part after a colon (:) and put it into the ``cellID`` column. ``globalDefaultReshapeToCellID`` Optional. Must be specified in the params file or nextflow.config. If ``cellBarcodeField`` is specified for a sample but either ``cellBarcodeRegexPattern`` nor ``cellBarcodeParsePattern`` is specified, then ``globalDefaultReshapeToCellID`` is used to determine how the ``cellID`` column will be parsed. Ignored if ``cellBarcodeRegexPattern`` or ``cellBarcodeParsePattern`` is given for the sample. ``globalDefaultReshapeToCellID.option`` Optional. Either ``--regex`` or ``--parse``, which determines whether ``globalDefaultReshapeToCellID.pattern`` (below) will be parsed using Python's ``re`` library or its ``parse`` library (see above options for details). ``globalDefaultReshapeToCellID.pattern`` Optional. Interpreted either a Python ``re`` regex or Python ``parse`` pattern depending on the value of ``globalDefaultReshapeToCellID.option``. ``globalDefaultReshapeToCellID.group`` Optional. The match group to use for the regex. Ignored if unspecified, and should be left unspecified if using ``parse``. ``reshapeParams`` Optional additional params passed to ``hich reshape``. Filtering Hi-C contacts ,,,,,,,,,,,,,,,,,,,,,,, Hich toolkit: `pairtools `_ See also: :ref:`Resource files` under ``restrictionEnzymes``, ``fragmentIndex`` ``selectFilters`` A multi-attribute of filters to apply to Hi-C contacts in .pairs files. ``selectFilters.keepPairTypes`` List of strings. Optional. Pairtools `pair types `_ to keep. Keeping ``UU``, ``UR``, and ``RU`` is recommended. ``selectFilters.keepTrans`` Boolean. Optional. If false, discards reads mapping to different chromosomes/contigs. If unspecified, these contacts will be kept. ``selectFilters.keepCis`` Boolean. Optional. If false, discards reads mapping to the same chromosome/contig. If unspecified, these contacts will be kept. ``selectFilters.minDistFR`` Integer. Optional. If specified, then for reads with the orientation FR, discards if they are below this distance between ``pos1`` and ``pos2``. ``selectFilters.minDistRF`` Integer. Optional. If specified, then for reads with the orientation RF, discards if they are below this distance between ``pos1`` and ``pos2``. ``selectFilters.minDistFF`` Integer. Optional. If specified, then for reads with the orientation FF, discards if they are below this distance between ``pos1`` and ``pos2``. ``selectFilters.minDistRR`` Integer. Optional. If specified, then for reads with the orientation RR, discards if they are below this distance between ``pos1`` and ``pos2``. .. note:: Two technical artifacts that routinely appear in Hi-C experiments enriched in short-range contacts are undigested chromatin and self-ligated strands. These will appear in the multiQC reports generated by Hich as a strong enrichment in the FR and RF orientations below a certain distance threshold. By pausing the Hich run after parsing to pairs and inspecting this report, the ``minDist`` values can be chosen appropriately according to the QC data. Data with no strand bias should have very close to 25% of each orientation. ``selectFilters.discardSingleFrag`` Boolean. Optional. Discard contacts where both ends map to the same restriction fragment as these likely originate from undigested chromatin. Requires that samples have been tagged with this information, which Hich will do automatically if ``fragmentIndex`` is specified. ``pairtoolsSelectParams`` List of strings. Optional. Additional parameters to pass to ``pairtools select``. The following options are hardcoded in Hich and should not be specified here: ``--output-rest``, ``--output``, ``--nproc-in``, ``--nproc-out``. Downsampling, merging, and deduplicating samples ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ``aggregate`` A single Hich-config block, typically declared in the YAML params file. Each sub-entry is the name of an **aggregate profile**. The aggregate profile defines how samples are to be downsampled, merged and deduplicated. Samples with no aggregate profile are cloned and tagged as belonging to that aggregate profile, and the aggregate profile is incorporated into the cloned sample's id and therefore becomes part of the filename. Options within an aggregate profile in the ``aggregate`` block ______________________________________________________________ **Deduplication** ``techrepDedup`` If true, techrep-level samples in the aggregate profile will be deduplicated **after** merging them into a biorep-level sample. ``techrepDedupMethod`` Controls the ``--method`` parameter for `pairtools dedup `_ when calling duplicates on technical replicate samples. ``biorepDedup`` If true, biorep-level samples in the aggregate profile will be deduplicated **after** merging them into a condition-level sample. ``biorepDedupMethod`` Controls the ``--method`` parameter for `pairtools dedup `_ when calling duplicates on biological replicate samples. ``conditionDedup`` If true, condition-level samples in the aggregate profile will be deduplicated **after** merging them into a merged-condition-level sample. ``conditionDedupMethod`` Controls the ``--method`` parameter for `pairtools dedup `_ when calling duplicates on condition samples. ``dedupMaxMismatch`` The max number of base pairs of mismatch between contacts permitted to deep them as duplicates. This value is interpreted in light of the ``techrepDedupMethod``, ``biorepDedupMethod``, or ``conditionDedupMethod``. ``dedupSingleCell`` If true, then in addition to the difference in position between contacts being small enough, the contacts' cellID column values must also match in order for one of the contacts to be discarded as a duplicate. See the section on :ref:`optional single-cell attributes ` for options controlling how to parse cellID from the readID or sam/bam tags. **Downsampling** ``techrepDownsamplePairs`` If true, techrep-level samples will be downsampled in a manner controlled by the following parameters. ``techrepCisStrata`` For downsampling, defines a partition over distance strata for contacts mapping to the same chromosome which will be used to homogenize the number of contacts within that strata across the techrep-level samples being downsampled together. ``techrepReadConjuncts`` For downsampling, selects which fields will be used to partition techrep-level contacts for downsampling. ``techrepDownsampleToMeanDistribution`` If true, then during downsampling, the mean fraction of contacts in each block in the partition will be used as the target distribution for each of the techrep-level samples being downsampled together. ``techrepToSize`` Controls the number of contacts each techrep-level sample in the aggregate profile will be downsampled to. If a float from 0-1, downsamples to approximately that fraction of the original size. If an integer greater than 1, downsamples to that number of contacts. ``biorepDownsamplePairs`` If true, biorep-level samples will be downsampled in a manner controlled by the following parameters. ``biorepCisStrata`` For downsampling, allows defining a partition over distance strata for contacts mapping to the same chromosome which will be used to homogenize the number of contacts within that strata across the biorep-level samples being downsampled together. ``biorepReadConjuncts`` For downsampling, selects which fields will be used to partition biorep-level contacts for downsampling. ``biorepDownsampleToMeanDistribution`` If true, then during downsampling, the mean fraction of contacts in each block in the partition will be used as the target distribution for each of the biorep-level samples being downsampled together. ``biorepToSize`` Controls the number of contacts each biorep-level sample in the aggregate profile will be downsampled to. If a float from 0-1, downsamples to approximately that fraction of the original size. If an integer greater than 1, downsamples to that number of contacts. ``conditionDownsamplePairs`` If true, condition-level samples will be downsampled in a manner controlled by the following parameters. ``conditionCisStrata`` For downsampling, allows defining a partition over distance strata for contacts mapping to the same chromosome which will be used to homogenize the number of contacts within that strata across the condition-level samples being downsampled together. ``conditionReadConjuncts`` For downsampling, selects which fields will be used to partition condition-level contacts for downsampling. ``conditionDownsampleToMeanDistribution`` If true, then during downsampling, the mean fraction of contacts in each block in the partition will be used as the target distribution for each of the condition-level samples being downsampled together. ``conditionToSize`` Controls the number of contacts each condition-level sample in the aggregate profile will be downsampled to. If a float from 0-1, downsamples to approximately that fraction of the original size. If an integer greater than 1, downsamples to that number of contacts. **Merging** ``mergeTechrepToBiorep`` If true, techreps with the same condition, biorep, and aggregate profile will be merged to create a biorep-level sample. The techrep-level samples will be retained for further processing as well. If specified, techrep-level downsampling will occur prior to the merge, while deduplication occurs after the merge. ``mergeBiorepToCondition`` If true, bioreps with the same condition and aggregate profile will be merged to create a condition-level sample. The biorep-level samples will be retained for further processing as well. If specified, biorep-level downsampling will occur prior to the merge, while deduplication occurs after the merge. ``mergeCondition`` If true, conditions-level samples with the same aggregate profile will be merged to create a new sample. The condition-level samples will be retained for further processing as well. If specified, condition-level downsampling will occur prior to the merge, while deduplication occurs after the merge. Creating contact matrices ,,,,,,,,,,,,,,,,,,,,,,,,, ``matrix`` A code block defining which contact matrix formats will be produced. ``matrix.makeMcoolFileFormat`` Part of the ``matrix`` code block. If true, then an `mcool-format `_ multi-resolution cooler file will be created. Creating this file format is necessary for calling insulation and compartment scores with Hich due to its dependence on the ``cooler`` library. ``matrix.makeHicFileFormat`` Part of the ``matrix`` code block. If true, then a `hic-format `_ Hi-C file will be created. This is not necessary for Hich, but an advantage of the .hic format over the .mcool format is that it allows retrieving expected and o/e values. ``matrix.resolutions`` Part of the ``matrix`` code block. A list of resolutions to produce. ``juicerToolsPreParams`` Not part of the ``matrix`` code block. Additional parameters passed to `juicer tools pre `_ command. ``coolerCloadParams`` Not part of the ``matrix`` code block. Additional parameters passed to `cooler cload pairs `_ which is used to generate the highest resolution `cool-format `_ file that serves as the input used to create the mcool-format file. ``coolerZoomifyParams`` Not part of the ``matrix`` code block. Additional parameters passed to `cooler zoomify `_ to create the mcool-format file from the cool format file. Generating multiQC reports on Hi-C contacts ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ``general.qcAfter`` List of steps after which a step-specific multiQC report will be generated using multiQC's ``pairtools`` module. Sample selection strategies ,,,,,,,,,,,,,,,,,,,,,,,,,,, Sample selections strategies are used to specify which samples feature calling is performed on. ``sampleSelectionStrategies`` Hich-config block. Keys are names of sample selection strategies. Each strategy is a hashmap, where keys are sample attribute names, values are the acceptable values of the sample attributes in order for the sample to be selected. Calling HiCRep SCC scores ,,,,,,,,,,,,,,,,,,,,,,,,, HiCRep SCC scores will be called on all pairs of samples in the sample selections strategy, using all combinations of resolutions, h, dBPMax, bDownSample specified in each parameterization. This will be output as a single TSV file associating the input sample pairs, chromosome, and parameterization and the resulting SCC score. ``hicrep`` Hich-config block. Keys are names of hicrep parameterizations. Values are the parameter names and values to be used for that call to hicrep, which can include ``resolutions``, ``chroms``, ``exclude``, ``chromFilter``, ``h``, ``dBPMax``, ``bDownSample``, and ``sampleSelectionStrategy``. Calling compartment scores ,,,,,,,,,,,,,,,,,,,,,,,,,, Compartment scores (bounded by [-1, 1], with positive values being more gene dense than negative values) will be generated for each parameterization on the samples matching its sample selection strategy. ``compartments`` Hich-config block. Keys are names of compartment-calling parameterizations. Values are the parameter names and values to be used for that parameterization of compartment calling, which can include ``resolution``, ``hichCompartmentsParams``, and ``sampleSelectionStrategy``. Calling insulation scores ,,,,,,,,,,,,,,,,,,,,,,,,, Insulation scores will be generated for each parameterization on the samples matching its sample selection strategy. ``insulation`` Hich-config block. Keys are names of insulation score-calling parameterizations. Values are the parameter names and values to be used for that parameterization of insulation score calling, which can include ``sampleSelectionStrategy``. Calling loops ,,,,,,,,,,,,,,,, Mustache loop calls will be generated for each parameterization on the samples matching its sample selection strategy. ``loops`` Hich-config block. Keys are names of Mustache loop-calling parameterizations. Values are the parameter names and values to be used for that parameterization of loop calling, which can include ``sampleSelectionStrategy``. Calling differential loop enrichments (diffloops) ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Mustache diffloop calls will be generated for each parameterization on all pairs of samples matching its sample selection strategy. ``differentialLoops`` Hich-config block. Keys are names of Mustache diffloop-calling parameterizations. Values are the parameter names and values to be used for that parameterization of diffloop calling, which can include ``sampleSelectionStrategy``. Recent outputs ,,,,,,,,,,,,,, ``latest`` ``latestSambam`` ``latestPairs`` ``latestMatrix`` Hich sample attributes built automatically (not typically manually specified by user) ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ``isSingleCell``