Hich Reference

Sample file

Normally, the sample file is called “samples.tsv” (tab-delimited). Basic sample attributes are usually specified here. Default sample attributes are customizeable and can be specified on the basis of individual sample ids in nextflow.config, which is useful for specifying defaults for biorep and condition samples produced via merge.

Example 1. Because reference, chromsizes, index_dir, index_prefix, and fragfile files are unspecified and the assembly values are supported, Hich will download the reference and produce these needed files automatically.

:file: tables/reference_samplefile1.tsv :delim: tab :header-rows: 1
maxdepth: 2

Example 2. Here, needed reference files are given (possibly from a permanent lab repository), so they will be used rather than produced by Hich. Because there’s just one sample, there is no need to specify a biorep or techrep parameter.

Example 3. Here Hich is ingesting files in several formats, autodetecting the datatype.

Example 4. An experiment using a variety of enzymes for reference digestion and fragment tagging, as well as one sample not tagged or filtered (MNase).

nextflow.config

The nextflow.config file is one way to configure Nextflow, including by setting Hich-specific sample attributes. All sample attributes are described in this section.

Scopes

Hich uses specialized config scopes, specified with a name followed by brackets, to group related sample attributes and general Hich workflow parameters. Here is an example with a subset of the real Hich default nextflow.config file and an extra scope used to specify parameters for a merge.

params {
    general {
        // The general scope holds params
        // relevant to general Hich workflow
        // control, not sample attributes.

        sampleFile {
            // Path to sample file
            // and column separator.
            filename = "samples.tsv"
            sep = "\t"
        }
    }

    defaults {
        // The default scope gives default
        // sample attributes applied to all
        // samples if an explicit value is not
        // given in samples.csv or in a scope
        // specific to the sample's id.

        // Default techrep and biorep labels
        // applied to any samples where they
        // are not specified in samples.csv
        techrep = 1
        biorep = 1

        // Minimum mapq threshold to keep reads
        minMapq = 30
    }

    ko {
        // Apply these parameters to samples
        // with the id "KO" and "NT"
        ids = ["KO", "NT"]
        hicrep {
            exclude = ["chrM"]
        }
    }
}

general

last_step

Specifies the last processing step that should be executed when nextflow run hich.nf is invoked (as a stub, humid or full run). QC for that step will also be completed. Useful for test runs, debugging, and making processing decisions based on QC results. Commented out by default.

params {
    general {
        //last_step = "align"

sampleFile

The filename and column separator for the sample file. The filename param can contain a path relative to the Nextflow projectDir.

params {
    general {
        sampleFile {
            filename = "samples.tsv"
            sep = "\t"
        }

publish

Specifies the Nextflow publishDir mode and output directory for the results of various Hich processes.

params {
    general {
        publish {
            // Nextflow publishDir param for all processes
            // https://www.nextflow.io/docs/latest/process.html#publishdir
            mode = "copy"

            // Where to publish results of Hich processes
            genome = "resources/.hich"
            chromsizes = "resources/.hich"
            bwa_mem2_index = "resources/.hich/bwa-mem2/index"
            bwa_mem_index = "resources/.hich/bwa-mem/index"
            digest = "resources/.hich"

            bam = "results/bam"
            parse = "results/pairs/parse"
            dedup = "results/pairs/dedup"
            mcool = "results/matrix/mcool"
            hic = "results/matrix/hic"
            pairStats = "results/pairStats"
            qc = "results/qc"
        }

qcAfter

params {
    general {
        // After these steps, generate read-level pairs
        // stats files and generate a combined MultiQC report
        // for all samples at each processing stage
        qcAfter = ["Parse",
                    "IngestPairs",
                    "OptionalFragtag",
                    "TechrepsToBioreps",
                    "Deduplicate",
                    "BiorepsToConditions",
                    "Select"]

humid

params {
    general {
        // Number of reads to downsample to
        // when doing a humid run
        humid {
            n_reads = 100000
        }

defaults

All sample attributes specified under this scope will be applied to any samples for which a value is not given in the sample file or one of the custom scopes.

custom scopes

Custom scopes work just like the defaults scope, except that they have a special ids list specifying the set of ids to which they should be applied. Custom scopes override the values in the sample file.

Sample attributes

In Hich, a sample is a single unit of data, such as a technical replicate, biological replicate, or experimental condition. Each sample has a number of sample attributes. These can be specified via columns in the sample file, or to a subset of sample ids via the nextflow.config file (or anywhere your Nextflow is configured, including directly at the command line).

Basic

condition

Required (no default)

A label for the condition. Biological replicates with the same condition label will be merged into a condition sample.

biorep

Required (default = 1)

A label for the biological replicate. Technical replicates with the same condition and biorep labels will be merged into a biorep sample. Note that Hich does not increment the default value, so it is essential to explicitly specify a biorep label if a value other than 1 is desired.

techrep

Required (default = 1)

A label for the technical replicate. Note that Hich does not increment the default value, so it is essential to explicitly specify a techrep label if a value other than 1 is desired.

assembly

Required (no default)

The name of the genome assembly for the sample, such as hg38.

fastq1 and fastq2

Optional, but one of fastq1 and fastq2, sambam, or pairs must be specified for each sample as these are the data files Hich ingests.

The fastq1 and fastq2 attributes are two separate columns in the sample file, each specifying the path to one of two paired-end .fastq-format files which can be gzipped.

sambam

Optional, but one of fastq1 and fastq2, sambam, or pairs must be specified for each sample as these are the data files Hich ingests.

Specifies a .sam or .bam format file to ingest. Hich will sort it by name automatically prior to parsing to .pairs format, which is required for correct parsing.

pairs

Optional, but one of fastq1 and fastq2, sambam, or pairs must be specified for each sample as these are the data files Hich ingests.

Specifies a 4DN .pairs format file to ingest.

reference

Required (no default, but can be downloaded automatically)

The reference genome file for the sample. Hich will automatically download a genome reference if not provided for the following assemblies:

hg38, homo_sapiens, GRCh38
mm10
dm6
galGal5
bGalGal5
danRer11

chromsizes

Required (no default, but can be built automatically)

The chromsizes file for the reference genome, a two-column list of contig names and contig sizes in bp. Built automatically from the reference genome if not specified for a given sample.

minMapq

Not required
Default: 30

Minimum alignment threshold to keep an aligned read.

datatype

Required, but typically autodetected.

Options:

fastq default + autodetected if “fastq1” and “fastq2” are specified but “sambam” and “pairs” are not.
sambam autodetected if “sambam” is specified but “fastq1”, “fastq2”, and “pairs” are not.
pairs autodetected if “pairs” is specified but “fastq1”, “fastq2”, and “sambam” are not.

The format for input read data.

Note: Hich can read data compressed in gzip format, but gzip compression does not need to be explicitly specified.

id

Required (defaults to {condition}_{biorep}_{techrep})

A unique id label for the sample.

Alignment

aligner

Required if datatype == fastq

Available options:

bwa
bwa-mem2 default

While bwa-mem2 is 1.3-3x faster, indexing genomes with bwa-mem2 requires a 60-80 Gb memory footprint, whereas indexing with bwa can be done in less than 32 Gb.

index_dir

Not required

Directory where the aligner-specific reference genome index files are stored. Each file should start with the same index_prefix. If not specified, Hich will attempt to index the reference genome and will output the result to resources/.hich under a subdirectory for the specific aligner.

index_prefix

Not required

Prefix shared by all needed aligner-specifi reference genome index files in the index_dir directory. If not specified, Hich will attempt to index the reference genome and will output the result to resources/.hich under a subdirectory for the specific aligner.

alignerThreads

Default: 10

Max threads to use for alignment. It is highly recommended to set this to the maximum number of available cores. Note that only one alignment process is spawned at a time. This is because every aligner Hich uses (BWA MEM and BWA MEM2) are internally parallelized, so there is no substantial performance gain to running multiple alignment processes in parallel, while the substantial memory footprint is duplicated for each aligner instance being run.

bwaFlags

Default: -SP5M

Flags to use for the aligner bwa mem or bwa-mem2 mem. The default -SP5M is recommended by 4DN for aligning paired-end Hi-C reads with bwa mem or bwa-mem2 mem. See bwa manual reference page for additional options.

Pairs processing

enzymes

Default: none

If restriction enzymes were used to digest the sample, they can be listed here. Hich allows specifying “Arima” for the Arima Hi-C+ kit enzymes. Any enzymes or combination of enzymes in Biopython’s Bio.Restrict library can be used. Multiple enzymes should be separated by commas ,. If specified, a “fragment index” (a digest of the reference genome using the enzymes in .bed format) will be produced automatically, used to tag tne ends of each read with the restriction fragment it maps to, and then filter out any reads where each end maps to the same restriction fragment. If not specified, none of these steps occur. See fragfile for how to use an already-created fragment index.

fragfile

Default: none

An already-created fragment index in .bed format, to be used for tagging contacts with the fragment from which each end originated if the enzymes parameter is specified for the sample.

deduplicate

Options:

true default
false

Whether to remove technical duplicates (i.e. PCR or optical duplicates). Deduplication is applied to biological replicates after forming them from non-deduplicated technical replicates or after ingesting them directly into Hich. Hich deduplicates technical replicates after using them to merge biological replicates.

pairsFormat

chrom1

Required
Default: 2

The column in the .pairs file where the first chromosome is labeled for each read.

pos1

Required
Default: 3

The column in the .pairs file where the first base pair position is labeled for each read.

chrom2

Required
Default: 4

The column in the .pairs file where the second chromosome is labeled for each read.

pos2

Required
Default: 5

The column in the .pairs file where the second base pair position is labeled for each read.

parseParams

Default:

--flip
--drop-readid
--drop-seq
--drop-sam

Extra parameters to use for parsing .sam/.bam alignments into .pairs format.

Note: The drop-* parameters are one of the most impactful for making Hich fast and giving it a low disk footprint. It is not recommended to remove these parameters unless you know what you are doing, although additional parameters can be added.

pairtoolsDedupParams

Extra parameters to use during the deduplication step.

pairtoolsSelectParams

Extra parameters to use during the selection step.

selectFilters

Read-level filters to use during the selection step.

keepPairTypes

Default: UU, UR, RU

U is for a unique aligned read, whereas an R is “rescued” by detecting pairs where one side maps to locus 1 and the other to a slightly different position on locus 1 and to locus 2, the classic “split ligation junction” pattern that represents an observed, rather than inferred, ligation junction.

keepTrans

Options:

true default
false

Whether to keep interchromosomal (“trans”) contacts. Note that this should be left as true if forming .mcool files and using the default trans-only option, which normalizes contact matrices based exclusively on trans contacts, which are in some cases thought to yield more biologically representative results.

keepCis

Options:

true default
false

Whether to keep intrachromosomal (“cis”) contacts.

minDistFR

Default: 1000

Minimum insert size (in bp) to keep FR (+- or inward) strands. In Hi-C, the set of short-range FR strands can be highly enriched in undigested chromatin, which shows up in Hich’s MultiQC report as a percentage of FR orientations substantially higher than the expected 25%. These can be filtered out using this option.

minDistRF

Default: 1000

Minimum insert size (in bp) to keep RF (-+ or outward) strands. In Hi-C, the set of short-range FR strands can be highly enriched in self-circles (digested fragments that self-ligated end to end), which shows up in Hich’s MultiQC report as a percentage of RF orientations substantially higher than the expected 25%. These can be filtered out using this option.

minDistFF

Default: 0

Minimum insert size (in bp) to keep FF (++) strands.

minDistFF

Default: 0

Minimum insert size (in bp) to keep RR (–) strands.

chroms

If specified, each read alignment must be to a chromosome in this set.

discardSingleFrag

Options:

true default
false

If true, fragments whose alignments are mapped to restriction fragments will be discarded if both ends mapped to the same restriction fragment.

Matrix processing

juicerToolsPreParams

Arguments supplied to juicer tools’ pre command when forming a Hi-C contact matrix.

coolerCloadParams

Arguments supplied to the cooler cload command for forming .cool format precursors to the .mcool contact matrix.

coolerZoomifyParams

Default:

--balance
--balance-args 'max-iters 2000 --trans-only'

Arguments supplied to the cooler zoomify command for coarsening high-res .cool matrices into multi-resolution .mcool contact matrices. The chosen defaults will generate multi-res contact matrices containing both the raw contacts and balancing weights produced using the trans contacts only.

matrix

makeMcoolFileFormat

Options:

true default
false

Whether to produce .mcool-format contact matrices (the Open2C multi-resolution format). Currently required for feature calling and QC.

makeHicFileFormat

Options:

true default
false default

Whether to produce .hic-format contact matrices (compatible with the Juicer tool ecosystem including the Juicebox browser).

resolutions

Default:

1000
2000
5000
10000
20000
50000
100000
200000
500000
1000000

Reference chromosome coordinates will be partitioned into these uniform block sizes (in bp) and contact ends mapped to those blocks to generate contact matrices. Lower numbers represent higher-resolution matrices.

Quality control

hicrep

call_on

Options:

is_techrep default
is_biorep default
is_condition default

Whether to compute Hicrep SCC scores on technical replicates, biological replicates, and conditions. Results for all comparisons are output to a single .tsv file with a per-column header giving the pair of samples, chromosome, resolution, and Hicrep parameters that were used, along with the SCC score.

resolutions

Default:

10000
100000
1000000

Which resolutions to use for calling Hicrep SCC scores.

chroms

Which chromosomes to use for calling Hicrep SCC scores. If not specified, all chromosomes shared by both matrices at the given resolution will be used.

exclude

Which chromosomes to exclude when calling Hicrep SCC scores.

chromFilter

A conditional statement in Python to determine whether to use a chromosome for Hicrep as a function of its name (referenced via the chrom variable) and size (the size variable). It will be evaluated using Python’s eval statement.

h

Values of Hicrep’s h parameter to use.

dBPMax

Values of Hicrep’s dBPMax parameter to use.

bDownSample

Values of Hicrep’s bDownSample parameter to use.

Feature calling

compartments

resolution

Default: 5000

The resolution at which compartments should be called.

cooltools_eigs_cis_params

Defaults:

–bigwig

Additional parameters that should be passed to cooltools_eigs_cis. The default specifies that a .bigwig-format file should be generated as well as the .bedgraph format.

insulation

resolution

Default: 5000

The resolution at which insulation should be called.

cooltoolsInsulationParams

Defaults:

–bigwig

Additional parameters that should be passed to cooltools_eigs_cis. The default specifies that a .bigwig-format file should be generated as well as the .bedgraph format.

loops

Hich uses Mustache for loop and differential loop calling. This software was chosen mainly for its theoretical advantages. Based on scale space theory, it applies an artifact-free filter to a matrix to remove fine details, then detects blobs which are called as loops. It thereby takes local information into account in loop calling. Differential loops for a pair of matrices are loops that are present or enriched in one matrix and not present or depleted in the other. An added practical benefit is that Mustache is fast enough that it can run on a CPU, whereas many other loop callers require a GPU.

call_on

Options:

is_techrep default
is_biorep default
is_condition default

Whether to call loops on technical replicates, biological replicates, and conditions.

use_format

Options:

mcool default
hic

Mustache can use both .mcool and .hic matrix formats as input. Loops will only be called on samples where the appropriate matrix type is output. If both are generated, which is used should not affect the outcome.

mustache_params

Default:

--resolution 5000
--pThreshold .1
--sparsityThreshold .88

Parameters passed to mustache_diffloops, which will output both individual matrix loop calls and a pair of diffloops calls for each matrix.