Sample File Setup
========================

In Hich, a **sample** is a set of key-value attributes that influence how Hich will process a specific input data file. Hich assembles the description of a sample from user inputs, such as a :ref:`sample file<The sample file>`, :doc:`params file <params_file>`, and :doc:`command line arguments <cli_params>`. Command-line parameters control execution and can specify samples for simpler workflows. The sample file specifies samples for more complex workflows. The params file allows defining sample attributes applying to all samples, as well as describing how to aggregate and call features over samples.

Hich might use the following sample file:

.. csv-table::
    :file: ../tables/sample_file_demo/sample_file_demo.tsv
    :delim: tab
    :header-rows: 1

And combine it with parameters specified in a params file like ``params/onerep_bulk.yml`` to create a hashmap describing all the attributes necessary for the sample to be processed correctly by Hich. It is not necessary to understand the implementation of Hich in detail to use the software, but it may be helpful to see an example of part of the hashmap Hich produces by combining the sample file above with the params file to produce a complete description of how the sample must be processed.

.. code-block:: c

    [id: "Condition1_1_1", condition: "Condition1", biorep: "1", techrep: "1", fastq1: "data/c1_r1.fq.gz", fastq2: "data/c2_r2.fq.gz", assembly: "hg38", aligner: "bwa", retrictionEnzymes: "DdeI,DpnII", bwaFlags: "-SP5M", minMapq: 30, ...]

A sample can represent a technical replicate, biological replicate, or a condition. Input samples are typically defined in a sample file, although they can also be specified at the command line for convenience in simple workflows. When samples are subjected to an aggregation protocol (including downsampling, merging, or deduplication), a new sample based on them is automatically created by Hich. Hich can be configured in the :doc:`params file <params_file>` to retain the original samples or drop them and carry on with just those created by aggregation. See that section for details.

The sample file
-----------------------

The sample file is a columnar, typically tab-delimited (.tsv) file We can break down the set of frequently used sample file columns into groups of related columns. These are not all required, for several reasons. Some, like ``biorep``, ``techrep``, and ``aligner`` are often set by default using the :doc:`params file <params_file>`. The reference files can often be auto-generated by Hich based on the processing columns. Some data inputs do not require certain reference files, such as when the data input is a pairs file or contact matrix, or when the data was generated from assays that use nuclease rather than restriction enzymes, like :doc:`micro-C <digests>`. 

In the sample file, attribute names are specified as case-sensitive column headers. Blank column headers are not allowed.

ID columns
..........................................

Column names: ``condition``, ``biorep``, ``techrep``, and optionally ``id``

All samples must have a unique ``id`` attribute. Samples typically have a ``condition``, ``biorep``, and ``techrep`` parameter. The included params files apply the value "1" to both the ``biorep`` and ``techrep`` parameter if missing. No default is applied to ``condition``. Whether or not samples will be processed correctly by Hich if these columns are missing depends on the aggregation profile(s) specified, if any. It is recommended to give ``biorep`` and ``techrep`` a value even if the experiment does not actually include any replicates. Special characters should be handled correctly in most cases, although to be on the safe side it is a good idea to choose names that avoid special characters used in bash such as "/" and "*".

If an ``id`` is not explicitly specified for a sample, Hich will assign it the ``id`` *[condition]_[biorep]_[techrep]*. For example, a sample with ``condition`` "C1", ``biorep`` "1" and ``techrep`` "1" would be auto-assigned the ``id`` "C1_1_1". Alternatively, the user can put an ``id`` column in the sample file and give the sample whatever ID they choose as long as all samples have distinct IDs. The sample ``id`` is used as the filename prefix for most of Hich's outputs, but the input files may have arbitrary names and be located anywhere on the filesystem. Aggregate profiles, described in the :doc:`params file<params_file>` documentation, also add the aggregate profile name as a suffix, like *[condition]_[biorep]_[techrep]_[aggregation_profile]".

For samples with a common ``aggregateProfile`` name (discussed in more detail in the :doc:`params file<params_file>` documentation) where the ``mergeTechrepsToBioreps`` flag is set to true, techrep-level samples with matching biorep and condition will be merged to a biorep-level sample. Similarly, if  ``mergeBiorepsToConditions`` is true, samples with matching ``condition`` will be merged to a condition-level sample. Below is an example of 8 input technical replicate samples. If an ``aggregateProfile`` defines the ``mergeTechrepsToBioreps`` and ``mergeBiorepsToConditions`` parameters, then 4 biological replicate samples and 2 condition samples will be created and processed in parallel along with the 8 input technical replicates.

.. list-table::
   :widths: 50 50
   :header-rows: 0

   * - .. csv-table::
         :file: ../tables/sample_file_demo/merge_samples_demo.tsv
         :delim: tab
         :header-rows: 1

     - .. image:: ../tables/sample_file_demo/merge_samples.png

Data columns
..........................................

Column names: ``fastq1``, ``fastq2``, ``bam``, ``pairs``, ``mcool``, ``hic``

Hich can ingest intermediate data formats as well as fastq reads. It is safest to either submit plaintext fastq and/or pairs files, or to have them gzipped using the suffix ``.gz`` or ``.gzip``, with `pbgzip <https://github.com/nh13/pbgzip>`_ preferred for .pairs format. The input data files should be listed under the appropriate column name. If different samples are starting at different stages (i.e. some are starting with fastq files, others with bam files), simply include both the ``fastq1``, ``fastq2`` and ``bam`` columns and put the path to the data file in the appropriate column for that sample, leaving the unneeded column values blank.

The ``pairs`` column is for files in `4DN .pairs format <https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md>`_. Note that Hich relies heavily on `pairtools <https://pairtools.readthedocs.io/en/latest/cli_tools.html>`_ for certain processing steps, which uses the column names ``chrom1`` and ``chrom2`` as opposed to ``chr1`` and ``chr2`` designated in the .pairs spec. Hich has not been tested for compatibility with using ``chr1`` and ``chr2``, so using ``chrom1`` and ``chrom2`` is recommended.

The ``hic`` column is for files in the Aiden Lab's `.hic format <https://github.com/aidenlab/hic-format>`. The ``mcool`` column is for files in Open2C `.mcool format <https://cooler.readthedocs.io/en/latest/schema.html#multi-resolution>`.

This example builds on the previous one, demonstrating how data at the fastq level might be combined with data that has been pre-aligned to bam format.

.. csv-table::
    :file: ../tables/sample_file_demo/mixed_data_columns.tsv
    :delim: tab
    :header-rows: 1

Processing columns
..........................................

Column names: ``assembly``, ``aligner``, ``restrictionEnzymes``

The ``assembly`` column is the *name* of the genome reference used, like "hg38", while a URI to a file on the local machine or on the web where the genome reference is stored goes in the ``genomeReference`` column, described in the next section. See the :doc:`page on building reference files <../workflow_steps/reference_files>` for information on how Hich can automatically and non-redundantly retrieve and build needed reference files, including the genome reference, chromsizes file, aligner index, and fragment index.

The ``aligner`` column can take on one of three values: ``bwa``, ``bwa-mem2``, or ``bsbolt``. The `BWA aligner <https://bio-bwa.sourceforge.net/>`_ is slower but requires less memory to index and run. The `BWA-MEM2 aligner <https://github.com/bwa-mem2/bwa-mem2>`_ produces near-identical outputs much more quickly, but requires more memory to index and run. The `BSBolt aligner <https://bsbolt.readthedocs.io/en/latest/>`_ is a BWA-based aligner and methylation caller for reads from a deaminating assay such as bisulfite conversion.

The ``restrictionEnzymes`` column describes the restriction enzymes (if any) used in the Hi-C digest. Hich recognizes `REBASE <https://rebase.neb.com/rebase/rebase.html>`_ enzyme names. For multi-enzyme digests, give a comma-delimited list of enzyme names such as ``DdeI,DpnII``.

This example shows two experimental conditions, one nuclease-based digest (no entry in ``restrictionEnzymes``) aligned to hg38 with the bwa aligner, the other a bisulfite-converted double restriction digest aligned using bsbolt to mm10.

.. csv-table::
    :file: ../tables/sample_file_demo/processing_columns.tsv
    :delim: tab
    :header-rows: 1

Reference file columns
..........................................

Column names: ``genomeReference``, ``chromsizes``, ``alignerIndexDir``, ``alignerIndexPrefix``, ``fragmentIndex``

These columns specify URIs (local paths or web urls) to reference files that are necessary for certain sample processing steps. If a sample's input data is preprocessed beyond the step at which the reference file is needed, then it can be left out for that sample. These columns can be entirely left out if not needed for any samples. See the :doc:`page on building reference files <../workflow_steps/reference_files>` for information on how Hich can automatically and non-redundantly retrieve and build needed reference files, including the genome reference, chromsizes file, aligner index, and fragment index.

The ``genomeReference`` file is to a `FASTA <https://www.ncbi.nlm.nih.gov/genbank/fastaformat/>`_ file with reference contigs, which can be downloaded by Hich during the run based on the assembly name for some assemblies. The ``chromsizes`` file is a two-column headerless file with chromosome names and sizes, which can be built by Hich from the genome reference. The ``alignerIndexDir`` and ``alignerIndexPrefix`` columns respectively label the directory in which the aligner's index files for that genome reference are stored and the filename prefix for the files in the index. For example, if aligner indexes are at ``hich/bwa/bwa/index/hg38.*``, then ``alignerIndexDir`` should be ``hich/bwa/bwa/index`` and ``alignerIndexPrefix`` should be ``hg38``. The ``fragmentIndex`` column  is a three-column headerless zero-indexed left-closed right-open BED file containing chromosome name, start and end position for restriction fragments for the digest. This is automatically produced and used by Hich to discard reads mapping to the same restriction fragment as likely resulting from undigested chromatin.

If unspecified, Hich will attempt to create these files and will hard-link them in ``resources/hich`` under subdirectories with sensible names.    

.. toctree::
    :hidden: