Pipeline Usage
This sections covers basic usage of dynast.
Building the STAR index with ref
Internally, dynast uses the STAR RNA-seq aligner to align reads to the genome [Dobin2013]. Therefore, we must construct a STAR index to use for alignment. The dynast ref
command is a wrapper around the STAR’s --runMode genomeGenerate
command, while also providing useful default parameters to limit memory usage, similar to Cell Ranger. Existing STAR indices can be used interchangeably with ones generated through dynast. A genome FASTA and gene annotation GTF are required to build the STAR index.
usage: dynast ref [-h] [--tmp TMP] [--keep-tmp] [--verbose] [-t THREADS] -i INDEX [-m MEMORY] fasta gtf
Build a STAR index from a reference
positional arguments:
fasta Genomic FASTA file
gtf Reference GTF file
optional arguments:
-h, --help Show this help message and exit
--tmp TMP Override default temporary directory
--keep-tmp Do not delete the tmp directory
--verbose Print debugging information
-t THREADS Number of threads to use (default: 8)
-m MEMORY Maximum memory used, in GB (default: 16)
required arguments:
-i INDEX Path to the directory where the STAR index will be generated
Aligning FASTQs with align
The dynast align
command is a wrapper around STARsolo [Dobin2013]. Dynast automatically formats the arguments to STARsolo to ensure the resulting alignment BAM contains information necessary for downstream processing.
Additionally, align
sets a more lenient alignment score cutoff by setting --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
because the reads are expected to have experimentally-induced conversions. The STARsolo defaults for both are 0.66
. The --STAR-overrides
argument can be used to pass arguments directly to STAR.
dynast align
outputs a different set of BAM tags in the alignment BAM depending on the type of sequencing technology specified. These are described in the following subsections.
usage: dynast align [-h] [--tmp TMP] [--keep-tmp] [--verbose] [-t THREADS] -i INDEX [-o OUT] -x TECHNOLOGY
[--strand {forward,reverse,unstranded}] [-w WHITELIST] [--overwrite]
[--STAR-overrides ARGUMENTS]
fastqs [fastqs ...]
Align FASTQs
positional arguments:
fastqs FASTQ files. If `-x smartseq`, this is a single manifest CSV file where the first
column contains cell IDs and the next two columns contain paths to FASTQs (the third
column may contain a dash `-` for single-end reads).
optional arguments:
-h, --help Show this help message and exit
--tmp TMP Override default temporary directory
--keep-tmp Do not delete the tmp directory
--verbose Print debugging information
-t THREADS Number of threads to use (default: 8)
--strand {forward,reverse,unstranded}
Read strandedness. (default: `forward`)
-w WHITELIST Path to file of whitelisted barcodes to correct to. If not provided, all barcodes are
used.
--overwrite Overwrite existing alignment files
--STAR-overrides ARGUMENTS
Arguments to pass directly to STAR.
required arguments:
-i INDEX Path to the directory where the STAR index is located
-o OUT Path to output directory (default: current directory)
-x TECHNOLOGY Single-cell technology used. `dynast --list` to view all supported technologies
UMI-based technologies
For UMI-based technologies (such as Drop-seq, 10X Chromium, scNT-seq), the following BAM tags are written to the alignment BAM.
MD
HI
,AS
for alignment index and scoreCR
,CB
for raw and corrected barcodesUR
,UB
for raw and corrected UMIs
Plate-based technologies
For plate-based technologies (such as Smart-Seq), the following BAM tags are written to the alignment BAM. See
MD
HI
,AS
for alignment index and scoreRG
indicating the sample name
Calling consensus sequences with consensus
dynast consensus
parses the alignment BAM to generate consensus sequences for each sequenced mRNA molecule (see Consensus procedure).
usage: dynast consensus [-h] [--tmp TMP] [--keep-tmp] [--verbose] [-t THREADS] -g GTF [-o OUT] [--umi-tag TAG]
[--barcode-tag TAG] [--gene-tag TAG] [--strand {forward,reverse,unstranded}]
[--quality QUALITY] [--barcodes TXT] [--add-RS-RI]
bam
Generate consensus sequences
positional arguments:
bam Alignment BAM file that contains the appropriate UMI and barcode tags, specifiable with
`--umi-tag`, and `--barcode-tag`.
optional arguments:
-h, --help Show this help message and exit
--tmp TMP Override default temporary directory
--keep-tmp Do not delete the tmp directory
--verbose Print debugging information
-t THREADS Number of threads to use (default: 8)
-o OUT Path to output directory (default: current directory)
--umi-tag TAG BAM tag to use as unique molecular identifiers (UMI). If not provided, all reads are assumed
to be unique. (default: None)
--barcode-tag TAG BAM tag to use as cell barcodes. If not provided, all reads are assumed to be from a single
cell. (default: None)
--gene-tag TAG BAM tag to use as gene assignments (default: GX)
--strand {forward,reverse,unstranded}
Read strandedness. (default: `forward`)
--quality QUALITY Base quality threshold. When generating a consensus nucleotide at a certain position, the base
with smallest error probability below this quality threshold is chosen. If no base meets this
criteria, the reference base is chosen. (default: 27)
--barcodes TXT Textfile containing filtered cell barcodes. Only these barcodes will be processed.
--add-RS-RI Add custom RS and RI tags to the output BAM, each of which contain a semi-colon delimited list
of read names (RS) and alignment indices (RI) of the reads and alignments from which the
consensus is derived. This option is useful for debugging.
required arguments:
-g GTF Path to GTF file used to generate the STAR index
The resulting BAM will contain a collection of consensus alignments and a subset of original alignments (for those alignments for which a consensus could not be determined). The latter are identical to those in the original BAM, while the names of the former will be seemingly random sequences of letters and numbers (in reality, these are SHA256 checksums of the grouped read names). They will also contain the following modified BAM tags
AS
is now the sum of the alignment scores of the readsHI
, the alignment index, is always 1
and the follwing additional BAM tags.
RN
indicating how many reads were used to generate the consensusRS
,RI
each containing a semicolon-delimited list of read names and their corresponding alignment indices (HI
tag in the original BAM) that were used to generate the consensus (only added if--add-RS-RI
is provided)
Quantifying counts with count
dynast count
parses the alignment BAM and quantifies the four RNA species (unlabeled unspliced, unlabeled spliced, labeled unspliced, labeled spliced) and outputs the results as a ready-to-use AnnData H5AD
file. In order to properly quantify the above four species, the alignment BAM must contain specific BAM tags, depending on what sequencing technology was used. If dynast align
was used to generate the alignment BAM, dynast automatically configures the appropriate BAM tags to be written.
usage: dynast count [-h] [--tmp TMP] [--keep-tmp] [--verbose] [-t THREADS] -g GTF --conversion CONVERSION [-o OUT]
[--umi-tag TAG] [--barcode-tag TAG] [--gene-tag TAG] [--strand {forward,reverse,unstranded}]
[--quality QUALITY] [--snp-threshold THRESHOLD] [--snp-min-coverage THRESHOLD] [--snp-csv CSV]
[--barcodes TXT] [--no-splicing | --exon-overlap {lenient,strict}] [--control]
[--dedup-mode {auto,conversion,exon}] [--overwrite]
bam
Quantify unlabeled and labeled RNA
positional arguments:
bam Alignment BAM file that contains the appropriate UMI and barcode tags, specifiable with
`--umi-tag`, and `--barcode-tag`.
optional arguments:
-h, --help Show this help message and exit
--tmp TMP Override default temporary directory
--keep-tmp Do not delete the tmp directory
--verbose Print debugging information
-t THREADS Number of threads to use (default: 8)
-o OUT Path to output directory (default: current directory)
--umi-tag TAG BAM tag to use as unique molecular identifiers (UMI). If not provided, all reads are assumed
to be unique. (default: None)
--barcode-tag TAG BAM tag to use as cell barcodes. If not provided, all reads are assumed to be from a single
cell. (default: None)
--gene-tag TAG BAM tag to use as gene assignments (default: GX)
--strand {forward,reverse,unstranded}
Read strandedness. (default: `forward`)
--quality QUALITY Base quality threshold. Only bases with PHRED quality greater than this value will be
considered when counting conversions. (default: 27)
--snp-threshold THRESHOLD
Conversions with (# conversions) / (# reads) greater than this threshold will be considered a
SNP and ignored. (default: no SNP detection)
--snp-min-coverage THRESHOLD
For a conversion to be considered as a SNP, there must be at least this many reads mapping to
that region. (default: 1)
--snp-csv CSV CSV file of two columns: contig (i.e. chromosome) and genome position of known SNPs
--barcodes TXT Textfile containing filtered cell barcodes. Only these barcodes will be processed.
--no-splicing, --transcriptome-only
Do not assign reads a splicing status (spliced, unspliced, ambiguous) and ignore reads that
are not assigned to the transcriptome.
--exon-overlap {lenient,strict}
Algorithm to use to detect spliced reads (that overlap exons). May be `strict`, which assigns
reads as spliced if it only overlaps exons, or `lenient`, which assigns reads as spliced if it
does not overlap with any introns of at least one transcript. (default: lenient)
--control Indicate this is a control sample, which is used to detect SNPs.
--dedup-mode {auto,conversion,exon}
Deduplication mode for UMI-based technologies (required `--umi-tag`). Available choices are:
`auto`, `conversion`, `exon`. When `conversion` is used, reads that have at least one of the
provided conversions is prioritized. When `exon` is used, exonic reads are prioritized. By
default (`auto`), the BAM is inspected to select the appropriate mode.
--overwrite Overwrite existing files.
required arguments:
-g GTF Path to GTF file used to generate the STAR index
--conversion CONVERSION
The type of conversion(s) introduced at a single timepoint. Multiple conversions can be
specified with a comma-delimited list. For example, T>C and A>G is TC,AG. This option can be
specified multiple times (i.e. dual labeling), for each labeling timepoint.
Basic arguments
The --barcode-tag
and --umi-tag
arguments are used to specify what BAM tags should be used to differentiate cells (barcode) and RNA molecules (UMI). If the former is not specified, all BAM alignments are assumed to be from a single cell, and if the latter is not specified, all aligned reads are assumed to be unique (i.e. no read deduplication is performed). If align
was used to generate the alignment BAM, then --barcode-tag CB --umi-tag UB
is recommended for UMI-based technologies (see UMI-based technologies), and --barcode-tag RG
is recommended for Plate-based technologies (see Plate-based technologies).
The --strand
argument can be used to specify the read strand of the sequencing technology. Usually, the default (forward
) is appropriate, but this argument may be of use for other technologies.
The --conversion
argument is used to specify the type of conversion that is experimentally introduced as a two-character string. For instance, a T>C conversion is represented as TC
, which is the default. Multiple conversions can be specified as a comma-delimited list, and --conversion
may be specified multiple times to indicate multiple-indexing experiments. For example, for an experiment that introduced T>C mutations at timepoint 1 and A>G and C>G mutations at timepoint 2, the appropriate options would be --conversion TC --conversion AG,CG
.
Detecting and filtering SNPs
dynast count
has the ability to detect single-nucleotide polymorphisms (SNPs) by calculating the fraction of reads with a mutation at a certain genomic position. --snp-threshold
can be used to specify the proportion threshold greater than which a SNP will be called at that position. All conversions/mutations at the genomic positions with SNPs detected in this manner will be filtered out from further processing. In addition, a CSV file containing known SNP positions can be provided with the --snp-csv
argument. This argument accepts a CSV file containing two columns: contig (i.e. chromosome) and genomic position of known SNPs.
Read deduplication modes
The --dedup-mode
option is used to select how duplicate reads should be deduplicated for UMI-based technologies (i.e. --umi-tag
is provided). Two different modes are supported: conversion
and exon
. The former prioritizes reads that have at least one conversions provided by --conversion
. The latter prioritizes exonic reads. See quant for a more technical description of how deduplication is performed. Additionally, see Consensus procedure to get an idea of why selecting the correct option may be important.
By default, the --dedup-mode
is set to auto
, which sets the deduplication mode to exon
if the input BAM is detected to be a consensus-called BAM (a BAM generated with dynast consensus
). Otherwise, it is set to conversion
. This option has no effect for non-UMI technologies.
Estimating counts with estimate
The fraction of labeled RNA is estimated with the dynast estimate
command. Whereas dynast count
produces naive UMI count matrices, dynast estimate
statistically models labeling dynamics to estimate the true fraction of labeled RNA (and then in turn uses this fraction to split the total UMI counts into unlabeled and labeled RNA). See Statistical estimation of a technical overview of this process. In this section, we will simply be describing the command-line usage of this command.
usage: dynast estimate [-h] [--tmp TMP] [--keep-tmp] [--verbose] [-t THREADS]
[--reads {total,transcriptome,spliced,unspliced}] [-o OUT] [--groups CSV]
[--ignore-groups-for-pi] [--genes TXT] [--cell-threshold COUNT] [--cell-gene-threshold COUNT]
[--downsample NUM] [--downsample-mode MODE] [--control] [--p-e P_E]
count_dirs [count_dirs ...]
Estimate fraction of labeled RNA
positional arguments:
count_dirs Path to directory that contains `dynast count` output. When multiple are provided, the
barcodes in each of the count directories are suffixed with `-i` where i is a 0-indexed
integer.
optional arguments:
-h, --help Show this help message and exit
--tmp TMP Override default temporary directory
--keep-tmp Do not delete the tmp directory
--verbose Print debugging information
-t THREADS Number of threads to use (default: 8)
--reads {total,transcriptome,spliced,unspliced}
Read groups to perform estimation on. This option can be used multiple times to estimate
multiple groups. (default: all possible reads groups)
-o OUT Path to output directory (default: current directory)
--groups CSV CSV containing cell (barcode) groups, where the first column is the barcode and the second is
the group name the cell belongs to. Cells will be combined per group for estimation of
parameters specified by `--groups-for`.
--ignore-groups-for-pi
Ignore cell groupings when estimating the fraction of labeled RNA. This option only has an
effect when `--groups` is also specified.
--genes TXT Textfile containing list of genes to use. All other genes will be treated as if they do not
exist.
--cell-threshold COUNT
A cell must have at least this many reads for correction. (default: 1000)
--cell-gene-threshold COUNT
A cell-gene pair must have at least this many reads for correction. (default: 16)
--downsample NUM Downsample the number of reads (UMIs). If a decimal between 0 and 1 is given, then the number
is interpreted as the proportion of remaining reads. If an integer is given, the number is
interpreted as the absolute number of remaining reads.
--downsample-mode MODE
Downsampling mode. Can be one of: `uniform`, `cell`, `group`. If `uniform`, all reads (UMIs)
are downsampled uniformly at random. If `cell`, only cells that have more reads than the
argument to `--downsample` are downsampled to exactly that number. If `group`, identical to
`cell` but per group specified by `--groups`.
--control Indicate this is a control sample, only the background mutation rate will be estimated.
--p-e P_E Textfile containing a single number, indicating the estimated background mutation rate
Estimation thresholds
The --cell-threshold
and --cell-gene-threshold
arguments control the minimum number of reads that a cell and cell-gene combination must have for accurate estimation. By default, these are 1000
and 16
respectively. Any cells with reads less than the former are excluded from estimation, and the same goes for any genes within a cell that has less reads than the latter. If --groups
is also provided, then these thresholds apply to each cell group instead of each cell individually. Internally, --cell-threshold
is used to filter cells before estimating the average conversion rate in labeled RNA (see Induced rate estimation (p_c)), and --cell-gene-threshold
is used to filter cell-gene combinations before estimating the fraction of new RNA (see Bayesian inference (\pi_g)).
Estimation on a subset of RNA species
The --reads
argument controls which RNA species to run the estimation procedure on. By default, all possible RNA species, minus ambiguous
reads, are used. This argument can take on the following values: total
, transcriptome
, spliced
, unspliced
(see Read groups). The value of this argument specifies which group of unlabeled/labeled RNA counts will be estimated. For instance, --reads spliced
will run statistical estimation on unlabeled/labeled spliced reads. This option may be provided multiple times to run estimation on multiple groups. The procedure involves estimating the conversion rate of unlabeled and labeled RNA, and modeling the fraction of new RNA as a binomial mixture model (see Statistical estimation).
Grouping cells
Sometimes, grouping read counts across cells may provide better estimation results, especially in the case of droplet-based methods, which result in fewer reads per cell and gene compared to plate-based methods. The --groups
argument can be used to provide a CSV of two columns: the first containing the cell barcodes and the second containing group names that each cell belongs to. Estimation is then performed on a per-group basis by combining the read counts across all cells in each group. This strategy may be applied across different samples, simply by specifying multiple input directories. In this case, the number of group CSVs specified with --groups
must match the number of input directories. For example, when providing two input directories ./input1
and ./input2
, with the intention of grouping cells across these two samples, two group CSVs, groups1.csv
and groups2.csv
must be provided where the former are groups for barcodes in the first sample, and the latter are groups for barcodes in the second sample. The group names may be shared across samples. The output AnnData will still contain reads per cell.
Cell groupings provided this way may be ignored for estimation of the fraction of labeled RNA (see Bayesian inference (\pi_g)) by providing the --ignore-groups-for-pi
flag. This flag may be used only in conjunction with --groups
, and when it is provided, estimation of the fraction of labeled RNA is performed per cell, while estimation of background and induced mutation rates are still done per group.
Downsampling
Downsampling UMIs uniformly, per cell, or per cell group may be useful to significantly reduce runtime while troubleshooting pipeline parameters (or just to quickly get some preliminary results). Dynast can perform downsampling when the --downsample
argument is used. The value of this argument may either be an integer indicating the number of UMIs to retain or a proportion between 0 and 1 indicating the proportion of UMIs to retain. Additionally, the downsampling mode may be specified with the --downsample-mode
argument, which takes one of the following three parameters: uniform
, cell
, group
. uniform
is the default that downsamples UMIs uniformly at random. When cell
is provided, the value of --downsample
may only be an integer specifying the threshold to downsample cells to. Only cells with UMI counts greater than this value will be downsampled to exactly this value. group
works the same way, but for cell groups and may be used only in conjunction with --groups
.
Control samples
Control samples may be used to find common SNPs and directly estimate the conversion rate of unlabeled RNA (see Background estimation (p_e)). Normally, the latter is estimating using the reads directly. However, it is possible to use a control sample (prepared in absence of the experimental introduction of conversions) to calculate this value directly. In addition, SNPs can be called in the control sample, and these called SNPs can be used when running the test sample(s) (see Detecting and filtering SNPs for SNP arguments). Note that SNP calling is done with dynast count
.
A typical workflow for a control sample is the following.
dynast count --control --snp-threshold 0.5 [...] -o control_count --conversion TC -g GTF.gtf CONTROL.bam
dynast estimate --control -o control_estimate control_count
Where [...]
indicates the usual options that would be used for dynast count
if this were not control samples. See Basic arguments for these options.
The dynast count
command detects SNPs from the control sample and outputs them to the file snps.csv
in the output directory control_count
. The dynast estimate
calculates the background conversion rate of unlabeled RNA to the file p_e.csv
in the output directory control_estimate
. These files can then be used as input when running the test sample.
dynast count --snp-csv control_count/snps.csv -o test_count [...] INPUT.bam
dynast estimate --p-e control_estimate/p_e.csv -o test_estimate test_count
The above set of commands runs quantification and estimation on the test sample using the SNPs detected from the control sample (control_count/snps.csv
) and the background conversion rate estimated from the control sample (control_estimate/p_e.csv
).