`dynast.preprocessing`

Submodules

Package Contents

Functions

`aggregate_counts`(df_counts, aggregates_path, conversions=frozenset([('TC', )]))	Aggregate conversion counts for each pair of bases.
`calculate_mutation_rates`(df_counts, rates_path, group_by=None)	Calculate mutation rate for each pair of bases.
`merge_aggregates`(*dfs)	Merge multiple aggregate dataframes into one.
`read_aggregates`(aggregates_path)	Read aggregates CSV as a pandas dataframe.
`read_rates`(rates_path)	Read mutation rates CSV as a pandas dataframe.
`check_bam_contains_duplicate`(bam_path, n_reads=100000, n_threads=8)
`check_bam_contains_secondary`(bam_path, n_reads=100000, n_threads=8)
`check_bam_contains_unmapped`(bam_path)
`get_tags_from_bam`(bam_path, n_reads=100000, n_threads=8)	Utility function to retrieve all read tags present in a BAM.
`parse_all_reads`(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)	Parse all reads in a BAM and extract conversion, content and alignment
`read_alignments`(alignments_path, args, *kwargs)	Read alignments CSV as a pandas DataFrame.
`read_conversions`(conversions_path, args, *kwargs)	Read conversions CSV as a pandas DataFrame.
`select_alignments`(df_alignments)	Select alignments among duplicates. This function performs preliminary
`sort_and_index_bam`(bam_path, out_path, n_threads=8, temp_dir=None)	Sort and index BAM.
`call_consensus`(bam_path, out_path, gene_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, quality=27, add_RS_RI=False, temp_dir=None, n_threads=8)
`complement_counts`(df_counts, gene_infos)	Complement the counts in the counts dataframe according to gene strand.
`count_conversions`(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)	Count the number of conversions of each read per barcode and gene, along with
`deduplicate_counts`(df_counts, conversions=None, use_conversions=True)	Deduplicate counts based on barcode, UMI, and gene.
`read_counts`(counts_path, args, *kwargs)	Read counts CSV as a pandas dataframe.
`split_counts_by_velocity`(df_counts)	Split the given counts dataframe by the velocity column.
`calculate_coverage`(bam_path, conversions, coverage_path, alignments=None, umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, velocity=True)	Calculate coverage of each genomic position per barcode.
`read_coverage`(coverage_path)	Read coverage CSV as a dictionary.
`detect_snps`(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)	Detect SNPs.
`read_snp_csv`(snp_csv)	Read a user-provided SNPs CSV
`read_snps`(snps_path)	Read SNPs CSV as a dictionary

Attributes

CONVERSION_COMPLEMENT

dynast.preprocessing.aggregate_counts(df_counts, aggregates_path, conversions=frozenset([('TC',)]))

Aggregate conversion counts for each pair of bases.

Parameters

df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases
aggregates_path (str) – path to write aggregate CSV
conversions (list, optional) – conversion(s) in question, defaults to frozenset([(‘TC’,)])

Returns

path to aggregate CSV that was written

Return type

str

dynast.preprocessing.calculate_mutation_rates(df_counts, rates_path, group_by=None)

Calculate mutation rate for each pair of bases.

Parameters

df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases
rates_path (str) – path to write rates CSV
group_by (list) – column(s) to group calculations by, defaults to None, which combines all rows

Returns

path to rates CSV

Return type

str

dynast.preprocessing.merge_aggregates(*dfs)

Merge multiple aggregate dataframes into one.

Parameters

*dfs –

dataframes to merge

Returns

merged dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_aggregates(aggregates_path)

Read aggregates CSV as a pandas dataframe.

Parameters: aggregates_path (str) – path to aggregates CSV
Returns: aggregates dataframe
Return type: pandas.DataFrame

dynast.preprocessing.read_rates(rates_path)

Read mutation rates CSV as a pandas dataframe.

Parameters: rates_path (str) – path to rates CSV
Returns: rates dataframe
Return type: pandas.DataFrame

dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)

dynast.preprocessing.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)

dynast.preprocessing.check_bam_contains_unmapped(bam_path)

dynast.preprocessing.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

Parameters

bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8

Returns

set of all tags found

Return type

set

dynast.preprocessing.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

Parameters

bam_path (str) – path to alignment BAM file
conversions_path (str) – path to output information about reads that have conversions
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
no_index_path (str) – path to no conversions index
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False
return_splits (bool, optional) – return BAM splits for later reuse, defaults to True

Returns

(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

Return type

(str, str, str) or (str, str, str, list)

dynast.preprocessing.read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters: alignments_path (str) – path to alignments CSV
Returns: conversions dataframe
Return type: pandas.DataFrame

dynast.preprocessing.read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters: conversions_path (str) – path to conversions CSV
Returns: conversions dataframe
Return type: pandas.DataFrame

dynast.preprocessing.select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters: df_alignments (pandas.DataFrame) – alignments dataframe
Returns: set of (read_id, alignment index) that were selected
Return type: set

dynast.preprocessing.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters

bam_path (str) – path to alignment BAM file to sort
out_path (str) – path to output sorted BAM
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to sorted and indexed BAM

Return type

str

dynast.preprocessing.call_consensus(bam_path, out_path, gene_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, quality=27, add_RS_RI=False, temp_dir=None, n_threads=8)

dynast.preprocessing.complement_counts(df_counts, gene_infos)

Complement the counts in the counts dataframe according to gene strand.

Parameters

df_counts (pandas.DataFrame) – counts dataframe
gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

Return type

pandas.DataFrame

dynast.preprocessing.CONVERSION_COMPLEMENT

dynast.preprocessing.count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters

conversions_path (str) – path to conversions CSV
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
counts_path – path to write counts CSV
counts_path – str
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf, defaults to None
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None
conversions (list, optional) – conversions to prioritize when deduplicating only applicable for UMI technologies, defaults to None
dedup_use_conversions (bool, optional) – prioritize reads that have at least one conversion when deduplicating, defaults to True
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to counts CSV

Return type

str

dynast.preprocessing.deduplicate_counts(df_counts, conversions=None, use_conversions=True)

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters

df_counts (pandas.DataFrame) – counts dataframe
conversions (list, optional) – conversions to prioritize, defaults to None
use_conversions (bool, optional) – prioritize reads that have conversions first, defaults to True

Returns

deduplicated counts dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_counts(counts_path, *args, **kwargs)

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters: counts_path (str) – path to CSV
Returns: counts dataframe
Return type: pandas.DataFrame

dynast.preprocessing.split_counts_by_velocity(df_counts)

Split the given counts dataframe by the velocity column.

Parameters: df_counts (pandas.DataFrame) – counts dataframe
Returns: dictionary containing velocity column values as keys and the subset dataframe as values
Return type: dictionary

dynast.preprocessing.calculate_coverage(bam_path, conversions, coverage_path, alignments=None, umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, velocity=True)

Calculate coverage of each genomic position per barcode.

Parameters

bam_path (str) – path to alignment BAM file
conversions (dictionary) – dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed
coverage_path (str) – path to write coverage CSV
alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
temp_dir (str, optional) – path to temporary directory, defaults to None
velocity (bool, optional) – whether or not velocities were assigned

Returns

coverage CSV path

Return type

str

dynast.preprocessing.read_coverage(coverage_path)

Read coverage CSV as a dictionary.

Parameters: coverage_path (str) – path to coverage CSV
Returns: coverage as a nested dictionary
Return type: dict

dynast.preprocessing.detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)

Detect SNPs.

Parameters

conversions_path (str) – path to conversions CSV
index_path (str) – path to conversions index
coverage (dict) – dictionary containing genomic coverage
snps_path (str) – path to output SNPs
alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
conversions (set, optional) – set of conversions to consider
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
threshold (float, optional) – positions with conversions / coverage > threshold will be considered as SNPs, defaults to 0.5
min_coverage (int, optional) – only positions with at least this many mapping read_snps are considered, defaults to 1
n_threads (int, optional) – number of threads, defaults to 8

dynast.preprocessing.read_snp_csv(snp_csv)

Read a user-provided SNPs CSV

Parameters: snp_csv (str) – path to SNPs CSV
Returns: dictionary of contigs as keys and sets of genomic positions with SNPs as values
Return type: dictionary

dynast.preprocessing.read_snps(snps_path)

Read SNPs CSV as a dictionary

Parameters: snps_path (str) – path to SNPs CSV
Returns: dictionary of contigs as keys and sets of genomic positions with SNPs as values
Return type: dictionary

dynast.preprocessing

Submodules

Package Contents

Functions

Attributes

`dynast.preprocessing`