dynast.preprocessing

Submodules

Package Contents

Functions

aggregate_counts(df_counts, aggregates_path, conversions=frozenset([('TC', )]))

Aggregate conversion counts for each pair of bases.

calculate_mutation_rates(df_counts, rates_path, group_by=None)

Calculate mutation rate for each pair of bases.

merge_aggregates(*dfs)

Merge multiple aggregate dataframes into one.

read_aggregates(aggregates_path)

Read aggregates CSV as a pandas dataframe.

read_rates(rates_path)

Read mutation rates CSV as a pandas dataframe.

check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)

check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)

check_bam_contains_unmapped(bam_path)

get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment

read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary

sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

call_consensus(bam_path, out_path, gene_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, quality=27, add_RS_RI=False, temp_dir=None, n_threads=8)

complement_counts(df_counts, gene_infos)

Complement the counts in the counts dataframe according to gene strand.

count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)

Count the number of conversions of each read per barcode and gene, along with

deduplicate_counts(df_counts, conversions=None, use_conversions=True)

Deduplicate counts based on barcode, UMI, and gene.

read_counts(counts_path, *args, **kwargs)

Read counts CSV as a pandas dataframe.

split_counts_by_velocity(df_counts)

Split the given counts dataframe by the velocity column.

calculate_coverage(bam_path, conversions, coverage_path, alignments=None, umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, velocity=True)

Calculate coverage of each genomic position per barcode.

read_coverage(coverage_path)

Read coverage CSV as a dictionary.

detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)

Detect SNPs.

read_snp_csv(snp_csv)

Read a user-provided SNPs CSV

read_snps(snps_path)

Read SNPs CSV as a dictionary

Attributes

CONVERSION_COMPLEMENT

dynast.preprocessing.aggregate_counts(df_counts, aggregates_path, conversions=frozenset([('TC',)]))

Aggregate conversion counts for each pair of bases.

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases

  • aggregates_path (str) – path to write aggregate CSV

  • conversions (list, optional) – conversion(s) in question, defaults to frozenset([(‘TC’,)])

Returns

path to aggregate CSV that was written

Return type

str

dynast.preprocessing.calculate_mutation_rates(df_counts, rates_path, group_by=None)

Calculate mutation rate for each pair of bases.

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases

  • rates_path (str) – path to write rates CSV

  • group_by (list) – column(s) to group calculations by, defaults to None, which combines all rows

Returns

path to rates CSV

Return type

str

dynast.preprocessing.merge_aggregates(*dfs)

Merge multiple aggregate dataframes into one.

Parameters

*dfs

dataframes to merge

Returns

merged dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_aggregates(aggregates_path)

Read aggregates CSV as a pandas dataframe.

Parameters

aggregates_path (str) – path to aggregates CSV

Returns

aggregates dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_rates(rates_path)

Read mutation rates CSV as a pandas dataframe.

Parameters

rates_path (str) – path to rates CSV

Returns

rates dataframe

Return type

pandas.DataFrame

dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)
dynast.preprocessing.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)
dynast.preprocessing.check_bam_contains_unmapped(bam_path)
dynast.preprocessing.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

Parameters
  • bam_path (str) – path to BAM

  • n_reads (int, optional) – number of reads to consider, defaults to 100000

  • n_threads (int, optional) – number of threads, defaults to 8

Returns

set of all tags found

Return type

set

dynast.preprocessing.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

Parameters
  • bam_path (str) – path to alignment BAM file

  • conversions_path (str) – path to output information about reads that have conversions

  • alignments_path (str) – path to alignments information about reads

  • index_path (str) – path to conversions index

  • no_index_path (str) – path to no conversions index

  • gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

  • transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

  • strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded

  • umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None

  • barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None

  • gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False

  • velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True

  • strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False

  • return_splits (bool, optional) – return BAM splits for later reuse, defaults to True

Returns

(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

Return type

(str, str, str) or (str, str, str, list)

dynast.preprocessing.read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

alignments_path (str) – path to alignments CSV

Returns

conversions dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

conversions_path (str) – path to conversions CSV

Returns

conversions dataframe

Return type

pandas.DataFrame

dynast.preprocessing.select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters

df_alignments (pandas.DataFrame) – alignments dataframe

Returns

set of (read_id, alignment index) that were selected

Return type

set

dynast.preprocessing.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters
  • bam_path (str) – path to alignment BAM file to sort

  • out_path (str) – path to output sorted BAM

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to sorted and indexed BAM

Return type

str

dynast.preprocessing.call_consensus(bam_path, out_path, gene_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, quality=27, add_RS_RI=False, temp_dir=None, n_threads=8)
dynast.preprocessing.complement_counts(df_counts, gene_infos)

Complement the counts in the counts dataframe according to gene strand.

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe

  • gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

Return type

pandas.DataFrame

dynast.preprocessing.CONVERSION_COMPLEMENT
dynast.preprocessing.count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters
  • conversions_path (str) – path to conversions CSV

  • alignments_path (str) – path to alignments information about reads

  • index_path (str) – path to conversions index

  • counts_path – path to write counts CSV

  • counts_path – str

  • gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf, defaults to None

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None

  • conversions (list, optional) – conversions to prioritize when deduplicating only applicable for UMI technologies, defaults to None

  • dedup_use_conversions (bool, optional) – prioritize reads that have at least one conversion when deduplicating, defaults to True

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to counts CSV

Return type

str

dynast.preprocessing.deduplicate_counts(df_counts, conversions=None, use_conversions=True)

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe

  • conversions (list, optional) – conversions to prioritize, defaults to None

  • use_conversions (bool, optional) – prioritize reads that have conversions first, defaults to True

Returns

deduplicated counts dataframe

Return type

pandas.DataFrame

dynast.preprocessing.read_counts(counts_path, *args, **kwargs)

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

counts_path (str) – path to CSV

Returns

counts dataframe

Return type

pandas.DataFrame

dynast.preprocessing.split_counts_by_velocity(df_counts)

Split the given counts dataframe by the velocity column.

Parameters

df_counts (pandas.DataFrame) – counts dataframe

Returns

dictionary containing velocity column values as keys and the subset dataframe as values

Return type

dictionary

dynast.preprocessing.calculate_coverage(bam_path, conversions, coverage_path, alignments=None, umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, velocity=True)

Calculate coverage of each genomic position per barcode.

Parameters
  • bam_path (str) – path to alignment BAM file

  • conversions (dictionary) – dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed

  • coverage_path (str) – path to write coverage CSV

  • alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

  • umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None

  • barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None

  • gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • velocity (bool, optional) – whether or not velocities were assigned

Returns

coverage CSV path

Return type

str

dynast.preprocessing.read_coverage(coverage_path)

Read coverage CSV as a dictionary.

Parameters

coverage_path (str) – path to coverage CSV

Returns

coverage as a nested dictionary

Return type

dict

dynast.preprocessing.detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)

Detect SNPs.

Parameters
  • conversions_path (str) – path to conversions CSV

  • index_path (str) – path to conversions index

  • coverage (dict) – dictionary containing genomic coverage

  • snps_path (str) – path to output SNPs

  • alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

  • conversions (set, optional) – set of conversions to consider

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • threshold (float, optional) – positions with conversions / coverage > threshold will be considered as SNPs, defaults to 0.5

  • min_coverage (int, optional) – only positions with at least this many mapping read_snps are considered, defaults to 1

  • n_threads (int, optional) – number of threads, defaults to 8

dynast.preprocessing.read_snp_csv(snp_csv)

Read a user-provided SNPs CSV

Parameters

snp_csv (str) – path to SNPs CSV

Returns

dictionary of contigs as keys and sets of genomic positions with SNPs as values

Return type

dictionary

dynast.preprocessing.read_snps(snps_path)

Read SNPs CSV as a dictionary

Parameters

snps_path (str) – path to SNPs CSV

Returns

dictionary of contigs as keys and sets of genomic positions with SNPs as values

Return type

dictionary