dynast.preprocessing.conversion

Module Contents

Functions

read_counts(counts_path, *args, **kwargs)

Read counts CSV as a pandas dataframe.

complement_counts(df_counts, gene_infos)

Complement the counts in the counts dataframe according to gene strand.

drop_multimappers(df_counts, conversions=None)

Drop multimappings that have the same read ID where

deduplicate_counts(df_counts, conversions=None, use_conversions=True)

Deduplicate counts based on barcode, UMI, and gene.

drop_multimappers_part(counter, lock, split_path, out_path)

deduplicate_counts_part(counter, lock, split_path, out_path, conversions=None, use_conversions=True)

split_counts_by_velocity(df_counts)

Split the given counts dataframe by the velocity column.

count_no_conversions(alignments_path, counter, lock, index, barcodes=None, temp_dir=None, update_every=10000)

Count reads that have no conversion.

count_conversions_part(conversions_path, alignments_path, counter, lock, index, barcodes=None, snps=None, quality=27, temp_dir=None, update_every=10000)

Count the number of conversions of each read per barcode and gene, along with

count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)

Count the number of conversions of each read per barcode and gene, along with

Attributes

CONVERSIONS_PARSER

ALIGNMENTS_PARSER

CONVERSION_IDX

BASE_IDX

CONVERSION_COMPLEMENT

CONVERSION_COLUMNS

BASE_COLUMNS

COLUMNS

CSV_COLUMNS

dynast.preprocessing.conversion.CONVERSIONS_PARSER
dynast.preprocessing.conversion.ALIGNMENTS_PARSER
dynast.preprocessing.conversion.CONVERSION_IDX
dynast.preprocessing.conversion.BASE_IDX
dynast.preprocessing.conversion.CONVERSION_COMPLEMENT
dynast.preprocessing.conversion.CONVERSION_COLUMNS
dynast.preprocessing.conversion.BASE_COLUMNS
dynast.preprocessing.conversion.COLUMNS
dynast.preprocessing.conversion.CSV_COLUMNS
dynast.preprocessing.conversion.read_counts(counts_path, *args, **kwargs)

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

counts_path (str) – path to CSV

Returns

counts dataframe

Return type

pandas.DataFrame

dynast.preprocessing.conversion.complement_counts(df_counts, gene_infos)

Complement the counts in the counts dataframe according to gene strand.

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe

  • gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

Return type

pandas.DataFrame

dynast.preprocessing.conversion.drop_multimappers(df_counts, conversions=None)

Drop multimappings that have the same read ID where * some map to the transcriptome while some do not – drop non-transcriptome alignments * none map to the transcriptome AND aligned to multiple genes – drop all * none map to the transcriptome AND assigned multiple velocity types – set to ambiguous

TODO: This function can probably be removed because BAM parsing only considers primary alignments now.

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe

  • conversions (list, optional) – conversions to prioritize, defaults to None

Returns

counts dataframe with multimappers appropriately filtered

Return type

pandas.DataFrame

dynast.preprocessing.conversion.deduplicate_counts(df_counts, conversions=None, use_conversions=True)

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters
  • df_counts (pandas.DataFrame) – counts dataframe

  • conversions (list, optional) – conversions to prioritize, defaults to None

  • use_conversions (bool, optional) – prioritize reads that have conversions first, defaults to True

Returns

deduplicated counts dataframe

Return type

pandas.DataFrame

dynast.preprocessing.conversion.drop_multimappers_part(counter, lock, split_path, out_path)
dynast.preprocessing.conversion.deduplicate_counts_part(counter, lock, split_path, out_path, conversions=None, use_conversions=True)
dynast.preprocessing.conversion.split_counts_by_velocity(df_counts)

Split the given counts dataframe by the velocity column.

Parameters

df_counts (pandas.DataFrame) – counts dataframe

Returns

dictionary containing velocity column values as keys and the subset dataframe as values

Return type

dictionary

dynast.preprocessing.conversion.count_no_conversions(alignments_path, counter, lock, index, barcodes=None, temp_dir=None, update_every=10000)

Count reads that have no conversion.

Parameters
  • alignments_path (str) – alignments CSV path

  • counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed

  • lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time

  • index (list) – index for conversions CSV

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • update_every (int, optional) – update the counter every this many reads, defaults to 5000

Returns

path to temporary counts CSV

Return type

str

dynast.preprocessing.conversion.count_conversions_part(conversions_path, alignments_path, counter, lock, index, barcodes=None, snps=None, quality=27, temp_dir=None, update_every=10000)

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode and gene. This function is used exclusively for multiprocessing.

Parameters
  • conversions_path (str) – path to conversions CSV

  • alignments_path (str) – path to alignments information about reads

  • counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed

  • lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time

  • index (list) – list of (file position, number of lines) tuples to process

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • update_every (int, optional) – update the counter every this many reads, defaults to 10000

Returns

path to temporary counts CSV

Return type

tuple

dynast.preprocessing.conversion.count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters
  • conversions_path (str) – path to conversions CSV

  • alignments_path (str) – path to alignments information about reads

  • index_path (str) – path to conversions index

  • counts_path – path to write counts CSV

  • counts_path – str

  • gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf, defaults to None

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None

  • conversions (list, optional) – conversions to prioritize when deduplicating only applicable for UMI technologies, defaults to None

  • dedup_use_conversions (bool, optional) – prioritize reads that have at least one conversion when deduplicating, defaults to True

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to counts CSV

Return type

str