dynast.preprocessing.conversion
Module Contents
Functions
|
Read counts CSV as a pandas dataframe. |
|
Complement the counts in the counts dataframe according to gene strand. |
|
Drop multimappings that have the same read ID where |
|
Deduplicate counts based on barcode, UMI, and gene. |
|
|
|
|
|
Split the given counts dataframe by the velocity column. |
|
Count reads that have no conversion. |
|
Count the number of conversions of each read per barcode and gene, along with |
|
Count the number of conversions of each read per barcode and gene, along with |
Attributes
- dynast.preprocessing.conversion.CONVERSIONS_PARSER
- dynast.preprocessing.conversion.ALIGNMENTS_PARSER
- dynast.preprocessing.conversion.CONVERSION_IDX
- dynast.preprocessing.conversion.BASE_IDX
- dynast.preprocessing.conversion.CONVERSION_COMPLEMENT
- dynast.preprocessing.conversion.CONVERSION_COLUMNS
- dynast.preprocessing.conversion.BASE_COLUMNS
- dynast.preprocessing.conversion.COLUMNS
- dynast.preprocessing.conversion.CSV_COLUMNS
- dynast.preprocessing.conversion.read_counts(counts_path, *args, **kwargs)
Read counts CSV as a pandas dataframe.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
counts_path (str) – path to CSV
- Returns
counts dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.conversion.complement_counts(df_counts, gene_infos)
Complement the counts in the counts dataframe according to gene strand.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf
- Returns
counts dataframe with counts complemented for reads mapping to genes on the reverse strand
- Return type
pandas.DataFrame
- dynast.preprocessing.conversion.drop_multimappers(df_counts, conversions=None)
Drop multimappings that have the same read ID where * some map to the transcriptome while some do not – drop non-transcriptome alignments * none map to the transcriptome AND aligned to multiple genes – drop all * none map to the transcriptome AND assigned multiple velocity types – set to ambiguous
TODO: This function can probably be removed because BAM parsing only considers primary alignments now.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
conversions (list, optional) – conversions to prioritize, defaults to None
- Returns
counts dataframe with multimappers appropriately filtered
- Return type
pandas.DataFrame
- dynast.preprocessing.conversion.deduplicate_counts(df_counts, conversions=None, use_conversions=True)
Deduplicate counts based on barcode, UMI, and gene.
The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions
If conversions is not provided, reads that have larger sum of all conversions
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
conversions (list, optional) – conversions to prioritize, defaults to None
use_conversions (bool, optional) – prioritize reads that have conversions first, defaults to True
- Returns
deduplicated counts dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.conversion.drop_multimappers_part(counter, lock, split_path, out_path)
- dynast.preprocessing.conversion.deduplicate_counts_part(counter, lock, split_path, out_path, conversions=None, use_conversions=True)
- dynast.preprocessing.conversion.split_counts_by_velocity(df_counts)
Split the given counts dataframe by the velocity column.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
- Returns
dictionary containing velocity column values as keys and the subset dataframe as values
- Return type
dictionary
- dynast.preprocessing.conversion.count_no_conversions(alignments_path, counter, lock, index, barcodes=None, temp_dir=None, update_every=10000)
Count reads that have no conversion.
- Parameters
alignments_path (str) – alignments CSV path
counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed
lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time
index (list) – index for conversions CSV
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
temp_dir (str, optional) – path to temporary directory, defaults to None
update_every (int, optional) – update the counter every this many reads, defaults to 5000
- Returns
path to temporary counts CSV
- Return type
str
- dynast.preprocessing.conversion.count_conversions_part(conversions_path, alignments_path, counter, lock, index, barcodes=None, snps=None, quality=27, temp_dir=None, update_every=10000)
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode and gene. This function is used exclusively for multiprocessing.
- Parameters
conversions_path (str) – path to conversions CSV
alignments_path (str) – path to alignments information about reads
counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed
lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time
index (list) – list of (file position, number of lines) tuples to process
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
temp_dir (str, optional) – path to temporary directory, defaults to None
update_every (int, optional) – update the counter every this many reads, defaults to 10000
- Returns
path to temporary counts CSV
- Return type
tuple
- dynast.preprocessing.conversion.count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.
- Parameters
conversions_path (str) – path to conversions CSV
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
counts_path – path to write counts CSV
counts_path – str
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf, defaults to None
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None
conversions (list, optional) – conversions to prioritize when deduplicating only applicable for UMI technologies, defaults to None
dedup_use_conversions (bool, optional) – prioritize reads that have at least one conversion when deduplicating, defaults to True
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
- Returns
path to counts CSV
- Return type
str