dynast.preprocessing
Submodules
Package Contents
Functions
|
Aggregate conversion counts for each pair of bases. |
|
Calculate mutation rate for each pair of bases. |
|
Merge multiple aggregate dataframes into one. |
|
Read aggregates CSV as a pandas dataframe. |
|
Read mutation rates CSV as a pandas dataframe. |
|
|
|
|
|
|
|
Utility function to retrieve all read tags present in a BAM. |
|
Parse all reads in a BAM and extract conversion, content and alignment |
|
Read alignments CSV as a pandas DataFrame. |
|
Read conversions CSV as a pandas DataFrame. |
|
Select alignments among duplicates. This function performs preliminary |
|
Sort and index BAM. |
|
|
|
Complement the counts in the counts dataframe according to gene strand. |
|
Count the number of conversions of each read per barcode and gene, along with |
|
Deduplicate counts based on barcode, UMI, and gene. |
|
Read counts CSV as a pandas dataframe. |
|
Split the given counts dataframe by the velocity column. |
|
Calculate coverage of each genomic position per barcode. |
|
Read coverage CSV as a dictionary. |
|
Detect SNPs. |
|
Read a user-provided SNPs CSV |
|
Read SNPs CSV as a dictionary |
Attributes
- dynast.preprocessing.aggregate_counts(df_counts, aggregates_path, conversions=frozenset([('TC',)]))
Aggregate conversion counts for each pair of bases.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases
aggregates_path (str) – path to write aggregate CSV
conversions (list, optional) – conversion(s) in question, defaults to frozenset([(‘TC’,)])
- Returns
path to aggregate CSV that was written
- Return type
str
- dynast.preprocessing.calculate_mutation_rates(df_counts, rates_path, group_by=None)
Calculate mutation rate for each pair of bases.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe, with complemented reverse strand bases
rates_path (str) – path to write rates CSV
group_by (list) – column(s) to group calculations by, defaults to None, which combines all rows
- Returns
path to rates CSV
- Return type
str
- dynast.preprocessing.merge_aggregates(*dfs)
Merge multiple aggregate dataframes into one.
- Parameters
*dfs –
dataframes to merge
- Returns
merged dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.read_aggregates(aggregates_path)
Read aggregates CSV as a pandas dataframe.
- Parameters
aggregates_path (str) – path to aggregates CSV
- Returns
aggregates dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.read_rates(rates_path)
Read mutation rates CSV as a pandas dataframe.
- Parameters
rates_path (str) – path to rates CSV
- Returns
rates dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)
- dynast.preprocessing.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)
- dynast.preprocessing.check_bam_contains_unmapped(bam_path)
- dynast.preprocessing.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)
Utility function to retrieve all read tags present in a BAM.
- Parameters
bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8
- Returns
set of all tags found
- Return type
set
- dynast.preprocessing.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)
Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.
- Parameters
bam_path (str) – path to alignment BAM file
conversions_path (str) – path to output information about reads that have conversions
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
no_index_path (str) – path to no conversions index
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False
return_splits (bool, optional) – return BAM splits for later reuse, defaults to True
- Returns
(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.
- Return type
(str, str, str) or (str, str, str, list)
- dynast.preprocessing.read_alignments(alignments_path, *args, **kwargs)
Read alignments CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
alignments_path (str) – path to alignments CSV
- Returns
conversions dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.read_conversions(conversions_path, *args, **kwargs)
Read conversions CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
conversions_path (str) – path to conversions CSV
- Returns
conversions dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.select_alignments(df_alignments)
Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.
- Parameters
df_alignments (pandas.DataFrame) – alignments dataframe
- Returns
set of (read_id, alignment index) that were selected
- Return type
set
- dynast.preprocessing.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)
Sort and index BAM.
If the BAM is already sorted, the sorting step is skipped.
- Parameters
bam_path (str) – path to alignment BAM file to sort
out_path (str) – path to output sorted BAM
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
- Returns
path to sorted and indexed BAM
- Return type
str
- dynast.preprocessing.call_consensus(bam_path, out_path, gene_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, quality=27, add_RS_RI=False, temp_dir=None, n_threads=8)
- dynast.preprocessing.complement_counts(df_counts, gene_infos)
Complement the counts in the counts dataframe according to gene strand.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf
- Returns
counts dataframe with counts complemented for reads mapping to genes on the reverse strand
- Return type
pandas.DataFrame
- dynast.preprocessing.CONVERSION_COMPLEMENT
- dynast.preprocessing.count_conversions(conversions_path, alignments_path, index_path, counts_path, gene_infos, barcodes=None, snps=None, quality=27, conversions=None, dedup_use_conversions=True, n_threads=8, temp_dir=None)
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.
- Parameters
conversions_path (str) – path to conversions CSV
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
counts_path – path to write counts CSV
counts_path – str
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf, defaults to None
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
snps (dictionary, optional) – dictionary of contig as keys and list of genomic positions as values that indicate SNP locations, defaults to None
conversions (list, optional) – conversions to prioritize when deduplicating only applicable for UMI technologies, defaults to None
dedup_use_conversions (bool, optional) – prioritize reads that have at least one conversion when deduplicating, defaults to True
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
- Returns
path to counts CSV
- Return type
str
- dynast.preprocessing.deduplicate_counts(df_counts, conversions=None, use_conversions=True)
Deduplicate counts based on barcode, UMI, and gene.
The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions
If conversions is not provided, reads that have larger sum of all conversions
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
conversions (list, optional) – conversions to prioritize, defaults to None
use_conversions (bool, optional) – prioritize reads that have conversions first, defaults to True
- Returns
deduplicated counts dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.read_counts(counts_path, *args, **kwargs)
Read counts CSV as a pandas dataframe.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
counts_path (str) – path to CSV
- Returns
counts dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.split_counts_by_velocity(df_counts)
Split the given counts dataframe by the velocity column.
- Parameters
df_counts (pandas.DataFrame) – counts dataframe
- Returns
dictionary containing velocity column values as keys and the subset dataframe as values
- Return type
dictionary
- dynast.preprocessing.calculate_coverage(bam_path, conversions, coverage_path, alignments=None, umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, velocity=True)
Calculate coverage of each genomic position per barcode.
- Parameters
bam_path (str) – path to alignment BAM file
conversions (dictionary) – dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed
coverage_path (str) – path to write coverage CSV
alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
temp_dir (str, optional) – path to temporary directory, defaults to None
velocity (bool, optional) – whether or not velocities were assigned
- Returns
coverage CSV path
- Return type
str
- dynast.preprocessing.read_coverage(coverage_path)
Read coverage CSV as a dictionary.
- Parameters
coverage_path (str) – path to coverage CSV
- Returns
coverage as a nested dictionary
- Return type
dict
- dynast.preprocessing.detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)
Detect SNPs.
- Parameters
conversions_path (str) – path to conversions CSV
index_path (str) – path to conversions index
coverage (dict) – dictionary containing genomic coverage
snps_path (str) – path to output SNPs
alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
conversions (set, optional) – set of conversions to consider
quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27
threshold (float, optional) – positions with conversions / coverage > threshold will be considered as SNPs, defaults to 0.5
min_coverage (int, optional) – only positions with at least this many mapping read_snps are considered, defaults to 1
n_threads (int, optional) – number of threads, defaults to 8
- dynast.preprocessing.read_snp_csv(snp_csv)
Read a user-provided SNPs CSV
- Parameters
snp_csv (str) – path to SNPs CSV
- Returns
dictionary of contigs as keys and sets of genomic positions with SNPs as values
- Return type
dictionary
- dynast.preprocessing.read_snps(snps_path)
Read SNPs CSV as a dictionary
- Parameters
snps_path (str) – path to SNPs CSV
- Returns
dictionary of contigs as keys and sets of genomic positions with SNPs as values
- Return type
dictionary