dynast.preprocessing.bam
Module Contents
Functions
|
Read alignments CSV as a pandas DataFrame. |
|
Read conversions CSV as a pandas DataFrame. |
|
Select alignments among duplicates. This function performs preliminary |
|
Parse all reads mapped to a contig, outputing conversion |
|
Utility function to retrieve all read tags present in a BAM. |
|
Utility function to check if BAM tags exists in a BAM within the first |
|
Utility function to check if BAM has paired reads. |
|
|
|
|
|
|
|
Sort and index BAM. |
|
Split BAM into n parts. |
|
Parse all reads in a BAM and extract conversion, content and alignment |
Attributes
- dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality']
- dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score']
- dynast.preprocessing.bam.read_alignments(alignments_path, *args, **kwargs)
Read alignments CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
alignments_path (str) – path to alignments CSV
- Returns
conversions dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.bam.read_conversions(conversions_path, *args, **kwargs)
Read conversions CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
conversions_path (str) – path to conversions CSV
- Returns
conversions dataframe
- Return type
pandas.DataFrame
- dynast.preprocessing.bam.select_alignments(df_alignments)
Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.
- Parameters
df_alignments (pandas.DataFrame) – alignments dataframe
- Returns
set of (read_id, alignment index) that were selected
- Return type
set
- dynast.preprocessing.bam.parse_read_contig(counter, lock, bam_path, contig, gene_infos=None, transcript_infos=None, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, update_every=2000, nasc=False, velocity=True, strict_exon_overlap=False)
Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.
- Parameters
counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed
lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time
bam_path (str) – path to alignment BAM file
contig (str) – only reads that map to this contig will be processed
gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None
transcript_infos (dictionary) – dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, None (unstranded)
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
temp_dir (str, optional) – path to temporary directory, defaults to None
update_every (int, optional) – update the counter every this many reads, defaults to 5000
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False
- Returns
(path to conversions, path to conversions index, path to alignments)
- Return type
(str, str, str)
- dynast.preprocessing.bam.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)
Utility function to retrieve all read tags present in a BAM.
- Parameters
bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8
- Returns
set of all tags found
- Return type
set
- dynast.preprocessing.bam.check_bam_tags_exist(bam_path, tags, n_reads=100000, n_threads=8)
Utility function to check if BAM tags exists in a BAM within the first n_reads reads.
- Parameters
bam_path (str) – path to BAM
tags (list) – tags to check for
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8
- Returns
(whether all tags were found, list of not found tags)
- Return type
(bool, list)
- dynast.preprocessing.bam.check_bam_is_paired(bam_path, n_reads=100000, n_threads=8)
Utility function to check if BAM has paired reads.
- Parameters
bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8
- Returns
whether paired reads were detected
- Return type
bool
- dynast.preprocessing.bam.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)
- dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path)
- dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)
- dynast.preprocessing.bam.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)
Sort and index BAM.
If the BAM is already sorted, the sorting step is skipped.
- Parameters
bam_path (str) – path to alignment BAM file to sort
out_path (str) – path to output sorted BAM
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
- Returns
path to sorted and indexed BAM
- Return type
str
- dynast.preprocessing.bam.split_bam(bam_path, n, n_threads=8, temp_dir=None)
Split BAM into n parts.
- Parameters
bam_path (str) – path to alignment BAM file
n (int) – number of splits
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
- Returns
List of tuples containing (split BAM path, number of reads)
- Return type
list
- dynast.preprocessing.bam.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)
Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.
- Parameters
bam_path (str) – path to alignment BAM file
conversions_path (str) – path to output information about reads that have conversions
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
no_index_path (str) – path to no conversions index
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False
return_splits (bool, optional) – return BAM splits for later reuse, defaults to True
- Returns
(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.
- Return type
(str, str, str) or (str, str, str, list)