`dynast.preprocessing.bam`

Module Contents

Functions

`read_alignments`(alignments_path, args, *kwargs)	Read alignments CSV as a pandas DataFrame.
`read_conversions`(conversions_path, args, *kwargs)	Read conversions CSV as a pandas DataFrame.
`select_alignments`(df_alignments)	Select alignments among duplicates. This function performs preliminary
`parse_read_contig`(counter, lock, bam_path, contig, gene_infos=None, transcript_infos=None, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, update_every=2000, nasc=False, velocity=True, strict_exon_overlap=False)	Parse all reads mapped to a contig, outputing conversion
`get_tags_from_bam`(bam_path, n_reads=100000, n_threads=8)	Utility function to retrieve all read tags present in a BAM.
`check_bam_tags_exist`(bam_path, tags, n_reads=100000, n_threads=8)	Utility function to check if BAM tags exists in a BAM within the first
`check_bam_is_paired`(bam_path, n_reads=100000, n_threads=8)	Utility function to check if BAM has paired reads.
`check_bam_contains_secondary`(bam_path, n_reads=100000, n_threads=8)
`check_bam_contains_unmapped`(bam_path)
`check_bam_contains_duplicate`(bam_path, n_reads=100000, n_threads=8)
`sort_and_index_bam`(bam_path, out_path, n_threads=8, temp_dir=None)	Sort and index BAM.
`split_bam`(bam_path, n, n_threads=8, temp_dir=None)	Split BAM into n parts.
`parse_all_reads`(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)	Parse all reads in a BAM and extract conversion, content and alignment

Attributes

`CONVERSION_CSV_COLUMNS`
`ALIGNMENT_COLUMNS`

dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality']

dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score']

dynast.preprocessing.bam.read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters: alignments_path (str) – path to alignments CSV
Returns: conversions dataframe
Return type: pandas.DataFrame

dynast.preprocessing.bam.read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters: conversions_path (str) – path to conversions CSV
Returns: conversions dataframe
Return type: pandas.DataFrame

dynast.preprocessing.bam.select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters: df_alignments (pandas.DataFrame) – alignments dataframe
Returns: set of (read_id, alignment index) that were selected
Return type: set

dynast.preprocessing.bam.parse_read_contig(counter, lock, bam_path, contig, gene_infos=None, transcript_infos=None, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, update_every=2000, nasc=False, velocity=True, strict_exon_overlap=False)

Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.

Parameters

counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed
lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time
bam_path (str) – path to alignment BAM file
contig (str) – only reads that map to this contig will be processed
gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None
transcript_infos (dictionary) – dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, None (unstranded)
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
temp_dir (str, optional) – path to temporary directory, defaults to None
update_every (int, optional) – update the counter every this many reads, defaults to 5000
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False

Returns

(path to conversions, path to conversions index, path to alignments)

Return type

(str, str, str)

dynast.preprocessing.bam.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

Parameters

bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8

Returns

set of all tags found

Return type

set

dynast.preprocessing.bam.check_bam_tags_exist(bam_path, tags, n_reads=100000, n_threads=8)

Utility function to check if BAM tags exists in a BAM within the first n_reads reads.

Parameters

bam_path (str) – path to BAM
tags (list) – tags to check for
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8

Returns

(whether all tags were found, list of not found tags)

Return type

(bool, list)

dynast.preprocessing.bam.check_bam_is_paired(bam_path, n_reads=100000, n_threads=8)

Utility function to check if BAM has paired reads.

Parameters

bam_path (str) – path to BAM
n_reads (int, optional) – number of reads to consider, defaults to 100000
n_threads (int, optional) – number of threads, defaults to 8

Returns

whether paired reads were detected

Return type

bool

dynast.preprocessing.bam.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)

dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path)

dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)

dynast.preprocessing.bam.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters

bam_path (str) – path to alignment BAM file to sort
out_path (str) – path to output sorted BAM
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to sorted and indexed BAM

Return type

str

dynast.preprocessing.bam.split_bam(bam_path, n, n_threads=8, temp_dir=None)

Split BAM into n parts.

Parameters

bam_path (str) – path to alignment BAM file
n (int) – number of splits
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

List of tuples containing (split BAM path, number of reads)

Return type

list

dynast.preprocessing.bam.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

Parameters

bam_path (str) – path to alignment BAM file
conversions_path (str) – path to output information about reads that have conversions
alignments_path (str) – path to alignments information about reads
index_path (str) – path to conversions index
no_index_path (str) – path to no conversions index
gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None
barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None
gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX
barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None
n_threads (int, optional) – number of threads, defaults to 8
temp_dir (str, optional) – path to temporary directory, defaults to None
nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False
velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True
strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False
return_splits (bool, optional) – return BAM splits for later reuse, defaults to True

Returns

(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

Return type

(str, str, str) or (str, str, str, list)

dynast.preprocessing.bam

Module Contents

Functions

Attributes

`dynast.preprocessing.bam`