dynast.preprocessing.bam

Module Contents

Functions

read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary

parse_read_contig(counter, lock, bam_path, contig, gene_infos=None, transcript_infos=None, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, update_every=2000, nasc=False, velocity=True, strict_exon_overlap=False)

Parse all reads mapped to a contig, outputing conversion

get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

check_bam_tags_exist(bam_path, tags, n_reads=100000, n_threads=8)

Utility function to check if BAM tags exists in a BAM within the first

check_bam_is_paired(bam_path, n_reads=100000, n_threads=8)

Utility function to check if BAM has paired reads.

check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)

check_bam_contains_unmapped(bam_path)

check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)

sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

split_bam(bam_path, n, n_threads=8, temp_dir=None)

Split BAM into n parts.

parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment

Attributes

CONVERSION_CSV_COLUMNS

ALIGNMENT_COLUMNS

dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality']
dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score']
dynast.preprocessing.bam.read_alignments(alignments_path, *args, **kwargs)

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

alignments_path (str) – path to alignments CSV

Returns

conversions dataframe

Return type

pandas.DataFrame

dynast.preprocessing.bam.read_conversions(conversions_path, *args, **kwargs)

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

conversions_path (str) – path to conversions CSV

Returns

conversions dataframe

Return type

pandas.DataFrame

dynast.preprocessing.bam.select_alignments(df_alignments)

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters

df_alignments (pandas.DataFrame) – alignments dataframe

Returns

set of (read_id, alignment index) that were selected

Return type

set

dynast.preprocessing.bam.parse_read_contig(counter, lock, bam_path, contig, gene_infos=None, transcript_infos=None, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, temp_dir=None, update_every=2000, nasc=False, velocity=True, strict_exon_overlap=False)

Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.

Parameters
  • counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed

  • lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time

  • bam_path (str) – path to alignment BAM file

  • contig (str) – only reads that map to this contig will be processed

  • gene_infos (dictionary) – dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None

  • transcript_infos (dictionary) – dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True, defaults to None

  • strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, None (unstranded)

  • umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None

  • barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None

  • gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • update_every (int, optional) – update the counter every this many reads, defaults to 5000

  • nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False

  • velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True

  • strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False

Returns

(path to conversions, path to conversions index, path to alignments)

Return type

(str, str, str)

dynast.preprocessing.bam.get_tags_from_bam(bam_path, n_reads=100000, n_threads=8)

Utility function to retrieve all read tags present in a BAM.

Parameters
  • bam_path (str) – path to BAM

  • n_reads (int, optional) – number of reads to consider, defaults to 100000

  • n_threads (int, optional) – number of threads, defaults to 8

Returns

set of all tags found

Return type

set

dynast.preprocessing.bam.check_bam_tags_exist(bam_path, tags, n_reads=100000, n_threads=8)

Utility function to check if BAM tags exists in a BAM within the first n_reads reads.

Parameters
  • bam_path (str) – path to BAM

  • tags (list) – tags to check for

  • n_reads (int, optional) – number of reads to consider, defaults to 100000

  • n_threads (int, optional) – number of threads, defaults to 8

Returns

(whether all tags were found, list of not found tags)

Return type

(bool, list)

dynast.preprocessing.bam.check_bam_is_paired(bam_path, n_reads=100000, n_threads=8)

Utility function to check if BAM has paired reads.

Parameters
  • bam_path (str) – path to BAM

  • n_reads (int, optional) – number of reads to consider, defaults to 100000

  • n_threads (int, optional) – number of threads, defaults to 8

Returns

whether paired reads were detected

Return type

bool

dynast.preprocessing.bam.check_bam_contains_secondary(bam_path, n_reads=100000, n_threads=8)
dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path)
dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8)
dynast.preprocessing.bam.sort_and_index_bam(bam_path, out_path, n_threads=8, temp_dir=None)

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters
  • bam_path (str) – path to alignment BAM file to sort

  • out_path (str) – path to output sorted BAM

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

path to sorted and indexed BAM

Return type

str

dynast.preprocessing.bam.split_bam(bam_path, n, n_threads=8, temp_dir=None)

Split BAM into n parts.

Parameters
  • bam_path (str) – path to alignment BAM file

  • n (int) – number of splits

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

Returns

List of tuples containing (split BAM path, number of reads)

Return type

list

dynast.preprocessing.bam.parse_all_reads(bam_path, conversions_path, alignments_path, index_path, gene_infos, transcript_infos, strand='forward', umi_tag=None, barcode_tag=None, gene_tag='GX', barcodes=None, n_threads=8, temp_dir=None, nasc=False, control=False, velocity=True, strict_exon_overlap=False, return_splits=False)

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

Parameters
  • bam_path (str) – path to alignment BAM file

  • conversions_path (str) – path to output information about reads that have conversions

  • alignments_path (str) – path to alignments information about reads

  • index_path (str) – path to conversions index

  • no_index_path (str) – path to no conversions index

  • gene_infos (dictionary) – dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

  • transcript_infos (dictionary) – dictionary containing transcript information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

  • strand (str, optional) – strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded

  • umi_tag (str, optional) – BAM tag that encodes UMI, if not provided, NA is output in the umi column, defaults to None

  • barcode_tag (str, optional) – BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column, defaults to None

  • gene_tag (str, optional) – BAM tag that encodes gene assignment, defaults to GX

  • barcodes (list, optional) – list of barcodes to be considered. All barcodes are considered if not provided, defaults to None

  • n_threads (int, optional) – number of threads, defaults to 8

  • temp_dir (str, optional) – path to temporary directory, defaults to None

  • nasc (bool, optional) – flag to change behavior to match NASC-seq pipeline, defaults to False

  • velocity (bool, optional) – whether or not to assign a velocity type to each read, defaults to True

  • strict_exon_overlap (bool, optional) – Whether to use a stricter algorithm to assin reads as spliced, defaults to False

  • return_splits (bool, optional) – return BAM splits for later reuse, defaults to True

Returns

(path to conversions, path to alignments, path to conversions index) If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

Return type

(str, str, str) or (str, str, str, list)