dynast.preprocessing.snp

Module Contents

Functions

read_snps(snps_path)

Read SNPs CSV as a dictionary

read_snp_csv(snp_csv)

Read a user-provided SNPs CSV

extract_conversions_part(conversions_path, counter, lock, index, alignments=None, conversions=None, quality=27, update_every=5000)

Extract number of conversions for every genomic position.

extract_conversions(conversions_path, index_path, alignments=None, conversions=None, quality=27, n_threads=8)

Wrapper around extract_conversions_part that works in parallel

detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)

Detect SNPs.

Attributes

SNP_COLUMNS

dynast.preprocessing.snp.SNP_COLUMNS = ['contig', 'genome_i', 'conversion']
dynast.preprocessing.snp.read_snps(snps_path)

Read SNPs CSV as a dictionary

Parameters

snps_path (str) – path to SNPs CSV

Returns

dictionary of contigs as keys and sets of genomic positions with SNPs as values

Return type

dictionary

dynast.preprocessing.snp.read_snp_csv(snp_csv)

Read a user-provided SNPs CSV

Parameters

snp_csv (str) – path to SNPs CSV

Returns

dictionary of contigs as keys and sets of genomic positions with SNPs as values

Return type

dictionary

dynast.preprocessing.snp.extract_conversions_part(conversions_path, counter, lock, index, alignments=None, conversions=None, quality=27, update_every=5000)

Extract number of conversions for every genomic position.

Parameters
  • conversions_path (str) – path to conversions CSV

  • counter (multiprocessing.Value) – counter that keeps track of how many reads have been processed

  • lock (multiprocessing.Lock) – semaphore for the counter so that multiple processes do not modify it at the same time

  • index (list) – list of (file position, number of lines) tuples to process

  • alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

  • conversions (set, optional) – set of conversions to consider

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • update_every (int, optional) – update the counter every this many reads, defaults to 5000

Returns

nested dictionary that contains number of conversions for each contig and position

Return type

dictionary

dynast.preprocessing.snp.extract_conversions(conversions_path, index_path, alignments=None, conversions=None, quality=27, n_threads=8)

Wrapper around extract_conversions_part that works in parallel

Parameters
  • conversions_path (str) – path to conversions CSV

  • index_path (str) – path to conversions index

  • alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

  • conversions (set, optional) – set of conversions to consider

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • n_threads (int, optional) – number of threads, defaults to 8

Returns

nested dictionary that contains number of conversions for each contig and position

Return type

dictionary

dynast.preprocessing.snp.detect_snps(conversions_path, index_path, coverage, snps_path, alignments=None, conversions=None, quality=27, threshold=0.5, min_coverage=1, n_threads=8)

Detect SNPs.

Parameters
  • conversions_path (str) – path to conversions CSV

  • index_path (str) – path to conversions index

  • coverage (dict) – dictionary containing genomic coverage

  • snps_path (str) – path to output SNPs

  • alignments (set, optional) – set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

  • conversions (set, optional) – set of conversions to consider

  • quality (int, optional) – only count conversions with PHRED quality greater than this value, defaults to 27

  • threshold (float, optional) – positions with conversions / coverage > threshold will be considered as SNPs, defaults to 0.5

  • min_coverage (int, optional) – only positions with at least this many mapping read_snps are considered, defaults to 1

  • n_threads (int, optional) – number of threads, defaults to 8