dynast.estimation.p_c

Module Contents

Functions

read_p_c(p_c_path, group_by=None)

Read p_c CSV as a dictionary, with group_by columns as keys.

binomial_pmf(k, n, p)

Numbaized binomial PMF function for faster calculation.

expectation_maximization_nasc(values, p_e, threshold=0.01)

NASC-seq pipeline variant of the EM algorithm to estimate average

expectation_maximization(values, p_e, p_c=0.1, threshold=0.01, max_iters=300)

Run EM algorithm to estimate average conversion rate in labeled RNA.

estimate_p_c(df_aggregates, p_e, p_c_path, group_by=None, threshold=1000, n_threads=8, nasc=False)

Estimate the average conversion rate in labeled RNA.

dynast.estimation.p_c.read_p_c(p_c_path, group_by=None)

Read p_c CSV as a dictionary, with group_by columns as keys.

Parameters
  • p_c_path (str) – path to CSV containing p_c values

  • group_by (list, optional) – columns to group by, defaults to None

Returns

dictionary with group_by columns as keys (tuple if multiple)

Return type

dictionary

dynast.estimation.p_c.binomial_pmf(k, n, p)

Numbaized binomial PMF function for faster calculation.

Parameters
  • k (int) – number of successes

  • n (int) – number of trials

  • p (float) – probability of success

Returns

probability of observing k successes in n trials with probability of success p

Return type

float

dynast.estimation.p_c.expectation_maximization_nasc(values, p_e, threshold=0.01)

NASC-seq pipeline variant of the EM algorithm to estimate average conversion rate in labeled RNA.

Parameters
  • values (numpy.ndarray) –

    array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where

    row: number of conversions column: nucleotide content value: number of reads

  • p_e (float) – background mutation rate of unlabeled RNA

  • threshold (float, optional) – filter threshold, defaults to 0.01

Returns

estimated conversion rate

Return type

float

dynast.estimation.p_c.expectation_maximization(values, p_e, p_c=0.1, threshold=0.01, max_iters=300)

Run EM algorithm to estimate average conversion rate in labeled RNA.

This function runs the following two steps. 1) Constructs a sparse matrix representation of values and filters out certain

indices that are expected to contain more than threshold proportion of unlabeled reads.

  1. Runs an EM algorithm that iteratively updates the filtered out data and stimation.

See https://doi.org/10.1093/bioinformatics/bty256.

Parameters
  • values (numpy.ndarray) –

    array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where

    row: number of conversions column: nucleotide content value: number of reads

  • p_e (float) – background mutation rate of unlabeled RNA

  • p_c (float, optional) – initial p_c value, defaults to 0.1

  • threshold (float, optional) – filter threshold, defaults to 0.01

  • max_iters (int, optional) – maximum number of EM iterations, defaults to 300

Returns

estimated conversion rate

Return type

float

dynast.estimation.p_c.estimate_p_c(df_aggregates, p_e, p_c_path, group_by=None, threshold=1000, n_threads=8, nasc=False)

Estimate the average conversion rate in labeled RNA.

Parameters
  • df_aggregates (pandas.DataFrame) – Pandas dataframe containing aggregate values

  • p_e (float) – background mutation rate of unlabeled RNA

  • p_c_path (str) – path to output CSV containing p_c estimates

  • group_by (list, optional) – columns to group by, defaults to None

  • threshold (int, optional) – read count threshold, defaults to 1000

  • n_threads (int, optional) – number of threads, defaults to 8

  • nasc (bool, optional) – flag to indicate whether to use NASC-seq pipeline variant of the EM algorithm, defaults to False

Returns

path to output CSV containing p_c estimates

Return type

str