`dynast.estimation.p_c`

Module Contents

Functions

`read_p_c`(p_c_path, group_by=None)	Read p_c CSV as a dictionary, with group_by columns as keys.
`binomial_pmf`(k, n, p)	Numbaized binomial PMF function for faster calculation.
`expectation_maximization_nasc`(values, p_e, threshold=0.01)	NASC-seq pipeline variant of the EM algorithm to estimate average
`expectation_maximization`(values, p_e, p_c=0.1, threshold=0.01, max_iters=300)	Run EM algorithm to estimate average conversion rate in labeled RNA.
`estimate_p_c`(df_aggregates, p_e, p_c_path, group_by=None, threshold=1000, n_threads=8, nasc=False)	Estimate the average conversion rate in labeled RNA.

dynast.estimation.p_c.read_p_c(p_c_path, group_by=None)

Read p_c CSV as a dictionary, with group_by columns as keys.

Parameters

p_c_path (str) – path to CSV containing p_c values
group_by (list, optional) – columns to group by, defaults to None

Returns

dictionary with group_by columns as keys (tuple if multiple)

Return type

dictionary

dynast.estimation.p_c.binomial_pmf(k, n, p)

Numbaized binomial PMF function for faster calculation.

Parameters

k (int) – number of successes
n (int) – number of trials
p (float) – probability of success

Returns

probability of observing k successes in n trials with probability of success p

Return type

float

dynast.estimation.p_c.expectation_maximization_nasc(values, p_e, threshold=0.01)

NASC-seq pipeline variant of the EM algorithm to estimate average conversion rate in labeled RNA.

Parameters

values (numpy.ndarray) –
array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where

row: number of conversions column: nucleotide content value: number of reads
p_e (float) – background mutation rate of unlabeled RNA
threshold (float, optional) – filter threshold, defaults to 0.01

Returns

estimated conversion rate

Return type

float

dynast.estimation.p_c.expectation_maximization(values, p_e, p_c=0.1, threshold=0.01, max_iters=300)

Run EM algorithm to estimate average conversion rate in labeled RNA.

This function runs the following two steps. 1) Constructs a sparse matrix representation of values and filters out certain

indices that are expected to contain more than threshold proportion of unlabeled reads.

Runs an EM algorithm that iteratively updates the filtered out data and stimation.

See https://doi.org/10.1093/bioinformatics/bty256.

Parameters

values (numpy.ndarray) –
array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where

row: number of conversions column: nucleotide content value: number of reads
p_e (float) – background mutation rate of unlabeled RNA
p_c (float, optional) – initial p_c value, defaults to 0.1
threshold (float, optional) – filter threshold, defaults to 0.01
max_iters (int, optional) – maximum number of EM iterations, defaults to 300

Returns

estimated conversion rate

Return type

float

dynast.estimation.p_c.estimate_p_c(df_aggregates, p_e, p_c_path, group_by=None, threshold=1000, n_threads=8, nasc=False)

Estimate the average conversion rate in labeled RNA.

Parameters

df_aggregates (pandas.DataFrame) – Pandas dataframe containing aggregate values
p_e (float) – background mutation rate of unlabeled RNA
p_c_path (str) – path to output CSV containing p_c estimates
group_by (list, optional) – columns to group by, defaults to None
threshold (int, optional) – read count threshold, defaults to 1000
n_threads (int, optional) – number of threads, defaults to 8
nasc (bool, optional) – flag to indicate whether to use NASC-seq pipeline variant of the EM algorithm, defaults to False

Returns

path to output CSV containing p_c estimates

Return type

str

dynast.estimation.p_c

Module Contents

Functions

`dynast.estimation.p_c`