dynast.estimation.p_c
Module Contents
Functions
|
Read p_c CSV as a dictionary, with group_by columns as keys. |
|
Numbaized binomial PMF function for faster calculation. |
|
NASC-seq pipeline variant of the EM algorithm to estimate average |
|
Run EM algorithm to estimate average conversion rate in labeled RNA. |
|
Estimate the average conversion rate in labeled RNA. |
- dynast.estimation.p_c.read_p_c(p_c_path, group_by=None)
Read p_c CSV as a dictionary, with group_by columns as keys.
- Parameters
p_c_path (str) – path to CSV containing p_c values
group_by (list, optional) – columns to group by, defaults to None
- Returns
dictionary with group_by columns as keys (tuple if multiple)
- Return type
dictionary
- dynast.estimation.p_c.binomial_pmf(k, n, p)
Numbaized binomial PMF function for faster calculation.
- Parameters
k (int) – number of successes
n (int) – number of trials
p (float) – probability of success
- Returns
probability of observing k successes in n trials with probability of success p
- Return type
float
- dynast.estimation.p_c.expectation_maximization_nasc(values, p_e, threshold=0.01)
NASC-seq pipeline variant of the EM algorithm to estimate average conversion rate in labeled RNA.
- Parameters
values (numpy.ndarray) –
array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where
row: number of conversions column: nucleotide content value: number of reads
p_e (float) – background mutation rate of unlabeled RNA
threshold (float, optional) – filter threshold, defaults to 0.01
- Returns
estimated conversion rate
- Return type
float
- dynast.estimation.p_c.expectation_maximization(values, p_e, p_c=0.1, threshold=0.01, max_iters=300)
Run EM algorithm to estimate average conversion rate in labeled RNA.
This function runs the following two steps. 1) Constructs a sparse matrix representation of values and filters out certain
indices that are expected to contain more than threshold proportion of unlabeled reads.
Runs an EM algorithm that iteratively updates the filtered out data and stimation.
See https://doi.org/10.1093/bioinformatics/bty256.
- Parameters
values (numpy.ndarray) –
array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where
row: number of conversions column: nucleotide content value: number of reads
p_e (float) – background mutation rate of unlabeled RNA
p_c (float, optional) – initial p_c value, defaults to 0.1
threshold (float, optional) – filter threshold, defaults to 0.01
max_iters (int, optional) – maximum number of EM iterations, defaults to 300
- Returns
estimated conversion rate
- Return type
float
- dynast.estimation.p_c.estimate_p_c(df_aggregates, p_e, p_c_path, group_by=None, threshold=1000, n_threads=8, nasc=False)
Estimate the average conversion rate in labeled RNA.
- Parameters
df_aggregates (pandas.DataFrame) – Pandas dataframe containing aggregate values
p_e (float) – background mutation rate of unlabeled RNA
p_c_path (str) – path to output CSV containing p_c estimates
group_by (list, optional) – columns to group by, defaults to None
threshold (int, optional) – read count threshold, defaults to 1000
n_threads (int, optional) – number of threads, defaults to 8
nasc (bool, optional) – flag to indicate whether to use NASC-seq pipeline variant of the EM algorithm, defaults to False
- Returns
path to output CSV containing p_c estimates
- Return type
str