poli.objective_repository.gfp_cbas.cbas_alphabet_preprocessing#

Functions

convert_aas_to_idx_array(X_aa)

Converts a list of amino acid sequences into an array of amino acid indices from AA_IDX.

convert_idx_array_to_aas(X_aa)

Converts an array containing indices of amino acids into the corresponding string amino acid sequences.

convert_mutations_to_sequence(base_seq, ...)

Given the wild type sequence and a formatted mtuation string, returns the mutated sequence

get_argmax(Xt_p)

Given a categorical probability distribution specifying the probability of amino acids at each position in a sequence, returns the most probable sequence

get_balaji_predictions(preds, Xt)

Given a set of predictors built according to the methods in the Balaji Lakshminarayanan paper 'Simple and scalable predictive uncertainty estimation using deep ensembles' (2017), returns the mean and variance of the total prediction.

get_experimental_X_y([random_state, ...])

For the GFP testing experiments.

get_gfp_X_y_aa(data_df[, functional_only, ...])

Converts the raw GFP data to a set of X and y values that are ready to use in a model

get_gfp_base_seq()

Returns the wild type GFP sequence

get_samples(Xt_p)

Samples from a categorical probability distribution specifying the probability of amino acids at each position in a sequence

one_hot_encode_aa(aa_str[, pad])

Returns a one hot encoded amino acid sequence

one_hot_encode_aa_array(X_aa)

OneHot encodes array: (batch_size, L) -> (batch_size, L, alphabet_size)

one_hot_encode_dna(dna_str[, pad, base_order])

Convert length M string into M x 4 tokenized array

partition_data(X, y[, percentile, ...])

Partition a (X, y) data set by a percentile of the y values

read_gfp_data([path, df_save_file])

Reads the GFP brightness data in a pandas DataFrame