Ehrlich functions#

Type of objective function: discrete Environment to run this objective function: poli base

About#

Ehrlich functions, proposed by [Stanton et al., 2024], are a closed-form optimization objective for discrete sequences. They are maximized when a collection of motifs are fulfilled in the input. Check the details in their paper.

Prerequisites#

None, this black box runs out of the box.

How to run#

from poli.objective_repository import EhrlichBlackBox, EhrlichProblemFactory

# You can either
# (i) Create a black box
f = EhrlichBlackBox(
    sequence_length=256,
    motif_length=8,
    n_motifs=4,
    quantization=8,
)

# or
# (ii) create a problem
problem = EhrlichProblemFactory().create(
    sequence_length=256,
    motif_length=8,
    n_motifs=4,
    quantization=8,
)
f, x0 = problem.black_box, problem.x0

# Example input:
print(x0)

# Querying:
y = f(x0)
print(y)

How to cite#

[1] Stanton, S., Alberstein, R., Frey, N., Watkins, A., & Cho, K. (2024). Closed-form test functions for biophysical sequence optimization algorithms. arXiv. https://arxiv.org/abs/2407.00236

[2] González-Duque, M., Bartels, S., & Michael, R. (2024). poli: a libary of discrete sequence objectives [Computer software]. MachineLearningLifeScience/poli


@misc{Stanton:Ehrlich:2024,
      title={Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms}, 
      author={Samuel Stanton and Robert Alberstein and Nathan Frey and Andrew Watkins and Kyunghyun Cho},
      year={2024},
      eprint={2407.00236},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.00236}, 
}


@software{Gonzalez-Duque:poli:2024,
author = {González-Duque, Miguel and Bartels, Simon and Michael, Richard},
month = jan,
title = {{poli: a libary of discrete sequence objectives}},
url = {https://github.com/MachineLearningLifeScience/poli},
version = {0.0.1},
year = {2024}
}

API reference#

class poli.objective_repository.ehrlich.register.EhrlichProblemFactory#

A factory for creating Ehrlich functions and initial conditions.

References

[1] Stanton, S., Alberstein, R., Frey, N., Watkins, A., & Cho, K. (2024).

Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms. arXiv preprint arXiv:2407.00236. https://arxiv.org/abs/2407.00236

create(sequence_length: int, motif_length: int, n_motifs: int, quantization: int | None = None, seed: int = None, return_value_on_unfeasible: float = - inf, alphabet: list[str] = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'], batch_size: int = None, parallelize: bool = False, num_workers: int = None, evaluation_budget: int = inf, force_isolation: bool = False) Problem#

Creates an Ehrlich function problem (containing an Ehrlich black box and an initial condition).

Parameters:
  • sequence_length (int) – The length of the sequence to be optimized. This length is fixed, and _only_ sequences of this length are considered.

  • motif_length (int) – The length of the motifs.

  • n_motifs (int) – The number of motifs.

  • quantization (int, optional) – The quantization parameter. This parameter must be between 1 and the motif length, and the motif length must be divisible by the quantization. By default, it is None (which corresponds to the motif length).

  • seed (int, optional) – The seed for the random number generator. By default, it is None (i.e. no seed is set).

  • return_value_on_unfeasible (float, optional) – The value to be returned when an unfeasible sequence is evaluated. By default, it is -np.inf.

  • alphabet (list of str, optional) – The alphabet to be used for the sequences. By default, it is the of 20 amino acids.

  • batch_size (int, optional) – The batch size for the black box. By default, it is None (i.e. all sequences are evaluated in a vectorized way).

  • parallelize (bool, optional) – Whether to parallelize the evaluation of the black box. By default, it is False.

  • num_workers (int, optional) – The number of processors used in parallelization.

  • evaluation_budget (int, optional) – The evaluation budget for the black box. By default, it is infinite.

References

[1] Stanton, S., Alberstein, R., Frey, N., Watkins, A., & Cho, K. (2024).

Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms. arXiv preprint arXiv:2407.00236. https://arxiv.org/abs/2407.00236

class poli.objective_repository.ehrlich.register.EhrlichBlackBox(sequence_length: int, motif_length: int, n_motifs: int, quantization: int | None = None, seed: int = None, return_value_on_unfeasible: float = - inf, feasibility_matrix_temperature: float = 0.5, feasibility_matrix_band_length: int | None = None, alphabet: list[str] = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'], batch_size: int = None, parallelize: bool = False, num_workers: int = None, evaluation_budget: int = inf)#

Ehrlich functions were proposed by Stanton et al. [1] as a quick-and-easy alternative for testing discrete sequence optimizers (with protein optimization in mind). They are deviced to

  1. be easy to query,

  2. have feasible and unfeasible sequences,

  3. have uninformative random samples (i.e. randomly sampling and evaluating should not be competitive, as many of these should be unfeasible).

  4. be maximized when certain motifs are present in the sequence. These motifs can be long-range within the sequence, and are meant to be non-additive.

Check the references for details on the implementation.

Parameters:
  • sequence_length (int) – The length of the sequence to be optimized. This length is fixed, and _only_ sequences of this length are considered.

  • motif_length (int) – The length of the motifs.

  • n_motifs (int) – The number of motifs.

  • quantization (int, optional) – The quantization parameter. This parameter must be between 1 and the motif length, and the motif length must be divisible by the quantization. By default, it is None (which corresponds to the motif length).

  • seed (int, optional) – The seed for the random number generator. By default, it is None (i.e. no seed is set).

  • return_value_on_unfeasible (float, optional) – The value to be returned when an unfeasible sequence is evaluated. By default, it is -np.inf.

  • feasibility_matrix_temperature (float, optional) – The temperature parameter for the feasibility matrix’s softmax. By default, it is 0.5.

  • feasibility_matrix_band_length (int, optional) – The band length for the non-zero values in the feasibility matrix. By default, it is None (i.e. if the alphabet size is v, the band length is v - 2 * (v // 5)).

  • alphabet (list of str, optional) – The alphabet to be used for the sequences. By default, it is the of 20 amino acids.

  • batch_size (int, optional) – The batch size for the black box. By default, it is None (i.e. all sequences are evaluated in a vectorized way).

  • parallelize (bool, optional) – Whether to parallelize the evaluation of the black box. By default, it is False.

  • num_workers (int, optional) – The number of processors used in parallelization.

  • evaluation_budget (int, optional) – The evaluation budget for the black box. By default, it is infinite.

References

[1] Stanton, S., Alberstein, R., Frey, N., Watkins, A., & Cho, K. (2024).

Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms. arXiv preprint arXiv:2407.00236. https://arxiv.org/abs/2407.00236

construct_optimal_solution(motifs: numpy.ndarray | None = None, offsets: numpy.ndarray | None = None) ndarray#

Constructs an optimal solution for a given set of motifs and offsets.

If None are provided, then the motifs and offsets of the black box are used.

construct_random_motifs(motif_length: int, n_motifs: int, seed: int = None) ndarray#

Creates a given number of random motifs of a certain length.

construct_random_offsets(motif_length: int, n_motifs: int, seed: int = None) ndarray#

Creates a given number of random offsets for the motifs.