5. Future Work

5.1. Brief General Conclusion to Date

Chemical informatics-based conformer generators are computationally cheaper than MD. Predicting geometrically plausible structures of cyclic peptides with these conformer generators works well. The more interesting question, however, concerns the dynamics of cyclic peptides. This includes predicting which geometrically plausible structures are relevant in solution, or when crossing cell membranes. The work presented so far is preliminary and creates the foundation for the next steps of this project. We developed tools and furthered our understanding of MD methods that capture dynamics of CPs. This initial work will hopefully pay dividends in the future when we apply these insights and try to augment or replace the cheaper conformer generator methods by ML models to recover some of the lost dynamics. Below, I outline some specific ideas for future work. First, I discuss how the MacroConf workflow & dataset (step 1) will be concluded. Then, I sketch what steps 2-4 (introduced in Outline of the General Project) may look like. I initially cover possible endpoints in step 4, before detailing steps 2 and 3, because the endpoints will determine composition and size, as well as features of the dataset in step 2.

5.2. Step 1 continued

At this point, the MacroConf dataset and workflows are well developed and tested on 4 compounds. We plan to scale the current analysis to more compounds of the dataset. Given the choice of the current forcefield, natural cyclic peptides will be the first compounds to simulate. To analyse the macrocyclic compounds of the dataset, additional force field parameters will be required[40], or if not available, a more flexible all atom force field (e.g., GAFF) will become necessary. Additionally, other enhanced sampling methods such as REMD could be added to the workflow to further validate the (G)aMD results. Other chemical informatics conformer generators could also be included to see whether they produce comparable results to OMEGA and RDKit. The final task of this step will be to assess how the studied conformer generators (MD and chemical informatics) can be applied to the following steps. We need to further analyse how well the chemical conformer generators match the MD ensembles, not just considering the backbone dihedral angles, but also sidechains. This could be combinations of reduced dihedrals with distance-based features, or shape-based metrics such as the principal moment of inertia. We will compute grosser properties of the MD ensembles and conformer generators, such as solvent-accessible surface area (SASA), polar surface area (PSA), dipole moments, etc. to assess how much the observed conformations vary in terms of these quantities.

5.3. Step 4: Prediction of Pharmacokinetic Properties of Cyclic Peptides

Before assembling a dataset in step 2, we need to first consider the desired outputs which depend on what is most interesting to accelerate the drug discovery process. Outputs are limited by data availability or by having reliable computational methods to estimate properties of interest. We currently see two possible starting points for further investigation: i) predicting target binding affinity or ii) membrane permeability. i) To predict target binding affinity, a suitable target with available experimental or computational data must be selected. A promising starting point with available binding data could be RGD containing cyclic peptides, that are being designed for the αvβ3 integrin target, which is overexpressed in many cancers.[59] Additionally, the bindingDB[60] lists 2174 hits for αvβ3 integrin. Further, it was demonstrated that GaMD simulations can be applied to inform binding of cyclic peptides to protein targets.[61] This may be useful to generate more training data.

ii) To predict membrane permeability, we will have to assemble or use a dataset of cyclic peptides with associated cell/membrane permeability data. Possible assays for membrane permeability are parallel artificial membrane permeability assay (PAMPA) [62] which measures passive membrane permeability only or Caco2[63]. The latter gives an indication of both passive and active transport. Permeability data is available in various sources, for example in the DOS macrocycle set, which was used by Poongavanam et al to predicted cell permeability based upon 3D descriptors calculated via OMEGA.[12, 64] Other permeability data is available, e.g., 200 non peptidic macrocycles with experimental cell permeability data[65] or others.[32, 66]

5.4. Step 3: Exploring AI Methods to Learn & Predict Properties/3D Structure from Features of the Dataset in Step 2

The feasibility of step 3 depends on our final analysis of step 1. How well do the chemical informatics conformer generators reproduce and predict cyclic peptide solution structures and is there a simple way of finding the relevant conformations depending on the environment. If the conformer generators are highly accurate, then it might not be necessary to predict 3D structures via ML. Rather, we could build ML models that assist with finding bioactive / relevant conformations of the pool of geometrically plausible structures. Recent advances in Multi-Instance Learning Approaches to QSAR modelling are a promising approach for trying this.[67] If we conclude that the accuracy of the chemical informatics conformer generators is not “good enough” compared to the MD methods, it will be valuable to predict the 3D structure from the sequence. For this, we would need to build a dataset of cyclic peptide solution structures in step 2, that covers the chemical space of cyclic peptides as much as possible. This could be leveraged to predict the 3D structure from sequence or other features. A crucial part of this step will be to benchmark several different ML models and model architectures. Another important aspect will be to consider how to represent the 3D structure, e.g., whether to directly predict cartesian coordinates, or instead use dihedral angles, distance matrices, etc. The success of this approach will depend on selecting appropriate input features derived from the sequence that enable accurate prediction of conformations (see step 2).

5.5. Step 2: Building a Dataset / Using a Dataset of Cyclic Peptide Solution Structures (via a Suitable Method identified in Step 1)

As part of the first step, we compared and benchmarked different conformer generator methods on CP solution structures. Before computationally expanding the MacroConf dataset to use in step 3, we need to identify a method or combinations thereof that accurately reproduce experimental solution structures of CPs. We will leverage this method to build an extensive dataset of CP solution structures. For the dataset, we still need to decide on the composition of the dataset. (i) We could use an already existing dataset, e.g., the gigalibrary[68]. (ii) We build a new dataset based on a set of design criteria (represent every amino acid equally, equal representation of pharmacokinetic properties, 3D structure motives, etc…). (iii) Or we could expand a previously published dataset of cyclic peptides. An important aspect when building the dataset will be to cover the full breadth of the chemical space of cyclic peptides as much as possible. When building this dataset, we need to consider what type of features the dataset should have. The decision of which features to use and how to represent the 3D structure is linked to which AI models will be used in step 3, and what other properties we will predict in step 4. Consequently, steps 2, 3, and 4 will not necessarily happen in strict succession. To predict pharmacokinetic properties of cyclic peptides, high quality data availability or methods to produce such will play a big role in determining the dataset. Further, depending on the objectives of step 4, we may have different datasets for different objectives.