2. Methods

2.1. Overview

This section covers the different MD methods and parameters used (see Section 2.2). In Section 2.3, I introduce details about how the simulations were analysed, including reweighting, dimensionality reduction, clustering, and computing of NOEs. In Section 2.4, I present the workflow manager Snakemake, which was used to execute all simulation and analysis steps. Finally in Section 2.5, I explain details about the chemical informatics conformer generators.

2.2. Molecular Dynamics simulations

For the CPs studied in this work, we used unconstrained enhanced sampling methods. Kamenik et al showed that aMD describes cyclic peptide solution structures well.[25] Here, we applied two flavours of accelerated MD: aMD[27] and GaMD[28], which were both already used to describe cyclic peptides in solution.[37] For comparison, we also performed cMD simulations. All MD simulations were run in the Amber 18 software.[38] While all atom forcefields are available, we used the AMBER FF-14SB protein force-field[39], which limits our initial analysis to natural cyclic peptides. The simulations performed here could easily be extended to include modified, non-natural amino acids by using extended parameters such as Forcefield_Ncaa[40] or a more general all atom forcefield (GAFF or others).[30] When building the cyclic peptide topologies in Amber, a bond between the first and last amino acid had to be included manually. For water, the tip3p model was used.[41] For simulations in DMSO, the GAFF forcefield was used.[42] RESP charge parameters for DMSO were taken from literature.[43] In addition, the BCC charge derivation method was also used.[44] To simulate Chloroform, we used the Amber18 Chloroform model.[45] All solvent boxes were octahedral, with a distance of at least 12 Å from the molecule to any box edge. After topology parametrisation and solvation, a cascade of different energy minimization and equilibration steps were performed including, minimize solvent, relax solvent, minimize system, heat system (NVT), equilibrate solvent (NPT), equilibrate system (NPT)[46]. Production runs in (G)aMD or cMD were performed in the NPT ensemble. Prior to (G)aMD simulations, an additional equilibration step was performed to determine the boosting parameters. For simulation details, see the computational workflow.

2.3. Analysis of MD simulations

2.3.1. Reweighting of accelerated MD simulations

Reweighting of accelerated MD was performed via Maclaurin series expansion to the second order as implemented in the PyReweighting toolkit provided by Miao et al.[26] For this work, the toolkit was modified to work as a python function. Inputs for reweighting are the 1D or 2D variables of interest that depend on the simulation time (e.g., dihedral angle over trajectory, distance over trajectory…), and the boosting potential for the whole trajectory. The reweighted quantity is obtained as a potential of the mean force (PMF) distribution of corresponding dimensionality to the input quantity.

2.3.2. Dimensionality Reduction

The simulation trajectory describes the potential energy of the system as a function of the cartesian coordinates (3Natoms dimensional space). To generate an interpretable picture of the energetic landscape, dimensionality reduction is required. Here, we performed principal component analysis (PCA) of the reduced dihedral angles (dPCA), cartesian coordinates, the pairwise distances matrix, and combinations and subsets thereof. By considering the first two principal components that explain the majority of the variance observed of the input features, we obtained qualitative PES. dPCA is commonly used as this tends to give reasonably well separated energy landscapes.[47]

2.3.3. Clustering

Many structures found during an MD simulation are qualitatively redundant with only minor coordinate changes. We clustered the simulation trajectories to identify representative conformations. As in Cipcigan et al, we used t-SNE (with a perplexity of 50, a learning rate of 400, and 2000 timesteps) as the dimensionality reduction method to better separate different structures.[37] As above, we used the reduced dihedral angles as input features. We then clustered with the density-based spatial clustering of applications with noise (DBSCAN) algorithm to identify different clusters in the t-SNE representation. Parameter details are similar to Cipcigan et al.[37] To get a representative structure for each cluster, we averaged the cartesian coordinates of all structures belonging to the same cluster. To avoid unphysical representative cluster structures, we selected the closest (lowest RMSD) observed structure in the simulation trajectory as the representative structure for each cluster. Finally, we compared the obtained representative cluster structures to the PCA representation to see if minima coincide with the representative cluster structures.

2.3.4. Computing NOEs in the context of accelerated MD

NOE distance constraints can be derived from NOESY or ROESY experiments. We cannot simply average the observed distances of an MD simulation to obtain NOE values. Instead, we applied \(r^{-6}\) averaging of the observed simulation distances.[25, 48, 49] This is formalised in Equation (2.1), where the distance between two atoms is \(r_{i}\) in the simulation frame \(i\), over all simulation frames \(N\).

(2.1)\[d_{NOE} = \left \langle r^{-6} \right \rangle ^{-{1}/{6}} = \left ( \frac{1}{N} \sum_{i=1}^{N}\frac{1}{r^{6}_{i}} \right )^{-1/6}\]

We separately computed the NOEs for all relevant atom pairs for ambiguous NOEs (an experimental NOE value can arise from multiple different chemically identical atoms). We could average the resulting ambiguous NOEs, but instead we chose to individually compare the ambiguous NOE values to the experimentally observed value. This is because we are uncertain about how the ambiguous experimental NOEs were derived for some datapoints. For accelerated MD, we cannot use the \(r^{-6}\) averaging procedure because the trajectories are biased and thus unphysical.[25] For this reason, we first reweighed the relevant distances for each NOE value as described above. We then applied a weighted \(r^{-6}\) average, with the weights derived from the resulting PMF distribution as Boltzmann factors.[50]

2.4. Snakemake & automated workflows

The workflow manager Snakemake[51] was chosen to implement the computational workflow in a reproducible and reusable way. Snakemake performs the “heavy lifting” of data analysis, such as; handling file inputs, automatically deploying software, ensuring scalability, portability, and readability to guarantee sustainable data analysis.[52] Snakemake implements user-defined rules which have defined inputs and output files. Snakemake automatically infers which rules to connect by building a directed acyclic graph (DAG) and handles execution of the rules in the right order to produce a user requested output. Snakemake was selected over other workflow managers for its Python derived readable syntax, its popularity, active development, and the native Conda and Jupyter notebook integration.

2.5. Chemical informatics conformer generators

SMILES strings were used as inputs to the RDKit[11] and OMEGA[9] conformer generators to remove any kind of previous structural information.[9] SMILES strings were produced from the parametrised MD topology (see Section 3.1) to closely match the MD reference structure for later comparison. The resulting conformer generator outputs were renumbered to match the MD reference and in some cases Hydrogen atoms were added or removed to exactly match the MD topology.