Hybrid 2D/3D-quantitative structure–activity relationship and modeling studies perspectives of pepstatin A analogs as cathepsin D inhibitors
Aim: Cathepsin D, one of the attractive targets in the treatment of breast cancer, has been implicated in HIV neuropathogenesis with potential proteolytic effects on chemokines. Methodology/result: Diverse modeling tools were used to reveal the key structural features affecting the inhibitory activities of 78 pepstatin A analogs. Analyses were performed to investigate the stability, rationality and fluctuation of the analogs. Results showed a clear correlation between the experimental and predicted activities of the analogs as well as the variation in their activities relative to structural modifications. Conclusion: The insight gained from this study offers theoretical references for understanding the mechanism of action of cathepsin D and will aid in the design of more potent and clinically-relevant drugs.The global cancer burden has increased considerably in the past 30 years [1] with the number of cases rising from12.7 million in 2008 [2] to 14 million new cases in 2012 [3], this being predicted to climb to 21.7 million in 2030 [4]and/or keep increasing by 70% per year over the next 20 years [5]. Globally, in 2015, cancer was responsible for8.8 million deaths, with nearly one out of six attributed to cancer [3]. In 2016, it was estimated that in the USA alone, approximately 1.6 million new cancer cases were reported [6].The growing HIV epidemic has contributed to and impacted on the increasing number of cancer cases in various countries [7–13]. HIV has been one of the main causes of the health crisis in Africa [14]. Globally, in 2015, approximately 1.1 million people died from HIV-related causes [15], while by the end of that year, approximately36.7 million people were living with the virus, and an estimated 2.1 million had become freshly infected. In 2015, approximately 25.6 million people were living with HIV in sub-Saharan Africa, which is the most affected region, and accounts for two out of three global total new cases of HIV infections.
In sub-Saharan Africa, cervical and breast cancers are the leading causes of cancer death among women [16], and over the years, the alarming increase of HIV infections has contributed to its increased rate [17–21]. Cancer is now being estimated to contribute to one third of deaths in patients with HIV in developing countries [22–24].Proteases of the cathepsins family are generally optimally active at low pH, a characteristic common to HIV- 1 infection, although catheptic activity can occur at neutral pH for a limited period of time [25]. Cathepsin D, a lysosomal aspartyl protease that is expressed in all tissues [26], has been suggested to play a role in the metastatic potential of several types of cancer, including colon, prostate, ovarian, uterine and breast cancer [27]. Most metastatic breast cancer cell lines secrete high levels of procathepsin D [28], which stimulates their proinvasive and prometastatic properties by acting as a mitogen on both cancer and stromal cells. This abnormal secretion is due to both overexpression of the cathepsin D gene and an altered processing of the precursor protein. Besides their established role in the pathogenesis of cancer [29–31], cathepsins have been reported to be implicated in the neuropathogenesis of HIV with potential proteolytic effects on chemokines [32], although direct interactions have not been demonstrated to date [33]. Upon HIV infection, cathepsin D, an aspartic activity and collagenase activity are decreased in circulating monocytes [34], as it increases intracellular viral replication by reacting with the long terminal repeat [35]. Some human mammary epithelial cell cultures have also been reported to support the growth of HIV-1 [36].Cathepsin D (PDB Code 1LYA and 1LYB), like other aspartic proteases [37–40], has a two-domain structure [41].
A recent study by Arodola and Soliman discussed the flap dynamics of the human cathepsin D in its Apo and bound forms [42], and provided insight into the two important aspartic residues (at position 33 and 231 located on the 14-kDa and 34-kDa chains, respectively (Figure 1) [40]. Cathepsin D is upregulated by estrogens and growth factors (i.e., IGF1, EGF, insulin) in ER-positive breast cancer cell lines [41], while the same cannot be said for its overexpression in ER-negative [43], making the development of extracellular and intracellular cathepsin D inhibitors important.Pepstatin A, an inhibitor of nearly all-acidic aspartic proteases, is the most potent inhibitor of cathepsin D so far identified [44–50]. It is commonly used to study the role of cathepsin D in in vitro [51] and in vivo [52–54]. Experimental studies on the inhibition of HIV replication in cell culture by pepstatin A have also been reported [55]. Cathepsin D level represents a prognostic factor in an array of cancers and is therefore considered to be a potential target of anticancer therapy [56–67].Several compounds that contain hydroxyethyl amine isosteres with flexible alkyl amines lead to the development of aspartic protease inhibitors with poor bioavailability and shorter half-lives [68], and are therefore therapeutically useless. However, the use of tertiary isosteres with cyclic tertiary amines provided inhibitors with enhanced oral absorption [69].This study aimed to investigate the detailed binding modes and important structural features affecting the inhibition activities between cathepsin D and 78 pepstatin A analogs, the intention being to target cathepsin D in the treatment of HIV and cancer since it is implicated in both. Various approaches were used to provide a comprehensive landscape using different computational methods. The advantage of accelerated molecular dynamics (aMD) cannot be overemphasized as it has been demonstrated in studies to improve the efficiency of conventional MD (cMD) over a short timescale of simulation [70,71].
In this study, to provide insight into the rational modification of molecules for the design of more potent cathepsin D inhibitors, we performed a molecular modeling study combining enhanced docking, aMD, molecular mechanics/generalized born surface area (MM/GBSA) calculation, 2D- and3D-quantitative structure–activity relationship (QSAR) calculations. Understanding the mechanism of action of cathepsin D could provide improved treatment for HIV/cancer co-infected patients globally.In this study, 78 pepstatin A analogs and their biological activities were identified from the literature [72–74] and used as the dataset for the QSAR study (Figure 2). The biological activities were converted to their corresponding pIC50 (= -log IC50) values, which were used as dependent variables in the respective 2D and 3D QSAR study. The 78 compounds were randomly divided into training set (including 55 compounds) and test set (including 23 compounds) in an approximate ratio of 3:1. All structures and their corresponding pIC50 values are listed in Table 1 below. The training set was used to construct 2D and 3D QSAR models, and the test set to ensure model validation. All molecular modeling and QSAR studies were performed using Discovery Studio 2016 software package [75].The x-ray crystal structure of bound cathepsin D in complex with pepstatin A (PDB code 1LYB) was obtained from Protein Data Bank [76] with only chains A and B being retained for simulation in this study. Complex modification as well as visualizations is performed on Chimera [77], MMV [78] molecular modeling suite and Avogadro software [79].Autodock Vina 1.5.4 [80] was used to dock 78 pepstatin A analogs into the cathepsin D binding pocket. All water molecules were removed from the original PDB file, polar hydrogen atoms were added to the ligands (i.e., 78 analogs), Gasteiger partial atomic charges [81] were added to the ligands; and Kollman charges [82] and atomic solvation parameters were assigned to the protein using Autodock tools Vina.
The grid parameters were X = 40, Y = 48 and Z = 40 for the dimensions and X = 20.919, Y = 96.245 and Z = 7.842 for the center grid box with spacing of 0.375 ˚A. Pepstatin A was docked back into the cathepsin D binding pocket in order to validate the docking protocol and the top four analogs with better binding affinities were taken for further modeling studies.Accelerated molecular dynamics simulationcMD simulations of macromolecules have frequently encountered two major challenges – conformational sampling and accurate force-fields [83,84]. aMD is an improved sampling technique that uses a non-negative boost potential when the potential of the system is lower than the reference energy needed to decrease the energy barriers and accelerate transitions between different low-energy states [85–87] and to sample diverse biomolecular conformations and rare-crossing events that are not accessible to cMD simulations.System preparation for simulation protocolMolecular dynamic simulations of cathepsin D in complex with four analogs were executed on the GPU version of PMEMD in Amber14 [88,89]. The RED server [90] was used to parameterize nonstandard residues wherein Firefly 7.1G was used for ligand optimization and to calculate the electrostatic potential charges using functional HF/6–31G*. The fitting of the charges was done and the parameter file for the ligand in mol2 format extracted.The FF99SB force-field in Amber14 [91] defined the parameters for cathepsin D, with the LEaP module used to add hydrogen atoms to cathespin D and counter ions addition for the neutralization of the entire system [92]. The solvation of the system was done in a TIP3P water box such that cathepsin D surface was located 8 ˚A away from the water-box boundary [93]. The ionizable side chain residues were protonated at a pH of 3.7 with Na+ being added to achieve charge neutrality and closely imitate the biological environs. The protonation state was substantiated bycomparing the results generated on AMBER and the H++ server [94]. Periodic boundary conditions were applied with long-range electrostatic interactions treated with the particle-mesh Ewald method [95] of AMBER14 with a 10 ˚A van der Waals cut-off.
Energy minimization was performed with a restraint potential of 500 kcal/mol ˚A2 to treat the solute for 1000 steps using the steepest descent method followed by 2000 steps unrestrained conjugate gradient minimization. Owing to the lack of parameters in the Cornell et al. force-field [96], the optimization of the ligands was performed with the Gaussian 03 suite using the HF/6–31G* level [97]. The AMBER force-field (ff03) [98] derived the cathepsin D topology and parameter files based on the Cornell et al. [99] force-field model. MD simulations with canonical ensemble (NVT) were implemented for 50 ps, gradually heated from 0 to 300 K with harmonic restraints of 5 kcal/mol ˚A2 for solute atoms and a Langevin thermostat with 1 ps collision at random frequency. System equilibration was conducted unrestraint at 300 K in the NPT ensemble for 500 ps, with the pressure being maintained at 1 bar using a Berendsen barostat [100]. The SHAKE algorithm [101] was used to constrain all hydrogen bonds and a 2 fs MD runs using the SPFP precision model [102] was performed. A 2 ns production run was conducted prior to a conventional MD production run of 10 ns with random velocities read at 500 ps, 1 ns, 1.5 ns and 2 ns from the initial 2 ns MD production run with NPT ensemble and a Berendsen barostat-measured pressure of 1 bar and a 2 ps pressure coupling constant for each case. Accelerated MD simulation was then run for 10 ns and the coordinate files were saved every 1 ps time interval with the trajectories from all system simulations being analyzed. Post-MD analysis was performed using the Amber14-implemented CPPTRAJ and PTRAJ modules [103]. Plotting and visualizations were conducted respectively on the Origin data analysis tool [104] and Chimera molecular modeling tool [105].Thermodynamics calculationThe free-binding energies of the cathepsin D-analogs active site were calculated using the MM/GBSA method. Thermodynamic calculations rely on molecular simulations to compute binding-free energy within the context of a force-field, thereby, providing valuable data about the complex. This requires the extraction of 1000 snapshots from 10 ns trajectory with the mean values being computed. The binding-free energy (∆Gbind) of the complex was calculated thus: In Equation (2), −T∆S, where T and S are temperature and entropy respectively, is the conformational energy upon binding. S was derived by using the NMODE module of AMBER14 [106]. In Equation (3), Egas denotes the gas-phase energy; Eint denotes the internal energy; and Eele and Evdw denotes the electrostatic and van der Waals energies, respectively. In Equation (4), Gsol denotes the free solvation energy while GGB and GSA denotes the polar and nonpolar solvation contribution, respectively. In Equation (5), the SASA denotes solvent accessible surface area; water probe radius and surface tension constant γ was set to 1.4 ˚A and 0.0072 kcal/(mol·˚A2), respectively [107].
Results & discussion
A cathepsin D enzyme comprises three domains: the N-terminal, C-terminal and inter-domain region [108] wherein two important aspartic acids, Asp33 and Asp223, contribute majorly toward the catalytic active site. As described by Arodola and Soliman [42], the active site residues are Asp33, His77, Ty78, Gly79, Ser80, Phe123, Tyr197, Thr226, Ser227, Asp231, Pro304, Pro305, Pro306, Ser307, Gly308 and Pro309 with grid parameters being X = 40, Y = 48 and Z = 40 for the dimensions and X = 20.919, Y = 96.245 and Z = 7.842 for the center grid box. Four analogs complexes with the best docking affinities were subjected to subsequent MD simulations. As seen in Table 1, compounds 48, 50, 73 and 76 with binding affinities of -10.0, -10.2, -10.0 and -10.0 kcal/mol respectively were taken for further modeling studies so as to conserve computational resources.Molecular alignment of compounds is a crucial step in developing a QSAR model, especially 3D QSAR model [109]. To obtain the rational conformation for developing 3D-QSAR models, two alignment methods were applied, based on the ligand (dataset superimposition) and receptor (docking) (Figure 3). For the ligand alignment, one of the most potent inhibitors (compound 48) was chosen as a template to fit the remaining datasets using the alignment function available on Chimera visualization software where the result showed that all analogs have a common scaffold. For the receptor alignment, the predicted conformations of the compounds after docking were aligned and the result of the superimposition is shown in Figure 3. Furthermore, the distribution of electrostatic potentials (nonbonded interactions due to distribution of electrons) and van der Waals forces interaction energy (repulsion or attraction between nonbonded atoms) maps were generated on Discovery Studio 2016 client (Figure 4).
The obtained results could help scientists understand the binding process, and provide some insights into the further modification and development of new potent inhibitors.2D QSAR modelTo provide insight into the 2D structural similarity of the dataset and to generate a statistically significant 2D-QSAR model, multiple linear regression (MLR) was used to analyze the training set by correlating the variation in their pIC50 values (dependent variable) with the predicted MLR statistical model (independent variable). MLR uses the large numbers of descriptors to maximize the amount of information gained from the training set to build an improved fit as a model [111]. A total of seven molecular descriptors were used to build the MLR model. To further evaluate the reliability of the generated MLR model, cross-validation analysis was accomplished by leave- one-out method, wherein one compound was removed from the dataset and the model predicted its activity. A cross-validated correlation coefficient, Q2 (otherwise known as R2cv) was obtained as 0.365. The established MLR model exhibited good correlative ability with cross-correlation coefficient of determination (R2) value of 0.821 (Figure 5). Predicted and residual values for the derived 2D QSAR model are provided in Table 2, and the residual plot depicted in Figure 6. Predicted values are the experimental activities of the compounds and the residual values are the difference between the experimental biological activities and the predicted activities, which were found to below.3D QSAR modelTo investigate the 3D conformational similarity of the dataset with respect to cathepsin D, a statistically significant 3D-QSAR model was generated. Partial least squares (PLS) is a statistical approach that combines features from principal component analysis and multiple regressions [112,113] and is mostly considered useful to predict a set of dependent variables from a large set of independent variables.
Thus, to analyze the training set, the pIC50 values (dependent variable) were correlated with the predicted PLS statistical model (independent variable). To further evaluate the reliability of the generated PLS model, cross-validation analysis was accomplished by the leave-one-out method, wherein one compound was removed from the dataset and the model predicted its activity. A cross-validated correlation coefficient, Q2 (otherwise known as R2cv) was obtained (0.574). The established PLS model exhibited good predictive ability with an adjusted correlation coefficient (R2A) of 0.758 and cross-correlation coefficient of determination (R2) value of 0.780 (Figure 7). The residual plot for the 3D-QSAR model is shown in Figure 8.Overall, a good value of R2 between experimental and predicted activity does not necessarily mean that the predicted values are very close to their corresponding experimental value. There may be considerable numerical differences between the two, however, maintaining an overall good correlation coefficient value greater than 0.5 is important for any acceptable model. Thus, while the 2D-and 3D-QSAR models reported in this study can be said to be acceptable models, 2D-QSAR is impervious to the conformational arrangements of atoms, while 3D-QSAR needs information on the atoms position in three spatial dimension. A Q2 value is usually used as a diagnostic tool to evaluate the predictive power of a model equation. High Q2 (>0.5) often serves as ultimate proof of the high predictive power of the QSAR model.
Thus, in this study, it is safe to say the 3D-QSAR (Q2= 0.574) showed better predictive power when compared with 2D-QSAR (Q2= 0.365) with regards to their respective Q2 values.Furthermore, as can be seen from Figure 9, the 78 analogs used as a dataset in this study fitted into the grid box of the receptor with a correlation coefficient value of 0.994. However, the conformations exhibited by each one can be deduced from the calculated binding affinities obtained from our docking result, which is included in Table 1 and comprehensively explained in section 3.6.Solubility assessment is one of the most important preformulation parameters that have a substantial impact on the performance of a molecule. Thus, the majority of new chemical entities fail in the development stage due to pharmacokinetic reasons [114]. Distribution coefficient (KD) is the ratio of all unionized and ionized species in two immiscible solvents, and is an extremely important parameter, particularly when assessing the molecule’s potential to cross biological or lipid barriers [115]. A molecule is said to be lipophilic (soluble in organic phase) when KD >1 whereas it is hydrophobic (soluble in aqueous phase) when KD <1.The ADMET descriptors generated for the 78 analogs on Discovery studio 2016 is depicted in Figure 10. The ‘absorption’ descriptor predicts human intestinal absorption after oral administration at 95 and 99% absorption while the ‘aqueous solubility’ descriptor predicts the solubility of the molecules at 25◦C. ‘Blood–brain barrier’ descriptor predicts the blood–brain barrier penetration of a molecule as a ratio of the compound (solute) on both sides of the membrane after oral administration. As seen from the plot in Figure 10, the majority of the analogs fall within the Absorption 95 and Absorption 99 region. This indicates that the majority of the studied analogs have good bioavailability/oral absorption. As seen from Tables 1 & 2, modifications in the ring of the hydroxyethyl tertiary amine appear to have enhanced the potency of the inhibitors.According to the docking result, the 78 analogs used in this study have very good binding affinities that range from-8.0 to -10.2 kcal/mol with compounds 48, 49, 50, 73 and 76 having the highest binding affinities (-10.0, -10.2,-10.2, -10.0 -10.0 kcal/mol), respectively. The reason for the differences in the binding affinities may be due to any of the following observations: first, comparing compounds 48 and 50, as they both have the same structural core (core D), however, compound 50 has an ethyl group attached to its R1 position, which may have increased its binding affinity. Second, the presence of hydrogen atoms makes hydrogen bonding and hydrophobic interaction possible. As seen from Tables 1 & 2, modifications in the ring of the hydroxyethyl tertiary amine appear to have enhanced the potency of the inhibitors.Although a docking procedure can provide a starting point with respect to binding modes prediction, it ignores the protein flexibility and the effect of water solvation on the system. Thus, to further validate and rationalize the docking result by taking into account ligand-induced conformational changes, 10 ns cMD followed by 10 ns aMD simulations were undertaken in a state that mimics the biological environment with the aim of checking the stability of the complex in the aqueous medium and the flexibility of the residues upon ligand binding.Post-MD analysesSystem stability & molecular dynamics simulationsTo ascertain the stability and proper equilibration of the systems, the root mean square deviation and potential energy fluctuation were monitored and analyzed. 10 ns cMD followed by 10 ns accelerated MD analyses were carried out on five systems. As depicted in Figure 11, the convergence of the systems attained stable states at approximately 2 ns (2000 ps). Compound 48 appears to have a slight deviation at 1.8 to 2.8 ns. However, the overall average RMS deviation values for the backbone atoms of cathepsin D-analog complexes were not greater than 1 ˚A. Figure 12 depicts the negligible variations in the potential energy for all systems, which affirms the proper equilibration and stability during simulation.The MM/GBSA method is used to calculate the free energy difference between two different solvated conformations of the same molecule. As seen from Table 3, pepstatin A (compound 0) has the highest binding-free energy, which is the energy that can do work, compared to the other analogs. The conformation adopted by the drug that perfectly aligns with the active site residues defines an optimum binding affinity which subsequently induces an efficient therapeutic effect whereby Pepstatin A inhibited cathepsin D at a concentration of 0.002 μM [72], with a docking score of -8.6 kcal/mol and an estimated binding-free energy of -91.88 kcal/mol. This implies that Pepstatin A has a relatively high inhibitory activity compared to the other studied analogs. However, there is still a pressing need for the discovery of more inhibitors that can inhibit cathepsin D. This information could be useful in designing ‘tailored’ inhibitors with improved binding affinity against cathepsin D.Root mean square fluctuation analysisFigure 13 presents the root mean square fluctuation of amino acid residues from cathepsin D-analog complex simulation. As seen in Figure 12, residues such as 48–50, 79–80, 201–207 and 288–290 exhibited similar pattern of fluctuation in the five studied systems. This behavior was properly explained in a previous publication [42] wherein cathepsin D exhibited a different conformational change, rigidity and/or flexibility upon ligand binding. Furthermore, the flexibility of some of these residues may have been due to the fact that they are trying to facilitate a perfect binding pocket that will fit each analog.Ligand interaction & per-residue energy decomposition analysisPer-residue analysis of the cathepsin D backbone atoms was studied to obtain more insight on the structural flexibility of the protein. The individual contribution of the active site residues was computed by using the MM/GBSA method implemented in Amber14 [103]. An intriguing observation in Figure 14 is that hydrophobic interactions and hydrogen bonds are largely accountable for the stability of the ligands in the active site.Pepstatin A interacts with the enzyme with up to six hydrogen bonds (Gly79, Gly35, Ser80, Thr226, Gly225 and Ser227), a characteristic not found in the analogs. The energy contribution of individual active site residues is presented in Figure 15.As Figure 15 reveals, key residues including Asp33, Gly35, Tyr78, Gly79, Ser80, Tyr197, Thr226, Ser227, Gly225, His77, Asp223, Ser227 and Ala13 contributed the most toward the binding energy of the studied analogs. Although there may be observed variations in the energy contributed by these residues to the binding of each analogs to cathepsin D, it was observed that they are important in ligand binding. The data reported herein will not only provide insight into the effect of structural diversity/similarity of the studied analogs in the inhibitory profiling of cathepsin D, but could also serve as a route map for designing novel cathepsin D inhibitors to treat HIV/cancer co-infection. Conclusion This study identified potential cathepsin D inhibitors to treat HIV/cancer co-infection and focused on the structural diversity and similarity of pepstatin A analogs, the only approved inhibitor of cathepsin D. QSAR models were developed based on molecular, structural, physiochemical, 2D and 3D properties obtained via MLR and PLS models, respectively. The biological activities of the 78 compounds used as dataset in this study were obtained from literature and validated using 2D- and 3D-QSAR model. These models hold good predictive correlation coefficient R2 values of 0.821 and 0.780, respectively. It is noticed that modifications in the ring of the hydroxyethyl tertiary amine appear to have enhanced the potency of the inhibitors. The root mean square deviation results revealed that the studied analogs exhibit good stability in the binding pocket and the deviation was not more than 1 ˚A. The ligand interaction map of the studied compounds showed that the hydrophobic and hydrogen bonding played an important role in binding to cathepsin D. Furthermore, the per-residue decomposition analysis showed the degree of energy contributed by these active site residues in binding. The structural similarity of pepstatin A analogs identified by these QSAR models can be used to design more potent drugs that possess cathepsin D inhibitory properties, similar to those of the studied analogs against cathepsin D. We believe that the information provided in this study will influence the future design of potent cathepsin D inhibitors in the treatment of HIV/breast cancer co-infection. Future perspective So far, the cure for HIV has continued to elude scientists. Metastatic breast cancer has been met with selective therapeutic compounds. It is understood that cathepsin D plays a vital role in the progression of cancer and allows the permissiveness of HIV-1 cell replication. However, there are only a few potent inhibitors of HIV-1 inhibitors available to date. Of a fact, these inhibitors are still met with resistance in many patients. Several cancer drugs have suffered the same fate. Pepstatin A has been the only known inhibitor of cathepsin D so far identified. This study contributes to the field by identifying the common structural features of synthesized analogs with experimental cathepsin D activities. Future research will focus on the design of tailored pharmacophores which would be validated using diverse computational approach. The research will focus on the optimization of their physicochemical properties as well as the characterization of their inhibition activities. However, the impact of the selected cathepsin D inhibitors will need to be investigated through the demonstration of intra/extracellular effects and therapeutic responses possibly in vitro, in vivo and eventually in an animal model. Study of preclinical pharmacokinetics ADME for analysis of the newly discovered inhibitors would have considerable benefits for research in this Pepstatin A field.