Web-based Tools for Computational Enzyme Design

Enzymes are on high demand for very diverse biotechnological applications. However, natural biocatalysts often need to be engineered for fine-tuning their properties towards the end applications, such as the activity, selectivity, stability to temperature or co-solvents, and solubility. Computational methods are increasingly used in this task, providing predictions that narrow down the space of possible mutations significantly and can enormously reduce the experimental burden. Many computational tools are available as web-based platforms, making them accessible to non-expert users. These platforms are typically user-friendly, contain walk-throughs, and do not require deep expertise and installations. Here we describe some of the most recent outstanding web-tools for enzyme engineering and formulate future perspectives in this field.


Introduction
Enzymes are the catalysts used by nature to perform the complex chemical reactions required to sustain life.They evolved over billions of years to achieve high efficiency and specificity required by each lifeform to survive and thrive in their environments.Biotechnology has emerged as a way of mankind to exploit such nature's creations, with numerous benefits over the classical chemical processes.In many cases, however, technological applications require particular properties beyond what is available in naturally occurring biomolecules, such as specific activity, selectivity, stability, solubility, etc.In such cases, they have to be reengineered on-demand [1].
The global protein-engineering market was evaluated in USD 1.9 billion in 2018, and it is projected to reach USD 3.9 billion in 2024 with the remarkable annual growth of 12.4% (CAGR) during this period.The rational protein design accounted for the largest technology segment of the market in 2018, while the biopharmaceutical companies accounted for the largest end-user [2].This categorically demonstrates the growing importance of protein engineering.
Directed evolution methods have been extensively used to successfully improve natural biomolecules.However, they can be expensive and time-consuming.Therefore, the usage of computational methods for rational design is becoming more and more common.The predictive power of computational tools is gradually improving.Many of the existing tools are developing intuitive and user-friendly web-based platforms, which expand their usability to the broader community.Without the need for software installation or Unix command-line environment, these platforms are ideal for nonspecialists.
In this review, we focus on web-based computational tools for enzyme engineering.We describe the recently developed web servers that we consider the most outstanding (Table 1), after a selection from a larger pool of tools (Supplementary Tables S1-S6).We have organized them by the focus: i) enzyme discovery, ii) protein solubility, iii) enzyme activity and specificity, iv) protein stability, v) protein dynamics, and vi) multipurpose.We omitted the tools specialized on the structure prediction or identification of protein-protein interactions to keep the focus on enzyme design.a Some of the listed items are mandatory and some are optional; b ΔΔG is the stabilization energy, and corresponds to the change in free energy upon each mutation from the wild-type or template; c ROSIE hosts many individual tools to be discussed individually.A complementary strategy consists of identifying protein-ligand structural motifs from the protein structures in RCSB Protein Data Bank (PDB) [38].This approach relies on the rationale that such binding interfaces are substrate-specific, which in turn is an essential fingerprint of the catalytic process.

Engineering protein solubility and aggregation
A recurring problem upon producing engineered variants of proteins is that they may suffer from diminished solubility or aggregation.Several approaches relying on sequence and structure properties provide solutions for solubility prediction and optimization [40,41].
Among the methods requiring the input of 3D structure, Aggrescan3D 2. SOLart [9] relies on structure-derived statistical potentials to infer the query protein solubility.
Differences in Gibbs free energy inferred from such statistical potentials -especially those considering backbone torsion angles, solvent accessibility and inter-residue distances -allow for accurate predictions of solubility when compared with experimental values, achieving a Pearson's correlation coefficient of 0.67 and 0.51 on independent validation set and modelled proteins, respectively.
AggreRATE-pred [10] integrates amino acid physicochemical and structural-based properties, and mutational and contact propensities in a multiple regression model to predict the effect of mutations on the aggregation rates.The chosen model to be applied depends on the protein length and the secondary structure type on where the mutation(s) occur.This strategy achieves a correlation in between experimental and predicted values of up to 0.82 and performs also well on modeled proteins.
Interestingly, this approach does not rely on any structural information for short peptides (< 40 amino acids).
When the 3D structure of the protein to engineer is not available or obtaining a model becomes challenging, solubility can also be predicted from the protein sequence.Solubility-Weighted Index [11] [*] offers a pre-calculated compendium of per-residue flexibility propensities that were refined and optimized in a set of 12,216 target proteins from 196 different species that were expressed In E. coli using either a C-or N-terminal poly-histidine fusion tag.The strategy derives from the observation that, over almost 10,000 different studied protein properties, flexibility was the best predictor for solubility.
SoluProt [12] [*] is based on gradient boosting regression and provides solubility prediction from the protein sequence.The machine learning model has been developed using a manually curated TargetTrack database.Considering the amino-acid singlet and dimer content of the poly-peptidic chain, their physicochemical properties, membrane propensity and similarity to E. coli 3D proteome, this approach achieves an AUC of 0.60 on a newly compiled independent set.SoluProt is integrated in EnzymeMiner [7], providing an easy way to filter out unlikely soluble proteins in the process of novel enzyme discovery.

Engineering enzyme activity and specificity
Enzyme activity and selectivity are the key features normally targeted in enzyme engineering.
Although activity and selectivity are very different properties, they can often be improved using similar computational approaches.Engineering the activity towards a substrate of interest is likely to result in the improvement of the selectivity towards this substrate.The most common strategy consists of introducing mutations in the active site and optimizing it towards the targeted substrate.Other approaches have also proven successful, namely the engineering of access tunnels, modification of the dynamic properties, editing recognition elements such as loops, or targeting allosteric sites.
Important computational tools for engineering enzyme function − among which is the goldstandard Rosetta toolbox [42] − have been reviewed [43,44].Rosetta-based web tool FuncLib [13] [**] was specially designed to add multiple-point mutations to the binding site.Taking into account evolutionary information and energetically favorable substitutions, single-point mutations are combined and ranked by the predicted stabilization free energies (ΔΔG).The FuncLib workflow ensures that no deleterious mutations are introduced and it can account for potential epistatic effects resulting from combining multiple mutations.
CaverDock [14], integrated into the Caver Web [30] [**] (section 7), can be used for engineering enzyme activity and selectivity.This tool was designed to predict the trajectory and binding energy profile of (un)binding of a ligand travelling through the enzyme access tunnels using a constrained molecular docking algorithm.The user can run calculations for different ligands or for multiple enzyme variants, and assess which combinations provide the best energy profiles.This is especially useful when the limiting steps in the catalysis involve the substrate binding or the product release.
Enzyme specificity can also be modified by engineering loops, which represent the flexible elements that can modulate substrate recognition and binding specificity.DaReUS-Loop [15] [**] (re)models loops in homology models and it can search the databases for new loop conformations suitable to be introduced in the target structure.This can help users find new enzyme variants with diverse substrate specificities.nAPOLI [16] automatically identifies conserved protein-ligand interactions across a large data set, such as a list of PDB structures or any protein within a specified range of sequence identity.It compiles the type of interactions and networks formed to find hotspots within the binding sites or suggest mutations that can produce more favorable interactions with a specific substrate.

Engineering protein stability
Enzyme stability refers to the range of temperature, co-solvents, pH, and other general conditions in which enzymes can resist and remain active.For many biotechnological purposes, it is desirable that the enzymes survive longer time or harsher conditions beyond what the native variants normally would.One can push those boundaries by engineering their stability using: (i) energy calculations, (ii) phylogenetic analysis, (iii) machine learning, and (iv) combination of the previous ones.
Ancestral sequence reconstruction (ASR) is a strategy that is becoming increasingly used for protein stabilization.FireProt-ASR [17] [*] is the first fully automated platform for inferring the ancestral sequences by phylogenetic analysis.Based on a single protein sequence, the tool builds a dataset of homology sequences and performs a multiple sequence alignment to build a phylogenetic tree and reconstruct the ancestral nodes.The method can be used not only to improve thermostability, but also to expand the catalytic promiscuity and increase expressibility of enzymes.
Electrostatic interactions are crucial to protein folding and integrity.They also rule the effects of pH and ion concentration on protein stability.However, they are often underestimated or poorly predicted during enzyme engineering.TKSA-MC [18] and pStab [19] tools tackle this issue by assessing unfavorable electrostatic interactions and identifying charged hot-spot residues for mutagenesis.
A very different approach is protein stabilization by introduction of disulfide bonds.Yosshi [21] [*] and SSBondPre [22] are recent tools devoted to this strategy, the former using evolutionary analysis and the latter using machine learning.Most of the stability prediction methods have been developed for globular soluble proteins.mCSM-membrane [23] can predict the stability changes or the pathogenicity associated with mutations in membrane proteins.

Engineering protein dynamics
Proteins exist in dynamic, metastable conformational states, transitioning through an ensemble of possible local conformations.The motions resulting from such transitions can fundamentally influence the catalytic activity of an enzyme [49,50].Thus, assessing and engineering enzyme dynamics residues, to impose user-defined distance restraints, and offers an improved graphical output.
ProSNEx [27] [*] models inter-residue interaction networks from the input 3D coordinates of the protein to be studied.Such contacts are weighted according to dynamical cross-correlation maps either obtained from elastic network models or other normal mode applications, the graph theory based spectral clustering of side chains, or molecular dynamic simulations derived energies.These dynamics studies are enriched with subsequent network and sequence conservation analysis, and the results are presented in an easy-to-interpret graphic-intensive interface.
AlloSigMA 2 [28] studies allostery and is based on the implementation of a structure-based statistical mechanical model.The server allows for evaluating the allosteric free energy resulting from the perturbation of any residue in the input structure.It allows for testing the allosteric effects of introducing mutations and the impact of introducing a ligand into the studied system.An intuitive graphical user interface provides a rapid interpretation of the protein regions that changed their dynamics.

LARMD [29]
[*] automates the execution of fully atomistic molecular dynamics simulations up to 4 ns long.Untrained users can opt for the suggested easy-to-set-up predefined conditions and more versed ones can fine-tune the execution parameters to their needs.The application is focused on deciphering the structural and dynamical effects of ligand binding, and to this end implements tunnel discovery tools such as CAVER 3.0 [52].Furthermore, it offers a wide range of analyses on the obtained trajectories: (i) structural variability and fluctuation analyses, (ii) normal mode analysis, and (iii) trajectory clustering.The server provides a wide range of graphics and charts to ease the interpretation of the results.

Engineering multiple properties
Some protein engineering web-tools integrate multiple tasks in robust workflows.Caver Web [30] [**] can be used to identify molecular tunnels and channels in proteins with buried cavities and predict the transport of ligands through these tunnels (Figure 2).The workflow starts with the identification of the relevant pockets and computing the tunnels from the selected pocket to the surface using CAVER 3.0 [52].The user then selects the tunnels and ligands for analysis of the transport using CaverDock (section 4).This integrated analysis allows identifying hotspots on the enzyme tunnels that can remove the barriers to the transport of the target substrates or products or increase their specificity, and thus improve the enzymatic function.are also available as web servers.
ProteinsPlus [33] [*] is a unified platform integrating multiple tasks of protein investigation, namely database exploration, structural quality assessment, conformational analysis, binding site analysis, 2D-interaction diagrams, pocket detection, etc.Although it is not devoted to enzyme engineering per se, it can provide comprehensive structural knowledge.pPerturb [34]  .The user enters a PDB structure or a sequence that will be used to predict the structure by homology modelling.A sequence of different calculations are performed, leading to four types of hot-pot predictions: (i) functional hotspots (nonessential residues located on functional pockets or tunnels, ranked by mutability), (ii) correlated hotspots (co-evolving pairs of residues, obtained from consensus and correlation analysis), (iii) stability from flexibility (hot-spots with higher B-factors), and (iv) stability from consensus (hot-spots recommended to be mutated to amino acids with higher frequency in the multiple sequence alignment).
The user can select the hot-spots for mutagenesis based on the integrated overview of the suggested positions, such as mutability, secondary structure, amino acid frequency and mutational landscape.The user can predict the stabilization (ΔΔG) from all the selected single-point mutations on the selected hotspots, and combine them into multiple-point mutations.The user can also calculate the optimal DNA codon content to build smart libraries for screening the selected positions with the desired set of amino acids.

Conclusions and perspectives
Here we reviewed the recently published web-based tools specialized in different aspects of enzyme engineering, which can be valuable resources to experimental scientists.The advantages of web-based tools are their immediate use without tedious installations, optimal settings already selected by the developers, regular updates and maintenance, and shared computational resources.We observe a boom of new methods and approaches, especially the rise of predictors based on machine learning, for which the quality of the experimental data used for training is of paramount importance.However, it is not always guaranteed by the available databases, which would highly benefit from stronger efforts of the community to supply high-quality, findable, annotated and curated data.These data will also provide essential input for machine learning as well as critical comparisons of newly developed tools.
Modern high-throughput experimental technologies like fluorescent activated cell sorting, microfluidics, cell-free expression and deep mutational scanning will enable the collection of large and highly consistent data sets.
We observed a large number of tools devoted to enzyme discovery, although mainly focused on predicting the potential enzymatic activity of a protein sequence, but not for retrieving potential catalysts from a collection of orphan proteins.We also see a shift in the strategies for engineering activity and specificity, as many recent tools focus on non-active site elements, e.g.loops, tunnels, highly flexible and allosteric regions.In general, engineering catalytic activity, selectivity and protein solubility are insufficiently developed and more reliable tools are needed to provide practically useful predictions.
With the constant increase of computational power, which allows more robust assessment of structural ensembles, we expect protein dynamics to become a more integral part of the next generation tools.We predict the same should happen with the design of catalytic activity using high-level methods, i.e.
quantum mechanics or hybrid quantum mechanics/molecular dynamics, to be made accessible via web servers.We have witnessed a game-changing situation with the development of GPU cards and their use for computationally demanding tasks.We envisage another major breakthrough with gradually maturing quantum computing technologies.
agreement 814418 and 857560.The article reflects the author's view and the Agency is not responsible for any use that may be made of the information it contains.

References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as: *  * Generates protein structure networks from an input structure and weights the network edges according to dynamic cross-correlation values, calculated from elastic network models or obtained from molecular dynamics.Provides a comprehensive analysis of the predicted motions of a protein.* Combines structural and sequence information to identify mutagenesis hotspots, based on functional features, conservation analysis, residue correlations, etc. Predicts stabilizing mutations using B-factors and back-to-consensus.Enables rational design of point mutations using Rosetta calculations as well as construction of smart libraries for directed evolution.* Provides a common environment for hosting a large number of web-accessible Rosetta protocols for modeling and designing proteins and other biopolymers.This new implementation simplified the submission process and extended the number of protocols available.
GSP4PDB[5]  allows the user to design and define the so-called Graph-based Structural Patterns as the representations of the protein-ligand interface and then query for such patterns in the PDB, thus returning proteins that could potentially accommodate the ligand.LIBRA-WA[6] is a web-based application that exploits network theory to identify binding pockets in an input protein.Such identification is done upon comparison with two precompiled databases, a ligand-binding sites and the Catalytic Site Atlas[39].A third strategy consists of finding the existing protein sequences that could carry on an enzymatic function.While the first approach relied on precisely predicting the enzymatic function of an input sequence, here the challenge is to comprehensively identify the maximum number of protein sequences able to perform a given catalytic function.EnzymeMiner [7] [**] accepts several proteins with known enzymatic function as input, infers their essential or catalytic residues, and exploits different tools for the assessment of sequence similarity to identify such potential catalysts (Figure1).In contrast Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 December 2020 doi:10.20944/preprints202012.0089.v1with the tools presented in the second strategy, EnzymeMiner does not rely on knowledge of 3D structure of proteins.The tool is fully automated, ranks sequences by their predicted solubility and provides annotations on source organism, extremophilicity, structure availability, etc., to guide the selection process.

Figure 1 .
Figure 1.Illustration of the EnzymeMiner workflow [7] [**].The web server accepts several sequences with the desired function.The user can also input 'other sequences', performing the desired function to help on the sequence filtering step.The server can retrieve catalytic residues from the Catalytic Site Atlas, or the user can define them (allowing for degenerated positions).The query proteins are used to search for homologs, and the obtained hits are subsequently clustered and filtered ensuring the presence of the defined essential residues.Multiple annotations are retrieved to enrich the information of the filtered list of hits.The final results are presented in two interactively integrated views: (i) Putative Hits allows for prioritization according to any of the retrieved annotations and (ii) the Similarity Network view presents the sequences clustered according to their sequence similarity.
0 [8] [**] is a wellestablished aggregation predictor that projects a pre-calculated intrinsic aggregation propensity scale to the query protein structure.Thus, the aggregation propensity values that are used to produce the final prediction are modulated by the specific structural context of the evaluated region or patch.The newest version improves the predictions by considering protein flexibility and stability and by providing suggestions of optimized solubility.

Preprints
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 3 December 2020 doi:10.20944/preprints202012.0089.v1maybe crucial to achieve a desired activity output.It also has an impact on predicting protein solubility and stability.DynaMut2[25] [*] combines Normal Mode Analysis methods and graph-based signatures to investigate the effects of single-and multiple-point mutations on protein stability and dynamics.The server reports B-factors that characterize the predicted flexibility of the mutants and changes in the stability.Moreover, the server offers the possibility to independently run coarse-grained predictions on a structure using five different force fields.CABS[51] is a coarse-grained force field accounting for side chain contacts, main-chain hydrogen bond networks, and local geometric preferences.It was validated against molecular dynamics and nuclear magnetic resonance ensembles, and is part of AGGRESCAN 3D[8].Freshly reimplemented in a web server CABS-flex 2.0[26], it allows for evaluating larger proteins with up to 2000

Figure 2 .
Figure 2. Illustration of the Caver Web workflow [30] [**].The user enters a PDB file or PDB code.The pockets in the 3D structure are calculated and one of them is used as a starting point to calculate the tunnels to the surface.The identified tunnels can be analyzed for their properties, bottleneck residues and tunnel-lining residues.The user can enter one or multiple ligands as files, drawing, SMILES or ZINC codes, and calculate their trajectories through the selected tunnels.The user can analyze the binding energy profiles of the ligand, determine energy minima, maxima and energy barriers.The ligand trajectory and the list of bottleneck residues forming the energy barriers can be downloaded.

Figure 3 .
Figure 3. Illustration of the HotSpot Wizard 3 workflow [31] [*].The user enters a PDB structure or a sequence that will be used to predict the structure by homology modelling.A sequence of different calculations are performed, leading to four types of hot-pot predictions: (i) functional hotspots (nonessential residues located on functional pockets or tunnels, ranked by mutability), (ii) correlated hotspots (co-evolving pairs of residues, obtained from consensus and correlation analysis), (iii) stability from flexibility (hot-spots with higher B-factors), and (iv) stability from consensus (hot-spots recommended to be mutated to amino acids with higher frequency in the multiple sequence alignment).The user can select the hot-spots for mutagenesis based on the integrated overview of the suggested positions, such as mutability, secondary structure, amino acid frequency and mutational landscape.The user can predict the stabilization (ΔΔG) from all the selected single-point mutations on the selected hotspots, and combine them into multiple-point mutations.The user can also calculate the optimal DNA codon content to build smart libraries for screening the selected positions with the desired set of amino acids.
28. Tan ZW, Guarnera E, Tee W-V, Berezovsky IN: AlloSigMA 2: paving the way to designing allosteric effectors and to exploring allosteric effects of mutations.Nucleic Acids Res 2020, 48:W116-W124.29.Yang J-F, Wang F, Chen Y-Z, Hao G-F, Yang G-F: LARMD: integration of bioinformatic resources to profile ligand-driven protein dynamics with a case on the activation of estrogen Brief Bioinform 2019, doi:10.1093/bib/bbz141.* Automates the execution of short molecular dynamics simulations for analysis of ligand transport via tunnels calculated by Caver.Non-expert users can easily set up the simulations by accepting the default settings while more experienced ones can adjust calculation parameters.30.Stourac J, Vavra O, Kokkonen P, Filipovic J, Pinto G, Brezovsky J, Damborsky J, Bednar D: Caver Web 1.0: identification of tunnels and channels in proteins and analysis of ligand transport.Nucleic Acids Res 2019, 47:W414-W422.** Integrates the calculation of protein cavities, molecular tunnels, and the trajectories of ligand moving through those tunnels.The result is a deep knowledge of the tunnel properties, tunnel residues, and energetic maxima and minima for the ligand transport.31.Sumbalova L, Stourac J, Martinek T, Bednar D, Damborsky J: HotSpot Wizard 3.0: web server for automated design of mutations and smart libraries based on sequence input information.Nucleic Acids Res 2018, 46:W356-W362.

Table 1 .
Selected web-based computational tools for enzyme engineering classified by the focus and published between 2018 and 2020.The full list of the tools is available in the Supplementary Tables S1-S6.
SOLarthttp://babylone.ulb.ac.be/SOLARTTo predict the solubility of a protein from its threedimensional structure Engineering enzyme activity and selectivity FuncLib http://FuncLib.weizmann.ac.ilTo redesign an active site and create multiple-point designs.Based on conservation analysis and energy calculations

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 December 2020 doi:10.20944/preprints202012.0089.v1 A
[37]on strategy for getting a good catalyst for a given substrate is to find a new natural enzyme in the genomic databases.Interestingly, there exists a vast space allowing for the discovery of novel enzymes, since the proportion of protein sequences that have not yet been biochemically characterized is huge: only 1 in every 450 protein sequences present in the NCBI nr database[35]has a record potentially encompassing functional annotation in the manually curated UniProtKB/Swiss-Prot database[36].Despite the availability of high throughput methods for biochemical characterization of large numbers of gene expression products[37], in silico approaches can conveniently reduce the time and costs of the process.The task of discovering new enzymes can be tackled in different manners.A straightforward strategy is to predict the enzymatic activity of a protein from its sequence.HEC-Net [3] is a deep learning tool that exploits strategies based on sequence pattern recognition, sequence similarity and amino-acid biochemical properties to achieve prediction accuracy over 90% on the fourth level of the Enzyme Commission (EC) classification.Also exploiting deep learning, Bio2Rxn [4] [*] produces a consensus prediction based on six individual predictors.One of them is based on convolutional neural networks that are trained exclusively on EC-number annotated protein sequences.The other five are more traditional predictors based on sequence similarity, identification of sequence patterns and aminoacid biochemical properties.Bio2Rxn retains high precision values (over 90%) while increasing recall (close to 60%) when compared to similar tools.