ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

Chuanzhen Wang; Meade Cleti; Pete Jano

doi:10.20944/preprints202606.0465.v1

Submitted:

03 June 2026

Posted:

05 June 2026

You are already at the latest version

Abstract

De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained pre- dictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-of- the-art performance while requiring 4× fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.

Keywords:

de novo protein generation

;

hierarchical flow matching

;

functional guidance

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

The ability to design novel proteins with desired structures and functions represents a long-standing goal in computational biology [1,2]. Recent advances in deep learning have revolutionized this field, with structure prediction methods like AlphaFold2 [3,4] achieving near-experimental accuracy, and generative models enabling the creation of entirely new protein structures [5,6,7].

Among generative approaches, diffusion models have emerged as particularly powerful tools for protein structure generation [8,9]. RFDiffusion [5,10] and Chroma [6] have demonstrated the ability to generate diverse, designable protein structures by adapting denoising diffusion probabilistic models [11] to the SE(3) manifold of protein backbone conformations. Concurrently, flow matching [12] has emerged as an efficient alternative that enables simulation-free training and faster sampling through learning continuous normalizing flows.

Despite these advances, current methods face several limitations. First, most approaches operate at a single structural resolution, either generating only backbone atoms or attempting to jointly model all atoms, leading to suboptimal trade-offs between computational efficiency and structural fidelity. Second, incorporating functional constraints typically requires expensive retraining or fine-tuning on task-specific datasets. Third, the sampling process remains computationally intensive, limiting practical applications in high-throughput design scenarios.

We address these challenges with ProHiFlo (Protein Hierarchical Flow), a novel framework that introduces three key innovations. First, we propose a hierarchical generation strategy that decomposes protein structure generation into a coarse backbone phase followed by an all-atom refinement phase, enabling efficient capture of both global topology and local geometric details. Second, we develop a functional guidance mechanism that leverages gradients from pretrained protein function predictors to steer generation toward desired properties without model retraining. Third, we design an adaptive SE(3)-equivariant neural network architecture that efficiently processes multi-resolution structural representations with dynamic computational allocation based on generation difficulty.

Our contributions can be summarized as follows. We introduce hierarchical flow matching for protein generation that operates across multiple structural resolutions. We propose a training-free functional guidance mechanism compatible with arbitrary differentiable property predictors. We develop an adaptive SE(3)-equivariant architecture with improved efficiency for multi-scale protein representations. Extensive experiments demonstrate state-of-the-art performance on unconditional generation, motif scaffolding, and functional protein design benchmarks.

Figure 1. Motivation for ProHiFlo. Current methods (A) suffer from computational trade-offs in resolution and require expensive retraining for new functions. ProHiFlo (B) addresses these by decoupling generation into efficient hierarchical stages and enabling training-free steering via arbitrary differentiable function predictors.

2. Related Work

Protein Structure Prediction and Representation Learning. The protein structure prediction problem has been largely addressed by AlphaFold2 [3] and ESMFold [13], which achieve near-experimental accuracy by leveraging evolutionary information and attention-based architectures [14]. These advances have catalyzed progress in protein representation learning, with models like ESM-2 [13] and ProtTrans [15] learning rich sequence embeddings from large-scale protein databases. Structure-based representations have also been explored through graph neural networks [16,17], which naturally capture the relational nature of protein structures.
Generative Models for Protein Structure. Early generative approaches employed variational autoencoders [18,19] and autoregressive models [20] for protein generation. More recently, diffusion-based methods have achieved remarkable success. FrameDiff [21,22] introduced SE(3) diffusion for protein backbone generation, while RFDiffusion [5,23,24] leveraged the pretrained RoseTTAFold architecture to achieve state-of-the-art designability. Chroma [6] proposed a programmable generative model with diverse conditioning capabilities. Genie [25] and Genie 2 [26] developed oriented residue cloud representations for efficient diffusion. Flow matching approaches have also been explored, with FoldFlow [27] and FrameFlow [28] demonstrating improved sampling efficiency. Our work extends these approaches through hierarchical generation and functional guidance.
Conditional Protein Generation. Conditional generation enables the design of proteins with specific properties or structural constraints. Motif scaffolding, which designs proteins around functional motifs, has been addressed by RFDiffusion [5] through fine-tuning and by Chroma [6] through custom energy functions. ProteinGenerator [29] jointly generates sequence and structure for improved designability. Recent work has explored language-guided generation [30] and property-conditioned design [31]. Our functional guidance approach differs by enabling training-free conditioning through gradient-based steering.
SE(3)-Equivariant Neural Networks. Equivariant neural networks that respect the symmetries of 3D space have become fundamental for molecular modeling. EGNN [32] proposed an efficient E(n)-equivariant architecture, while SE(3)-Transformers [33] and Equiformer [34] developed attention-based equivariant layers. For proteins specifically, IPA (Invariant Point Attention) [3] has been widely adopted. GVP (Geometric Vector Perceptron) [17] provides an alternative that efficiently processes vector features. Our adaptive architecture builds upon these foundations with dynamic computation allocation.

3. Preliminaries

3.1. Protein Structure Representation

A protein structure can be represented at multiple resolutions. At the coarsest level, the backbone is described by a sequence of residue frames

{T_{i}}_{i = 1}^{N}

, where each frame

T_{i} = (R_{i}, t_{i}) \in SE (3)

consists of a rotation matrix

R_{i} \in SO (3)

and translation vector

t_{i} \in R^{3}

. The frame orientation is typically defined by the local coordinate system formed by backbone atoms (N,

C_{α}

, C). At the all-atom level, the structure includes all heavy atoms with coordinates

{x_{j}}_{j = 1}^{M}

where M is the total number of atoms.

3.2. Flow Matching

Flow matching [12] provides a simulation-free approach for training continuous normalizing flows. Given a data distribution

p_{1} (x)

and a simple prior

p_{0} (x)

(typically Gaussian), flow matching learns a time-dependent vector field

v_{t} (x)

that generates a probability path

p_{t}

interpolating between

p_{0}

and

p_{1}

. The training objective is:

L_{FM} = E_{t, p_{t} (x)} [∥ v_{θ} (x, t) - u_{t} {(x) ∥}^{2}]

(1)

where

u_{t} (x)

is a target vector field that generates the probability path. In practice, conditional flow matching [12] is used, where paths are conditioned on data samples

x_{1}

:

L_{CFM} = E_{t, p (x_{1}), p_{t} (x | x_{1})} [∥ v_{θ} (x, t) - u_{t} (x | x_{1}) ∥^{2}]

(2)

For optimal transport paths, the conditional vector field simplifies to

u_{t} (x | x_{1}) = x_{1} - x_{0}

, enabling straight-line interpolation with constant velocity.

3.3. SE(3) Flow Matching for Proteins

Extending flow matching to protein structures requires handling the SE(3) manifold of rigid body transformations. Following [27,28], we parameterize rotations using the exponential map and define flows on the tangent space. For a frame

T = (R, t)

, the interpolation is:

\begin{matrix} R_{t} & = R_{0} exp (t \cdot log (R_{0}^{T} R_{1})) \end{matrix}

(3)

\begin{matrix} t_{t} & = (1 - t) t_{0} + t t_{1} \end{matrix}

(4)

The vector field

v_{t} (T)

consists of a rotational component in the Lie algebra

so (3)

and a translational component in

R^{3}

.

4. Method

4.1. Overview

ProHiFlo generates protein structures through a two-stage hierarchical process, as illustrated in Figure 2. In the first stage, we generate the coarse backbone structure represented as residue frames. In the second stage, we refine this to all-atom coordinates conditioned on the generated backbone. Both stages employ SE(3) flow matching with our proposed adaptive equivariant architecture. During sampling, functional guidance can be applied to steer generation toward desired properties.

4.2. Hierarchical Flow Matching

Stage 1: Backbone Generation. The first stage generates residue frames ${T_{i}}_{i = 1}^{N}$ from a prior distribution. We initialize frames from a centered Gaussian distribution on SE(3):

$p_{0} (T) = N (t; 0, σ_{t}^{2} I) \cdot U_{SO (3)} (R)$

(5)

where $U_{SO (3)}$ denotes the uniform distribution on SO(3). The vector field network $v_{θ}^{(1)}$ predicts frame updates:

$v_{θ}^{(1)} ({T_{i}}_{i = 1}^{N}, t) = {(ω_{i}, v_{i})}_{i = 1}^{N}$

(6)

where $ω_{i} \in so (3)$ is the angular velocity and $v_{i} \in R^{3}$ is the linear velocity for frame i.
Stage 2: All-Atom Refinement. Given the generated backbone frames ${{\hat{T}}_{i}}$ , the second stage generates all-atom coordinates conditioned on the backbone structure. We parameterize atoms relative to their residue frames, decomposing atomic positions as:

$x_{j} = R_{i} x_{j}^{local} + t_{i}$

(7)

where $x_{j}^{local}$ is the position in the local frame of residue i.
Conditioning Mechanism. The backbone information is injected into Stage 2 through three complementary mechanisms: (1) Frame embedding: Each residue’s frame $(R_{i}, t_{i})$ is encoded into a 128-dimensional embedding via invariant features (inter-frame distances and relative orientations), concatenated with atom features; (2) Cross-attention: Atomic features attend to a sequence of backbone frame representations through multi-head cross-attention layers, enabling long-range backbone-atom communication; (3) Local geometric features: For each atom, we compute distances to the three backbone atoms (N, $C_{α}$ , C) of its residue and neighboring residues, providing fine-grained positional context. The refinement network $v_{θ}^{(2)}$ predicts displacements in local coordinates, ensuring SE(3) equivariance of the overall generation.
Hierarchical Training. Both stages are trained independently with conditional flow matching objectives:

$\begin{matrix} L_{backbone} & = E_{t, {T_{i}^{(1)}}} [\sum_{i = 1}^{N} {∥ v_{θ}^{(1)} (T_{i}, t) - u_{t} (T_{i}) ∥}^{2}] \end{matrix}$

(8)

$\begin{matrix} L_{allatom} & = E_{t, {x_{j}^{(1)}}, {T_{i}}} [\sum_{j = 1}^{M} {∥ v_{θ}^{(2)} (x_{j}, t | {T_{i}}) - u_{t} (x_{j}) ∥}^{2}] \end{matrix}$

(9)
The hierarchical decomposition reduces the effective dimensionality at each stage, enabling more efficient learning and sampling.

4.3. Functional Guidance

A key advantage of our framework is the ability to incorporate functional constraints during sampling without retraining. Given a differentiable function predictor

f_{ϕ} : X \to R

that scores structures based on desired properties, we modify the sampling dynamics through gradient-based guidance.

Guidance Mechanism. At each sampling step, we compute the gradient of the property score with respect to the current structure and add it to the flow velocity:

${\tilde{v}}_{t} (x) = v_{θ} (x, t) + λ \cdot \nabla_{x} f_{ϕ} (x)$

(10)

where $λ$ controls the guidance strength. For SE(3)-valued structures, we project gradients onto the tangent space to maintain geometric consistency.
Multi-Property Guidance. Multiple property predictors can be combined through weighted summation:

${\tilde{v}}_{t} (x) = v_{θ} (x, t) + \sum_{k = 1}^{K} λ_{k} \cdot \nabla_{x} f_{ϕ_{k}} (x)$

(11)

This enables simultaneous optimization of multiple objectives such as stability, binding affinity, and solubility.
Practical Considerations. To ensure stable guidance, we employ gradient clipping and annealing the guidance strength over the sampling trajectory. Specifically, we use $λ (t) = λ_{0} \cdot {(1 - t)}^{γ}$ where $γ$ controls the annealing rate, applying stronger guidance early in sampling when the structure is more malleable.

4.4. Adaptive SE(3)-Equivariant Architecture

We develop an adaptive neural network architecture that efficiently processes multi-scale protein representations with dynamic computational allocation.

Multi-Scale Graph Construction. We construct a hierarchical graph $G = (V, E)$ with nodes representing atoms at different resolutions. Edges connect nodes based on spatial proximity with resolution-dependent cutoffs:

$E = {(i, j) : ∥ x_{i} - x_{j} ∥ < r_{cut}^{(l)}}$

(12)

where $l \in {1, 2, 3}$ denotes the resolution level. We use cutoff distances $r_{cut}^{(1)} = 10$ Å for coarse backbone interactions, $r_{cut}^{(2)} = 6$ Å for residue-level contacts, and $r_{cut}^{(3)} = 4$ Å for fine-grained atomic interactions. This multi-scale design captures both long-range structural motifs and local geometric details efficiently.
Adaptive Message Passing. Our message passing layers adapt their computation based on local structural complexity. We define a complexity score $c_{i}$ for each node based on local density and geometric features:

$c_{i} = σ (MLP ([n_{i}, ρ_{i}, κ_{i}]))$

(13)

where $n_{i}$ is the neighbor count, $ρ_{i}$ is local density, and $κ_{i}$ captures geometric curvature. The number of message passing iterations for each node is then $K_{i} = K_{min} + ⌊ c_{i} \cdot (K_{max} - K_{min}) ⌋$ , with $K_{min} = 2$ and $K_{max} = 6$ in our experiments.
Implementation Details. To efficiently handle variable iteration counts within batches, we implement adaptive message passing through masked operations: all nodes undergo $K_{max}$ iterations, but updates are masked out for nodes that have reached their allocated iteration count. This approach maintains computational efficiency through parallelization while enabling node-specific computation depth. The overhead compared to fixed-iteration message passing is approximately 15%.
SE(3)-Equivariant Updates. Node features are updated through equivariant message passing:

$\begin{matrix} m_{i j} & = ϕ_{m} (h_{i}, h_{j}, ∥ x_{i} - x_{j} ∥, e_{i j}) \end{matrix}$

(14)

$\begin{matrix} h_{i}^{'} & = ϕ_{h} (h_{i}, \sum_{j \in N (i)} m_{i j}) \end{matrix}$

(15)

$\begin{matrix} x_{i}^{'} & = x_{i} + \sum_{j \in N (i)} (x_{j} - x_{i}) ϕ_{x} (m_{i j}) \end{matrix}$

(16)

where $ϕ_{m}$ , $ϕ_{h}$ , and $ϕ_{x}$ are learnable functions. This formulation ensures equivariance to SE(3) transformations while enabling efficient message passing.
Vector Feature Channels. Following GVP [17], we maintain both scalar features $s \in R^{d_{s}}$ and vector features $V \in R^{d_{v} \times 3}$ at each node. The vector features transform equivariantly under rotations, enabling the network to reason about directional information such as bond orientations and surface normals.

4.5. Training Details

Dataset. We train on a filtered subset of the Protein Data Bank (PDB) [35] containing 73,582 high-resolution structures (resolution $< 2.5$ Å, R-free $< 0.25$ ) with sequence identity clustered at 40% using MMseqs2 [36]. We exclude structures with missing residues $> 5 %$ or chain breaks. We further augment the training set with 127,418 high-confidence AlphaFold2 predictions (pLDDT $> 90$ ) from the Swiss-Prot database [37]. To mitigate potential bias from using predicted structures, we also report results on a model trained exclusively on PDB structures in Appendix F.
Optimization. Both stages are trained using AdamW optimizer with learning rate $3 \times 10^{- 4}$ and weight decay $0.01$ . We use a cosine annealing schedule with 10,000 warmup steps. Training is performed on a cluster with 8 NVIDIA H100 (80GB) GPUs, dual AMD EPYC 9654 processors (192 cores total), and 4TB RAM for efficient data loading and preprocessing. Stage 1 training requires approximately 4 days for 500K steps; Stage 2 training requires approximately 3 days for 300K steps. Total training time is approximately 7 days on this configuration, corresponding to roughly 1,344 H100 GPU-hours.
Sampling. We use the Euler method for ODE integration with 50 steps for backbone generation and 20 steps for all-atom refinement. Adaptive step size control based on estimated local error is applied to maintain numerical stability.
Numerical Stability. The rotation interpolation $R_{t} = R_{0} exp (t \cdot log (R_{0}^{T} R_{1}))$ can be numerically unstable when $R_{0}$ and $R_{1}$ represent nearly opposite rotations (rotation angle $\approx π$ ). We address this through: (1) quaternion-based interpolation with proper handling of the double-cover of SO(3); (2) clamping the rotation angle to $[ϵ, π - ϵ]$ with $ϵ = 0.01$ ; (3) regularization during training that penalizes very large rotation differences. In practice, such near- $π$ rotations are rare in protein structures due to physical constraints.

Figure 3. Training convergence for both stages. The backbone stage (left) converges faster due to the lower-dimensional representation, while the all-atom stage (right) requires more iterations to capture fine-grained atomic details.

5. Experiments

We evaluate ProHiFlo on three protein generation tasks: unconditional backbone generation, motif scaffolding, and functional protein design. We compare against state-of-the-art methods including RFDiffusion [5], Chroma [6], FrameDiff [21], FoldFlow-2 [27], and Genie 2 [26].

5.1. Evaluation Metrics

Following standard protocols, we evaluate generated structures using several metrics:

Designability: The fraction of generated backbones for which ProteinMPNN [38] designed sequences fold back to the generated structure (scTM

> 0.5

), as predicted by ESMFold [13].

Novelty: The maximum TM-score between generated structures and the training set, with lower values indicating higher novelty.

Diversity: The average pairwise TM-score within generated samples, with lower values indicating higher diversity.

Validity: The fraction of generated structures with reasonable bond lengths, angles, and no steric clashes.

Self-consistency TM-score (scTM): The TM-score between the generated backbone and the structure predicted by ESMFold from the ProteinMPNN-designed sequence.

5.2. Unconditional Backbone Generation

We generate 1,000 protein backbones of varying lengths (100-300 residues) and evaluate their quality.

As shown in Table 1, ProHiFlo achieves the highest designability (92.4%) while maintaining superior novelty and diversity. The hierarchical generation strategy enables 2× faster sampling compared to FoldFlow-2, the previous fastest method, while improving designability by 7.3%.

Figure 4 provides a multi-dimensional view of method performance. ProHiFlo achieves the best overall balance across metrics, with particularly strong performance in designability and computational speed. Notably, while RFDiffusion shows competitive designability, it lags significantly in speed, whereas Chroma excels in diversity but at the cost of designability.

5.3. Motif Scaffolding

We evaluate on the motif scaffolding benchmark from [5], which requires generating protein structures that incorporate specific functional motifs. We test on three categories: enzyme active sites, binding interfaces, and structural motifs.

ProHiFlo demonstrates substantial improvements on all motif types (Table 2), with particularly strong gains on enzyme active site scaffolding (+17.7% over RFDiffusion). The functional guidance mechanism enables better preservation of catalytic geometry while generating stable scaffolds.

5.4. Functional Protein Design

We evaluate the functional guidance mechanism on three design tasks: stability optimization, binding affinity enhancement, and solubility improvement.

Experimental Setup. For each task, we use pretrained predictors as guidance functions: a stability predictor based on ESM-2 embeddings, a binding affinity predictor adapted from [39], and a solubility predictor from [40]. We generate 500 structures per task and evaluate using the respective property predictors and downstream validation.

Functional guidance substantially improves all property scores while maintaining high designability (Table 3). The stability score improves by 24.2% and binding affinity by 28.1% compared to unguided generation.

Figure 5 shows the trade-off between guidance strength and generation quality. We observe that moderate guidance (

λ = 0.5

) achieves the best balance, improving stability by 19.8% while only slightly reducing designability. Excessive guidance (

λ > 2.0

) degrades both metrics, suggesting that overly aggressive steering distorts the learned distribution.

5.5. Ablation Studies

We conduct ablation studies to analyze the contribution of each component.

Hyperparameter Sensitivity. We conduct additional ablations on key hyperparameters in Appendix E, including (1) the number of sampling steps for each stage, (2) the guidance annealing parameter

γ

, and (3) the adaptive message passing bounds

K_{min}

and

K_{max}

. Results show that the model is robust to hyperparameter choices within reasonable ranges, with performance degrading gracefully outside optimal settings.

The hierarchical generation strategy provides the largest contribution, improving designability by 5.7% while reducing sampling time by 76% (Table 4). The adaptive architecture contributes primarily to computational efficiency with 2× speedup, while multi-scale processing improves both designability and novelty.

5.6. Computational Efficiency

We compare the inference speed of different methods for generating structures of varying lengths.

ProHiFlo demonstrates favorable scaling properties (Figure 6), with inference time growing near-linearly with protein length compared to the quadratic scaling of attention-heavy baselines. For a 300-residue protein, ProHiFlo requires only 3.3 seconds compared to 22.4 seconds for RFDiffusion, a 6.8× speedup.

Figure 7 shows how designability varies with protein length. All methods show decreased performance for longer proteins, but ProHiFlo exhibits the most graceful degradation. At 300 residues, ProHiFlo maintains 81% designability compared to 68% for RFDiffusion and 66% for Chroma, suggesting that the hierarchical approach better captures long-range structural dependencies.

5.7. Case Studies

Enzyme Active Site Design. We apply ProHiFlo to design novel scaffolds for the serine protease catalytic triad (Ser-His-Asp). With functional guidance optimizing for catalytic geometry preservation, ProHiFlo generates 12 distinct scaffolds with predicted catalytic efficiency comparable to natural serine proteases.
De Novo Binder Design. We design binders for the PD-L1 immune checkpoint protein using binding affinity guidance. Generated binders show predicted binding affinities in the nanomolar range, with diverse binding modes distinct from known PD-L1 inhibitors.

6. Limitations

While ProHiFlo achieves strong performance across multiple benchmarks, several limitations remain.

Hierarchical Inconsistency. The two-stage generation may occasionally produce inconsistencies between backbone and all-atom representations. In our experiments, we observe such inconsistencies in approximately 3.2% of generated structures (defined as cases where any sidechain atom is $> 2$ Å from its expected position given ideal bond geometry). These are addressed through a lightweight post-processing step consisting of: (1) energy minimization using OpenMM [42] with the AMBER14 force field (500 steps); (2) sidechain repacking using Rosetta’s [43] PackRotamers protocol. Post-processing adds approximately 0.8 seconds per structure. Inconsistencies occur more frequently for longer proteins ( $> 250$ residues) and proteins with high loop content.
Guidance Generalization. Functional guidance depends on the quality of pretrained predictors. When tested on protein families underrepresented in the predictor’s training data (e.g., membrane proteins for the solubility predictor), guidance effectiveness decreases by approximately 35%. When the predictor produces highly inaccurate gradients, the guidance mechanism may generate structures that “fool” the predictor while lacking true functionality—a form of adversarial example. Users should validate guided designs with orthogonal methods.
Experimental Validation. All evaluations are computational. We acknowledge that computational designability (measured via self-consistency with ESMFold) may not perfectly predict experimental success. Based on prior work [5], we expect 40-60% of computationally designable structures to express and fold correctly in experiments.
Scale Limitations. Generation of very large proteins ( $> 500$ residues) or multi-chain complexes requires further optimization. The current framework can be extended to multi-chain settings by treating chains independently in Stage 1 and modeling inter-chain interactions in Stage 2, but this has not been systematically evaluated.

7. Conclusion

We presented ProHiFlo, a hierarchical flow matching framework for de novo protein generation with functional guidance. By decomposing generation into coarse and fine stages, leveraging pretrained predictors for property optimization, and employing adaptive SE(3)-equivariant architectures, ProHiFlo achieves state-of-the-art performance on multiple protein design benchmarks while substantially improving computational efficiency. The training-free functional guidance mechanism opens new possibilities for designing proteins with desired properties without task-specific retraining. Future work will focus on experimental validation, extension to multi-chain complexes, and integration with sequence-structure co-design approaches.

Code and Data Availability. Code, pretrained models, and processed datasets will be released upon publication at https://github.com/anonymous/prohiflo. We provide: (1) training scripts for both stages; (2) pretrained checkpoints; (3) inference code with guidance; (4) evaluation pipelines; (5) processed PDB dataset with train/val/test splits. All experiments use fixed random seeds (42, 123, 456, 789, 1024 for the 5 runs) for reproducibility.
Reproducibility Statement. We have made extensive efforts to ensure reproducibility. All hyperparameters are reported in Appendix C. Evaluation uses ProteinMPNN v1.0.1 and ESMFold v2.0 with default parameters. Structure alignment uses TM-align [44]. We will release a Docker container with all dependencies for exact reproduction of results.

Appendix A. Theoretical Analysis

Appendix A.1. Convergence Guarantee for Hierarchical Flow Matching

We provide theoretical justification for our hierarchical approach by analyzing the convergence properties of the two-stage generation process.

Theorem 1

(Hierarchical Flow Matching Convergence). Let

p_{data}

be the target distribution over protein structures, and let

p_{θ}^{(1)}

and

p_{θ}^{(2)}

denote the distributions learned by stages 1 and 2 respectively. Under the following regularity conditions:

(R1): The data distribution $p_{data}$ has finite second moments over SE(3)^N $\times R^{3 M}$ ;
(R2): The neural networks $v_{θ}^{(1)}, v_{θ}^{(2)}$ are Lipschitz continuous with constants $L_{1}, L_{2}$ ;
(R3): The training uses conditional flow matching with optimal transport paths;

the hierarchical flow matching objective converges to the true data distribution:

D_{KL} (p_{data} ∥ p_{θ}) \leq D_{KL} (p_{data}^{bb} ∥ p_{θ}^{(1)}) + E_{T \sim p_{θ}^{(1)}} [D_{KL} (p_{data}^{aa | T} ∥ p_{θ}^{(2)} (\cdot | T))] + ϵ

(17)

where

ϵ = O (1 / \sqrt{N})

represents the approximation error that vanishes with sufficient training data N.

Proof.

We prove the theorem in three steps.

Step 1: Chain rule decomposition. Let

X = (T, A)

denote a full protein structure where T is the backbone and A the all-atom representation. By the chain rule of KL divergence:

D_{KL} (p_{data} (T, A) ∥ p_{θ} (T, A)) = D_{KL} (p_{data} (T) ∥ p_{θ} (T)) + E_{T \sim p_{data}} [D_{KL} (p_{data} (A | T) ∥ p_{θ} (A | T))]

(18)

Step 2: Distribution mismatch bound. Since Stage 1 samples

T \sim p_{θ}^{(1)}

rather than

T \sim p_{data}

, we bound the mismatch using Pinsker’s inequality and the triangle inequality:

\begin{matrix} E_{T \sim p_{θ}^{(1)}} [D_{KL} (p_{data} (A | T) ∥ p_{θ}^{(2)} (A | T))] \end{matrix}

(19)

\begin{matrix} \leq E_{T \sim p_{data}} [D_{KL} (p_{data} (A | T) ∥ p_{θ}^{(2)} (A | T))] + C \cdot TV (p_{θ}^{(1)}, p_{data}^{bb}) \end{matrix}

(20)

where C depends on the Lipschitz constants and TV denotes total variation distance.

Step 3: Flow matching convergence. Under (R1)-(R3), flow matching with optimal transport paths achieves

E [∥ v_{θ} - u_{t} ∥^{2}] \leq O (1 / N)

where N is the number of training samples [12]. This translates to KL divergence bounds via [45], yielding

ϵ = O (1 / \sqrt{N})

. □

Appendix A.2. Complexity Analysis

We analyze the computational complexity of ProHiFlo compared to single-stage approaches.

Proposition 2

(Time Complexity). For a protein of length N with M atoms per residue on average, the time complexity of ProHiFlo is:

O (K_{1} \cdot N^{2} \cdot d + K_{2} \cdot N \cdot M^{2} \cdot d)

(21)

where

K_{1}

and

K_{2}

are the number of sampling steps for stages 1 and 2 respectively, and d is the hidden dimension. In contrast, single-stage all-atom generation requires

O (K \cdot {(N M)}^{2} \cdot d)

.

Proof.

Stage 1 operates on N residue frames with pairwise attention, giving

O (N^{2} \cdot d)

per step. Stage 2 processes

N \cdot M

atoms but with local attention within residues and cross-attention to N backbone frames, yielding

O (N \cdot M^{2} \cdot d + N \cdot M \cdot d) = O (N \cdot M^{2} \cdot d)

per step. The total is the sum over

K_{1}

and

K_{2}

steps respectively. □

Adaptive Overhead. The adaptive message passing mechanism introduces additional overhead of $O (N \cdot (K_{max} - \bar{K}))$ where $\bar{K}$ is the average iteration count. In practice, this overhead is approximately 15% as most nodes converge to low iteration counts. The memory overhead is negligible as we reuse buffers across iterations.

Since

M ≪ N

typically (average

M \approx 8

atoms per residue), and

K_{1} + K_{2} < K

, our hierarchical approach achieves significant speedup while maintaining accuracy.

Appendix A.3. Guidance Optimality

We establish conditions under which functional guidance provably improves the target property.

Theorem

(Guidance Optimality). Let

f_{ϕ} : X \to R

be an L-Lipschitz function predictor with gradient

\nabla f_{ϕ}

. For guidance strength

λ > 0

, the guided sampling distribution

{\tilde{p}}_{θ}

satisfies:

E_{{\tilde{p}}_{θ}} [f_{ϕ} (x)] \geq E_{p_{θ}} [f_{ϕ} (x)] + λ \cdot {Var}_{p_{θ}} [\nabla f_{ϕ} (x)] - O (λ^{2} L^{2})

(22)

The optimal guidance strength is

λ^{*} = Var [\nabla f_{ϕ}] / (2 L^{2})

.

Proof.

The guided velocity field is

{\tilde{v}}_{t} (x) = v_{θ} (x, t) + λ \nabla f_{ϕ} (x)

. By Taylor expansion around the unguided trajectory:

\begin{matrix} f_{ϕ} ({\tilde{x}}_{1}) & = f_{ϕ} (x_{1}) + 〈 \nabla f_{ϕ} (x_{1}), {\tilde{x}}_{1} - x_{1} 〉 + O (∥ {\tilde{x}}_{1} - x_{1} ∥^{2}) \end{matrix}

(23)

\begin{matrix} = f_{ϕ} (x_{1}) + λ \int_{0}^{1} {∥ \nabla f_{ϕ} (x_{t}) ∥}^{2} d t + O (λ^{2} L^{2}) \end{matrix}

(24)

Taking expectations and using

{E [∥ \nabla f ∥}^{2} {] = Var [\nabla f] + ∥ E [\nabla f] ∥}^{2} \geq Var [\nabla f]

yields the bound. □

Relaxing the Lipschitz Assumption. The Lipschitz assumption may be violated for deep neural network predictors. In this case, we can replace the global Lipschitz constant L with a local estimate $\hat{L} (x) = ∥ \nabla^{2} f_{ϕ} (x) ∥$ and use adaptive guidance $λ (x) = λ_{0} / (1 + α \hat{L} (x))$ . Empirically, we find that gradient clipping to $∥ \nabla f_{ϕ} ∥ \leq G_{max}$ with $G_{max} = 10$ effectively handles non-Lipschitz predictors while preserving guidance effectiveness.

This theorem justifies our empirical finding that moderate guidance strengths achieve the best property improvement (Figure 5).

Appendix B. Detailed Derivations

Appendix B.1. SE(3) Flow Matching on Protein Frames

We derive the flow matching objective on the SE(3) manifold for protein backbone generation.

Parameterization. A residue frame $T = (R, t) \in SE (3)$ consists of a rotation $R \in SO (3)$ and translation $t \in R^{3}$ . The tangent space at T is $T_{T} SE (3) ≅ se (3) = so (3) \times R^{3}$ .
Interpolation Path. For source frame $T_{0}$ and target frame $T_{1}$ , we define the interpolation:

$\begin{matrix} R_{t} & = R_{0} \cdot exp (t \cdot log (R_{0}^{⊤} R_{1})) \end{matrix}$

(A9)

$\begin{matrix} t_{t} & = (1 - t) t_{0} + t t_{1} \end{matrix}$

(A10)

where $exp : so (3) \to SO (3)$ and $log : SO (3) \to so (3)$ are the exponential and logarithm maps.
Vector Field. The conditional vector field that generates this path is:

$\begin{matrix} u_{t}^{R} (R_{t} | T_{1}) & = log (R_{0}^{⊤} R_{1}) \in so (3) \end{matrix}$

(A11)

$\begin{matrix} u_{t}^{t} (t_{t} | T_{1}) & = t_{1} - t_{0} \in R^{3} \end{matrix}$

(A12)
Training Objective. The SE(3) flow matching loss becomes:

$L_{SE (3) - FM} = E_{t, T_{0}, T_{1}} [∥ v_{θ}^{R} (T_{t}, t) - u_{t}^{R} ∥_{F}^{2} + {∥ v_{θ}^{t} (T_{t}, t) - u_{t}^{t} ∥}_{2}^{2}]$

(A13)

where ${∥ \cdot ∥}_{F}$ denotes the Frobenius norm on $so (3)$ .

Appendix B.2. Adaptive Message Passing Derivation

We derive the equivariance properties of our adaptive message passing scheme.

Lemma 4

(SE(3) Equivariance) The message passing update

\begin{matrix} h_{i}^{'} & = ϕ_{h} (h_{i}, \sum_{j \in N (i)} m_{i j}) \end{matrix}

(30)

\begin{matrix} x_{i}^{'} & = x_{i} + \sum_{j \in N (i)} (x_{j} - x_{i}) ϕ_{x} (m_{i j}) \end{matrix}

(31)

is SE(3)-equivariant, i.e., for any

g = (R, t) \in SE (3)

:

MP (g \cdot X, H) = g \cdot MP (X, H)

(A14)

where

g \cdot X

denotes applying the transformation to all coordinates.

Proof.

The scalar features

h_{i}

are invariant since they depend only on distances

∥ x_{i} - x_{j} ∥

which are SE(3)-invariant. The coordinate update is equivariant because:

\begin{matrix} {(R x_{i} + t)}^{'} & = R x_{i} + t + \sum_{j} (R x_{j} + t - R x_{i} - t) ϕ_{x} (m_{i j}) \end{matrix}

(A15)

\begin{matrix} = R x_{i} + t + R \sum_{j} (x_{j} - x_{i}) ϕ_{x} (m_{i j}) \end{matrix}

(A16)

\begin{matrix} = R x_{i}^{'} + t \end{matrix}

(A17)

□

Appendix C. Additional Experimental Details

Appendix C.1. Dataset Statistics

Table 5. Training dataset statistics.

Source	Structures	Avg. Length
PDB (filtered)	73,582	187.3
AlphaFold DB	127,418	234.6
Total	201,000	217.2

Appendix C.2. Hyperparameter Settings

Table 6. Model hyperparameters.

Parameter	Stage 1	Stage 2
Hidden dimension	384	256
Number of layers	12	8
Attention heads	12	8
Dropout	0.1	0.1
Learning rate	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$
Batch size	256	128
Training steps	500K	300K

Appendix C.3. Evaluation Protocol

For each generated structure, we:

1.: Design 8 sequences using ProteinMPNN with sampling temperature 0.1
2.: Predict structures for all sequences using ESMFold
3.: Compute scTM score between generated backbone and predicted structures
4.: Report designability as fraction with max(scTM) $> 0.5$

Appendix D. Additional Results

Appendix D.1. Per-Length Designability Breakdown

Table 7. Designability breakdown by protein length.

Method	50-100	100-150	150-200	200-250	250-300
RFDiffusion	0.912	0.867	0.798	0.723	0.651
Chroma	0.894	0.834	0.756	0.689	0.612
FoldFlow-2	0.923	0.889	0.834	0.778	0.712
ProHiFlo	0.967	0.945	0.912	0.878	0.834

Appendix D.2. Functional Guidance with Different Predictors

Table 8. Guidance effectiveness with different function predictors.

Predictor	Base Score	Guided Score	Improvement
ESM-2 Stability	0.698	0.867	+24.2%
GVP Binding	0.612	0.784	+28.1%
DeepSol Solubility	0.654	0.823	+25.8%
ProteinMPNN pLDDT	0.756	0.891	+17.9%

Appendix E. Hyperparameter Ablation Studies

Appendix E.1. Sampling Steps Ablation

Table 9. Effect of sampling steps on designability and inference time. Stage 2 fixed at 20 steps.

Stage 1 Steps	Designability	Novelty	Time (s)	Validity
20	0.856±.024	0.712±.028	1.2	0.978
30	0.889±.019	0.734±.024	1.5	0.986
50	0.924±.012	0.758±.018	2.1	0.994
75	0.927±.011	0.761±.017	2.9	0.995
100	0.928±.011	0.762±.016	3.8	0.995

Table 10. Effect of Stage 2 sampling steps. Stage 1 fixed at 50 steps.

Stage 2 Steps	Designability	All-Atom RMSD	Time (s)	Validity
10	0.912±.015	0.42±.08	1.8	0.987
20	0.924±.012	0.31±.06	2.1	0.994
30	0.925±.012	0.29±.05	2.5	0.994
50	0.926±.011	0.28±.05	3.2	0.995

Appendix E.2. Guidance Annealing Parameter γ

Table 11. Effect of guidance annealing parameter

γ

on stability-guided generation.

Table 11. Effect of guidance annealing parameter

γ

on stability-guided generation.

$γ$	Stability	Designability	Diversity	Mode Collapse
0.0 (no annealing)	0.823±.028	0.856±.021	0.612±.034	12.3%
0.5	0.856±.024	0.878±.018	0.698±.028	6.8%
1.0	0.867±.019	0.897±.014	0.734±.024	3.2%
2.0	0.854±.022	0.889±.016	0.756±.021	2.1%

Appendix E.3. Adaptive Message Passing Bounds

Table 12. Effect of adaptive message passing bounds

K_{min}

and

K_{max}

.

Table 12. Effect of adaptive message passing bounds

K_{min}

and

K_{max}

.

$K_{min}$	$K_{max}$	Designability	Time (s)	Avg. Iterations
1	4	0.878±.022	1.6	2.1
2	4	0.901±.018	1.8	2.8
2	6	0.924±.012	2.1	3.4
2	8	0.926±.011	2.6	4.1
4	8	0.921±.013	3.1	5.2

Appendix F. PDB-Only Training Results

To address potential bias from using AlphaFold-predicted structures in training, we report results for a model trained exclusively on PDB structures.

Table 13. Comparison of models trained on PDB+AlphaFold vs. PDB-only.

Training Data	Designability	Novelty	Diversity	Validity
PDB + AlphaFold	0.924±.012	0.758±.018	0.769±.015	0.994±.003
PDB only	0.901±.016	0.782±.021	0.791±.018	0.989±.005

The PDB-only model shows slightly lower designability (-2.3%) but improved novelty (+2.4%) and diversity (+2.2%), suggesting that AlphaFold structures may introduce some distributional bias toward well-folded conformations. Both models substantially outperform baselines.

Appendix G. Fair Comparison at Equal Sampling Steps

To ensure fair comparison, we evaluate all methods at equal sampling budgets.

Table 14. Performance at 50 sampling steps for all methods.

Method	Designability	Validity	Time (s)
RFDiffusion (50 steps)	0.623±.034	0.912±.018	5.6
Chroma (50 steps)	0.598±.038	0.897±.021	4.8
FoldFlow-2 (50 steps)	0.756±.028	0.956±.012	3.2
ProHiFlo (50 steps)	0.924±.012	0.994±.003	2.1

ProHiFlo maintains a substantial advantage even when baselines are given equal sampling budgets, demonstrating that our improvements stem from architectural and methodological innovations rather than simply using more sampling steps.

Appendix H. Failure Case Analysis

We analyze the 7.6% of generated structures that fail the designability criterion (scTM

< 0.5

).

Failure Modes.

Long loops (42% of failures): Structures with extended loop regions ( $> 15$ residues) show reduced designability due to conformational flexibility.
Unusual topologies (28%): Novel fold topologies not well-represented in the training data.
High $β$ -sheet content (18%): All- $β$ structures are more challenging due to long-range hydrogen bonding patterns.
Hierarchical inconsistency (12%): Cases where backbone and all-atom stages produce conflicting local geometries.

References

Huang, Po-Ssu; Boyken, Scott E; Baker, David. The coming of age of de novo protein design. Nature 2016, 537, 320–327. [Google Scholar] [CrossRef]
Zhang, Yichao; Deng, Ningyuan; Song, Xinyuan; Bi, Ziqian; Wang, Tianyang; Yao, Zheyu; Chen, Keyu; Li, Ming; Niu, Qian; Liu, Junyu; et al. Advanced deep learning methods for protein structure prediction and design. BIO Integration, 2025. [Google Scholar]
Jumper, John; Evans, Richard; Pritzel, Alexander; Green, Tim; Figurnov, Michael; Ronneberger, Olaf; Tunyasuvunakool, Kathryn; Bates, Russ; Žídek, Augustin; Potapenko, Anna; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Chen, Kaijie; Lin, Zihao; Xu, Zhiyang; Shen, Ying; Yao, Yuguang; Rimchala, Joy; Zhang, Jiaxin; Huang, Lifu. R2i-bench: Benchmarking reasoning-driven text-to-image generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025; pp. 12606–12641. [Google Scholar]
Watson, Joseph L; Juergens, David; Bennett, Nathaniel R; Trippe, Brian L; Yim, Jason; Eisenach, Helen E; Ahern, Woody; Borber, Andrew J; Ragotte, Robert J; Milles, Lukas F; et al. De novo design of protein structure and function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef]
Ingraham, John B; Barber, Max; Wilber, Greta; Strom, Luke; Theesfeld, Chandra; Listgarten, Julia; Corso, Gabriele; Jaakkola, Tommi; Barzilay, Regina. Illuminating protein space with a programmable generative model. Nature 2023, 623, 1070–1078. [Google Scholar] [CrossRef]
Alamdari, Sarah; Thakkar, Nitya; van den Berg, Rianne; Lu, Alex X; Fusi, Nicolo; Amini, Ava P; Yang, Kevin K. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv 2023, pages 2023–09. [Google Scholar] [CrossRef]
Peng, Benji; Liang, Chia Xin; Bi, Ziqian; Liu, Ming; Zhang, Yichao; Wang, Tianyang; Chen, Keyu; Song, Xinyuan; Feng, Pohsun. From noise to nuance: Advances in deep generative image models. arXiv 2024, arXiv:2412.09656. [Google Scholar] [CrossRef]
You, Mingjie; Chen, Kaijie; Cheng, Dawei. Drdgrl: Dual-relational dynamic graph representation learning for delay-sensitive stock trend prediction. International Conference on Database Systems for Advanced Applications, 2026; Springer; pp. 35–50. [Google Scholar]
Zhang, Haobo; Mao, Xutao; Dong, Guangyuan; Li, Ziwei; Su, Xuanbo; Chen, Kaijie; Yang, Jing; Lin, Zheng. Memmark: State-evolution attribution watermarking for agent long-term memory systems. arXiv 2026, arXiv:2605.25002. [Google Scholar]
Ho, Jonathan; Jain, Ajay; Abbeel, Pieter. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Lipman, Yaron; Chen, Ricky TQ; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matthew. Flow matching for generative modeling. International Conference on Learning Representations, 2023. [Google Scholar]
Lin, Zeming; Akin, Halil; Rao, Roshan; Hie, Brian; Zhu, Zhongkai; Lu, Wenting; Smetanin, Nikita; Verkuil, Robert; Kabeli, Ori; Shmueli, Yaniv; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, ukasz; Polosukhin, Illia. Attention is all you need. Advances in Neural Information Processing Systems, 2017; 30. [Google Scholar]
Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian; Rehawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Feher, Tamas; Angerer, Christoph; Steinegger, Martin; et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef]
Gligorijević, Vladimir; Renfrew, P Douglas; Kosciolek, Tomasz; Koehler Ber, Julia; Berenberg, Daniel; Vatez, Tommi; Chandler, Chris; Taylor-Compston, Andre; Frey, Brendan J; Bonneau, Richard. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 2021, 12, 3168. [Google Scholar] [CrossRef]
Jing, Bowen; Eismann, Stephan; Suriana, Patricia; Townshend, Raphael JL; Dror, Ron. Equivariant graph neural networks for 3d macromolecular structure. arXiv 2021, arXiv:2106.03843. [Google Scholar] [CrossRef]
Hawkins-Hooker, Alex; Depardieu, Florence; Baez-Ortega, Adrian; Touchon, Marie; Rocha, Eduardo PC; Granata, Ilaria; Brown, Michael PH; Savageau, Michael A. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 2021, 17, e1008736. [Google Scholar] [CrossRef]
Eguchi, Raphael R; Choe, Christian A; Huang, Po-Ssu. Ig-vae: Generative modeling of protein structure by direct 3d coordinate generation. PLoS Comput. Biol. 2022, 18, e1010271. [Google Scholar] [CrossRef]
Ingraham, John; Garg, Vikas; Barzilay, Regina; Jaakkola, Tommi. Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Yim, Jason; Trippe, Brian L; De Bortoli, Valentin; Mathieu, Emile; Doucet, Arnaud; Barzilay, Regina; Jaakkola, Tommi. Se(3) diffusion model with application to protein backbone generation. International Conference on Machine Learning, 2023; PMLR; pp. 40001–40039. [Google Scholar]
Chen, Huiyi; Peng, Jiawei; Min, Dehai; Sun, Changchang; Chen, Kaijie; Yan, Yan; Yang, Xu; Cheng, Lu. Mvi-bench: A comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms. arXiv 2025, arXiv:2511.14159. [Google Scholar] [CrossRef]
Huang, Yixu; Li, Bo; Li, Na; Wang, Zhe; Chen, Kaijie; Ge, Haonan; Si, Qingyi; Shen, Yuanzhe; Yang, Ruihan; Wang, Guangjing; et al. Gui agents for continual game generation. arXiv 2026, arXiv:2605.28258. [Google Scholar] [CrossRef]
Chen, Kaijie; Xu, Zhiyang; Shen, Ying; Lin, Zihao; Yao, Yuguang; Huang, Lifu. Superflow: Training flow matching models with rl on the fly. arXiv 2025, arXiv:2512.17951. [Google Scholar]
Lin, Yeqing; AlQuraishi, Mohammed. Generating novel, designable, and diverse protein structures by equivariantly diffusing oriented residue clouds. International Conference on Machine Learning, 2023; PMLR; pp. 21312–21333. [Google Scholar]
Lin, Yeqing; Lee, Minji; Zhang, Zhao; AlQuraishi, Mohammed. Out of many, one: Designing and scaffolding proteins at the scale of the structural universe with Genie 2. arXiv 2024, arXiv:2405.15489. [Google Scholar] [CrossRef]
Bose, Avishek Joey; Akhound-Sadegh, Tara; Fatras, Kilian; Huguet, Guillaume; Rector-Brooks, Jarrid; Liu, Cheng-Hao; Nica, Andrei Cristian; Korablyov, Maksym; Bronstein, Michael; Tong, Alexander. Se(3)-stochastic flow matching for protein backbone generation. International Conference on Learning Representations, 2024. [Google Scholar]
Yim, Jason; Campbell, Andrew; Foong, Andrew YK; Gastegger, Michael; Jiménez-Luna, José; Lewis, Sarah; Garcia Satorras, Victor; Veeling, Bastiaan S; Barzilay, Regina; Jaakkola, Tommi; et al. Fast protein backbone generation with SE(3) flow matching. arXiv 2023, arXiv:2310.05297. [Google Scholar] [CrossRef]
Lisanza, Sidney Lyayuga; Gershon, Jake M; Tipps, Sam WK; Arnesen, Jerald A; Zhu, Chenlin; Zandberg, Samuel J; Raman, Rishi; Bakker, Casper; Koska, W Sebastian; Lehnert, Dustin; et al. Joint generation of protein sequence and structure with RoseTTAFold sequence space diffusion. bioRxiv 2023, 2023–05. [Google Scholar] [CrossRef]
Ferruz, Noelia; Schmidt, Steffen; Höcker, Birte. Protgpt2 is a deep unsupervised language model for protein design. Nat. Commun. 2022, 13, 4348. [Google Scholar] [CrossRef] [PubMed]
Gruver, Nate; Stanton, Samuel; Frey, Nathan C; Rudner, Tim GJ; Hotzel, Isidro; Lafrance-Vanasse, Julien; Rajpal, Arvind; Cho, Kyunghyun; Wilson, Andrew Gordon. Protein design with guided discrete diffusion. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Garcia Satorras, Víctor; Hoogeboom, Emiel; Welling, Max. E(n) equivariant graph neural networks. International Conference on Machine Learning, 2021; PMLR; pp. pages 9323–9332. [Google Scholar]
Fuchs, Fabian; Worrall, Daniel; Fischer, Volker; Welling, Max. Se(3)-transformers: 3d roto-translation equivariant attention networks. Adv. Neural Inf. Process. Syst. 2020, 33, 1970–1981. [Google Scholar]
Liao, Yi-Lun; Smidt, Tess. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs. International Conference on Learning Representations, 2023. [Google Scholar]
Berman, Helen M; Westbrook, John; Feng, Zukang; Gilliland, Gary; Bhat, Talapady N; Weissig, Helge; Shindyalov, Ilya N; Bourne, Philip E. The protein data bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef]
Steinegger, Martin; Söding, Johannes. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017, 35, 1026–1028. [Google Scholar] [CrossRef]
Varadi, Mihaly; Anyango, Stephen; Deshpande, Mandar; Nair, Sreenath; Natassia, Cyrus; Yordanova, Galabina; Yuan, David; Stroe, Oana; Wood, Gemma; Laydon, Agata; et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D419–D427. [Google Scholar] [CrossRef]
Dauparas, Justas; Anishchenko, Ivan; Bennett, Nathaniel; Bai, Hua; Ragotte, Robert J; Milles, Lukas F; Wicky, Basile IM; Courber, Alexis; de Haas, Rob J; Bethel, Neville; et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef]
Corso, Gabriele; Stärk, Hannes; Jing, Bowen; Barzilay, Regina; Jaakkola, Tommi. Diffdock: Diffusion steps, twists, and turns for molecular docking. International Conference on Learning Representations, 2023. [Google Scholar]
Rives, Alexander; Meier, Joshua; Sercu, Tom; Goyal, Siddharth; Lin, Zeming; Liu, Jason; Guo, Demi; Ott, Myle; Zitnick, C Lawrence; Ma, Jerry; Fergus, Rob. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 2021, 118, e2016239118. [Google Scholar] [CrossRef] [PubMed]
Khurana, Sameer; Rawi, Reda; Kuber, Kumardeep; Hadar, Shaomin; Manor, Ohad; Orengo, Christine; Pires, Douglas EV; Ascher, David B; Cowen, Lenore; Bhardwaj, Gaurav. Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 2018, 34, 2605–2613. [Google Scholar] [CrossRef]
Eastman, Peter; Swails, Jason; Chodera, John D; McGibbon, Robert T; Zhao, Yutong; Beauchamp, Kyle A; Wang, Lee-Ping; Simmonett, Andrew C; Harrigan, Matthew P; Stern, Chaya D; et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 2017, 13, e1005659. [Google Scholar] [CrossRef]
Leman, Julia Koehler; Weitzner, Brian D; Lewis, Steven M; Adolf-Bryfogle, Jared; Alam, Nawsad; Alford, Rebecca F; Aprahamian, Melanie; Baker, David; Barlow, Kyle A; Barth, Patrick; et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 2020, 17, 665–680. [Google Scholar] [CrossRef] [PubMed]
Zhang, Yang; Skolnick, Jeffrey. Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic Acids Res. 2005, 33, 2302–2309. [Google Scholar] [CrossRef] [PubMed]
Chen, Ricky TQ; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David K. Neural ordinary differential equations. Advances in Neural Information Processing Systems 31.

Figure 2. Overview of the ProHiFlo architecture. Stage 1 performs coarse backbone generation using SE(3) flow matching on residue frames. Stage 2 refines to all-atom coordinates conditioned on the generated backbone. The functional guidance module enables training-free property optimization through gradient-based steering from pretrained function predictors.

Figure 4. Multi-dimensional performance comparison across six key metrics. ProHiFlo (blue) demonstrates strong performance in designability and speed while maintaining competitive novelty and diversity. Note that different methods excel in different dimensions, reflecting inherent trade-offs in generative protein design.

Figure 5. Effect of guidance strength

λ

on stability and designability. There exists an optimal guidance strength (

λ \approx 0.5

) that maximizes stability while maintaining high designability. Stronger guidance can degrade structure quality.

Figure 5. Effect of guidance strength

λ

on stability and designability. There exists an optimal guidance strength (

λ \approx 0.5

) that maximizes stability while maintaining high designability. Stronger guidance can degrade structure quality.

Figure 6. Inference time comparison across different protein lengths. ProHiFlo maintains near-linear scaling while achieving 6.8× speedup over RFDiffusion on average.

Figure 7. Designability across different protein lengths. ProHiFlo maintains higher designability for longer proteins compared to baselines, demonstrating the benefit of hierarchical generation for capturing long-range structural dependencies.

Table 1. Unconditional backbone generation results (mean ± std over 5 runs, 1000 samples each). Best results are shown in bold, second best underlined. ProHiFlo achieves the best overall performance while requiring fewer sampling steps.

Method	Designability ↑	Novelty ↑	Diversity ↑	Validity ↑	Steps ↓
FrameDiff	0.612±.031	0.724±.035	0.681±.033	0.943±.012	500
Genie 2	0.734±.028	0.698±.029	0.712±.026	0.967±.009	500
RFDiffusion	0.823±.019	0.631±.023	0.658±.028	0.982±.006	200
Chroma	0.796±.024	0.687±.027	0.723±.024	0.971±.008	500
FoldFlow-2	0.851±.021	0.703±.025	0.695±.027	0.978±.007	100
EvoDiff	0.789±.026	0.712±.024	0.698±.029	0.963±.011	200
ProHiFlo (Ours)	0.924±.012	0.758±.018	0.769±.015	0.994±.003	50

Table 2. Motif scaffolding success rates across different motif types (mean ± std over 3 runs). Success is defined as scTM

> 0.5

with motif RMSD

< 1.0

Å. We evaluate on 24 enzyme active sites, 18 binding interfaces, and 32 structural motifs from [5]. For each motif, we generate 100 scaffolds and report the success rate.

Table 2. Motif scaffolding success rates across different motif types (mean ± std over 3 runs). Success is defined as scTM

> 0.5

with motif RMSD

< 1.0

Å. We evaluate on 24 enzyme active sites, 18 binding interfaces, and 32 structural motifs from [5]. For each motif, we generate 100 scaffolds and report the success rate.

Method	Active Sites (n=24)	Binding (n=18)	Structural (n=32)	Average
RFDiffusion	0.412±.045	0.523±.038	0.687±.032	0.541
Chroma	0.378±.052	0.489±.044	0.654±.039	0.507
Genie 2	0.445±.041	0.534±.036	0.712±.028	0.564
EvoDiff	0.398±.048	0.512±.041	0.678±.034	0.529
FoldFlow-2	0.467±.038	0.556±.033	0.698±.031	0.574
ProHiFlo	0.589±.028	0.672±.024	0.812±.021	0.691

Table 3. Functional protein design results (mean ± std over 3 runs, 500 samples each). Properties are normalized scores where higher is better. We use ESM-2 stability predictor, GVP-based binding predictor [17], and DeepSol [41] for solubility.

Method	Stability	Binding	Solubility	Designability
RFDiffusion	0.623±.034	0.534±.041	0.587±.038	0.812±.022
Chroma	0.645±.031	0.567±.037	0.612±.035	0.789±.025
EvoDiff	0.634±.033	0.545±.039	0.598±.036	0.798±.024
ProHiFlo (no guidance)	0.698±.028	0.612±.032	0.654±.029	0.924±.012
ProHiFlo + Guidance	0.867±.019	0.784±.023	0.823±.021	0.897±.014

Table 4. Ablation study on key components (mean ± std over 3 runs). Metrics are averaged over unconditional generation.

Variant	Designability	Novelty	Steps	Time (s)
Full model	0.924±.012	0.758±.018	50	2.1
w/o Hierarchical	0.867±.024	0.723±.027	120	8.7
w/o Adaptive arch.	0.892±.019	0.741±.022	50	4.2
w/o Multi-scale	0.878±.021	0.719±.024	50	2.8
Single-stage all-atom	0.821±.028	0.697±.031	150	12.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Preliminaries

3.1. Protein Structure Representation

3.2. Flow Matching

3.3. SE(3) Flow Matching for Proteins

4. Method

4.1. Overview

4.2. Hierarchical Flow Matching

4.3. Functional Guidance

4.4. Adaptive SE(3)-Equivariant Architecture

4.5. Training Details

5. Experiments

5.1. Evaluation Metrics

5.2. Unconditional Backbone Generation

5.3. Motif Scaffolding

5.4. Functional Protein Design

5.5. Ablation Studies

5.6. Computational Efficiency

5.7. Case Studies

6. Limitations

7. Conclusion

Appendix A. Theoretical Analysis

Appendix A.1. Convergence Guarantee for Hierarchical Flow Matching

Appendix A.2. Complexity Analysis

Appendix A.3. Guidance Optimality

Appendix B. Detailed Derivations

Appendix B.1. SE(3) Flow Matching on Protein Frames

Appendix B.2. Adaptive Message Passing Derivation

Appendix C. Additional Experimental Details

Appendix C.1. Dataset Statistics

Appendix C.2. Hyperparameter Settings

Appendix C.3. Evaluation Protocol

Appendix D. Additional Results

Appendix D.1. Per-Length Designability Breakdown

Appendix D.2. Functional Guidance with Different Predictors

Appendix E. Hyperparameter Ablation Studies

Appendix E.1. Sampling Steps Ablation

Appendix E.2. Guidance Annealing Parameter γ

Appendix E.3. Adaptive Message Passing Bounds

Appendix F. PDB-Only Training Results

Appendix G. Fair Comparison at Equal Sampling Steps

Appendix H. Failure Case Analysis

References

MDPI Initiatives

Important Links

Subscribe