Lightweight Machine Learning for Oyster-Reef Environmental Forecasting: A Random-Forest Baseline with Classical and Deep Benchmarks

Van Le; Ransford Antwi; Tan Le

doi:10.20944/preprints202606.1440.v1

Submitted:

17 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract

Short-term forecasting of coastal environmental conditions is critical for supporting oyster-reef management, early detection of stress events, and informed decision-making in dynamic estuarine systems. While deep learning models have gained traction in environmental prediction, lightweight machine learning approaches remain compelling due to their low computational cost, robustness to noisy sensor data, and interpretability. This paper presents a systematic baseline study for multivariate forecasting using a 48-hour input window and a 24-hour prediction horizon across seven key oyster-reef environmental variables. Random Forest (RF) is evaluated as the primary model and compared against four widely used benchmarks—ARIMA, Gradient Boosting (GB), LSTM, and GRU—under identical windowing, time-ordered splits, and per-feature evaluation. Results show that RF delivers strong and stable performance across most variables, outperforming ARIMA and matching or exceeding GB and recurrent neural networks. The study provides a reproducible, low-complexity baseline that supports future research in coastal forecasting, environmental risk assessment, and real-time autonomous monitoring systems.

Keywords:

environmental forecasting

;

oyster reefs

;

lightweight machine learning

;

random forest

;

multivariate time series

Subject:

Environmental and Earth Sciences - Atmospheric Science and Meteorology

1. Introduction

Oyster reefs are highly sensitive to short-term fluctuations in temperature, salinity, dissolved oxygen, pH, and tidal dynamics, all of which can vary rapidly in estuarine systems such as the Chesapeake Bay [1,2]. Accurate 24-hour forecasts of these environmental variables are essential for identifying stress conditions, anticipating harmful events (e.g., hypoxia, salinity intrusions, thermal spikes), and supporting operational decision-making in coastal ecosystems [3]. While deep learning architectures have shown promise for environmental prediction [4], many coastal deployments operate under tight computational and energy constraints, motivating the use of lightweight, robust, and interpretable models suitable for edge devices and autonomous monitoring platforms.

In parallel, artificial intelligence and machine learning have transformed a wide range of networked and cyber-physical systems, including edge caching and computing [5,6,7,8], smart healthcare and activity recognition [9,10,11], and more recent quantum-augmented and trustworthy AI frameworks [12,13,14,15,16]. These works demonstrate that carefully designed AI/ML pipelines—combining domain structure, resource awareness, and interpretability—can deliver practical, deployable intelligence in constrained environments. However, in the context of oyster-reef monitoring, there remains a gap between complex, high-capacity models and simple, reproducible baselines that can be readily deployed on resource-limited platforms such as edge devices and autonomous coastal systems.

This paper addresses that gap by developing and rigorously evaluating a lightweight machine learning baseline for short-term oyster-reef environmental forecasting. We focus on Random Forest (RF) as the primary model due to its nonlinear modeling capacity, robustness to noisy and heterogeneous sensor data, and favorable computational profile. RF is compared against four widely used baselines—ARIMA, Gradient Boosting (GB), LSTM, and GRU—under a unified multivariate, multi-horizon forecasting setup with a 48-hour input window and a 24-hour prediction horizon. The analysis combines theoretical insights on RF variance reduction and regularization with a detailed empirical evaluation across seven environmental variables.

The contributions of this work can be summarized as follows:

Unified forecasting formulation. We formulate a multivariate, multi-horizon oyster-reef environmental forecasting problem with a 48-hour input and 24-hour output window, using consistent time-based splits and per-feature evaluation across seven key environmental variables.
Optimized Random Forest baseline. Building on RF theory (Section 3), we design an optimized RF configuration with structural regularization, subsampling, and leaf shrinkage (Section 4), tailored to noisy coastal sensor regimes.
Systematic model comparison. We perform a controlled comparison of RF against ARIMA, Gradient Boosting, LSTM, and GRU under identical data processing, windowing, and evaluation metrics, providing a clear ranking of lightweight and deep models for this task.
Empirical validation of theory. We show that the empirical performance of RF—strong average accuracy and per-feature robustness—aligns with its theoretical variance reduction and regularization properties, thereby validating the proposed optimization strategies.
Reproducible, extensible baseline. The resulting RF-based pipeline constitutes a reproducible, low-complexity baseline that can be extended in future work to joint environmental–biological forecasting and to more advanced deep learning architectures.

2. Dataset and Problem Formulation

2.1. Notation and Variables

Let F denote the number of environmental features and S the number of monitoring sites. At each site and time index t, we observe an environmental feature vector

x_{t} \in R^{F},

written componentwise as

x_{t} = {[\begin{matrix} T_{t}, S_{t}, V_{t}, H_{t}, D_{t}^{sat}, D_{t}^{mg}, {pH}_{t} \end{matrix}]}^{⊤},

F = 7,

corresponding to temperature, salinity, vertical velocity, tide height, dissolved oxygen saturation, dissolved oxygen concentration and pH.

2.2. Multivariate, Multi-Horizon Forecasting Setup

For each site

s \in {1, \dots, S}

we observe a chronological sequence

{x_{1}^{(s)}, x_{2}^{(s)}, \dots, x_{T_{s}}^{(s)}},

where

T_{s}

is the number of hourly observations. We construct supervised samples using a sliding window of past environmental conditions. Given a reference time t, the input window is

X_{t} = {x_{t - 47}, \dots, x_{t}} \in R^{48 \times F},

representing the past 48 hours. The forecasting target consists of the next 24 hours is

Y_{t} = {x_{t + 1}, \dots, x_{t + 24}} \in R^{24 \times F} .

Thus each supervised sample is

(X_{t}, Y_{t})

, and the number of valid windows at site s is

N_{s} = T_{s} - 48 - 24 + 1 .

The full dataset is the union across sites, i.e.

D = ⋃_{s = 1}^{S} {(X_{t}^{(s)}, Y_{t}^{(s)})}_{t = 1}^{N_{s}} .

2.3. Forecasting Operator

We seek a model

f_{θ} : R^{48 \times F} \to R^{24 \times F},

parameterized by

θ

, that maps a 48-hour input window to a 24-hour multivariate forecast, i.e.

{\hat{Y}}_{t} = f_{θ} (X_{t}),

{\hat{Y}}_{t} \in R^{24 \times F} .

Equivalently, for each forecast horizon

h \in {1, \dots, 24}

, we have

{\hat{x}}_{t + h} = f_{θ}^{(h)} (X_{t}) \in R^{F} .

This defines a standard multivariate, multi-horizon forecasting operator.

2.4. Training, Validation and Test Splits

To avoid temporal leakage, we use strictly time-ordered splits within each site, i.e.

70 % training,

10 % validation,

20 % testing .

Splits are applied before windowing so that no window crosses a boundary.

2.5. Evaluation Metrics

Environmental forecasting accuracy is evaluated using: (1) RMSE and MAE per feature, (2) horizon-wise RMSE/MAE curves, and (3) aggregated average RMSE/MAE across features (excluding DOSAT when noted due to scale imbalance). These metrics quantify amplitude accuracy, bias, and temporal stability across environmental regimes.

2.6. Data Quality and Preprocessing

Key preprocessing steps include: (1) Resampling and alignment: all sensor channels are resampled to hourly resolution and aligned across sites; (2) Imputation: short gaps are interpolated, whilst longer gaps are masked and excluded from training windows; (3) Normalization: each feature is standardized per site using training-set statistics; and (4) Outlier handling: extreme sensor spikes are clipped or removed based on physical plausibility thresholds.

2.7. Problem Challenges and Modeling Implications

Environmental forecasting exhibits several challenges: (1) Nonstationarity: diurnal cycles, tidal oscillations, and episodic events (storms, freshwater pulses); (2) Multivariate coupling: strong nonlinear interactions among temperature, salinity, oxygen, and pH; (3) Autocorrelation: smooth variables (e.g. tide, vertical velocity) require models robust to temporal dependence; and (4) Sensor noise: coastal deployments exhibit dropouts, drift, and calibration shifts. These considerations motivate the use of lightweight, robust models such as Random Forests, as well as classical and deep benchmarks evaluated in the following sections.

3. Random Forests for Environmental Forecasting

This section builds on the notation and forecasting operator introduced in Section 2. Given an input window

X_{t} \in R^{48 \times F}

, the goal is to predict the next 24 hours of environmental conditions. We describe the Random Forest (RF) model, its multi-output regression formulation, and theoretical properties relevant to multivariate, multi-horizon forecasting.

3.1. CART Regression Tree: Vector-Valued Leaves

A single CART regression tree partitions the input space into disjoint regions

{R_{m}}_{m = 1}^{M}

. For multivariate regression, each leaf stores a vector mean corresponding to the F environmental variables. Let the per-sample target be

y_{i} \in R^{F},

representing the environmental feature vector at a given forecast horizon. For leaf

R_{m}

, the vector leaf mean is

{\bar{y}}_{m} = \frac{1}{| R_{m} |} \sum_{i : x_{i} \in R_{m}} y_{i} \in R^{F} .

(1)

The tree predictor is the piecewise-constant mapping

T (x) = \sum_{m = 1}^{M} {\bar{y}}_{m} 1 (x \in R_{m}), T (x) \in R^{F} .

(2)

At a node with sample set S, a split on feature j and threshold t produces left/right subsets

S_{L} (j, t) = {i \in S : x_{i j} \leq t},

S_{R} (j, t) = S ∖ S_{L} (j, t) .

We use the multivariate impurity (trace of within-leaf covariance):

I (S) = \frac{1}{| S |} \sum_{i \in S} {∥ y_{i} - {\bar{y}}_{S} ∥}_{2}^{2}, {\bar{y}}_{S} = \frac{1}{| S |} \sum_{i \in S} y_{i} .

(3)

Here, the optimal split minimizes the weighted post-split impurity as

(j^{*}, t^{*}) = arg min_{j, t} [\frac{| S_{L} |}{| S |} I (S_{L}) + \frac{| S_{R} |}{| S |} I (S_{R})] .

(4)

3.2. Bootstrap Aggregation and Multi-Output Ensemble

Random Forest constructs B trees

{T_{b}}_{b = 1}^{B}

on bootstrap (or subsampled) datasets

D^{(b)}

. Each tree outputs a vector prediction as in (2). The ensemble predictor averages the tree outputs:

{\hat{f}}_{RF} (x) = \frac{1}{B} \sum_{b = 1}^{B} T_{b} (x) \in R^{F} .

(5)

For multi-horizon forecasting, one RF model is trained per horizon

h \in {1, \dots, 24}

, yielding

{\hat{x}}_{t + h} = {\hat{f}}_{RF}^{(h)} (X_{t}) \in R^{F} .

3.3. Random Feature Subsampling and Variance Reduction

At each split, RF selects a random subset of features

F_{split}

with

| F_{split} | = m

[17]. Let

Z_{b} (x) = T_{b} (x)

denote the random vector prediction of tree b. For any unit vector

u \in R^{F}

, consider the scalar projection

u^{⊤} Z_{b} (x)

. If

E [u^{⊤} Z_{b}] = μ_{u},

Var [u^{⊤} Z_{b}] = σ_{u}^{2},

Corr [u^{⊤} Z_{b}, u^{⊤} Z_{b^{'}}] = ρ_{u},

then the ensemble variance satisfies the classical decomposition [18]:

Var [u^{⊤} {\hat{f}}_{RF} (x)] = ρ_{u} σ_{u}^{2} + \frac{1 - ρ_{u}}{B} σ_{u}^{2} .

(6)

Thus RF reduces variance by (1) averaging many trees and (2) reducing inter-tree correlation via feature subsampling.

3.4. Consistency and Practical Considerations

Under mild assumptions, RF is consistent for regression and converges to the Bayes predictor [19]. For environmental forecasting, the following considerations are important: (1) Vector-valued leaves: each tree predicts all F environmental variables jointly; (2) Regularization: control tree depth, minimum leaf size, and subsampling to avoid overfitting to sensor noise; and (3) Computational efficiency: ExtraTrees [20] (random thresholds) and smaller m reduce training time with modest accuracy tradeoffs.

3.5. RF Implementation and Link to Data

In our implementation, the parameter set

θ

corresponds to the ensemble of tree structures and leaf vectors. The forecasting operator is implemented as

f_{RF} (X_{t}) = {\hat{f}}_{RF} (ϕ (X_{t})),

where

ϕ (\cdot)

denotes deterministic feature extraction from the input window (e.g., flattening, lag features, or domain aggregates). Hyperparameters (tree depth, m, B, leaf size) are selected using the validation split. Additional theoretical details are summarized in Appendix A of online technical report [21].

4. Optimizing Random Forest for Environmental Forecasting

This section presents practical modifications to Random Forest (RF) tailored to the multivariate, multi-horizon environmental forecasting problem defined in Section 2. The optimizations are grounded in the theoretical properties of RF described in Section 3 and the consistency sketch in Appendix A of online technical report [21]. We focus on (1) structural regularization, (2) honest estimation and subsampling, (3) leaf shrinkage for stability, (4) temporal smoothing for multi-horizon consistency, and (5) efficiency improvements such as ExtraTrees and feature grouping.

4.1. Structural Regularization and Hyperparameters

To control variance and computational cost in noisy coastal sensor regimes, we impose the following structural constraints on each tree

T_{b}

:

(C 1) : depth (T_{b}) \leq d_{max},

(C 2) : | S_{L} |, | S_{R} | \geq n_{min},

(C 3) : | L_{b} | \geq ℓ_{min} .

These constraints ensure that (1) leaves remain localized, reducing bias (Appendix A [21]), and (2) each leaf contains enough samples for the empirical mean in Equation (1) to be reliable. A moderate ensemble size B (typically 100–200) balances variance reduction via Equation (6) with compu- tational efficiency.

4.2. Honest Forests and Split/Estimation Separation

To reduce adaptive bias, RF can be trained in an honest manner [22]. For each tree b, the bootstrap (or subsample)

D^{(b)}

is split into two disjoint sets:

D^{(b)} = D_{split}^{(b)} \cup D_{est}^{(b)},

D_{split}^{(b)} \cap D_{est}^{(b)} = \emptyset .

Here,

D_{split}^{(b)}

selects splits by minimizing impurity (Equation (3)) and the weighted post-split criterion (Equation (4)), while

D_{est}^{(b)}

computes leaf means (Equation (1)). Honesty directly supports the consistency conditions in Appendix A [21]. Meaning that it prevents the same samples from influencing both the partition and the leaf estimate, reducing adaptive bias and improving variance control.

4.3. Leaf Shrinkage and Regularized Estimation

Leaf predictions may be unstable when the effective sample size in a leaf is small. Shrinkage stabilizes these predictions by pulling leaf means toward the global mean:

{\tilde{y}}_{L} = (1 - λ_{L}) {\bar{y}}_{L} + λ_{L} \bar{y}, λ_{L} = \frac{τ}{τ + | L |},

(7)

where

τ > 0

is a shrinkage hyperparameter. This ridge-style shrinkage reduces variance, complements the variance decomposition in Equation (6), and supports the “effective sample growth” condition in Appendix A [21].

4.4. Temporal Smoothing and Multi-Horizon Consistency

Environmental variables exhibit strong temporal continuity. To encourage smooth multi-horizon forecasts, we apply a post-hoc temporal smoothing step to the 24-step prediction sequence as

{\hat{x}}_{t + h} \leftarrow LPF ({\hat{x}}_{t + 1}, \dots, {\hat{x}}_{t + 24}),

where

LPF (\cdot)

denotes a low-pass filter (e.g., moving average or exponential smoothing). This improves horizon-to-horizon consistency without modifying the RF training procedure and is especially useful for variables with strong diurnal or tidal periodicity.

4.5. Subsampling, ExtraTrees, and Efficiency

Subsampling without replacement reduces computation and inter-tree correlation, i.e.

D^{(b)} \sim Sample (D, α N, without replacement),

α \in (0, 1] .

ExtraTrees [20] further accelerate training by selecting random thresholds as

t \sim Uniform ({min}_{j} x_{i j}, {max}_{j} x_{i j}),

avoiding exhaustive threshold search. Both techniques reduce correlation

ρ

in Equation (6), improving ensemble variance reduction and supporting the variance control conditions in Appendix A [21].

4.6. Domain-Aware Feature Grouping

Environmental variables naturally cluster (e.g., temperature–oxygen, salinity–pH, tide–velocity). Instead of sampling individual features, we sample feature groups

G_{k}

and then a feature within the chosen group. This encourages physically meaningful splits, reduces the likelihood of spurious splits on noisy channels, and improves interpretability without altering the theoretical guarantees.

4.7. Practical Guidance and Hyperparameters

A practical recipe for environmental forecasting with RF is:

$B = 150$ trees;
$m = max (1, ⌊ \sqrt{F} ⌋)$ or group-based feature sampling;
$d_{max} = 12$ , $ℓ_{min} = 10$ , $n_{min} = 5$ ;
subsample fraction $α = 0.6$ ;
optional temporal smoothing on the 24-step output sequence.

These optimizations improve robustness, reduce variance, and enhance temporal stability, and they align directly with the consistency conditions summarized in Appendix A [21].

5. Results

In this section, we present the essential experimental results to validate our proposed method and its theoretical analysis. Full details of results are provided in our technical report [21].

5.1. Experimental Setup

All models are evaluated under the multivariate, multi-horizon setup defined in Section 2. For each environmental feature, models receive a 48-hour input window and produce a 24-hour forecast.

Input length: 48 hours (past observations).
Prediction length: 24 hours (future horizons).
Evaluation metrics: root mean square error (RMSE) and mean absolute error (MAE), reported per feature and per model; we also compute averages across features (with and without DOSAT, as discussed below).
Training loss: mean squared error (MSE) for LSTM and GRU; tree-based and ARIMA models use their standard regression objectives.
Hardware: all classical models (ARIMA, RF, Gradient Boosting) run efficiently on CPU; LSTM/GRU can use CPU or GPU but remain lightweight at this scale.

All models share identical windowing, time-based splits, and evaluation metrics, ensuring a controlled comparison across modeling paradigms.

5.2. Baseline Models

We compare the optimized Random Forest against four representative forecasting baselines under the same 48-hour input and 24-hour output setup. Firstly, ARIMA(2,0,2) serves as a classical linear benchmark, fitted on the preceding 48 points for each test window. Secondly, Random Forest (our primary model) trains one independent regressor per forecast horizon using 200 trees. Thirdly, Gradient Boosting employs 300 estimators with depth 3 and learning rate 0.05, also trained per horizon. Finally, LSTM and GRU models use a single recurrent layer with hidden size 32 to map the 48-step univariate sequence to a 24-step forecast.

5.3. Quantitative Performance

Table 1 reports per-feature RMSE and MAE for all five baseline models across the seven environmental variables. A clear pattern emerges: RandomForest achieves the lowest error in six out of seven features (temp, sal, tide, dosat, domgl, and phn), confirming its robustness and ability to capture nonlinear environmental dynamics. The only exception is vertical velocity (vert), where GRU attains the lowest RMSE and MAE, reflecting the advantage of gated recurrent architectures for smooth, strongly autocorrelated signals. GradientBoosting is typically competitive and often ranks second, while LSTM and GRU provide moderate improvements over ARIMA but, except for vert, do not surpass the tree-based models. ARIMA consistently yields the highest errors, with particularly severe degradation on dosat due to strong nonstationarity and nonlinear variability. These results empirically support the theoretical variance reduction and nonlinear modeling capabilities of RF discussed in Section 3 and Appendix A [21].

Figure 1 summarizes the average predictive error of the five baseline models across six environmental variables (temp, sal, vert, tide, domgl, phn). The dissolved oxygen saturation feature (dosat) is excluded from this aggregation because its error magnitude is orders of magnitude larger than the other variables (RMSE

\approx 1700

for ARIMA and

\approx 14

for RandomForest), which would dominate and distort the overall averages. By removing dosat from the mean calculation, the figure provides a more balanced and interpretable comparison of model performance across the remaining features. A detailed, feature-specific analysis—including dosat—is presented in the subsequent figures.

As shown, RandomForest achieves the lowest average error (RMSE

= 0.369

, MAE

= 0.253

), followed by Gradient Boosting (RMSE

= 0.413

, MAE

= 0.291

). The recurrent models (LSTM, GRU) obtain comparable average RMSEs (LSTM

= 0.465

, GRU

= 0.474

) but higher MAEs than RandomForest. Classical ARIMA exhibits the largest average error (RMSE

= 0.707

, MAE

= 0.482

), driven by substantial errors on several features. This aggregate view orients the reader to the overall model ranking and motivates the per-feature breakdowns that follow.

Across Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, which present per-feature RMSE and MAE comparisons for temperature, salinity, vertical velocity, tide, dissolved oxygen (dosat and domgl), and pH, a consistent pattern emerges. The optimized Random Forest (RF) introduced in Section 4 achieves the lowest error for six of the seven environmental variables (temp, sal, tide, dosat, domgl, and phn), demonstrating strong generalization, robustness to heterogeneous noise, and the ability to model nonlinear multi-factor interactions. The only exception is vertical velocity (vert), where the GRU model attains the lowest RMSE and MAE, reflecting the advantage of gated recurrent architectures in capturing smooth, strongly autocorrelated temporal dynamics. GradientBoosting typically ranks second across most features, while LSTM and GRU provide moderate improvements over ARIMA but, except for vert, do not match the accuracy of the tree-based methods. These empirical trends are consistent with the RF variance reduction and regularization mechanisms discussed in Section 3 and Section 4.

5.3.1. Per-Feature Interpretation And Validation Of The Optimized Random Forest

The performance trends observed across all features align closely with the design principles and optimization strategies described in Section 4. Below, we contextualize the per-feature results in light of the physical characteristics of each variable and the modeling capabilities of the baseline algorithms.

Temperature (temp). Temperature dynamics arise from nonlinear interactions among meteorological forcing, tidal mixing, and salinity gradients. The optimized RF captures these interactions effectively and benefits from ensemble averaging, which stabilizes predictions under diurnal and weather-driven variability. ARIMA’s linear structure cannot represent these nonlinearities, GradientBoosting is more sensitive to noise due to sequential boosting, and the recurrent models tend to oversmooth extremes given the limited sequence length and dataset size.
Salinity (sal). Salinity exhibits threshold-like transitions driven by freshwater pulses and tidal mixing. The RF’s split-based structure naturally accommodates such discontinuities and remains robust to outliers and regime shifts. ARIMA fails to capture abrupt changes, GradientBoosting performs competitively but is slightly more sensitive to rare events, and LSTM/GRU struggle with the mixture of periodic and event-driven behavior when data volume is limited.
Vertical velocity (vert). Vertical velocity is the one feature where RF does not achieve the lowest error. This variable is smooth and strongly autocorrelated, making it well suited to recurrent architectures. GRU, in particular, leverages gated temporal memory to track subtle flow dynamics that static models cannot represent. RF performs reasonably but lacks explicit temporal state, while ARIMA cannot model nonlinear vertical mixing and LSTM is slightly less stable than GRU on this dataset.
Tide. Although tidal motion is quasi-periodic, it is modulated by nonlinear meteorological and bathymetric effects. The optimized RF captures these nonlinear distortions with low variance, whereas ARIMA only models idealized harmonic structure and degrades when the signal deviates from pure periodicity. GradientBoosting is competitive but more prone to overfitting local patterns, and the recurrent models tend to misalign phase or oversmooth when trained on limited sequences.
Dissolved oxygen saturation (dosat). Dosat is highly nonlinear and strongly influenced by temperature, biological activity, and turbulent mixing. The optimized RF handles these multi-factor interactions and remains stable under extreme values and nonstationarity. ARIMA breaks down entirely under these conditions, producing extremely large errors. GradientBoosting performs reasonably but is more sensitive to rare high/low excursions, while LSTM/GRU require larger datasets to reliably model the noisy, multi-driver dynamics.
Dissolved oxygen concentration (domgl). Dissolved oxygen depends on nonlinear coupling among temperature, salinity, and biological processes. The RF effectively models these interactions and handles heteroscedasticity and sharp drops or spikes. ARIMA misses nonlinear coupling, GradientBoosting is close but slightly more prone to overfitting, and LSTM/GRU benefit from temporal structure but are limited by noise and dataset size.
pH (phn). pH responds nonlinearly to CO₂ dynamics, temperature, and biological activity. The optimized RF captures these relationships and remains robust to small-scale fluctuations and sensor noise. ARIMA’s linear formulation is insufficient for carbonate chemistry, GradientBoosting is competitive but slightly noisier, and LSTM/GRU do not gain enough advantage from temporal structure given the limited data.

Overall, the per-feature results validate the design choices in Section 4: the optimized Random Forest provides a strong, stable, and interpretable baseline across a wide range of environmental variables, with recurrent models offering advantages only for the most strongly autocorrelated signals such as vertical velocity.

5.3.2. Validation of the Optimized Random Forest

Figure 9, Figure 10, Figure 11 and Figure 12 present a subset of the most essential diagnostics for evaluating the optimized Random Forest (RF) model. For clarity and space considerations, we include representative results for temperature (TEMP), salinity (SAL), vertical velocity (VERT), and dissolved oxygen saturation (DOSAT); the full set of per-feature diagnostics—including tide, dissolved oxygen concentration, and pH—is provided in our online technical report [21].

Each figure contains four complementary views: prediction vs. truth, scatter plot with a 1:1 reference line, error distribution, and rolling RMSE. Across the displayed variables, the optimized RF demonstrates strong generalization and low bias, consistent with the theoretical properties discussed in Section 4. Prediction–truth overlays show tight alignment, and scatter plots cluster closely around the 1:1 line, indicating accurate amplitude tracking. Error histograms remain sharply centered near zero, reflecting balanced residuals and the stabilizing effect of structural regularization and leaf-shrinkage. Rolling RMSE curves exhibit stable temporal behavior, with only brief increases during short-lived environmental disturbances such as freshwater pulses or tidal transitions.

The DOSAT results highlight RF’s robustness under extreme dynamic range and nonstationarity, while the VERT results illustrate that RF maintains low error even on smooth, autocorrelated variables where recurrent models perform slightly better. Overall, the essential diagnostics shown here—and the extended results in the technical report—confirm that the optimized RF provides a strong, interpretable, and reliable baseline for multivariate environmental forecasting.

5.4. Discussion

The results demonstrate that the optimized Random Forest (RF) model provides a strong and stable baseline for multivariate environmental forecasting. Across six of the seven environmental variables, RF achieves the lowest RMSE and MAE, validating the theoretical advantages discussed in Section 3 and Section 4. In particular, the ensemble variance reduction (Equation (6)), structural regularization, and leaf-shrinkage mechanisms contribute to RF’s robustness under noisy coastal sensor conditions. Recurrent models such as GRU outperform RF only for the most strongly autocorrelated variable (vertical velocity), consistent with their ability to maintain temporal state. Overall, the empirical findings confirm that RF offers an effective balance of nonlinear modeling capacity, interpretability, and computational efficiency for short-term environmental forecasting.

6. Conclusions and Future Work

This paper establishes a lightweight, reproducible baseline for short-term oyster-reef environmental forecasting using a unified 48-hour input and 24-hour output framework. Across a diverse set of environmental variables, Random Forest consistently delivers strong and stable performance, outperforming ARIMA and matching or exceeding Gradient Boosting and recurrent neural networks. The empirical results align closely with the theoretical properties of RF—variance reduction, robustness to noise, and effective nonlinear modeling—outlined in Section 3 and Section 4.

Future work will extend this baseline toward richer modeling tasks, including multivariate deep learning architectures, uncertainty quantification, and integration with ecological risk indicators. A natural next step is to incorporate oyster-health variables into a joint environmental–biological forecasting framework, enabling a full multi-task extension of the methods introduced here.

Acknowledgments

This work was supported in part by the U.S. National Science Foundation (NSF) under Grant NSF 2101227 and Virginia Space Grant Consortium (VSGC).

References

Kemp, W.M.; Boynton, W.R.; Adolf, J.E.; Boesch, D.F.; Boicourt, W.C.; Brush, G.; Cornwell, J.C.; Fisher, T.R.; Glibert, P.M.; Hagy, J.D.; et al. Eutrophication of Chesapeake Bay: historical trends and ecological interactions. Mar. Ecol. Prog. Ser. 2005, 303, 1–29. [Google Scholar] [CrossRef]
Dai, M.; Zhao, Y.; Chai, F.; Chen, M.; Chen, N.; Chen, Y.; Cheng, D.; Gan, J.; Guan, D.; Hong, Y.; et al. Persistent eutrophication and hypoxia in the coastal ocean. Camb. Prism. Coast. Futur. 2023, 1, e19. [Google Scholar] [CrossRef]
Breitburg, D.; Levin, L.A.; Oschlies, A.; Grégoire, M.; Chavez, F.P.; Conley, D.J.; Garçon, V.; Gilbert, D.; Gutiérrez, D.; Isensee, K.; et al. Declining oxygen in the global ocean and coastal waters. Science 2018, 359, eaam7240. [Google Scholar] [CrossRef] [PubMed]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
Pervej, M.F.; Tan, L.T.; Hu, R.Q. Artificial Intelligence Assisted Collaborative Edge Caching in Small Cell Networks. In Proceedings of the GLOBECOM 2020 - 2020 IEEE Global Communications Conference, 2020; pp. 1–7. [Google Scholar] [CrossRef]
Tan, L.T.; Hu, R.Q. Mobility-Aware Edge Caching and Computing in Vehicle Networks: A Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2018, 67, 10190–10203. [Google Scholar] [CrossRef]
Tan, L.T.; Hu, R.Q.; Hanzo, L. Twin-Timescale Artificial Intelligence Aided Mobility-Aware Edge Caching and Computing in Vehicular Networks. IEEE Trans. Veh. Technol. 2019, 68, 3086–3099. [Google Scholar]
Wang, Q.; Tan, L.T.; Hu, R.Q.; Qian, Y. Hierarchical Energy-Efficient Mobile-Edge Computing in IoT Networks. IEEE Internet Things J. 2020, 7, 11626–11639. [Google Scholar] [CrossRef]
Le, T.; Shetty, S. Artificial intelligence-aided privacy preserving trustworthy computation and communication in 5G-based IoT networks. Ad. Hoc Netw. 2022, 126, 102752. [Google Scholar] [CrossRef]
Zahin, A.; Tan, L.T.; Hu, R.Q. Sensor-Based Human Activity Recognition for Smart Healthcare: A Semi-supervised Machine Learning. In Proceedings of the Artificial Intelligence for Communications and Networks; Springer International Publishing, 2019; pp. 450–472. [Google Scholar]
Zahin, A.; Tan, L.T.; Hu, R.Q. A Machine Learning Based Framework for the Smart Healthcare Monitoring. 2020 Intermountain Engineering, Technology and Computing (IETC), 2020. [Google Scholar]
Le, T.; Reisslein, M.; Shetty, S. Multi-Timescale Actor-Critic Learning for Computing Resource Management With Semi-Markov Renewal Process Mobility. IEEE Trans. Intell. Transp. Syst. 2024, 25, 452–461. [Google Scholar] [CrossRef]
Tan, L.; Van, L.; Sachin, S. Privacy-Aware Framework of Robust Malware Detection in Indoor Robots: Hybrid Quantum Computing and Deep Neural Networks. TechRxiv, 2025. [Google Scholar]
Tan, L.; Van, L.; Sachin, S. Quantum-Augmented AI/ML for O-RAN: Hierarchical Threat Detection with Synergistic Intelligence and Interpretability. TechRxiv, 2025. [Google Scholar]
Le, T.; Le, V. DPFAGA-Dynamic Power Flow Analysis and Fault Characteristics: A Graph Attention Neural Network. In Proceedings of the The 2025 International Conference on the AI Revolution: Research, Ethics, and Society (AIR-RES 2025), 2025. [Google Scholar]
Le, V.; Le, T. Hybrid Quantum–Classical Encoding for Accurate Residue-Level pKa Prediction. In Proceedings of the International Conference on the AI Revolution: Research, Ethics, and Society (AIR-RES 2026), 2026. [Google Scholar]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Biau, G.; Devroye, L.; Lugosi, G. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 2008, 9. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Le, V.; Antwi, R.; Le, T. Lightweight Machine Learning for Oyster-Reef Environmental Forecasting: A Random-Forest Baseline with Classical and Deep Benchmarks. Available online: https://www.dropbox.com/scl/fi/uqh17lvvkcqtp1kux5unn/Techreport_LMLEP.pdf?rlkey=knuvsiac74ih4jswtlp9xvlhh&st=krq8pdvy&dl=0.
Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]

Figure 1. Average error across six features (dosat excluded) for baseline models. Bars show mean RMSE and MAE per model; RandomForest achieves the lowest average RMSE (0.369) and MAE (0.253), indicating the strongest overall performance.

Figure 2. Performance comparison for temperature forecasting across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest achieves the lowest RMSE (0.5470) and MAE (0.3416), indicating superior predictive accuracy for temperature. ARIMA exhibits the highest error values, while LSTM and GRU demonstrate comparable performance with moderate improvement over ARIMA but remain less accurate than tree-based models.

Figure 3. Performance comparison for salinity (sal) prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest and GradientBoosting achieve the lowest RMSE (0.3585 and 0.3631) and MAE (0.2585 and 0.2575), respectively, outperforming the neural models. ARIMA yields higher errors, while LSTM and GRU show moderate accuracy.

Figure 4. Performance comparison for vertical velocity (vert) prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). GRU achieves the lowest RMSE (0.1095) and MAE (0.0745), followed closely by LSTM and GradientBoosting. ARIMA shows the largest errors, while tree-based and recurrent models demonstrate strong predictive capability for this feature.

Figure 5. Performance comparison for tide prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest achieves the lowest RMSE (0.0763) and MAE (0.0502), indicating the best predictive accuracy for tide variation. ARIMA produces the highest errors, while GradientBoosting, LSTM, and GRU perform similarly with small differences.

Figure 6. Performance comparison for dissolved oxygen saturation (dosat) prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest achieves the lowest RMSE (13.93) and MAE (9.48), indicating the most accurate prediction for dosat. ARIMA exhibits the highest error values, while GradientBoosting, LSTM, and GRU yield similar intermediate performance.

Figure 7. Performance comparison for dissolved oxygen concentration (domgl) prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest achieves the lowest RMSE (0.9666) and MAE (0.6548), indicating the most accurate prediction for domgl. ARIMA exhibits the highest error values, while GradientBoosting, LSTM, and GRU yield similar intermediate performance.

Figure 8. Performance comparison for pH prediction across baseline models. Left panel shows the root mean square error (RMSE), and right panel shows the mean absolute error (MAE). RandomForest achieves the lowest RMSE (0.1210) and MAE (0.0877), demonstrating superior predictive accuracy for pH. ARIMA shows the largest errors, while GradientBoosting, LSTM, and GRU perform comparably with moderate improvement over ARIMA.

Figure 9. Optimized Random Forest performance for temperature (TEMP). Panels show: prediction vs. true time series, scatter plot with 1:1 line, error distribution, and rolling RMSE. The optimized RF captures nonlinear thermal dynamics with low bias and stable temporal error.

Figure 10. Optimized Random Forest performance for salinity (SAL). RF accurately models threshold-like transitions driven by freshwater pulses and tidal mixing, with symmetric residuals and low rolling RMSE.

Figure 11. Optimized Random Forest performance for vertical velocity (VERT). Although recurrent models slightly outperform RF on this smooth, autocorrelated variable, RF still maintains low error and stable temporal behavior.

Figure 12. Optimized Random Forest performance for dissolved oxygen saturation (DOSAT). Despite the large dynamic range of DOSAT, RF preserves balanced residuals and avoids systematic bias, validating the robustness of the optimized ensemble.

Table 1. Per-feature RMSE and MAE comparison across baseline models. RandomForest consistently achieves the lowest or near-lowest error values, confirming its robustness for environmental forecasting.

Feature	RMSE					MAE
	ARIMA	Random Forest	Gradient Boosting	LSTM	GRU	ARIMA	Random Forest	Gradient Boosting	LSTM	GRU
temp	1.0848	0.5470	0.7041	0.8626	0.8605	0.7209	0.3416	0.4847	0.5982	0.5911
sal	0.4444	0.3585	0.3631	0.5329	0.5194	0.3135	0.2585	0.2575	0.4225	0.3880
vert	0.2093	0.1350	0.1168	0.1129	0.1095	0.1386	0.0948	0.0839	0.0808	0.0745
tide	0.2079	0.0763	0.0850	0.0862	0.0841	0.1386	0.0502	0.0590	0.0574	0.0549
dosat	1665.95	13.93	15.11	21.25	20.41	38.30	9.48	10.32	15.69	15.04
domgl	2.0535	0.9666	1.0366	1.0683	1.0208	1.3064	0.6548	0.7121	0.7530	0.7139
phn	0.1948	0.1210	0.1285	0.1573	0.1616	0.1440	0.0877	0.0928	0.1211	0.1251

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Lightweight Machine Learning for Oyster-Reef Environmental Forecasting: A Random-Forest Baseline with Classical and Deep Benchmarks

Abstract

Keywords:

Subject:

1. Introduction

2. Dataset and Problem Formulation

2.1. Notation and Variables

2.2. Multivariate, Multi-Horizon Forecasting Setup

2.3. Forecasting Operator

2.4. Training, Validation and Test Splits

2.5. Evaluation Metrics

2.6. Data Quality and Preprocessing

2.7. Problem Challenges and Modeling Implications

3. Random Forests for Environmental Forecasting

3.1. CART Regression Tree: Vector-Valued Leaves

3.2. Bootstrap Aggregation and Multi-Output Ensemble

3.3. Random Feature Subsampling and Variance Reduction

3.4. Consistency and Practical Considerations

3.5. RF Implementation and Link to Data

4. Optimizing Random Forest for Environmental Forecasting

4.1. Structural Regularization and Hyperparameters

4.2. Honest Forests and Split/Estimation Separation

4.3. Leaf Shrinkage and Regularized Estimation

4.4. Temporal Smoothing and Multi-Horizon Consistency

4.5. Subsampling, ExtraTrees, and Efficiency

4.6. Domain-Aware Feature Grouping

4.7. Practical Guidance and Hyperparameters

5. Results

5.1. Experimental Setup

5.2. Baseline Models

5.3. Quantitative Performance

5.3.1. Per-Feature Interpretation And Validation Of The Optimized Random Forest

5.3.2. Validation of the Optimized Random Forest

5.4. Discussion

6. Conclusions and Future Work

Acknowledgments

References

MDPI Initiatives

Important Links

Subscribe