An Improved Deep Learning Framework for In-Situ Detection of Geometric Keypoints of Heliostats in Concentrated Solar Power Plants

Fen Xu; Hongyu Miao

doi:10.20944/preprints202606.0866.v1

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract

In-situ detection of tracking poses of heliostats from a single image can help improve the tracking accuracies of heliostats and reduce the task loads of heliostat calibration in a large-scale concentrated solar power plant, as traditional methods normally require a period of off-tracking during the calibrating process of a heliostat. This paper presents a deep-learning-based keypoint detection framework for in-situ detection and calibration of heliostat pose. The proposed framework is built upon YOLOv8-Pose but integrates a high-resolution P2 feature branch to recover fine-grained spatial details that are otherwise lost in deep semantic layers. Further, a geometry-consistency loss is introduced to regularize the predicted quadrilateral, enforcing strict structural integrity under dynamically changing illumination. Experimental study on an actual dataset of heliostats shows that the proposed framework achieves an end-to-end inference speed of 25.14 FPS, making in-situ detection of tracking poses of heliostats possible in CSP plants. The mean end-point error (EPE) of detected keypoints is around 1.89 pixels, while the stringent mAP@0.5:0.95 metric reaches 0.9847. The proposed in-situ detection framework can be integrated with in-field heliostat control systems to further improve the working efficiency of heliostats in a large-scale CSP plant in the future.

Keywords:

heliostat

;

geometric keypoints

;

in-situ detection

;

deep learning

Subject:

Engineering - Control and Systems Engineering

1. Introduction

Due to their economical strength and higher safety on mass storage of energy, concentrated solar power (CSP) plants are playing an increasingly important role in modern renewable power systems [1,2]. The working efficiency of a CSP station is largely dependent on the tracking accuracy of the thousands of heliostats that act as the sun-tracking devices in a CSP plant. If this accuracy is not well controlled, the station may fail to reach its designated output power because of substantial solar-flux spillage and non-uniform thermal loading on the central receiver, which can even shorten the receiver’s service life [3,4]. Calibrating heliostat tracking accuracy is therefore a routine requirement for the stable operation of a commercial CSP station [3,4,5].

At present, most calibration methods work offline or intrusively. For example, Beam Characterization Systems (BCS) require maneuvering individual heliostats away from the central receiver to direct their reflected beams onto a Lambertian-like target for measurement of tracking errors [6,7,8,9]. Using BCS to calibrate all heliostats of a 50 MW CSP plant may take weeks, during which dynamic operational errors accumulate without compensation. Consequently, modern solar engineering is progressively transitioning from static offline characterization toward dynamic online servo control. This paradigm shift necessitates the deployment of high-frequency, non-intrusive measuring gauges [10,11,12,13]. Recent studies have begun to apply learning-based vision directly to heliostat metrology, including a hybrid YOLOv5-plus-Hough-Transform pipeline for corner retrieval and downstream PnP pose estimation [14], differentiable ray-tracing surface reconstruction [15,16], and the broader landscape of AI-driven heliostat control surveyed in [17].

Embedding machine vision feedback into the control loop of a heliostat tracking system faces several challenges. First, the visual cues of heliostats are often corrupted by over-exposure and mirror occlusions of the heliostat field [18,19,20,21,22]. Traditional keypoint detectors and pose estimation networks [23,24,25,26,27,28] often suffer from structural instability and high-frequency coordinate jitter in such situations. Secondly, the highly reflective nature of the heliostat surface makes visual detection of objects difficult. A highly specular surface has little intrinsic texture and instead displays the texture of reflected objects. The signal that a generic keypoint detector relies on, namely a stable local appearance pattern at a fixed object-relative position, therefore does not exist for the mirror facet. The most reliable visual cues are the corner points of the mirror facets, which appear as a high-contrast quadrilateral against an inhomogeneous specular background. This observation directly motivates a corner-based, geometry-aware formulation of the measurement task rather than a holistic pose-regression formulation.

To bridge the gap of vision-based feedback control, this study presents a deep-learning-based keypoint detection framework explicitly optimized for metrological evaluation of heliostats. Two innovations are introduced in the framework: (1) a high-resolution P2 feature branch integrated into the network’s neck to recover fine-grained boundary cues; and (2) a physically-driven geometric consistency loss (

L_{geo}

) based on perspective-compatible geometric consistency that promotes topological stability against extreme optical noise [13,29]. The proposed system achieves a processing throughput of 25.14 FPS, a rate compatible with the latency and frequency budget typically required for closed-loop control of a servo system.

The contributions of this paper are summarized as follows. First, we adapt a YOLOv8-Pose backbone [30] into a heliostat-corner detection network by adding a high-resolution P2 feature branch that preserves the boundary cues otherwise destroyed by deep down-sampling, and we visualize the per-level feature energy to substantiate this design. Second, we introduce a geometry-consistency loss term

L_{geo}

that combines a log edge-ratio penalty (an image-plane structural consistency term) with a diagonal-intersection penalty (a projective-invariant term), both compatible with the perspective imaging geometry of an off-axis heliostat. Third, ablation studies and application-oriented performance discussion of the framework are provided in the paper, especially the quantitative metrics related to field deployment tasks.

2. Related Work

2.1. Heliostat Calibration and Vision-Based Measurement

Heliostat calibration has long been dominated by Beam Characterization Systems (BCS) and their camera-augmented variants, which are accurate but intrinsically offline and intrusive (Section 1) [4,5,6,7]. Vision-based online alternatives have therefore been explored, including camera observations of the receiver flux pattern, fiducial-marker placement on the heliostat support structure, and direct image-based detection of the mirror facet. The latter family is closest in spirit to the present work: it does not require physical fiducials on the mirror, does not interrupt power generation, and can be sampled at video rate [10,11,12,13]. Recent learning-based instances of this family include a hybrid YOLOv5 detector combined with a Hough-Transform corner extractor for EPnP-based pose recovery [14] and differentiable-ray-tracing surface inverse-rendering for in-situ metrology [15,16]; compared with these detection-plus-classical-CV cascades, the present detector uses a single end-to-end, geometry-regularised keypoint regressor, removing the post-processing latency and the failure modes that the Hough stage introduces when the facet boundary is partially saturated. Open challenges in this family include robustness to specular glare, the lack of intra-facet texture [20,21], and the need to deliver geometrically consistent observations to a downstream controller.

2.2. Deep Learning-Based Keypoint Detection

Modern keypoint detection has been driven by two parallel directions: holistic top-down regressors and high-resolution heatmap predictors [23,24,25,26]. Single-stage detectors such as the YOLO-Pose family [27,30] attach a keypoint head to an object detector, allowing both localization and per-instance keypoint regression in one network pass; they are attractive for industrial deployment because they offer a favourable speed/accuracy trade-off on a single GPU. Heatmap-based pipelines instead predict per-keypoint spatial response maps and decode coordinates by spatial argmax, often with multi-resolution feature fusion [31,32]. Coordinate-classification methods such as SimCC reformulate localization as one-dimensional classification along each image axis [33], while RTMPose pushes the inference latency to the millisecond range on commodity hardware [28]. Both families are typically benchmarked on natural-scene human-pose datasets where targets exhibit rich, repeatable textures and where small misalignments between predicted keypoints are tolerable. Heliostat corner localization deviates from these assumptions: there is essentially no intra-facet texture, the four corners must form a topologically valid quadrilateral at every frame, and isolated coordinate jitter is not acceptable as it propagates directly into servo commands. Consequently, off-the-shelf keypoint detectors transfer imperfectly, motivating the addition of high-resolution branches and structural constraints described in Section 3.

2.3. Geometry-Constrained Keypoint Detection

A recurring strategy in industrial vision is to inject geometric priors into a learning-based pipeline so that predictions remain physically plausible under noise. Examples include pairwise distance and angle penalties, ratio-based shape losses, projective and homography-consistency objectives, and explicit reprojection terms tied to a known camera model [13,29]. The pertinent observation for the heliostat setting is that, under an off-axis pinhole camera viewing a tilted rectangular mirror, parallel-edge or right-angle constraints are violated by the perspective projection itself. Consequently, the only meaningful regularizers are quantities that are invariant under the perspective projection: cross-ratios, line-incidence relations, and intersections of corresponding lines. Our

L_{geo}

combines a log edge-length-ratio term with an offset term on the diagonal-intersection point relative to its ground-truth location. The diagonal-intersection term is a projective invariant (collinearity and the cross-ratio along each diagonal are preserved by perspective projection), whereas the edge-length-ratio term is best understood as image-plane structural consistency: it is not strictly invariant under perspective, but it is perspective-compatible in the sense that its target value is read directly from the ground-truth quadrilateral and therefore does not contradict the imaging geometry. Related ideas appear in scene-text quadrilateral detection, document-corner regression, and document dewarping, where a four-corner topology must be preserved against noisy cues [13,29].

3. Materials and Methods

3.1. Overall Pipeline

The proposed online tracking-error measurement system is a single-camera pipeline (Figure 1) that converts a continuous image stream from a fixed observation viewpoint into a time-resolved geometric error signal suitable for closed-loop servo feedback. At each time step, the camera captures an RGB frame containing one or several heliostat facets visible from the observation viewpoint. The frame is forwarded to the modified YOLOv8-Pose detector [30], which predicts, per visible facet, four ordered corner keypoints (top-left, top-right, bottom-left, bottom-right). These four 2D keypoints are then compared with the geometric configuration that the facet is commanded to have at the same instant, and the residual between observed and commanded geometry serves as the tracking-error signal that is forwarded to the heliostat controller [34].

This pipeline frames the deep network not as a stand-alone classifier but as a measurement instrument inserted into a control loop. Two design consequences follow. First, the output must be available at a frequency compatible with the servo control loop; for the prototype reported here the end-to-end throughput is 25.14 FPS (Section 4, Table 2), which is sufficient for the slow mechanical dynamics of a heliostat drive. Second, the per-frame output must be geometrically consistent: the four predicted corners must form a valid (non-self-intersecting, non-collapsed) quadrilateral that approximates the projected facet shape, otherwise the downstream controller would receive structurally invalid measurements and could execute large, abrupt commands.

3.2. Problem Formulation and Optical Challenges

The objective of this study is to perform high-speed and accurate localization of the four image-plane corners of a heliostat mirror from a monocular RGB image. This 2D measurement constitutes the foundational geometric observation required for the downstream closed-loop servo control of CSP systems. Treating the four corners as the primary observable is deliberate: a rectangular facet is fully determined, up to a rigid-body pose with respect to the camera, by the projected positions of its four corners, and four 2D–3D correspondences are sufficient for a downstream PnP-style pose estimator [13,35]. Compared with detecting the full facet contour, four-corner detection is also markedly more robust under partial occlusion because losing one corner still leaves enough geometric information to flag the frame as degenerate and skip it.

Under a standard pinhole camera model, a 3D point

P_{w} = {[X_{w}, Y_{w}, Z_{w}, 1]}^{⊤}

residing on the heliostat surface is mapped to the 2D image coordinate

p = {[x, y, 1]}^{⊤}

via the perspective projection matrix:

z_{c} p = K [R ∣ t] P_{w}

(1)

where

K

is the camera intrinsic matrix,

[R ∣ t]

denotes the extrinsic pose parameters relative to the mirror, and

z_{c}

is the depth of the 3D point in the camera frame (the homogeneous-coordinate scale factor). The symbol

z_{c}

is reserved for this geometric depth and is distinct from the scalar loss weights

λ_{geo}

and

λ_{inter}

introduced later. Because

z_{c}

varies continuously across the tilted mirror plane, affine properties—such as the parallelism between opposite edges of the rectangular mirror—are violated in the perspective image plane. Any structural regularization applied to the detection network therefore cannot rely on simplistic parallel-line assumptions, and must instead invoke projective invariants such as cross-ratios and line incidences [13].

Given an input image tensor

X \in R^{H \times W \times 3}

, the detection framework predicts, per visible facet, a set of four corner keypoints

\hat{Q} = {{\hat{q}}_{k}}_{k = 0}^{3}

, where

{\hat{q}}_{k} = ({\hat{x}}_{k}, {\hat{y}}_{k})

denotes the predicted image coordinate of the k-th corner. The corresponding ground-truth set annotated by domain experts is denoted by

Q = {q_{k}}_{k = 0}^{3}

. We adopt a fixed corner ordering—top-left (TL), top-right (TR), bottom-left (BL) and bottom-right (BR)—during both annotation and prediction, so that diagonal correspondences (TL↔BR and TR↔BL) are well defined and can be used inside the geometric loss without combinatorial ambiguity.

The YOLOv8-Pose architecture is adopted as the baseline detector due to its highly efficient Cross Stage Partial (CSP) backbone, which is well-suited for industrial edge-computing deployments [30,36]. However, its default feature pyramid (relying heavily on P3 to P6 layers) entails excessive spatial downsampling, which over-smooths the high-frequency geometric discontinuities essential for precision metrology. Pixel-accurate corner localization in this regime is fundamentally limited by the receptive-field stride at the level from which the keypoint head reads its features; a pose head attached to P3–P6 only sees an effective stride of at least 8 pixels relative to the input resolution, which is insufficient when the facet edge spans only a few tens of pixels in the captured image.

3.3. Resolution-Preserving Feature Fusion and Energy Analysis

To address the critical attenuation of micro-structures at mirror boundaries, a high-resolution P2 feature level is integrated into the network’s neck (Figure 2). In the standard YOLOv8-Pose-P6 stack, the keypoint head receives features from levels P3 through P6, whose spatial strides relative to the input range from 8 to 64. We extend the multi-scale fusion path so that the keypoint head additionally reads a P2 feature with an effective stride of 4, which approximately doubles the spatial precision available at the head input without altering the backbone’s semantic depth or the pretrained weight initialization for the deeper levels [31,32]. To empirically validate this architectural intervention, we extract and quantify the feature response intensity across all pyramid levels. For a given feature tensor

F^{(l)} \in R^{C_{l} \times H_{l} \times W_{l}}

at pyramid level l, the channel-wise activation is aggregated using root-mean-square (RMS) energy:

E^{(l)} (h, w) = \sqrt{\frac{1}{C_{l}} \sum_{c = 1}^{C_{l}} {(F_{c}^{(l)} (h, w))}^{2}}

(2)

Following a robust quantile normalization process to suppress background sensor noise, the resulting pseudo-color energy maps (Figure 2a–e) reveal a critical phenomenon: the P2 feature exhibits profound spatial convergence, tightly anchoring to the mirror’s physical boundaries. In contrast, deeper semantic levels (P5, P6) respond broadly to the global scene context, losing the localized precision needed for accurate few-pixel-level corner measurement. From the perspective of measurement, this is the expected behaviour: deeper levels integrate large receptive fields that are useful for object presence but have already discarded the boundary location information by spatial pooling, whereas P2 retains short-range gradient cues that are co-located with the actual facet corners. Adding P2 into the head input therefore restores the high-frequency information that subsequent loss minimization can exploit; the deeper levels continue to contribute by gating the per-anchor visibility classification and confidence estimation.

3.4. Geometry Consistency Regularization Under Perspective Projection

The standard keypoint regression head in generic pose estimation architectures optimizes spatial coordinates by minimizing a visibility-masked distance metric:

L_{kpt} = \frac{1}{N} \sum_{k = 0}^{3} v_{k} \cdot D ({\hat{q}}_{k}, q_{k}),

(3)

where

v_{k} \in {0, 1}

indicates keypoint visibility, and

D

typically represents the Object Keypoint Similarity (OKS) metric [37]. However, this point-wise independent optimization ignores the holistic topological constraints of the quadrilateral. Under severe optical noise or partial glare, independent regression frequently yields topologically implausible concave shapes or severe twists. From a measurement-engineering point of view this is a critical defect: even if the average pixel error is small, a single mispositioned corner that crosses one of its neighbours converts the predicted facet into a self-intersecting quadrilateral, and the geometric error estimator downstream cannot interpret such a frame.

A naive remedy would be to penalize departures from a rigid rectangular shape, e.g. by enforcing parallel opposite edges or right angles. As discussed in Section 3, a rectangular facet projects to a non-rectangular quadrilateral under an off-axis pinhole projection, so such a penalty would push the network toward physically wrong predictions. The remedy must therefore be expressed in quantities that are invariant under the perspective projection. We choose two such invariants and combine them.

To explicitly enforce topological stability without violating perspective geometry, a novel geometry consistency loss (

L_{geo}

) is formulated using perspective-compatible geometric quantities, i.e. quantities whose target values are read from the ground-truth projection rather than from a planar-rectangle prior. The first component evaluates the scale-invariant geometric edge-ratio (

L_{ratio}

):

L_{ratio} = |log \frac{d ({\hat{q}}_{0}, {\hat{q}}_{1})}{d ({\hat{q}}_{2}, {\hat{q}}_{3})} - log \frac{d (q_{0}, q_{1})}{d (q_{2}, q_{3})}| + |log \frac{d ({\hat{q}}_{0}, {\hat{q}}_{2})}{d ({\hat{q}}_{1}, {\hat{q}}_{3})} - log \frac{d (q_{0}, q_{2})}{d (q_{1}, q_{3})}|

(4)

where

d (a, b) = {∥ b - a ∥}_{2}

denotes the Euclidean distance. The use of the logarithm has two pragmatic effects. First, it converts a multiplicative ratio mismatch into an additive scale-symmetric penalty, so that a 10% error on a small edge contributes the same loss magnitude as a 10% error on a large edge. Second, it removes the numerical asymmetry that a raw ratio would introduce when an edge is short. To prevent log singularities for very small edges, the implementation clamps both numerator and denominator to a small positive lower bound before taking the logarithm.

The second component enforces diagonal-intersection consistency (

L_{inter}

). Let

L_{1} = ∥ q_{3} - q_{0} ∥

and

L_{2} = ∥ q_{2} - q_{1} ∥

denote the two diagonal lengths, and let

{\hat{d}}_{1} = \frac{q_{3} - q_{0}}{L_{1}}, {\hat{d}}_{2} = \frac{q_{2} - q_{1}}{L_{2}}

(5)

denote the corresponding unit-norm direction vectors (

∥ {\hat{d}}_{1} ∥ = ∥ {\hat{d}}_{2} ∥ = 1

). The two ground-truth diagonals are then parameterised by arc length as

ℓ_{1} (s) = q_{0} + s {\hat{d}}_{1}

and

ℓ_{2} (t) = q_{1} + t {\hat{d}}_{2}

, where

s \in [0, L_{1}]

and

t \in [0, L_{2}]

are measured directly in image-plane pixels. Their intersection

I

must satisfy

q_{0} + s {\hat{d}}_{1} = q_{1} + t {\hat{d}}_{2} .

(6)

Taking the 2D cross product of both sides with

{\hat{d}}_{2}

eliminates t and yields the closed-form solution

s = \frac{det (q_{1} - q_{0}, {\hat{d}}_{2})}{det ({\hat{d}}_{1}, {\hat{d}}_{2}) + ε} = \frac{det (q_{1} - q_{0}, {\hat{d}}_{2})}{sin θ + ε},

(7)

where

θ

is the angle between

{\hat{d}}_{1}

and

{\hat{d}}_{2}

, and

ε

is a small positive constant that prevents division by zero when the two diagonals become nearly collinear (

θ \to 0

). Because

{\hat{d}}_{1}

and

{\hat{d}}_{2}

are unit-norm,

det ({\hat{d}}_{1}, {\hat{d}}_{2})

is exactly

sin θ

, so

ε

acquires the clear physical meaning of a lower bound on the admissible angular separation of the two diagonals rather than the role of a purely numerical safeguard.

The ground-truth intersection follows from Eq. (7) in one line as

I = q_{0} + s {\hat{d}}_{1}

. The predicted intersection

\hat{I}

is computed by the same closed form applied to the predicted corner set

\hat{Q}

– i.e. by substituting

{\hat{q}}_{k}

for

q_{k}

throughout Eqs. (5)–(7) – and the Euclidean distance between the two intersection points serves as the structural penalty

L_{inter} = ∥ \hat{I} - I ∥_{2} .

(8)

Both intersection points are expressed in normalised image coordinates (pixel coordinates divided by the input image size) before this distance is evaluated, so

L_{inter}

is dimensionless and is directly comparable in magnitude with

L_{ratio}

.

Geometrically,

L_{inter}

penalises configurations in which the two predicted diagonals fail to meet at the same projected point as the ground-truth diagonals. Because the cross-ratio along each diagonal is preserved by the perspective projection [13], requiring the two predicted diagonals to intersect at the correct projected point is equivalent to requiring the projected facet to have the correct internal cross-ratio structure – without imposing parallelism, right angles, or rectangularity in the image plane.

L_{inter}

is therefore a perspective-invariant topological regulariser, complementary to the image-plane edge-length-ratio regulariser

L_{ratio}

of Eq. (4), which constrains scale consistency along the facet sides rather than the projective incidence of the diagonals.

The complete training objective combines this topological regulariser with the standard YOLO bounding-box, classification and keypoint terms:

L_{total} = L_{YOLO} + λ_{geo} (L_{ratio} + λ_{inter} L_{inter}) .

(9)

In our reference implementation the geometric weight

λ_{geo}

is held at a small value so that

L_{geo} = L_{ratio} + λ_{inter} L_{inter}

behaves as a stabiliser rather than as a primary objective: it shapes the search space to exclude topologically invalid quadrilaterals while leaving the bulk of the gradient signal to the standard keypoint and bounding-box terms. The loss is clipped to a finite range so that the gradient remains bounded when degenerate, near-parallel-diagonal configurations briefly appear early in training – that is, when

sin θ \to 0

in Eq. (7) and the closed-form s would otherwise dominate the optimisation step.

4. Results

4.1. Dataset and Implementation Protocols

The proposed framework was evaluated on an industrial heliostat image dataset captured under outdoor conditions. The dataset comprises 300 high-resolution images of operating heliostats captured at various times of day. The four corner points of each facet of the heliostat in the image are labeled in a fixed TL/TR/BL/BR order, together with a visibility flag, so that the geometric loss in Section 3 can be computed only over fully visible facets. The dataset is treated as a self-contained engineering benchmark rather than a public competition task. It serves as a testbed for evaluating deep-learning methods for the detection and control of heliostats in a CSP plant [11].

Each image contains a single heliostat composed of an

8 \times 8

grid of sub-mirror facets, yielding

64 \times 4 = 256

corner annotations per fully visible image. Corners that fall inside saturated (over-exposed) or shadowed (under-exposed) regions are intentionally not annotated and are excluded from the loss by the per-keypoint visibility flag

v_{k}

(see Eq. (3)); this convention propagates the optical-degradation statistics of the deployment scene into the supervision signal rather than into the labelling protocol. The 300 images are split at the image level into train, validation and test partitions in an

8 : 1 : 1

ratio (240 / 30 / 30 images), corresponding to up to

61, 440 / 7, 680 / 7, 680

annotated corners under the fully visible upper bound.

The model was trained on an NVIDIA RTX 4090D GPU using the AdamW optimizer [38] for 300 epochs, with input tensors resized to

1280 \times 1280

to fully leverage the high-resolution P2 capabilities. Mixed-precision (AMP) computation was enabled to reduce GPU memory pressure, and the keypoint-loss gain was raised relative to the default value to align the optimization budget with the precision-driven objective of this work. Geometric data augmentation was deliberately conservative (small rotation, small translation, small scale, no synthetic perspective warp) so that the model is exposed to operationally realistic facet shapes rather than to aggressively warped quadrilaterals; mosaic and a small mixup factor were retained so that anchor-based pose detection still benefits from multi-instance composition during training. The optimizer used an initial learning rate of

5 \times 10^{- 4}

with a cosine schedule, and the geometry weight was set to

λ_{geo} = 0.05

; all other settings followed the YOLOv8-Pose defaults.

For evaluation we use four metrics. mAP@0.5 reports the average precision under a relatively loose IoU threshold and quantifies whether the facet is found at all. mAP@0.5:0.95 averages precision over a sweep of stricter IoU thresholds and is therefore the most sensitive to localization accuracy of the bounding box that surrounds the four keypoints. AR@0.5 reports the average recall under the loose IoU threshold and complements precision by quantifying how often a present facet is missed. End-point error (EPE) is the mean Euclidean distance between predicted and ground-truth keypoint coordinates over matched detections. End-to-end frames per second (FPS) is reported as the wall-clock throughput including pre-processing, inference and post-processing.

4.2. Quantitative Analysis

To strictly isolate the contributions of the proposed architectural and mathematical modifications, a comprehensive ablation study was conducted (Table 1).

Table 1 shows that the baseline model can already detect the facets at a high rate. Its limitations lie on the localization precision of corner points. Adding

L_{geo}

on top of an unmodified backbone (Only_Geo) does not by itself reduce EPE, because the underlying coordinates are still produced from low-resolution features that fundamentally cannot resolve the boundary at sub-pixel level. Adding the P2 branch (Only_P2) alone restores the high-frequency information and reduces EPE from 2.58 to 1.89 px, which is consistent with a stride-driven precision floor: halving the head’s effective stride approximately halves the unavoidable rounding error contribution, and the remaining 1.89 px is dominated by genuine optical ambiguity (glare-saturated edges, motion blur). Combining both interventions retains the EPE level of Only_P2 and additionally tightens the strict mAP@0.5:0.95 score, which is the metric most sensitive to localization quality. We interpret this as evidence that

L_{geo}

removes residual outlier configurations that would otherwise lower the strict-IoU precision but would barely affect the average pixel error—in other words,

L_{geo}

matters most where it is hardest to detect statistically, namely in the high-IoU regime that controls servo-loop usability.

4.3. Comparison with a Generic Pose Estimator

To benchmark the proposed framework against a generic baseline, the widely used open-source framework MMPose is evaluated on the same heliostat dataset under its default top-down configuration. Under identical conditions it reaches only mAP@0.5

= 0.3267

and AR@0.5

= 0.3207

, far below the proposed framework (mAP@0.5

= 0.9927

in Table 1); its EPE of

2.0139

px is computed only over the few facets it manages to detect and is therefore not directly comparable with the dense matches of our detector. This large detection gap shows that generic top-down detectors, designed for human-scale articulated poses with rich intrinsic texture [25,26,39], degrade severely on the texture-poor, highly specular heliostat facets, confirming that the task-specific design adopted here is necessary rather than incremental.

4.4. Qualitative Analysis

Qualitative comparisons across different ablation configurations are presented in Figure 3.

As shown in Figure 3, the Baseline model produces structurally distorted or collapsed quadrilaterals when the corner boundaries are blurred for various reasons. The proposed framework greatly improved the geometric resilience of the model by acquiring a topologically valid quadrilateral structure with the majority of captured samples. Two qualitative regularities are worth noting. First, failure modes of the Baseline model cluster on samples where one corner sits in a saturated glare patch; the loss of local gradient information propagates almost exclusively to that corner, which then collapses toward an internal point and turns the predicted quadrilateral into a degenerate, near-triangular shape. Second, the residual failure modes of Only_P2 cluster on samples where two adjacent corners are simultaneously affected, in which case the unconstrained pairwise regression produces visually small but topologically incorrect twists; once

L_{geo}

is added, the diagonal-intersection penalty rules these twists out because they would force the two predicted diagonals to meet at the wrong projected point. The results in Figure 3 are consistent with the design intent of each component: P2 supplies localization sharpness,

L_{geo}

supplies topological stability.

5. Discussion

5.1. Error Justification for Vision-Based Feedback Control

Although the proposed framework reduces the mean EPE to 1.89 pixels, this absolute static error may appear numerically modest when juxtaposed with tightly controlled indoor metrology. However, this residual error is a deliberate and theoretically expected consequence of the operational environment and our architectural trade-offs. Firstly, the intense specular glare inherent to functioning heliostats fundamentally saturates local pixel gradients, rendering mathematically perfect sub-pixel corner localization highly improbable from a single monocular snapshot. Secondly, to strictly adhere to the high-frequency latency constraints required for active control loops, computationally expensive iterative sub-pixel refinement modules—commonly utilized in static photogrammetry—were intentionally excluded from our pipeline.

Rather than viewing this 1.89-pixel deviation as a metrological limitation, it must be contextualized within its primary prospective application: dynamic closed-loop servo control. As depicted in Figure 4a (error curves) and Figure 4b (box plots), our proposed geometric regularization actively suppresses catastrophic outliers (the extreme upper tail of the error distribution). Suppressing these large structural deviations is expected to reduce the abrupt coordinate jumps that a downstream controller would otherwise have to reject; whether this translates into smoother actuator behaviour can only be confirmed once the loop is closed, which is beyond the scope of the present open-loop evaluation.

A sun-tracking heliostat moves at only about

15^{\circ}

per hour, so the sampling-rate requirement is modest, and the measured throughput of 25.14 FPS (Table 2) leaves ample margin for closed-loop control. Moreover, the bounded error tail without self-intersecting quadrilaterals that

L_{geo}

is designed to produce is well suited to short-window temporal filtering, so closed-loop control of the integrated detector + filter + actuator system becomes feasible in future deployment [34,40].

Table 2. Runtime profile and latency breakdown of the proposed measurement framework.

Performance Metric	Measured Value
End-to-end FPS (wall-clock) ↑	25.14
Inference-only FPS ↑	27.51
Preprocess time (ms) ↓	1.996
Inference time (ms) ↓	36.349
Postprocess time (ms) ↓	0.784

5.2. Deployment Considerations

Translating the proposed vision-based measuring system into a deployable field system raises a number of issues that may affect its performance. First, camera installation stability is essential because the absolute measurement budget of the system is the sum of the detector’s pixel error and any residual misalignment between the camera’s optical axis and the heliostat field reference frame. A rigid mount, periodic auto-recalibration against fixed scene references, and a thermal shield on the camera enclosure are practical countermeasures; these are standard in industrial machine-vision deployments and can be reused without modification [12,13].

Second, although the training set used in this model covers strong-glare conditions, the cameras of the vision-based measuring system are expected to have automatic exposure control functions as well as high-resolution imaging sensors. Sensor-level remediation tends to be more effective than pure post-processing because once a corner falls inside a saturated region, no amount of downstream filtering can recover the lost gradient [18,19].

Third, given the output rate of 25.14 FPS, the pose estimation accuracy can be further improved by integrating filters into the control loop [34,40]. Temporal filters can suppress the per-frame jitter over a short time window and maintain reasonable estimates when momentary detection failures occur.

Fourth, edge-computing deployment is preferable to a centralized inference server. An embedded GPU or NPU at the edge is sufficient to maintain the reported FPS and avoids network-induced jitter on the control loop [36]. The predicted four-corner coordinates can be combined with a known 3D facet model and a calibrated camera intrinsic matrix to produce a 6-DoF pose via PnP, or combined with a calibrated reference homography to generate a 2D tracking-error vector directly for the angle control of heliostats [13,35].

6. Conclusions

This study proposes a high-speed, geometry-aware visual measurement framework for in-situ detection and calibration of heliostats in concentrated solar power plants. By adapting the YOLOv8-Pose architecture with a high-resolution P2 branch and a perspective-invariant geometry consistency loss, the framework can recover fine-grained boundaries while ensuring rigid topological stability under extreme conditions. The proposed framework achieved an end-to-end processing speed of 25.14 FPS and a mean end-point error of 1.89 px with the real-world dataset. Although the experiments were carried out at a small-scale CSP station, the methodology is applicable to large-scale commercial CSP plants, provided that a suitable dataset is constructed.

The current framework focuses on localization of 2D image keypoints instead of estimating the heliostat pose. The data-driven model outputs the image coordinates of the four corners of each facet, while the estimation of heliostat tracking pose is left to a downstream module that uses the predicted keypoints with EPnP-type algorithms [14]. This separation allows the keypoint detector to be easily integrated with different pose estimation algorithms [13,35].

The reported throughput (25.14 FPS) is based on the specified hardware. A production deployment will most likely run on an embedded edge accelerator next to the camera, where memory and power constraints differ; the network may need quantization, channel pruning, or knowledge-distillation-driven compression to retain real-time behaviour on the target platform [36]. Such compression is feasible because the geometric constraints are imposed at training time only and do not enlarge the inference graph.

Overall, the proposed detector provides fast and geometrically reliable 2D observations that can serve as a practical front-end for closed-loop heliostat calibration in CSP plants; closing the detector–filter–actuator loop on a physical drive remains the primary direction of future work.

Author Contributions

Conceptualization, F.X.; methodology, H.M.; software, H.M.; validation, F.X. and H.M.; investigation, H.M.; data curation, H.M.; writing—original draft preparation, H.M.; writing—review and editing, F.X.; visualization, H.M.; supervision, F.X.; project administration, F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially supported by an industrial cooperation project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors want to express their gratitude to Dr. Feihu Sun for his help on data acquisition at Badaling experimental CSP station. During the preparation of this work, the authors have used ChatGPT for language polishing, but have checked all AI-generated words and take full responsibility for the content of the published article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Islam, M.T.; Huda, N.; Abdullah, A.B.; Saidur, R. A Comprehensive Review of State-of-the-Art Concentrating Solar Power (CSP) Technologies: Current Status and Research Trends. Renew. Sustain. Energy Rev. 2018, 91, 987–1018. [Google Scholar] [CrossRef]
Ho, C.K. Advances in Central Receivers for Concentrating Solar Applications. Sol. Energy 2017, 152, 38–56. [Google Scholar] [CrossRef]
Noone, C.J.; Torrilhon, M.; Mitsos, A. Heliostat Field Optimization: A New Computationally Efficient Model and Biomimetic Layout. Sol. Energy 2012, 86, 792–803. [Google Scholar] [CrossRef]
Maiga, M.; N’Tsoukpoe, K.E.; Gomna, A.; Fiagbe, Y.A.K. Sources of Solar Tracking Errors and Correction Strategies for Heliostats. Renew. Sustain. Energy Rev. 2024, 203, 114770. [Google Scholar] [CrossRef]
Berenguel, M.; Rubio, F.R.; Valverde, A.; Lara, P.J.; Arahal, M.R.; Camacho, E.F.; López, M. An Artificial Vision-Based Control System for Automatic Heliostat Positioning Offset. Correction in a Central Receiver Solar Power Plant. Solar Energy 2004, 76, 563–575. https://doi.org/10.1016/j.solener.2003.12.006..
Phipps, G.S. Heliostat Beam Characterization System–Calibration Technique. Technical Report SAND78-8038, Sandia Laboratories, 1978.
Röger, M.; Herrmann, P.; Ulmer, S.; Ebert, M.; Prahl, C.; Göhring, F. Techniques to Measure Solar Flux Density Distribution on Large-Scale Receivers. J. Sol. Energy Eng. 2014, 136, 031013. [Google Scholar] [CrossRef]
Mehos, M.; Price, H.; Cable, R.; Kearney, D.; Kelly, B.; Kolb, G.; Morse, F. Concentrating Solar Power Best Practices Study. Technical Report NREL/TP-5500-75763, National Renewable Energy Laboratory, Golden, CO, USA, 2020. [CrossRef]
Prahl, C.; Röger, M.; Hilgert, C. Air-Borne Shape Measurement of Parabolic Trough Collector Fields. Sol. Energy 2013, 91, 68–78. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S. Visual Servo Control. I. Basic Approaches. IEEE Robot. Autom. Mag. 2006, 13, 82–90. [Google Scholar] [CrossRef]
Steger, C.; Ulrich, M.; Wiedemann, C. Machine Vision Algorithms and Applications, 2 ed.; Wiley-VCH: Weinheim, Germany, 2018. [Google Scholar]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2 ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
Xu, F.; Li, C.; Sun, F. On-Line Measurement of Tracking Poses of Heliostats in Concentrated Solar Power Plants. Sensors 2024, 24, 6373. [Google Scholar] [CrossRef]
Pargmann, M.; Quinto, D.M.; Schwarzbözl, P.; Pitz-Paal, R. Automatic Heliostat Learning for In Situ Concentrating Solar Power Plant Metrology with Differentiable Ray Tracing. Nat. Commun. 2024, 15, 6997. [Google Scholar] [CrossRef] [PubMed]
Lewen, J.; Pargmann, M.; Löwe, M.C.; Maldonado Quinto, D.; Pitz-Paal, R. Inverse Deep Learning Raytracing for Heliostat Surface Prediction. Sol. Energy 2025, 289, 113312. [Google Scholar] [CrossRef]
Carballo, J.A.; Bonanos, A.M.; Fernández-Reche, J.; Vasallo, M.J.; Marugán-Cruz, C.; Santana, D. Artificial Intelligence in Heliostat Control and Optimization for CSP Plants: A Critical Review. Renew. Sustain. Energy Rev. 2025, 214, 115530. [Google Scholar] [CrossRef]
Nayar, S.K.; Fang, X.S.; Boult, T. Separation of Reflection Components Using Color and Polarization. Int. J. Comput. Vis. 1997, 21, 163–186. [Google Scholar] [CrossRef]
Debevec, P.E.; Malik, J. Recovering High Dynamic Range Radiance Maps from Photographs. In Proceedings of the Proceedings of SIGGRAPH 1997, 1997; pp. 369–378. [Google Scholar] [CrossRef]
Qiu, J.M.; Jiang, P.S.; Zhu, Y.; Yin, Z.X.; Cheng, M.M.; Mu, T.J. Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023; pp. 20823–20833. [Google Scholar] [CrossRef]
Han, Y.; Yan, H.; Liu, Y.; Zhao, J.; Wang, K.; Wang, Y.; Wang, Y.S. NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 21001–21010. [Google Scholar] [CrossRef]
Tang, J.; Zhang, C.; Yu, Y.T.; Wang, P.; Zhang, Y.; Liu, L.; Yang, Y.L.; Theobalt, C. SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision – ECCV 2016. Springer, 2016, Vol. 9912, Lecture Notes in Computer Science, pp. 483–499. [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014; pp. 1653–1660. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Computer Vision – ECCV 2018. Springer, 2018, Vol. 11210, Lecture Notes in Computer Science, pp. 472–487. 2018; Vol. 11210, pp. 472–487. [Google Scholar] [CrossRef]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022; pp. 2637–2646. [Google Scholar] [CrossRef]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose, 2023. arXiv arXiv:cs. [CrossRef]
Kendall, A.; Cipolla, R. Geometric Loss Functions for Camera Pose Regression with Deep Learning. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017; pp. 6555–6564. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, 2023.
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017; pp. 936–944. [Google Scholar] [CrossRef]
Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S.T. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the Computer Vision – ECCV 2022.Springer, 2022, Vol. 13666, Lecture Notes in Computer Science, pp. 89–106. [CrossRef]
Bolton, W. Programmable Logic Controllers, 6 ed.; Newnes: Oxford, UK, 2015. [Google Scholar]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision – ECCV 2014. Springer, 2014, Vol. 8693, Lecture Notes in Computer Science, pp. 740–755. [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, 2019. [Google Scholar]
MMPose Contributors. OpenMMLab Pose Estimation Toolbox and Benchmark. 2020. [Google Scholar]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]

Figure 1. Overall pipeline of the proposed heliostat sensing system. Stages (1)–(2) (blue dashed region) form the sensing front-end developed in this work: a fixed RGB camera (1) acquires frames at ∼25 Hz (the inset highlights the four predicted corners of a representative facet in yellow), and the modified YOLOv8-Pose detector (2), trained jointly with the geometry-consistency loss

L_{geo}

, predicts for each visible facet the four ordered corner keypoints

{{\hat{q}}_{k}}_{k = 0}^{3}

, the bounding box, and the per-keypoint visibility flags. Stages (3)–(5) (orange dashed region) denote the prospective downstream deployment (geometric error estimation, PLC/EKF control, and the heliostat drive), which is outside the scope of this study and left as future work.

Figure 1. Overall pipeline of the proposed heliostat sensing system. Stages (1)–(2) (blue dashed region) form the sensing front-end developed in this work: a fixed RGB camera (1) acquires frames at ∼25 Hz (the inset highlights the four predicted corners of a representative facet in yellow), and the modified YOLOv8-Pose detector (2), trained jointly with the geometry-consistency loss

L_{geo}

, predicts for each visible facet the four ordered corner keypoints

{{\hat{q}}_{k}}_{k = 0}^{3}

, the bounding box, and the per-keypoint visibility flags. Stages (3)–(5) (orange dashed region) denote the prospective downstream deployment (geometric error estimation, PLC/EKF control, and the heliostat drive), which is outside the scope of this study and left as future work.

Figure 2. Overview of the proposed measurement framework. The lower section illustrates the modified YOLOv8-Pose architecture integrating a high-resolution P2 branch and the geometry consistency loss (

L_{geo}

). Panels (a)–(e) present the feature-response energy maps from P2 to P6, highlighting the superior spatial preservation of the P2 layer.

Figure 2. Overview of the proposed measurement framework. The lower section illustrates the modified YOLOv8-Pose architecture integrating a high-resolution P2 branch and the geometry consistency loss (

L_{geo}

). Panels (a)–(e) present the feature-response energy maps from P2 to P6, highlighting the superior spatial preservation of the P2 layer.

Figure 3. Qualitative comparison of corner detections across ablation configurations with various input images. Each column shows one representative single sub-mirror captured under challenging illumination. The top row (Input Image) gives the grayscale input; each subsequent row overlays the four predicted corners of that sub-mirror, connected into a quadrilateral, for one configuration: Baseline (blue), Only_Geo (orange), Only_P2 (yellow) and Ours (green). A well-formed, convex quadrilateral that tightly encloses the facet indicates correct topology, whereas a collapsed, twisted or self-intersecting outline indicates a structural failure of the prediction.

Figure 4. Pixel-error analysis on the test set. (a) Sorted absolute pixel-error curves, revealing the suppression of maximum errors; (b) error-distribution box plots, indicating a narrowed extreme upper tail.

Table 1. Ablation study on measurement precision. The combination of P2 features and geometric constraints attains the highest strict-IoU (mAP@0.5:0.95) precision.

Model Configuration	mAP@0.5 ↑	mAP@0.5:0.95 ↑	EPE (px) ↓
Baseline YOLOv8-Pose	0.9903	0.9671	2.58
Only_Geo	0.9883	0.9657	2.67
Only_P2	0.9936	0.9823	1.89
Ours (P2_Geo)	0.9927	0.9847	1.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

An Improved Deep Learning Framework for In-Situ Detection of Geometric Keypoints of Heliostats in Concentrated Solar Power Plants

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Heliostat Calibration and Vision-Based Measurement

2.2. Deep Learning-Based Keypoint Detection

2.3. Geometry-Constrained Keypoint Detection

3. Materials and Methods

3.1. Overall Pipeline

3.2. Problem Formulation and Optical Challenges

3.3. Resolution-Preserving Feature Fusion and Energy Analysis

3.4. Geometry Consistency Regularization Under Perspective Projection

4. Results

4.1. Dataset and Implementation Protocols

4.2. Quantitative Analysis

4.3. Comparison with a Generic Pose Estimator

4.4. Qualitative Analysis

5. Discussion

5.1. Error Justification for Vision-Based Feedback Control

5.2. Deployment Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe