In this section, the detailed workflow of the study is explained, starting with data collection, moving on to analyzing the results, and ending with the proposed multimodal algorithm. The system model is generally depicted in
Figure 1.
3.1. Data Collection Setup
Real-world driving environment experiments were conducted to collect data for our multimodal driver monitoring project. Although the presented work here is a pilot study, it includes a substantial amount of driving data, totaling 480 minutes, which provides sufficient data points for preliminary validation and statistical analysis. Five licensed participants, as detailed in
Table 2, drove their vehicles, ensuring they were comfortable with the driving environment and minimizing the influence of unfamiliar vehicle dynamics.
Before commencing the actual drives, participants underwent a familiarization period in a parking lot. During this period, all sensors were worn for at least 20 minutes while the copilot ensured correct sensor operation and participant comfort. Participants also drove short loops within the parking area to acclimate to the equipment, proceeding to public roads only when they felt safe and comfortable.
Two wearable sensing systems were deployed: Pupil Labs Invisible eye-tracking glasses [
35] connected to a companion smartphone for continuous gaze data acquisition, and the Polar H10 chest-worn sensor [
24] for HR monitoring. A dashcam recorded the forward road view for reference and potential manual labeling. The copilot was responsible for hardware handling, synchronization, and real-time monitoring of data quality.
Figure 2 illustrates the complete setup in a real driving environment.
Data collection took place mostly during the daytime and under various weather conditions, including clear skies, overcast conditions, and light rain. Extreme weather scenarios (e.g., heavy rain, fog) were intentionally excluded to maintain consistent visual and physiological measurement conditions. Traffic conditions were naturally variable, ranging from low-density rural motorway segments to dense urban traffic, and included both high-flow and stop-and-go conditions. While traffic density and environmental variability are recognized as influencing factors, this pilot dataset reflects uncontrolled, realistic conditions. More systematic control of these variables is planned for future large-scale studies.
Driving scenarios were naturally mixed, with participants alternating between city and motorway environments over multiple sessions and on different days. The final dataset comprises approximately urban driving and motorway driving. In total, the data collection yielded 28,963 HR samples, 10,669 blink events, 63,300 fixation events, and 63,050 saccade events, forming a rich multimodal dataset for analysis.
All participants provided informal consent before the study, acknowledging the voluntary nature of participation and their understanding of the procedures involved. Given the pilot nature of this work and the use of commercially available sensors, formal institutional ethical approval was not sought at this stage; however, future studies involving broader participant recruitment will follow formal ethical approval protocols.
3.2. HR Data Collection and Processing
HR data were collected using the Polar H10 chest strap [
24], a device known for its high reliability and validity in measuring HR in various settings [
36] and shown in
Figure 2c. The Polar H10 records raw electrocardiogram (ECG) signals at a sampling frequency of 1000 Hz [
36]. In this study, HR data were processed to output an HR value at 1 Hz, which is sufficient for capturing temporal changes in HR during real-world driving. The strap incorporates plastic electrodes on the underside to detect the heart’s electrical activity, and was securely positioned around the lower chest to ensure stable and accurate measurements throughout each driving session. Data collection was managed via the Polar Beat smartphone application, which allowed real-time monitoring and session control. Upon completion, the recorded data were uploaded to the Polar cloud service for secure storage and subsequently downloaded to a PC for further processing and analysis. This study focuses on HR rather than heart rate variability (HRV). While HRV can provide additional information on autonomic nervous system activity, our focus is motivated by the fact that HR alone has demonstrated strong sensitivity to cognitive and environmental driving demands. State-of-the-art research [
37] has shown that HR can achieve higher accuracy than HRV in differentiating cognitive load during urban and motorway driving scenarios. Nevertheless, HRV analysis is recognized as a valuable complementary measure and will be incorporated in future studies to provide a more comprehensive physiological assessment.
3.3. Gaze Data Collection and Processing
The gaze data were collected using Pupil Labs Invisible glasses [
35], an advanced eye-tracking device designed for dynamic real-world scenarios, as shown in
Figure 2b. The glasses are equipped with a front-facing scene camera and two infrared (IR) eye cameras that capture eye images at a resolution of 192 x 192 pixels and a frequency of 200 Hz. A companion device runs a neural network algorithm [
38] to compute gaze data. During recording, the Pupil invisible companion device calculates gaze data in real-time, with the frame rate dependent on the companion device model. For this work, using a OnePlus 8, the frame rate is over 120 Hz. Once uploaded to the Pupil Lab cloud, the gaze data is recalculated at the maximum frame rate of 200 Hz. Then, the Pupil Labs algorithms were employed for gaze data processing. Blink detection is achieved through an XGBoost classifier trained on device-specific datasets, ensuring high recall and low false detection rates [
39]. This algorithm processes optical flow patterns associated with eyelid closure and reopening events, resulting in accurate detection of blink events with a recall of approximately 95% [
39].
Fixation events were identified using an extended I-VT (identification by velocity threshold) algorithm [
40]. This method compensates for head and body movements through optical flow correction and applies adaptive velocity thresholds based on dynamic scene conditions. Events with durations below physiological plausibility thresholds were filtered, ensuring precise fixation detection even in challenging real-world scenarios. Saccade events were detected as transitions between fixations using thresholds for velocity, amplitude, and duration [
40].
3.4. Data Analysis
The collected gaze and HR data were systematically analyzed to extract meaningful metrics to assess the state of the driver. The analysis consisted of reading and synchronizing the data, cleaning outliers, and performing statistical analyses to uncover insights based on subjects and driving scenarios.
Initially, gaze data, including blinks, fixations, and saccades, and HR data were read from their respective files. Synchronization was achieved by normalizing the timestamps to ensure temporal alignment across datasets. The normalized timestamp
t was calculated as:
where
is the original timestamp,
is the minimum timestamp in the dataset segment.
Outlier removal was performed to enhance data quality by excluding physiologically implausible values. For each metric
m, data points that fall outside the predefined lower and upper bounds
were discarded:
This step ensured that the analyses were not skewed by erroneous measurements or artifacts. The limits used in this study are illustrated in
Table 3.
To gain insights into driver behavior, we conducted a frequency distribution analysis and computed metric counts. Frequency distributions were generated for key metrics such as blink duration, fixation duration, saccade duration, saccade amplitude, saccade velocity, and HR. For each metric, data were binned, and the frequency percentage for each bin was calculated:
Analyzing these distributions enabled us to observe patterns and variations in HR and gaze behavior under different conditions. Ultimately, it also helped establish the thresholds for what can be considered normal or abnormal. Moreover, to get an idea of how different scenarios and subjects vary, normalized counts per minute were calculated for blinks, fixations, and saccades. The counts per minute or second, which facilitated fair comparisons irrespective of the duration of each driving session, were computed as:
To statistically validate scenario-based differences in HR and gaze metrics, pairwise comparisons were conducted between driving scenarios using Welch’s
t-test [
41], which does not assume equal variances. For each metric, samples
and
corresponding to two different scenarios were compared. The test statistic was computed as:
where
,
, and
denote the sample mean, variance, and size for scenario
i, respectively. The resulting
p-values were used to assess the statistical significance of observed differences, with
indicating a significant effect of the driving scenario on the corresponding metric. This analysis ensured that only metrics with sufficient sample size and statistically significant variation were used to inform the threshold and weight selection in the proposed multimodal algorithm that will be detailed next.
3.5. Driver Multimodal Monitoring Algorithm
As discussed in
Section 1 and
Section 2, there is a need for a comprehensive algorithm that is both computationally lightweight and practical for real-time deployment in on-road driving environments. To address this, we propose a Driver Multimodal Monitoring Algorithm that integrates physiological and gaze-based metrics at the decision level. The fusion occurs through a weighted decision-making process, combining HR and gaze metrics into a unified state score. This approach not only enables real-time driver state estimation but also establishes a foundation for future refinements in parameter tuning based on scenario-specific labeling and deep learning models.
The collected data are divided into fixed-length segments of duration
T (set to 10 seconds in this study). Although the proposed algorithm can be adapted to any window length, a 10-second duration has been shown to effectively capture short-term changes in physiological and visual attention, while remaining responsive to transient events. Longer windows could capture more context but would reduce the temporal sensitivity of the system [
42]. Optimizing the window size for each modality is left for future work. For each segment
s, key metrics are calculated, including the mean HR (
), mean blink duration (
), mean fixation duration (
), and mean saccade metrics—duration (
), amplitude (
), and velocity (
). Each metric is compared against statistically informed thresholds and classified as
Normal or
Abnormal. To prioritize more informative modalities, a weighted scoring system is employed. The metric weights were initially determined based on statistical analyses of our dataset (
Section 4.1), with additional validation through sensitivity analysis (
Section 4.2.3). These initial values serve as a baseline for future refinement once larger, more comprehensively labeled datasets are available. The weighting framework follows methodologies from mental workload estimation [
43] and cognitive workload fusion [
44], emphasizing factors that contribute most significantly to classification performance [
45]. The overall score
for each segment is computed as:
Here, , , , and are the respective metric weights, and is an indicator function returning 1 if the condition is true and 0 otherwise.
The final decision is derived using a decision threshold
, set to
in this pilot study. A binary classification (
Normal vs.
Abnormal) was intentionally adopted for real-time practicality, as it facilitates rapid intervention in safety-critical contexts without the computational overhead of multi-class classification [
46]. The classification rule is:
Sensitivity analyses were conducted to examine how varying
and the weights affect classification outcomes (
Section 4.2.3). The results confirm that weight calibration strongly influences detection rates and highlight the importance of carefully selecting
that represents accurate driver state detections. Algorithm 1 presents a step-by-step pseudocode implementation of the proposed system, showing the complete flow from metric extraction to decision-level fusion.
|
Algorithm 1:Multimodal Driver Monitoring |
 |