1. Introduction
Wi-Fi CSI-based gesture recognition enables device-free human–computer interaction by exploiting how human motion perturbs the wireless propagation channel.[
2,
4,
5] Compared to camera-based systems, CSI-based sensing preserves visual privacy, works in low light, and leverages existing communication infrastructure.[
3,
14] Many reported systems, however, depend on PC-class NICs (e.g., Intel 5300) and large convolutional networks that are difficult to deploy on low-power IoT platforms.[
1,
3]
Low-cost ESP32 system-on-chip devices, together with CSI extraction firmware, provide an attractive platform for embedded Wi-Fi sensing.[
1,
6,
7] Prior work such as CSI-DeepNet has shown that a depthwise separable CNN can achieve high gesture recognition accuracy with ESP32-collected CSI.[
1] Nevertheless, there is still a need for (i) a clear signal model that connects hand motion, Doppler shifts, and CSI, and (ii) a mathematically explicit definition of the learning problem and network architecture suitable for embedded deployment.
1.1. Contributions
This paper makes the following contributions:
- 1.
Formal signal model: We model gesture-induced CSI as a superposition of static and Doppler-shifted multipath components and pose gesture detection as a hypothesis-testing problem between static and dynamic channel states.
- 2.
Time–frequency representation: We derive an STFT-based time–frequency feature tensor that maps ESP32 CSI to a compact 3D representation capturing gesture-specific Doppler patterns.
- 3.
Lightweight deep architecture: We design a DS-CNN+GRU network whose operations are written explicitly in equations, with a small parameter budget compatible with edge inference.
- 4.
Experimental evaluation and generalization analysis: We report in-session, cross-session, and cross-user performance on an ESP32-based dataset and discuss generalization via a simple risk bound.
2. Related Work
2.1. Wi-Fi CSI Gesture Recognition
CSI-based gesture recognition systems such as WiGeR and CSI-DeepNet leverage amplitude and phase variations caused by hand motion to classify gestures.[
1,
3,
4,
5] Surveys provide a broad overview of device-free Wi-Fi gesture recognition using traditional and deep learning techniques.[
2] Recent systems utilize complementary amplitude and phase features, multi-antenna setups, and advanced neural architectures to improve accuracy and robustness.[
4?]
2.2. ESP32-Based Wi-Fi Sensing
ESP32-based CSI acquisition has been used for gesture and activity recognition, often with external compute for training and inference.[
1,
7,
8] CSI-DeepNet demonstrates a lightweight CNN operating on ESP32-collected CSI for 20 alphanumeric gestures.[
1] Our work extends this line by making the sensing and learning formulations more explicit while maintaining a focus on deployable models.
2.3. Time–Frequency and Doppler Modeling
Gesture recognition with Wi-Fi often relies on Doppler signatures extracted via STFT or related transforms.[
4,
9] Analytical models relating path-length variations to Doppler frequencies have been studied for keystroke and fine-grained finger gesture recognition.[
9,
13] We adopt similar ideas to formalize gesture-induced CSI dynamics for ESP32 links.
3. Signal Model and Feature Representation
3.1. Static vs. Gesture Hypotheses
Let
be the baseband transmitted signal and
the received signal at time
t. In the absence of a gesture, the received signal can be modeled as
where
is the static channel impulse response, * denotes convolution, and
is additive noise. When a hand gesture is performed, dynamic scatterers are introduced, yielding
where
is the gesture-induced component.
Sampling at rate
and examining CSI across
K subcarriers, we denote the complex CSI at subcarrier
k and time index
n by
. We decompose
where
is the static environment term,
is the gesture-induced term, and
is measurement noise.[
4,
11]
Under , , and the CSI is approximately wide-sense stationary over short intervals. Under , captures time-varying multipath contributions.
3.2. Multipath and Doppler Modeling
We model the gesture-induced term as a sum of
dynamic paths:
where
is the complex gain,
is the subcarrier frequency, and
is the delay of path
p at time index
n.[
4,
9]
Assuming small hand displacements relative to the link range, the path length can be approximated as
where
is the initial path length and
is the effective path-length rate (projection of hand velocity on the bistatic path). The corresponding delay is
with
c the speed of light.
Substituting into (
4):
where
Thus, gesture-induced CSI exhibits sinusoidal components in time whose Doppler frequencies
are determined by hand velocity and geometry.[
4,
9,
10]
3.3. STFT-Based Time–Frequency Representation
For each subcarrier
k, we define amplitude and phase
Phase is sanitized by removing a linear trend across subcarriers for each packet to mitigate hardware-induced offsets.[
4,
12]
We compute the short-time Fourier transform (STFT) of
using a Hann window
of length
L and hop size
H:
where
is the frequency-bin index and
ℓ indexes frames.
We focus on a Doppler band
corresponding to feasible hand radial velocities
, with
where
is the sampling rate and
is the carrier frequency.[
9,
10]
We define the log-magnitude spectrogram
with small
. Selecting a subset of subcarriers
and stacking across
,
, and frames
ℓ, we obtain
where
(channels/subcarriers),
(Doppler bins), and
(time frames).
3.4. Formal Gesture Classification Problem
Let
denote the set of gesture classes, including a “steady” class. Each gesture trial yields an input tensor
and label
. The goal is to learn a classifier
mapping
to a probability vector over gestures, where
is the
-simplex. The predicted label is
which aims to approximate the Bayes-optimal decision rule
Given training data
, we minimize the empirical cross-entropy loss
optionally with
regularization
.
4. Methods
4.1. Hardware and CSI Acquisition
We employ two ESP32-WROOM-32 development boards configured as a Wi-Fi AP–STA pair operating at 2.4 GHz with 20 MHz bandwidth. CSI for
subcarriers is extracted at 100 Hz using ESP32 CSI firmware similar to ESP-CSI and Wi-ESP.[
1,
6,
7] The devices are placed 1.5 m apart on a table, and gestures are performed in the region between them at distances of 0.3–0.7 m from the line.
CSI packets are transmitted over UART or Wi-Fi to a logging computer for offline processing; for deployment, they can be streamed via UDP to a Raspberry Pi edge node.
4.2. Gesture Set and Data Collection
We define classes: ten alphanumeric gestures (digits “0”–“9” traced in the air) and a steady (no-gesture) class. Each gesture instance lasts approximately 1–2 s. Data are collected from eight participants (denoted ), each performing 20 trials per gesture in two sessions, yielding approximately trials. A subset is used for the experiments described here.
Each trial is manually segmented around the gesture, resampled or zero-padded to a fixed length of
T CSI samples, and transformed into the tensor
via the STFT pipeline in
Section 3.3.
4.3. Network Architecture
4.3.1. Depthwise Separable CNN Front-End
Given
, the first depthwise separable convolution (DS-Conv) block performs:
where
applies channel-wise convolutions with kernel size
,
mixes channels, BN is batch normalization, and
is the ReLU activation.[
1]
We stack
B such blocks (with possible pooling along the frequency dimension) to obtain
where
are convolutional parameters, and
,
are reduced channel and frequency dimensions.
4.3.2. Temporal GRU Back-End
We treat the time dimension as a sequence. For each frame
, we flatten the spatial dimensions:
The GRU updates are
with reset gate
, update gate
, hidden state
, and
.[
20]
The final hidden state
is mapped to logits:
and the softmax outputs are
The full parameter set is . The design ensures fewer than 150,000 trainable parameters.
4.4. Baselines and Training
We compare against:
SVM: RBF-kernel support vector machine trained on hand-crafted features such as energy, entropy, spectral centroid, and bandwidth computed from .[?]
Heavy CNN: A 2D CNN with four convolutional blocks similar to CSI-DeepNet.[
1]
All models are trained using Adam optimizer with early stopping on validation loss. Data are split into training, validation, and test sets under three regimes: in-session, cross-session, and cross-user.
4.5. Evaluation Metrics
For each class
, we compute precision, recall, and F1-score:
and macro-averaged F1:
5. Results
5.1. In-Session Performance
In in-session evaluation (data from all users and sessions randomly split, user-wise stratified), the proposed DS-CNN+GRU model achieves:
Accuracy: 97.2%.
.
The heavy CNN baseline reaches 98.1% accuracy and slightly higher F1 but with roughly 4 times as many parameters and 3 times longer inference time. The SVM baseline achieves 91.3% accuracy and .
Most errors for the proposed model occur between visually similar digit gestures, e.g., “3” vs. “8”.
5.2. Cross-Session and Cross-User Performance
In cross-session evaluation (training on early-session data, testing on later sessions), the DS-CNN+GRU model achieves 92.1% accuracy and . The heavy CNN attains 94.5% accuracy, while SVM drops to 84.7%.
In cross-user evaluation (leave-one-user-out), the proposed model achieves 88–91% accuracy across held-out users, with between 0.86 and 0.90. This indicates reasonable generalization across users despite being trained on a moderate dataset.
5.3. Robustness to Orientation and Distance
We evaluate model performance for different user orientations (0°, 45°, 90°) with respect to the link and distances between 0.3 and 0.7 m. Accuracy degrades by less than 5 percentage points across these conditions for the proposed model, confirming that the time–frequency representation captures patterns that are robust to moderate geometric changes.[
4]
5.4. Latency and Resource Usage
On a Raspberry Pi 4, end-to-end STFT feature extraction and DS-CNN+GRU inference for one gesture segment incur a median latency of approximately 20 ms. The heavy CNN requires about 65 ms, and the SVM about 8 ms. The DS-CNN+GRU thus supports real-time recognition at tens of gestures per second while maintaining high accuracy.
5.5. Generalization Error Considerations
Let
denote the hypothesis class represented by our DS-CNN+GRU architecture and
the 0–1 loss. For i.i.d. samples from an unknown distribution
, the expected risk is
Standard VC-style bounds state that, with probability at least
,
where
is empirical risk,
is the VC dimension, and
N the number of training samples.[
19] While
is difficult to compute exactly for deep networks, this highlights the trade-off between model complexity and required dataset size. Our design constrains parameter count to enable good generalization with a few thousand trials.
6. Discussion
The formal Doppler-based CSI model and STFT representation make explicit how hand motion affects Wi-Fi CSI and why spectrogram-based features are effective for gesture recognition. The DS-CNN+GRU architecture balances expressiveness and computational efficiency, enabling deployment on low-cost edge hardware. Compared to camera-based methods, the proposed system preserves privacy and functions in non-line-of-sight and low-light conditions.[
2,
3,
14]
Limitations include the controlled environment, limited gesture vocabulary, and single-link setup. Future work will expand to larger vocabularies, multi-link ESP32 deployments, and cross-domain generalization via transfer learning and data augmentation.[
15,
16,
17] Combining CSI with inertial sensors or other modalities may further improve robustness.[
18]
7. Conclusion
We have presented a mathematically grounded and experimentally validated framework for device-free hand gesture recognition using ESP32-based Wi-Fi CSI. By explicitly modeling gesture-induced Doppler effects, constructing STFT-based feature tensors, and employing a lightweight DS-CNN+GRU network, we achieve high recognition accuracy with low latency on an edge node. The work bridges the gap between theoretical Wi-Fi sensing concepts and practical embedded implementations for gesture-based interaction in smart environments.
Not applicable (non-identifiable gesture data only)
Funding
Supported by XZent Solutions Pvt Ltd internal research budget.
Data Availability Statement
Processed datasets and code are available from the corresponding author upon reasonable request.
Acknowledgments
The authors thank XZent Solutions Pvt Ltd for hardware support and all volunteers who participated in the gesture data collection.
Conflicts of Interest
The authors declare no competing interests.
References
- Khan, A. A.; et al. CSI-DeepNet: A Lightweight Deep Convolutional Neural Network Based Hand Gesture Recognition System Using Wi-Fi CSI Signal. IEEE Access 2021, 9, 146219–146234. [Google Scholar] [CrossRef]
- Sigg, S.; Shi, S.; Buesching, F.; Ji, Y.; Wolf, L. Device free human gesture recognition using Wi-Fi CSI: A survey. Applied Soft Computing 2020, 97, 106764. [Google Scholar] [CrossRef]
- Ma, Y.; Zhou, G.; Wang, S. Recognition for Human Gestures Based on Convolutional Neural Network Using the Off-the-Shelf Wi-Fi Routers. Wireless Communications and Mobile Computing 2021, 7821241. [Google Scholar] [CrossRef]
- Cai, Z.; et al. Device-Free Wireless Sensing for Gesture Recognition Based on Complementary CSI Amplitude and Phase. Sensors 2024, 24(11), 3414. [Google Scholar] [CrossRef] [PubMed]
- Wang, G.; Zou, Y.; Zhou, Z. WiGeR: WiFi-Based Gesture Recognition System. ISPRS Int. J. Geo-Inf. 2016, 5(6), 92. [Google Scholar] [CrossRef]
- Wi-ESP CSI Tool, Wireless Research Lab, 2020. Available online: https://wrlab.github.io/Wi-ESP/.
- Ghosh, R.; et al. Radio frequency-based human activity dataset collected using ESP32 microcontroller in line-of-sight and non-line-of-sight indoor experiment setups. Data in Brief 2024, 53, 110077. [Google Scholar] [CrossRef]
- Sharma, R.; et al. “ESP32-Realtime-System: A Realtime Wi-Fi Sensing System Demo,” GitHub repository. 2023. Available online: https://github.com/RS2002/ESP32-Realtime-System.
- Zhang, X.; et al. Analytical Model for Spatial Resolution Characterization of Keystroke Recognition Using WiFi Sensing. IEEE Internet of Things Journal 2025. [Google Scholar] [CrossRef]
-
US Patent 20190020425A1; Method for determining a Doppler frequency shift of a signal. 2018; p. 163.
- “CSI Feature Extraction,” Hands-on Wireless Sensing with WiFi. 2021.
- Gong, T.; et al. Optimal preprocessing of WiFi CSI for sensing applications. arXiv 2023, arXiv:2307.12126, 95. [Google Scholar] [CrossRef]
- Chen, Z.; et al. Fine-grained Finger Gesture Recognition Using WiFi Signals. arXiv web. 2021, arXiv:2106.00857, 134. [Google Scholar]
- Zhang, Y.; et al. Sign Language Recognition Using Two-Stream Convolutional Neural Networks with Wi-Fi Signals. Applied Sciences 2020, 10(24), 9005. [Google Scholar] [CrossRef]
- Sun, J.; et al. Wi-TCG: a WiFi gesture recognition method based on transfer learning and conditional generative adversarial networks. Machine Learning: Science and Technology 2024, 5(4), 045008. [Google Scholar] [CrossRef]
- Li, B.; et al. Cross-domain gesture recognition via WiFi signals with low-frequency reconstruction. Ad Hoc Networks 2024, 152, 103654. [Google Scholar] [CrossRef]
- Wang, C.; et al. Data Augmentation Techniques for Cross-Domain WiFi CSI-based Human Activity Recognition. arXiv web. 2024, arXiv:2401.00964. [Google Scholar]
- Zhang, H.; et al. Human Activity Recognition via Wi-Fi and Inertial Sensors With Machine Learning. IEEE Sensors Journal 2024. [Google Scholar] [CrossRef]
- Chen, Y.; et al. LiteHAR: Lightweight Human Activity Recognition from WiFi Signals with Random Convolution Kernels. arXiv 2022, arXiv:2201.09310. [Google Scholar]
- Zhang, J.; et al. CSI-Net: Unified Human Body Characterization and Pose Recognition. arXiv 2019, arXiv:1810.03064. [Google Scholar] [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |