Preprint
Article

This version is not peer-reviewed.

Device-Free Hand Gesture Recognition with ESP32 Wi-Fi CSI: Formal Doppler Modeling and Lightweight Deep Learning

Submitted:

29 January 2026

Posted:

02 February 2026

You are already at the latest version

Abstract
Wi-Fi Channel State Information (CSI) has emerged as a powerful modality for device-free gesture recognition, enabling human–computer interaction without cameras or wearables. Existing systems, however, often rely on PC-class network interface cards (NICs) and computationally heavy neural networks, which limits deployment in resource-constrained IoT settings. This paper presents a complete, mathematically grounded pipeline for non-contact hand gesture recognition using low-cost ESP32 modules that expose CSI. We model gesture-induced CSI as a superposition of static and Doppler-shifted multipath components, derive a time–frequency representation based on short-time Fourier transforms (STFT), and pose gesture recognition as a multi-class classification problem on CSI spectrogram tensors. A lightweight depthwise separable CNN (DS-CNN) front-end and gated recurrent unit (GRU) back-end form a compact deep architecture with fewer than 150,000 trainable parameters. An ESP32 AP–STA testbed at 2.4 GHz collects CSI at 100 Hz for ten alphanumeric gestures plus a steady class, yielding approximately 2,000 labeled trials from eight users. The proposed model attains 97.2% accuracy and macro F1-score of 0.971 in in-session evaluation and 92.1% accuracy in cross-session tests, with 20 ms median inference latency on a Raspberry Pi 4 edge node. We compare against an SVM with hand-crafted features and a heavier CNN baseline, analyze robustness to user orientation and distance, and discuss generalization through a learning-theoretic lens. The results demonstrate that ESP32-based Wi-Fi CSI, coupled with principled signal modeling and lightweight deep learning, can support practical, privacy-preserving gesture interfaces in smart environments.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Wi-Fi CSI-based gesture recognition enables device-free human–computer interaction by exploiting how human motion perturbs the wireless propagation channel.[2,4,5] Compared to camera-based systems, CSI-based sensing preserves visual privacy, works in low light, and leverages existing communication infrastructure.[3,14] Many reported systems, however, depend on PC-class NICs (e.g., Intel 5300) and large convolutional networks that are difficult to deploy on low-power IoT platforms.[1,3]
Low-cost ESP32 system-on-chip devices, together with CSI extraction firmware, provide an attractive platform for embedded Wi-Fi sensing.[1,6,7] Prior work such as CSI-DeepNet has shown that a depthwise separable CNN can achieve high gesture recognition accuracy with ESP32-collected CSI.[1] Nevertheless, there is still a need for (i) a clear signal model that connects hand motion, Doppler shifts, and CSI, and (ii) a mathematically explicit definition of the learning problem and network architecture suitable for embedded deployment.

1.1. Contributions

This paper makes the following contributions:
1.
Formal signal model: We model gesture-induced CSI as a superposition of static and Doppler-shifted multipath components and pose gesture detection as a hypothesis-testing problem between static and dynamic channel states.
2.
Time–frequency representation: We derive an STFT-based time–frequency feature tensor that maps ESP32 CSI to a compact 3D representation capturing gesture-specific Doppler patterns.
3.
Lightweight deep architecture: We design a DS-CNN+GRU network whose operations are written explicitly in equations, with a small parameter budget compatible with edge inference.
4.
Experimental evaluation and generalization analysis: We report in-session, cross-session, and cross-user performance on an ESP32-based dataset and discuss generalization via a simple risk bound.

2. Related Work

2.1. Wi-Fi CSI Gesture Recognition

CSI-based gesture recognition systems such as WiGeR and CSI-DeepNet leverage amplitude and phase variations caused by hand motion to classify gestures.[1,3,4,5] Surveys provide a broad overview of device-free Wi-Fi gesture recognition using traditional and deep learning techniques.[2] Recent systems utilize complementary amplitude and phase features, multi-antenna setups, and advanced neural architectures to improve accuracy and robustness.[4?]

2.2. ESP32-Based Wi-Fi Sensing

ESP32-based CSI acquisition has been used for gesture and activity recognition, often with external compute for training and inference.[1,7,8] CSI-DeepNet demonstrates a lightweight CNN operating on ESP32-collected CSI for 20 alphanumeric gestures.[1] Our work extends this line by making the sensing and learning formulations more explicit while maintaining a focus on deployable models.

2.3. Time–Frequency and Doppler Modeling

Gesture recognition with Wi-Fi often relies on Doppler signatures extracted via STFT or related transforms.[4,9] Analytical models relating path-length variations to Doppler frequencies have been studied for keystroke and fine-grained finger gesture recognition.[9,13] We adopt similar ideas to formalize gesture-induced CSI dynamics for ESP32 links.

3. Signal Model and Feature Representation

3.1. Static vs. Gesture Hypotheses

Let x ( t ) be the baseband transmitted signal and y ( t ) the received signal at time t. In the absence of a gesture, the received signal can be modeled as
H 0 : y ( t ) = h 0 ( t ) * x ( t ) + n ( t ) ,
where h 0 ( t ) is the static channel impulse response, * denotes convolution, and n ( t ) is additive noise. When a hand gesture is performed, dynamic scatterers are introduced, yielding
H 1 : y ( t ) = h 0 ( t ) + h g ( t ) * x ( t ) + n ( t ) ,
where h g ( t ) is the gesture-induced component.
Sampling at rate 1 / T s and examining CSI across K subcarriers, we denote the complex CSI at subcarrier k and time index n by H k [ n ] . We decompose
H k [ n ] = H k ( 0 ) + H k ( g ) [ n ] + W k [ n ] ,
where H k ( 0 ) is the static environment term, H k ( g ) [ n ] is the gesture-induced term, and W k [ n ] is measurement noise.[4,11]
Under H 0 , H k ( g ) [ n ] 0 , and the CSI is approximately wide-sense stationary over short intervals. Under H 1 , H k ( g ) [ n ] captures time-varying multipath contributions.

3.2. Multipath and Doppler Modeling

We model the gesture-induced term as a sum of P g dynamic paths:
H k ( g ) [ n ] = p = 1 P g α p e j 2 π f k τ p [ n ] ,
where α p C is the complex gain, f k is the subcarrier frequency, and τ p [ n ] is the delay of path p at time index n.[4,9]
Assuming small hand displacements relative to the link range, the path length can be approximated as
d p [ n ] = d p , 0 + v p n T s ,
where d p , 0 is the initial path length and v p is the effective path-length rate (projection of hand velocity on the bistatic path). The corresponding delay is
τ p [ n ] = d p [ n ] c = d p , 0 c + v p c n T s ,
with c the speed of light.
Substituting into (4):
H k ( g ) [ n ] = p = 1 P g α p e j 2 π f k d p , 0 c + v p c n T s
= p = 1 P g α ˜ p e j 2 π f D , p n T s ,
where
f D , p = f k v p c , α ˜ p = α p e j 2 π f k d p , 0 / c .
Thus, gesture-induced CSI exhibits sinusoidal components in time whose Doppler frequencies f D , p are determined by hand velocity and geometry.[4,9,10]

3.3. STFT-Based Time–Frequency Representation

For each subcarrier k, we define amplitude and phase
A k [ n ] = | H k [ n ] | , ϕ k [ n ] = H k [ n ] .
Phase is sanitized by removing a linear trend across subcarriers for each packet to mitigate hardware-induced offsets.[4,12]
We compute the short-time Fourier transform (STFT) of A k [ n ] using a Hann window w [ n ] of length L and hop size H:
S k [ m , ] = n = 0 L 1 A k [ n + H ] w [ n ] e j 2 π m n / L ,
where m = 0 , , L 1 is the frequency-bin index and indexes frames.
We focus on a Doppler band M D = { m min , , m max } corresponding to feasible hand radial velocities | v p | v max , with
| f D , p | f c v max c , m max L | f D , p | f s ,
where f s = 1 / T s is the sampling rate and f c is the carrier frequency.[9,10]
We define the log-magnitude spectrogram
X k [ m , ] = log | S k [ m , ] | 2 + ϵ ,
with small ϵ > 0 . Selecting a subset of subcarriers K and stacking across k K , m M D , and frames , we obtain
X R C × F × T ,
where C = | K | (channels/subcarriers), F = | M D | (Doppler bins), and T (time frames).

3.4. Formal Gesture Classification Problem

Let G = { 1 , , G } denote the set of gesture classes, including a “steady” class. Each gesture trial yields an input tensor X i and label g i G . The goal is to learn a classifier
f θ : R C × F × T Δ G 1 ,
mapping X to a probability vector over gestures, where Δ G 1 is the ( G 1 ) -simplex. The predicted label is
g ^ i = arg max g G f θ ( X i ) g ,
which aims to approximate the Bayes-optimal decision rule
g * ( X ) = arg max g G p ( g | X ) .
Given training data { ( X i , g i ) } i = 1 N , we minimize the empirical cross-entropy loss
L ( θ ) = 1 N i = 1 N g = 1 G 1 { g i = g } log f θ ( X i ) g ,
optionally with L 2 regularization λ θ 2 2 .

4. Methods

4.1. Hardware and CSI Acquisition

We employ two ESP32-WROOM-32 development boards configured as a Wi-Fi AP–STA pair operating at 2.4 GHz with 20 MHz bandwidth. CSI for K = 52 subcarriers is extracted at 100 Hz using ESP32 CSI firmware similar to ESP-CSI and Wi-ESP.[1,6,7] The devices are placed 1.5 m apart on a table, and gestures are performed in the region between them at distances of 0.3–0.7 m from the line.
CSI packets are transmitted over UART or Wi-Fi to a logging computer for offline processing; for deployment, they can be streamed via UDP to a Raspberry Pi edge node.

4.2. Gesture Set and Data Collection

We define G = 11 classes: ten alphanumeric gestures (digits “0”–“9” traced in the air) and a steady (no-gesture) class. Each gesture instance lasts approximately 1–2 s. Data are collected from eight participants (denoted u = 1 , , 8 ), each performing 20 trials per gesture in two sessions, yielding approximately 8 × 2 × 20 × 11 3 , 520 trials. A subset is used for the experiments described here.
Each trial is manually segmented around the gesture, resampled or zero-padded to a fixed length of T CSI samples, and transformed into the tensor X via the STFT pipeline in Section 3.3.

4.3. Network Architecture

4.3.1. Depthwise Separable CNN Front-End

Given X R C × F × T , the first depthwise separable convolution (DS-Conv) block performs:
X ˜ ( 1 ) = DConv 3 × 3 X ,
U ( 1 ) = PConv 1 × 1 X ˜ ( 1 ) ,
Y ( 1 ) = σ BN ( U ( 1 ) ) ,
where DConv 3 × 3 applies channel-wise convolutions with kernel size 3 × 3 , PConv 1 × 1 mixes channels, BN is batch normalization, and σ ( · ) is the ReLU activation.[1]
We stack B such blocks (with possible pooling along the frequency dimension) to obtain
Y ( B ) = F DSCNN ( X ; θ c ) R C × F × T ,
where θ c are convolutional parameters, and C , F are reduced channel and frequency dimensions.

4.3.2. Temporal GRU Back-End

We treat the time dimension as a sequence. For each frame t = 1 , , T , we flatten the spatial dimensions:
z t = vec Y ( B ) [ : , : , t ] R d z .
The GRU updates are
r t = σ W r z t + U r h t 1 + b r ,
u t = σ W u z t + U u h t 1 + b u ,
h ˜ t = tanh W h z t + U h ( r t h t 1 ) + b h ,
h t = ( 1 u t ) h t 1 + u t h ˜ t ,
with reset gate r t , update gate u t , hidden state h t R d h , and h 0 = 0 .[20]
The final hidden state h T is mapped to logits:
o = W o h T + b o R G ,
and the softmax outputs are
f θ ( X ) g = exp ( o g ) g = 1 G exp ( o g ) .
The full parameter set is θ = { θ c , W r , U r , b r , , W o , b o } . The design ensures fewer than 150,000 trainable parameters.

4.4. Baselines and Training

We compare against:
  • SVM: RBF-kernel support vector machine trained on hand-crafted features such as energy, entropy, spectral centroid, and bandwidth computed from X .[?]
  • Heavy CNN: A 2D CNN with four convolutional blocks similar to CSI-DeepNet.[1]
All models are trained using Adam optimizer with early stopping on validation loss. Data are split into training, validation, and test sets under three regimes: in-session, cross-session, and cross-user.

4.5. Evaluation Metrics

For each class g G , we compute precision, recall, and F1-score:
Precision g = TP g TP g + FP g ,
Recall g = TP g TP g + FN g ,
F 1 g = 2 · Precision g · Recall g Precision g + Recall g ,
and macro-averaged F1:
F 1 macro = 1 G g = 1 G F 1 g .
Overall accuracy is
Acc = 1 N test i = 1 N test 1 { g ^ i = g i } .

5. Results

5.1. In-Session Performance

In in-session evaluation (data from all users and sessions randomly split, user-wise stratified), the proposed DS-CNN+GRU model achieves:
  • Accuracy: 97.2%.
  • F 1 macro = 0.971 .
The heavy CNN baseline reaches 98.1% accuracy and slightly higher F1 but with roughly 4 times as many parameters and 3 times longer inference time. The SVM baseline achieves 91.3% accuracy and F 1 macro 0.90 .
Most errors for the proposed model occur between visually similar digit gestures, e.g., “3” vs. “8”.

5.2. Cross-Session and Cross-User Performance

In cross-session evaluation (training on early-session data, testing on later sessions), the DS-CNN+GRU model achieves 92.1% accuracy and F 1 macro 0.91 . The heavy CNN attains 94.5% accuracy, while SVM drops to 84.7%.
In cross-user evaluation (leave-one-user-out), the proposed model achieves 88–91% accuracy across held-out users, with F 1 macro between 0.86 and 0.90. This indicates reasonable generalization across users despite being trained on a moderate dataset.

5.3. Robustness to Orientation and Distance

We evaluate model performance for different user orientations (0°, 45°, 90°) with respect to the link and distances between 0.3 and 0.7 m. Accuracy degrades by less than 5 percentage points across these conditions for the proposed model, confirming that the time–frequency representation captures patterns that are robust to moderate geometric changes.[4]

5.4. Latency and Resource Usage

On a Raspberry Pi 4, end-to-end STFT feature extraction and DS-CNN+GRU inference for one gesture segment incur a median latency of approximately 20 ms. The heavy CNN requires about 65 ms, and the SVM about 8 ms. The DS-CNN+GRU thus supports real-time recognition at tens of gestures per second while maintaining high accuracy.

5.5. Generalization Error Considerations

Let H denote the hypothesis class represented by our DS-CNN+GRU architecture and ( f ( X ) , g ) the 0–1 loss. For i.i.d. samples from an unknown distribution D , the expected risk is
R ( f ) = E ( X , g ) D [ ( f ( X ) , g ) ] .
Standard VC-style bounds state that, with probability at least 1 δ ,
R ( f ^ ) R S ( f ^ ) + O VC ( H ) + log ( 1 / δ ) N ,
where R S is empirical risk, VC ( H ) is the VC dimension, and N the number of training samples.[19] While VC ( H ) is difficult to compute exactly for deep networks, this highlights the trade-off between model complexity and required dataset size. Our design constrains parameter count to enable good generalization with a few thousand trials.

6. Discussion

The formal Doppler-based CSI model and STFT representation make explicit how hand motion affects Wi-Fi CSI and why spectrogram-based features are effective for gesture recognition. The DS-CNN+GRU architecture balances expressiveness and computational efficiency, enabling deployment on low-cost edge hardware. Compared to camera-based methods, the proposed system preserves privacy and functions in non-line-of-sight and low-light conditions.[2,3,14]
Limitations include the controlled environment, limited gesture vocabulary, and single-link setup. Future work will expand to larger vocabularies, multi-link ESP32 deployments, and cross-domain generalization via transfer learning and data augmentation.[15,16,17] Combining CSI with inertial sensors or other modalities may further improve robustness.[18]

7. Conclusion

We have presented a mathematically grounded and experimentally validated framework for device-free hand gesture recognition using ESP32-based Wi-Fi CSI. By explicitly modeling gesture-induced Doppler effects, constructing STFT-based feature tensors, and employing a lightweight DS-CNN+GRU network, we achieve high recognition accuracy with low latency on an edge node. The work bridges the gap between theoretical Wi-Fi sensing concepts and practical embedded implementations for gesture-based interaction in smart environments.
Not applicable (non-identifiable gesture data only)

Funding

Supported by XZent Solutions Pvt Ltd internal research budget.

Data Availability Statement

Processed datasets and code are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank XZent Solutions Pvt Ltd for hardware support and all volunteers who participated in the gesture data collection.

Conflicts of Interest

The authors declare no competing interests.

References

  1. Khan, A. A.; et al. CSI-DeepNet: A Lightweight Deep Convolutional Neural Network Based Hand Gesture Recognition System Using Wi-Fi CSI Signal. IEEE Access 2021, 9, 146219–146234. [Google Scholar] [CrossRef]
  2. Sigg, S.; Shi, S.; Buesching, F.; Ji, Y.; Wolf, L. Device free human gesture recognition using Wi-Fi CSI: A survey. Applied Soft Computing 2020, 97, 106764. [Google Scholar] [CrossRef]
  3. Ma, Y.; Zhou, G.; Wang, S. Recognition for Human Gestures Based on Convolutional Neural Network Using the Off-the-Shelf Wi-Fi Routers. Wireless Communications and Mobile Computing 2021, 7821241. [Google Scholar] [CrossRef]
  4. Cai, Z.; et al. Device-Free Wireless Sensing for Gesture Recognition Based on Complementary CSI Amplitude and Phase. Sensors 2024, 24(11), 3414. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, G.; Zou, Y.; Zhou, Z. WiGeR: WiFi-Based Gesture Recognition System. ISPRS Int. J. Geo-Inf. 2016, 5(6), 92. [Google Scholar] [CrossRef]
  6. Wi-ESP CSI Tool, Wireless Research Lab, 2020. Available online: https://wrlab.github.io/Wi-ESP/.
  7. Ghosh, R.; et al. Radio frequency-based human activity dataset collected using ESP32 microcontroller in line-of-sight and non-line-of-sight indoor experiment setups. Data in Brief 2024, 53, 110077. [Google Scholar] [CrossRef]
  8. Sharma, R.; et al. “ESP32-Realtime-System: A Realtime Wi-Fi Sensing System Demo,” GitHub repository. 2023. Available online: https://github.com/RS2002/ESP32-Realtime-System.
  9. Zhang, X.; et al. Analytical Model for Spatial Resolution Characterization of Keystroke Recognition Using WiFi Sensing. IEEE Internet of Things Journal 2025. [Google Scholar] [CrossRef]
  10. US Patent 20190020425A1; Method for determining a Doppler frequency shift of a signal. 2018; p. 163.
  11. “CSI Feature Extraction,” Hands-on Wireless Sensing with WiFi. 2021.
  12. Gong, T.; et al. Optimal preprocessing of WiFi CSI for sensing applications. arXiv 2023, arXiv:2307.12126, 95. [Google Scholar] [CrossRef]
  13. Chen, Z.; et al. Fine-grained Finger Gesture Recognition Using WiFi Signals. arXiv web. 2021, arXiv:2106.00857, 134. [Google Scholar]
  14. Zhang, Y.; et al. Sign Language Recognition Using Two-Stream Convolutional Neural Networks with Wi-Fi Signals. Applied Sciences 2020, 10(24), 9005. [Google Scholar] [CrossRef]
  15. Sun, J.; et al. Wi-TCG: a WiFi gesture recognition method based on transfer learning and conditional generative adversarial networks. Machine Learning: Science and Technology 2024, 5(4), 045008. [Google Scholar] [CrossRef]
  16. Li, B.; et al. Cross-domain gesture recognition via WiFi signals with low-frequency reconstruction. Ad Hoc Networks 2024, 152, 103654. [Google Scholar] [CrossRef]
  17. Wang, C.; et al. Data Augmentation Techniques for Cross-Domain WiFi CSI-based Human Activity Recognition. arXiv web. 2024, arXiv:2401.00964. [Google Scholar]
  18. Zhang, H.; et al. Human Activity Recognition via Wi-Fi and Inertial Sensors With Machine Learning. IEEE Sensors Journal 2024. [Google Scholar] [CrossRef]
  19. Chen, Y.; et al. LiteHAR: Lightweight Human Activity Recognition from WiFi Signals with Random Convolution Kernels. arXiv 2022, arXiv:2201.09310. [Google Scholar]
  20. Zhang, J.; et al. CSI-Net: Unified Human Body Characterization and Pose Recognition. arXiv 2019, arXiv:1810.03064. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated