1. Introduction
Human-computer interaction (HCI) has been a prominent research area over the past few decades, with a growing number of studies published each year and a broad range of applications [
1,
2]. As software and hardware have evolved, HCI solutions have progressed from 2D, pointer-based graphical interfaces to modern wireless systems [
3]. In parallel, advances in sensing devices and their increasing availability have expanded the set of modalities to include touch, gestures, eye tracking, body movements, biosignals, and more [
4]. Together, these have created new opportunities to study and design more useful, efficient, and personalized interaction frameworks.
Among other modalities, gestures are the most natural form of communication, making them a valuable means of interaction with digital systems [
5]. Gestures enable contactless operation, unrestricted movement, and a strong sense of presence in immersive environments [
6]. With the wide range of sensing and tracking hardware available today, many types of gestures can be considered, such as static hand poses; hand, arm, or wrist movements recorded as simple paths represented as primitive shape images; or staged, discrete paths represented by vector trajectories [
3,
7,
8,
9]. Going beyond hand gestures, full-body postures and movements, including micro-movements, can also be interpreted as gestures, further expanding the range of methods and their potential applications [
10].
Captured gesture data can be static or dynamic, and systems can use them independently or in combination [
11,
12,
13]. Static data lacks temporal information; therefore, the input for static methods typically includes images, depth data, gesture-trail images, spatial coordinates, radar images, and other relevant data. Dynamic methods usually combine static data with temporal information, most commonly with the timestamps of captured frames, and organize the collected frames into temporally ordered sequences for further analysis.
Gesture recognition also involves two contrasting training and inference strategies: the first is user-independent [
14,
15,
16,
17], in which a single model is trained on multiple datasets from various users. The second is a user-specific strategy in which many models are trained separately for each user [
6,
18,
19,
20,
21]. The second approach enables more flexible and personalized communication, which is important when users have different skills, abilities, or even disabilities [
22].
The aim of this study was to design, develop, and evaluate a software research framework for static in-air two-dimensional (2D) stroke-based wrist-forearm personalized gesture recognition using convolutional neural networks (CNNs) and the Vicara KaiTM wearable controller. The target gestures are stored as binary images consisting of linear segments connecting points scanned by the capture device. The environment includes the controller purchased with its SDK (KaiSDK) and the Kai.WebsocketModule (a NuGet module from Vicara). The author’s contributions include a specialized MyKai.dll library; the Main Module (a UI WindowsTM application written in the .NET Framework); the dataset; a research-oriented Python component (MyKaiPy); the models and statistics database; and, finally, the End User Module (UI). The primary functions of the framework are to capture gestures, perform learning and inference processes, evaluate the system, and optimize the convolutional network. In addition, the software supports scanning incoming controller messages, profiling the UI modules, and configuring their vital parameters. The system was evaluated using 1,125 directly captured gesture trails (5 individuals * 15 classes * 15 samples) and 1,125 images of the gestures, smoothed using unions of cubic Bézier curves. The dataset includes 100 rapid, casual gestures that can occur randomly during gesture capture and that the system should reject. The research procedure involved developing a lightweight CNN architecture, determining the image size, applying augmentations, setting the dropout rate, optimizing the number of learning samples (critical for personalized approaches), evaluating recognition performance on a test set, and examining the system’s robustness against non-gesture or casual inputs using receiver operating characteristic (ROC) curve analysis.
1.1. Related Work
This section briefly summarizes recent related work on gesture recognition systems. Given the domain’s highly diverse landscape, the review covers static, dynamic, and mixed approaches across various system types. It includes studies on hand movements — referred to in the literature as mid-air or air-drawn gestures — and gesture poses. Similarly, the different devices used to capture hand movements are considered. Overall, the cited works align with at least one feature of the approach presented in this study.
The authors of [
18] (Lin et al.) proposed a dynamic personalization approach for hand-gesture recognition using data collected during 2D game navigation controlled by a META sEMG wristband. Their method employed a multi-armed bandit algorithm to personalize gesture recognition for each user using online, user-specific data collected during gameplay. The results demonstrated that the personalized model improved accuracy and allowed some participants who had initially failed the baseline (user-independent) model to succeed. System performance was assessed using the false navigation percentage metric, which was significantly reduced. Additionally, user experience statistics, such as self-reported feelings of success and frustration, indicated an improved user experience following fine-tuning sessions.
In [
19], Xu et al. investigated how to enable user-defined hand gesture customization on a wrist-worn device without degrading existing gesture recognition performance (in their work, gestures corresponded to hand poses). As part of the presented approach, the authors first collected a large-scale dataset of accelerometer and gyroscope data to train a robust, user-independent gesture recognition model that achieved high accuracy while remaining resilient to non-gesture (casual) movements. Using this pre-trained model, the authors developed a customization framework that enabled users to add their own gestures. The framework was evaluated using 12 novel gestures of 20 participants, showing that new gestures can be learned with high accuracy while maintaining the performance of the original hand posture set. The study also examined usability, demonstrating that users could successfully create and use custom gestures with minimal effort and subjective satisfaction.
The study by Wang et al. [
20], similar to the previously summarized work, explores hand poses in the context of gesture personalization. The paper presents a camera-based system that allows users to define their own gestures and map them to textual inputs. The authors investigated the challenge of variation in gesture styles, which limits the effectiveness of user-specified gesture datasets. In their approach, gestures are personalized using a lightweight Multilayer Perceptron (MLP) trained on a particular user’s data. The study poses two questions. The first addresses the feasibility of training a neural network with a small dataset, and the second examines user acceptance of the developed interface, including those without a technical background. To evaluate the answers, the authors developed a UI-based system capable of detecting and recognizing hand gestures represented by the hand skeleton, including 21 points, whose coordinates were used to train three models: a 3-layer MLP with Sparse Categorical Crossentropy (SCC) loss function, a double 3-layer MLP with SCC, and a double 3-layer MLP with Contrastive Loss (CL) function. The third solution achieved the best performance, with 98.48% accuracy and a loss of 0.0430 after 369 epochs across 6 learning gestures. The second part of the system evaluation was the user study, in which the following factors were assessed: attractiveness, efficiency, perspicuity, dependability, stimulation, and novelty. For each of these factors, users were asked several detailed questions, scored from 1 to 10. Overall, the results indicated a high level of system acceptance (average scores of approximately 7) and revealed potential avenues for further system development. In summary, the research path described by Wang et al., particularly question 1, aligns well with the elements of the current study (see Sub
Section 4.1 and
Section 4.2).
Another personalization framework was proposed by Zou et al. [
6]. The resulting solution, Gesture Builder, allows users to define custom dynamic hand gestures using only three demonstrations within a VR environment. The system decomposes gestures into static postures and wrist trajectories, which are represented using an unsupervised Vector Quantized Variational Autoencoder (VQ-VAE) [
23]. (k-means and DeepDPM [
24] algorithms were also examined, but they achieved lower precision.) The VQ-VAE model is pre-trained on a large, unlabeled dataset of hand postures, which are clustered into discrete templates (latent labels). Another stage of the pipeline is customization, in which 3D hand joint coordinates serve as input and are assigned to the most similar latent labels, yielding a latent label sequence that is then transformed into a sequence pattern. If the newly defined gesture conflicts with existing gestures, the user may choose which version they prefer. The final part is the gesture-recognition process, which is partially similar to customization. The evaluation experiments were conducted with 16 participants using the Oculus Quest 2 apparatus. The reported overall accuracy was 90.08% (cited chapter,
Figure 13a). Independently, the system’s usability was assessed using a questionnaire based on the System Usability Scale (SUS) [
25]. Different aspects of the user experience, such as satisfaction, usability, and learnability, were rated at approximately 80±10%. Additionally, other aspects of the application, such as workload, customization capability, and user feedback, were assessed.
A personalized touch-gesture solution was presented by Ma et al. [
21]. It utilized a Near Field Communication (NFC) tag for back-of-device interaction with a smartphone. An NFC tag is a small, passive chip with an antenna that stores data and communicates with an NFC-enabled device (such as a smartphone) when brought close. The authors investigated rectangular tags of different sizes, used in pairs and arranged in T-shape layouts, mounted on the back of a typical smartphone cover. The USRP (Universal Software Radio Peripheral) served as the PC communication interface. The designed system utilizes the phenomenon in which the user’s finger impedance alters the amplitude and phase of the tag’s backscattered NFC signal, which is preprocessed and used as input for machine learning algorithms. The system’s performance was evaluated on a dataset of 10 individuals and an 8-gesture set for both user identification and personalized gesture recognition. The overall F1-score accuracy of 95,74% is reported.
Many researchers in the Hand Gesture Recognition (HGR) domain aim to assist people with impairments. Khanna et al. [
22] presented an HGR system dedicated to blind users, utilizing smartwatches as the capture device. The authors claimed that their gestures are noisier, slower, and include more pauses than those of sighted individuals, and that regular HGR systems are ineffective in this case. They conducted a comparative user study with blind and sighted participants, including three gesture categories: forearm movements, compound movements, and shape-like movements. Then, they proposed a gesture recognition system that relies solely on gyroscope data and focuses on short, user-invariant “micro-movements” (the gesture nucleus). Gestures in that approach were represented by direct device signals and 3D trajectories. It required an ensemble model combining a multi-view CNN and a geometric-feature-based classifier. The system evaluation with blind users yielded an accuracy of 92%, a precision of 92%, and a recall of 91%.
In [
26], Willems et al. presented a feature-based recognition of multi-stroke gestures based on online pen trajectories. The study explores four gesture datasets, including NicIcon, UNIPEN 1b, UNIPEN 1d, and IRONOFF, and applies four feature sets: global features (g-48), stroke-level features (s-
), coordinate-level features (c30 and c-60), and three different classifiers: Support Vector Machine (SVM), MLP, and Dynamic Time Warping (DTW). The performance comparison across feature sets and classifiers shows results that do not differ significantly, whereas performance across datasets varies slightly — the highest score was achieved for NicIcons and global features with SVM (99.2%), while the lowest was achieved for IRONOFF and c-60 with MLP (88.4%). Although the study is not very recent, it provides a strong example of explicit feature engineering for 2D stroke gestures, offering a useful contrast to CNN-based methods that learn features implicitly.
(Bernardos et al.) [
27] went a step further than gesture recognition, proposing a syntax grounded in human-like language, inspired by the vocative case and the imperative mood. Each command consists of a triplet of words: the first two words are obligatory, and the third is optional. The first pair of words determines the subject and the action to be performed on it. The third word is a supplementary part of the command that additionally specifies activity details. The commands are associated with mid-air gestures that, together, form a two-level communication system, enabling natural interaction and creating an environment capable of controlling VR worlds, intelligent homes, assisted living systems, and more. The system was evaluated for user experience in three tests. The first test used a five-point Likert scale [
28], ranging from strong disagreement to strong agreement. In the subsequent test, users’ social acceptance of the system was measured using a 10-point Likert scale, with users acting in different environments and in front of various audiences. In the third test, the System Usability Scale (SUS) [
25] was utilized to assess the usability of the system running on three input devices: the camera, phone, and watch. Notably, the approach presented by Bernardos et al. could be a suitable solution for use with the method in this study, creating a potential real-world implementation scenario and paving the way for the next, more abstract level of gesture interpretation.
1.2. The Vicara KaiTM Controller
The Vicara Kai gesture-based wearable controller was launched as a project on the Indiegogo crowdfunding platform in May 2018 and became available for preorder. Independent tech/design sites also featured the KAI controller in July 2018, reinforcing that its first public exposure and description in the press occurred in mid-2018 [
29]. Following this debut, the Kai gained wider attention as a device capable of interpreting simple hand gestures and postures to operate computers, VR/AR applications, and other software. Since that time, Vicara has continued to develop the hardware and software platform, refining its motion-tracking capabilities and positioning KAI as a novel interface for human-computer interaction. The controller was also presented to the IT research community at the 17th International Conference on Computers Helping People (ICCHP), in the framework of the Young Researchers’ Consortium (YRC), by Jankowski (paper not published, attached as an additional file), who implemented a program for text entry using some fixed built-in abilities of the controller. Kai was also mentioned as an HCI wearable device in the study of Vedhagiri et al. [
30], but it was not investigated in this work. At present, Vicara is building its brand as an entity operating in the field of generative artificial intelligence.
The Kai controller relies on embedded sensors and combines an internal measurement unit (IMU) with multiple electro-optical detectors to capture hand motion, hand orientation, and finger activity. The captured data is transmitted wirelessly via Bluetooth Low Energy (BLE) through a USB dongle and processed by the Vicara Motion Engine (VME), which serves as the first level of interpretation for the sensors’ data within the Natural User Interaction (NUI) paradigm. The controller is purchased with two software components: Kai Control Center and Kai SDK. The Kai Control Center supports two main use cases: calibrating the controller and assigning actions to the simple gestures it recognizes. Calibrating has two stages. During the first stage, the user is asked to position the sensor toward the screen and the desktop surface and to wait for the controller to perform its computations. In the second stage, the user must perform the simple gestures shown in the video. The gestures are the following: “swipe-up”, “swipe-down”, “swipe-right”, “swipe-left”, “swipe-side-up”, “swipe-side-down”, “swipe-side-right”, “swipe-side-left”. The first four gestures are to be made with the hand directed toward the surface, and the next four with the hand held vertically. The second use case implemented by the Kai Control Center is to assign the specified gestures to actions that users can record using the Record function.
The second software component of the Kai controller environment is the Kai SDK. It is a Windows system tray application that must run in the background when applications, including the Kai Control Center, need to communicate with the controller. The communication framework uses JSON format to send messages that an application can handle. The application programming interface (API) is available for C# .NET, Python 3.x, and Node.js. For this study, the C# .NET Framework NuGet package called Web. KaiSocketModule was used with the .NET Framework WinForms library and Visual Studio 2022. The API supports 10 types of sensor data, as shown in
Table 1. These data types are called
capabilities in the API’s vocabulary and can be enabled or disabled depending on application requirements. The Kai controller messages can be handled by .NET Windows functions by adding a function delegate to a particular
Kai object. For example, to invoke the function
GestureHandler on incoming gesture data, the function has to be set as a handler of
Kai object as follows:
Kai.Gesture += GestureHandler, where
GestureHandler is a function having parameters like:
object e sender, EventArgs args. This is a basic scheme that was substantially expanded in the MyKaiLib.dll library created by the author for the purposes of the study. According to the author’s tests, the latest version of the API does not support all capabilities. In addition, finger detection strongly depends on the device’s battery level; when the battery is low, the data are unreliable (not all bent fingers are detected). Regardless of these drawbacks, the device was very useful for creating the gesture recognition system described in this work, which aimed to extend the usability of the controller by enabling the recognition of more complex user-defined gestures based on the position of the controller given by its pitch, yaw, and roll (PYR) data.
1.3. Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a widely used machine learning technique and require no further introduction. Since the first publications, such as that by LeCun et al. [
31], CNNs have become a game-changer, as evidenced by their winning the ImageNet competition in 2012 with AlexNet by Krizhevsky, Sutskever, and Hinton [
32], an event that started the modern history of deep neural networks and their applications in various areas, including solutions that are today referred to as “Artificial intelligence”. CNNs are also widely used in gesture recognition, enabling robust recognition without the laborious feature selection required by other classifiers. In simple terms, CNNs, in contrast to Multi Layer Perceptrons (MLPs), include several types of layers, each with a different function. The key concept is the convolutional layer. It includes learnable filters that automatically extract image features and pass the results to deeper layers. The network also includes fully connected (dense) layers enabling high-level reasoning, dropout layers that randomly restrict neuron connections to prevent overfitting, pooling layers that reduce the spatial resolution of feature maps, thus lowering computational cost and enabling translation invariance, activation layers that provide nonlinearity to the model, and the softmax layer that converts the network output into a probability distribution over classes.
2. Materials and Methods
2.1. Physical Environment Setup
A significant issue in gesture recording and inference is the setup of the physical environment, including the user’s body position, hand location, possible hand movements, and the relative positions of the user and the computer screen. As shown in
Figure 1, in the prepared environment, the user sits at the PC with the controller worn on their preferred hand, which is slightly clenched into a fist (see
Figure 1a, label 1). The forearm can be supported on an armrest (see
Figure 1a, label 2). In this position, the user must direct the controller towards the screen and control the gesture cursor, which can be moved within the fixed-size gesture viewport (see
Figure 1b, label 1). The possible range of movement in this setup allows the user to move both the forearm and the wrist (see
Figure 1b, label 2). For the best and smoothest gesture drawing, both degrees of freedom should be used, as it is hard to draw shapes effectively with either the forearm or the wrist alone.
2.2. Event Queue Decimation
The Kai controller generates asynchronous messages that can be mapped to .NET classes via the .NET event mechanism. Since there are multiple controller abilities (see sub
Section 1.2), a large stream of data is generated. When the given ability is active and handled, each message runs the handler method on a separate thread. This can lead to excessive resource consumption and delays due to increased memory requirements (capturing and storing more points) and parallel thread processing at the operating system level (handling more events simultaneously). Moreover, even restricting the messages to the PYR event, there are still too many handler calls in a very short time, indicating that the sampling rate is too high.
For the above reasons, the event stream must be decimated, meaning that not every event should trigger the handler. However, it is important to note that users draw their gestures with different dynamics: some strokes are drawn slowly, while others are drawn quickly. If a gesture fragment is drawn faster than messages are processed, gesture quality suffers, and angular segments may appear in the gesture trails instead of smoother lines that better reflect the gesture shape. Conversely, when the number of messages is too large, there are too many gesture points to capture, store, and draw. Thus, the decimation factor (DF) strongly influences the capture process and system usability, and it is a vital system parameter; therefore, there must be a trade-off among shape quality, system performance, and memory requirements.
To accurately determine the decimation factor, two measurement series were conducted on a running system UI module. The first involved measuring the PYR event-triggering time interval as a function of DF: during the profiling session, the factor was increased by two every 20 event handling cycles, and the time interval between two consecutive events was recorded; the results are shown in
Figure 2a (as expected, it is linear). The second element of the described procedure was to measure the time required for the system to process all code between the two handling procedures and the procedure itself. The results obtained for 500 cycles are shown in
Figure 2b. The dashed line indicates an average cycle duration of approximately 45 ms, implying a decimation factor of 4: only every fourth event is handled; the remaining three are ignored. The tests performed by the author confirmed this approximated DF value.
2.3. The Dataset
The dataset used in this study was collected using the
Main Module UI, as described in detail in subsection 3.1 (see
Figure 5). The gesture capture process consisted of two stages. In the first stage, users were asked to draw gestures prompted by the system (
Figure 3a). These prompts were intended to facilitate the acquisition process: a set of easy-to-draw shapes was presented to users (one at a time), and they were asked to draw a similar shape in their own way. In the second stage of gesture recording, users drew their own shapes without any prompt (
Figure 3b). As a result of performing these two stages, the dataset comprised two primary partitions: gestures that were prompted (hereinafter referred to as the
Prompted Set) and shapes defined and drawn by users (hereinafter referred to as the
User Set). Eventually, during the study’s research stage, both options could be evaluated for classification stability and efficacy. The key point is that both the
Prompted Set and the
User Set are treated as users’ personalized gesture sets since both reflect influences associated with their individual drawing styles.
2.4. Bézier Approximations
As noted earlier, rapid cursor movements can affect the rendered gesture’s shape. Although the phenomenon occurs relatively rarely, it was decided to develop a simple smoothing algorithm and evaluate its impact on recognition performance. The implemented solution uses Bézier curves to produce a smoothed gesture. Its mechanism is very straightforward. First, the list of points is modified so that its length is divisible by four, with the additional requirement that the last point of each gesture is always preserved; if necessary, the previous point or points are deleted. Then, every four points in the new point list are used to draw a cubic Bézier curve given by:
where
are the points of the
i-point point quadruple, additionally providing that the two control points (
and
) lie on the gesture trail, and
t is the parameter that controls how much influence the control points have. That solution is the simplest approach to using Bézier curves for the intended purpose; however, it provides a noticeable improvement in shape. It should also be noted that placing the steering points outside the trail would require a much more sophisticated and time-consuming procedure. The described problem is illustrated in Figure (
Figure 4), which shows the gestures affected by fast strokes (a), their Bézier approximations (b), and both superimposed (c).
2.5. Research Procedure
The general goal of this work was to create and evaluate a software framework for hand-stroke gesture learning and recognition using the Vicara KaiTM controller and a CNN classifier. The first activity in the research procedure was to lay the foundation for the library and the GUI module, enabling communication with the controller SDK and demonstrating how to receive and interpret controller input data. Once the software framework was enabled to read and transfer data correctly, the gesture-capturing interface was created, parameterized, and tested. Given the software’s broad functional scope at this stage, this was an essential component of the project.
The next step of the research procedure was to collect the dataset. The gestures of the five individuals were recorded. Their ages ranged from 11 to 55 years. The obtained dataset comprises 15 classes per user and 15 samples per class. So that each class generates 1125 original samples and, additionally, an equal number of their Bezier approximations calculated using gesture point coordinates. The first 10 gestures of each user were generated using shape prompts, whereas the remaining five were generated without prompting. In addition, 100 casual gestures are included to perform the ROC analysis experiment.
The following part of the project workflow involved preparing the software to develop an effective CNN pipeline. The starting point was a simple architecture inspired by the “Flowers” dataset example published in the TensorFlow documentation [
33]. The decision to use this architecture as a starting point for further research was based on the simplicity and versatility of the provided examples. While flower images require high levels of detail, the initial pipeline models tended to overfit. Because the gesture types considered in this study do not require such detailed analysis, the input layer size was reduced, and dropout layers were applied. Since the personalized approach provided insufficient learning samples, the CNN pipeline was preceded by data augmentation, including random shifts, rotations, and scaling; the input end network pipeline investigation stage included four experiments (referred to as Experiments 1-4).
After developing the CNN pipeline, a series of experiments was conducted to evaluate key parameters of the developed research environment and its efficiency metrics. Because the number of learning samples is vital for a personalized approach, the fifth experiment was designed to determine the optimal training set size. The models were trained on samples from the Prompted Set and the User Set, jointly, using 6 to 9 images from the learning set. The experiment was also repeated for images approximated with Bezier curves (the results are presented in Sub
Section 4.2). The sixth experiment evaluated the system’s inference performance using confusion matrices as metrics (results are discussed in Sub
Section 4.3). The final (seventh) experiment aimed to assess the system’s ability to reject unknown or casual gestures, which is essential for real-world applications. Receiver operating characteristic (ROC) analysis was performed to evaluate the classifier’s performance in rejecting gestures that do not belong to the training set of classes (see Sub
Section 4.4).
This study included a gesture capture phase focused solely on data collection. This was done using equipment and software that had been tested prior to data recording. Data was captured under controlled conditions, without introducing behavioral interventions or usability assessments. No user reactions were observed, and no manipulations were introduced. Gesture scanning did not require excessive physical or mental effort. When the data was collected. The study meets the conditions specified in Regulation No. 179/2025 of the Rector of the Silesian University of Technology, dated 27 October 2025, named "Regulation on the establishment and operating principles of the Committee on Ethics in Research Involving Human Subjects".
Generative AI technology has become increasingly ubiquitous, and AI features are now present in almost all popular applications. In this context, users must carefully observe how these features work to avoid including AI-generated content in their documents. The author of this study used the Grammarly web application and Overleaf as primary text editors, relying solely on their (supposed) AI-based functions for superficial text editing and minor rephrasing. AI tools such as Google, ChatGPT, and GitHub Copilot were also used as web search tools. Still, it is important to note that each sentence and paragraph in this paper is the result of the author’s independent and in-depth reflection, regardless of the corrections suggested by the spelling, grammar, and punctuation correction tools. The more sophisticated functions of the aforementioned correctors were not used, as they provide AI-generated text, which was checked with the AI Detector function of Grammarly and with the ZeroGPT page. The use of AI in this study meets the conditions specified in Regulation No. 5/2026 of the Rector of the Silesian University of Technology, dated 15 January 2026, named "Regulation on the Policy for the Use of Artificial Intelligence in Scientific Research and Education at the Silesian University of Technology".
5. Discussion
This study investigated a personalized wrist–forearm static gesture recognition framework based on the Vicara Kai™ controller and convolutional neural networks. The results show that a user-specific approach based on static data enables reliable gesture recognition even with a limited number of learning samples (6-9). Training a separate CNN model for each user addresses intra-class variability caused by differences in motor habits, gesture dynamics, and other biomechanical constraints. This reduction allows for achieving good results even with only a few samples available. The observed differences in gesture stability across individuals confirm this observation.
The important feature of the method used in the presented study is that a portion of the gestures captured was prompted to users using a predefined gesture vocabulary. This approach was used to facilitate gesture registration and reduce user burden. However, as the confusion matrix analysis shows, not all of the proposed gestures proved successful at the recognition stage (in particular, gestures marked as “ALPV”, “SIGM”, and “PUxx”); the same was true for the user’s own circle gestures. This indicates that the proposed solution should account for the difficulties involved in performing certain movements. Initially, it can be concluded that this involves combining elements that require rapid horizontal-axis movements with simultaneous changes in vertical cursor position. If the axes are separated from each other (‘TUxx’), the results are better.
While working on the presented experiments, it was observed that rapid movements sometimes affect the shape of the gesture, while the cursor movements they cause are too fast for the applied sample rate (a vital system parameter selected by observing how the system handles messages from the controller). Keeping that in mind, the author decided to apply a preprocessing step to reduce distortion. The simple cubic Bézier curve approximation was applied to the gestures to smooth their shapes. The effect was noticeable visually; however, as observed during tests, it concerned the level of detail that the CNN network does not analyze, as it generalizes the structure of gestures. This can be explained by the fact that Bézier approximation does not alter the semantic structure of gestures but rather improves signal quality.
The experiments confirmed that lightweight CNN architectures are sufficient for the task. Reducing the input resolution to 64×64 pixels improved generalization, significantly reduced computational cost, and eliminated overfitting. More complex architectures designed for natural image classification were unnecessary and prone to overfitting due to the limited dataset size. The data augmentation had a minor effect on training results; however, the results achieved with a 20% rate of transition, rotation, and scale were slightly better. In addition, the inclusion of a dropout layer improved generalization performance. Three dropout ratios were investigated: 0.2, 0.4, and 0.6, corresponding to low, medium, and strong generalization. Finally, the ratio of 0.4 provided the best balance between regularization and learning capacity, with a minor drawback: the target loss and accuracy values are reached later in the epochs than they are without dropout.
In addition to the described experiments, the study includes an important software development stage. As part of the work, four software modules were created: the MainModule, the MyKai.dll Library, the EndUserModule, and, in the final phase, the MyKaiPy research module. This software is an important contribution by the author that can be used in future work. While the software’s architecture is open, it can be further developed, refactored, and enhanced with functions that enable other modalities and output data.
Due to the significant diversity of approaches referred to as gesture recognition, even when limited to those focused on user-specific gestures and static data, it is difficult for the author to compare the results achieved with those in other works. The best performance reached in this study is a validation loss of 0.32, a validation accuracy of 0.92, a test set accuracy of 0.92, and a test set precision (macro) of 0.92. In the studies cited in the Introduction, performance, measured using various metrics across different classifiers, ranged from about 0.88 to 0.98. The most similar research problem was investigated by Wang et al. [
20] — the authors reported a training validation accuracy of 0.985, but the task was to recognize gesture postures, not gestures understood as hand movements.
The study has a few limitations that highlight the directions for further development and investigation. The primary task for future research is to evaluate the proposed research pathway using a dataset encompassing a larger number of individuals. Although the number of collected samples is quite large (5 samples * 15 classes * 5 individuals * 5 partitions + 100 casual gesture samples = 2350 files) and the method is stable, extending the dataset to include more participants is a crucial next step for future work. Especially interesting is the examination of differences between prompted and non-prompted gestures and, in particular, how the system would behave when all users’ gestures are used to train a single model (user-independent approach). Another important development and testing issue is conducting a usability study to assess user experience across both overall and detailed system functions, particularly those that support smooth user-computer interactions. Finally, once the method has been thoroughly investigated and improved, exploring practical use cases becomes worthwhile. There are several potential application areas and many directions to achieve this goal. One approach is to integrate the developed UI software with an input method (or methods) designed primarily for people with motion impairments. This use case was addressed by Jankowski (as described in the Introduction) and by the author in [
34]. Another potential application area could be physical rehabilitation, where the developed software would be used to assess recovery progress after injuries, strokes, heart-attacks, and other conditions. The online inference accuracy could be used as a measure of regained fitness. Additionally, the developed software could be used to control the system by running programs or executing their specific functions. A tree-like command structure could be used, allowing many commands to be controlled by a small set of gestures. This also concerns software that is associated with external systems, such as smart home or assisted living solutions.
6. Conclusions
The article presents a personalized (user-specific) approach for recognizing static wrist-forearm gestures using the Vicara Kai™ controller and a convolutional neural network. The developed software enables users to record their own gesture sets, either prompted by the system or created entirely by users. The experiments in the study demonstrate the feasibility of training a lightweight CNN with a small number of gesture samples per class. Several input/network pipelines were investigated, resulting in an overall accuracy of 0.92. As mentioned in the discussion, some gesture types succeeded, achieving “a perfect class,” while others were unstable. The main difficulty was moving the controller on both the X and Y axes simultaneously. Downsizing the image, data augmentation, and a medium level of regularization improved performance compared to the initial network architecture. The ROC analysis showed that the system is resilient to random, casual cursor movements. The final output of the study is a research software framework that the author believes can be further developed, expanded, and successfully applied in the areas outlined in the discussion.
Informed Consent Statements: Informed consent was obtained from all subjects involved in the study.
Acknowledgements: The author would like to express gratitude to his son, Szymon Szedel BSc, Eng., for his critical remarks and constructive suggestions, as well as for sharing his knowledge of machine learning Python libraries. Additional acknowledgments are due to Krzysztof Dobosz, PhD, Eng., for inspiring the author’s interest in the topic of gesture recognition.
Conflict of Interest: The author declares no conflict of interest.
Figure 1.
The physical setup of gesture recording: (a) The right-side view: the controller is worn on a hand lightly clenched into a fist (label 1); the forearm is supported on an armrest (label 2). (b) A top-down view: fixed-size gesture viewport (label 1); the user moves both the forearm and the wrist (label 2).
Figure 1.
The physical setup of gesture recording: (a) The right-side view: the controller is worn on a hand lightly clenched into a fist (label 1); the forearm is supported on an armrest (label 2). (b) A top-down view: fixed-size gesture viewport (label 1); the user moves both the forearm and the wrist (label 2).
Figure 2.
Determining the PYR event decimation factor: (a) – The PYR event triggering time [ms] in function of the event decimation factor, (b) – The duration of the PYR event handling cycle [ms]; the results show that the decimation factor value should be set to 4, which means that only the each fourth event will be handled.
Figure 2.
Determining the PYR event decimation factor: (a) – The PYR event triggering time [ms] in function of the event decimation factor, (b) – The duration of the PYR event handling cycle [ms]; the results show that the decimation factor value should be set to 4, which means that only the each fourth event will be handled.
Figure 3.
The examples of gesture shapes; the gray square represents the viewport size: (a) – the gestures included in the prompted set (set of gestures suggested to users when recording), (b) – examples of the gestures proposed by the users.
Figure 3.
The examples of gesture shapes; the gray square represents the viewport size: (a) – the gestures included in the prompted set (set of gestures suggested to users when recording), (b) – examples of the gestures proposed by the users.
| |
|
|
|
|
| alpha right |
alpha vertical |
"C" upper |
finger |
"h" lower |
| |
|
|
|
|
| |
|
|
|
|
| "P" upper |
sigma upper |
spiral |
"T" upper |
"W" upper |
| |
|
|
|
|
| |
|
(a) |
|
|
| |
|
|
|
|
| |
|
|
|
|
| circle |
dash |
"G" upper |
less |
more |
| |
|
|
|
|
| |
|
|
|
|
| pipe |
"S" upper |
square |
tipi |
"V" upper |
| |
|
|
|
|
| |
|
(b) |
|
|
Figure 4.
Examples of gestures with fast drawn fragments: (a) – the original gesture trails (as captured), (b) – corresponding gesture Bézier approximations, (c) – original gestures and their approximations superimposed.
Figure 4.
Examples of gestures with fast drawn fragments: (a) – the original gesture trails (as captured), (b) – corresponding gesture Bézier approximations, (c) – original gestures and their approximations superimposed.
Figure 5.
The system map showing it’s architecture construction blocks and their content and responsibilities; for better readability, hardware, software, and data artifacts are organized clockwise and labeled with numbers (1–10); informal notation is used; the following elements are depicted: the Kai controller (1), PC with the Kai Dongle (2), and Kai SDK (3), and Kai.WebsocketModule (4), MyKaiLibrary (5), Main Module UI (6), gesture System Dataset (7), MyKaiPy Research Module (8), Models and Statistics (9), and End User Module (10)
Figure 5.
The system map showing it’s architecture construction blocks and their content and responsibilities; for better readability, hardware, software, and data artifacts are organized clockwise and labeled with numbers (1–10); informal notation is used; the following elements are depicted: the Kai controller (1), PC with the Kai Dongle (2), and Kai SDK (3), and Kai.WebsocketModule (4), MyKaiLibrary (5), Main Module UI (6), gesture System Dataset (7), MyKaiPy Research Module (8), Models and Statistics (9), and End User Module (10)
Figure 6.
Illustration of the gesture extraction process; the subsequent frames of the process view were superimposed, while the user sees only one frame at a time.
Figure 6.
Illustration of the gesture extraction process; the subsequent frames of the process view were superimposed, while the user sees only one frame at a time.
Figure 7.
State machine of the gesture extraction scenario shown from the user’s perspective.
Figure 7.
State machine of the gesture extraction scenario shown from the user’s perspective.
Figure 8.
State machine diagram that shows the detailed view of the gesture extraction scenario; t is the stabilization time, is the cursor movement distance, limit stands for both the time and distance limits.
Figure 8.
State machine diagram that shows the detailed view of the gesture extraction scenario; t is the stabilization time, is the cursor movement distance, limit stands for both the time and distance limits.
Figure 9.
Loss and accuracy plots showing the learning results obtained for the initial pipeline () (the shaded area corresponds to the 95% confidence level): (a) the plot of averages across experiments and epochs, (b) the worst individual result obtained for one of the participants with less stable gestures.
Figure 9.
Loss and accuracy plots showing the learning results obtained for the initial pipeline () (the shaded area corresponds to the 95% confidence level): (a) the plot of averages across experiments and epochs, (b) the worst individual result obtained for one of the participants with less stable gestures.
Figure 10.
Loss and accuracy plots presenting the learning results obtained for different input image sizes (continued): (b) 64x64, (c) 32x32.
Figure 10.
Loss and accuracy plots presenting the learning results obtained for different input image sizes (continued): (b) 64x64, (c) 32x32.
Figure 11.
Three sets of gestures demonstrating different types of variations: (a) scale, (b) position, (c) rotation. Each set of three samples in sets a, b, and c belongs to a separate participant.
Figure 11.
Three sets of gestures demonstrating different types of variations: (a) scale, (b) position, (c) rotation. Each set of three samples in sets a, b, and c belongs to a separate participant.
Figure 12.
Loss and accuracy plots showing the learning results for different augmentation ratios applied to scale, position shift, and rotation: (a) , (b) , and (c) .
Figure 12.
Loss and accuracy plots showing the learning results for different augmentation ratios applied to scale, position shift, and rotation: (a) , (b) , and (c) .
Figure 13.
Loss and accuracy plots illustrating the influence of a dropout layer applied after the dense layer for three dropout rates (continued): (b) medium regularization (), (c) strong regularization ().
Figure 13.
Loss and accuracy plots illustrating the influence of a dropout layer applied after the dense layer for three dropout rates (continued): (b) medium regularization (), (c) strong regularization ().
Figure 14.
Results of the training process for different numbers of training samples: the average minimum validation loss, and the maximum validation accuracy.
Figure 14.
Results of the training process for different numbers of training samples: the average minimum validation loss, and the maximum validation accuracy.
Figure 15.
Loss and accuracy plots illustrating the training results for different numbers of learning samples: (a) the lowest number , (b) the highest number .
Figure 15.
Loss and accuracy plots illustrating the training results for different numbers of learning samples: (a) the lowest number , (b) the highest number .
Figure 16.
Confusion matrices obtained for the set of testing samples and a randomly selected images representing each class; smoothed gestures (continued): (d) the worst result.
Figure 16.
Confusion matrices obtained for the set of testing samples and a randomly selected images representing each class; smoothed gestures (continued): (d) the worst result.
Figure 17.
Examples of casual – unwanted, rapid, random gestures.
Figure 17.
Examples of casual – unwanted, rapid, random gestures.
Figure 18.
Roc curves of Experiment 7: (a) – ROC with the highest AUC, (a) – ROC with the lowest AUC, (c) – ROC aggregated across all participants.
Figure 18.
Roc curves of Experiment 7: (a) – ROC with the highest AUC, (a) – ROC with the lowest AUC, (c) – ROC aggregated across all participants.
Table 1.
The Kai controller capabilities detailed description.
Table 1.
The Kai controller capabilities detailed description.
| Capability name |
Flag |
Meaning |
Argument type |
Available in v. 1.0.0.9 |
Remarks |
| Gesture |
1 |
Basic gesture |
Gesture enum |
Yes |
Capability is always on |
| Linear Flick |
2 |
Flick |
string |
No |
No events received |
| Finger Shortcut |
4 |
Fingers bent? |
bool[] (four elements) |
Yes |
Incorrect if battery low |
| Finger Position |
8 |
Finger position |
int[] (four elements) |
No |
No events received |
| PYR |
16 |
Pitch, Yaw, Roll |
float (three values) |
Yes |
Large number of events |
| Quaternion |
32 |
w, x, y, z |
float (four values) |
Yes |
Large number of events |
| Accelero-meter |
64 |
ax, ay, az |
Vector3 |
No |
No events received |
| Gyroscope |
128 |
gx, gy, gz |
Vector3 |
No |
No events received |
| Magneto-meter |
256 |
mx, my, mz |
Vector3 |
No |
No events received |
Table 2.
CNN pipelines involved in this part of the study: ’+’ indicates that the element is included, ’-’ indicates that the element was not included; the pipeline element parameters are given in brackets.
Table 2.
CNN pipelines involved in this part of the study: ’+’ indicates that the element is included, ’-’ indicates that the element was not included; the pipeline element parameters are given in brackets.
| |
CNN Ppipelines |
| Layer |
(Initial) |
(Augmentation) |
(Dropout) |
| Input pipeline |
| Rescaling |
+ (1./255) |
+ (1./255) |
+ (1./255) |
| Resizing |
– (180, 180, 3) |
+ (90x90, 64x64, 32x32) x 1 |
+ (64x64) |
| Random translation |
– |
+ (0.1, 0.2, 0.3) |
+ (0.2) |
| Random rotation |
– |
+ (0.1, 0.2, 0.3) |
+ (0.2) |
| Random scale |
– |
+ (0.1, 0.2, 0.3) |
+ (0.2) |
| Network pipeline |
| Conv2D |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
| MaxPooling2D |
+ |
+ |
+ |
| Conv2D |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
| MaxPooling2D |
+ |
+ |
+ |
| Conv2D |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
+ (32, 3, ’relu’) |
| MaxPooling2D |
+ |
+ |
+ |
| Flatten |
+ |
+ |
+ |
| Dense |
+ (128, ’relu’) |
+ |
+ |
| Dropout |
– |
– |
+ (0.2, 0.4, 0.6) |
| Dense |
+ (no. of classes) |
+ (no. of classes) |
+ (no. of classes) |
Table 3.
Summary of parameters of the first experiment – evaluating the initial CNN pipeline ().
Table 3.
Summary of parameters of the first experiment – evaluating the initial CNN pipeline ().
| Parameter |
Value |
| dataset partition |
gestures as captured |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
180x180 pixels (one channel) |
| augmentation |
none |
| number of learning samples per class |
6 |
| number of validation samples per class |
4 |
| total number of samples per class |
10 |
| dropout rate |
0 |
| optimizer |
adam |
Table 4.
Summary of parameters of the Experiment 2 – changing the image size ().
Table 4.
Summary of parameters of the Experiment 2 – changing the image size ().
| Parameter |
Value |
| dataset partition |
gestures as captured |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
90x90, 64x64, 32x32 (one channel) |
| augmentation |
none |
| number of learning samples per class |
6 |
| number of validation samples per class |
4 |
| total number of samples per class |
15 |
| dropout rate |
0.0 |
| optimizer |
adam |
Table 5.
Summary of parameters of the Experiment 3 – data augmentation ().
Table 5.
Summary of parameters of the Experiment 3 – data augmentation ().
| Parameter |
Value |
| dataset partition |
gestures as captured |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
64x64 (one channel) |
| augmentation |
scale, translation, rotation: 10%, 20%, 30% |
| number of learning samples per class |
6 |
| number of validation samples per class |
4 |
| total number of samples per class |
15 |
| dropout rate |
0.0 |
| optimizer |
adam |
Table 6.
Summary of parameters of the fourth experiment – dropout application ().
Table 6.
Summary of parameters of the fourth experiment – dropout application ().
| Parameter |
Value |
| dataset partition |
gestures as captured |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
64x64 (one channel) |
| augmentation |
20% |
| number of learning samples per class |
6 |
| number of validation samples per class |
4 |
| total number of samples per class |
15 |
| dropout rate |
(0.2, 0.4, 0.6) |
| optimizer |
adam |
Table 7.
Summary of all experiments related to the input pipeline, including experiment options and the resulting minimum average loss () and maximum average accuracy () for training and validation.
Table 7.
Summary of all experiments related to the input pipeline, including experiment options and the resulting minimum average loss () and maximum average accuracy () for training and validation.
| Experiment 1 - initial pipeline |
| Option |
Train |
Val |
Train |
Val |
| average |
0.00 |
1.05 |
1.00 |
0.82 |
| worst (less stable) |
0.00 |
1.50 |
1.00 |
0.81 |
| |
| Experiment 2 - input image size |
| Option |
Tra |
Val |
Tra |
Val |
| 90x90 pixels |
0.00 |
0.57 |
1.00 |
0.89 |
| 64x64 pixels |
0.01 |
0.56 |
0.99 |
0.90 |
| 32x32 pixels |
0.36 |
1.03 |
0.90 |
0.73 |
| |
| Experiment 3 - data augmentation |
| Option |
Tra |
Val |
Tra |
Val |
| 10% (scale, shift, rotation) |
0.01 |
0.55 |
0.99 |
0.91 |
| 20% (scale, shift, rotation) |
0.01 |
0.54 |
0.99 |
0.91 |
| 30% (scale, shift, rotation) |
0.01 |
0.58 |
0.99 |
0.90 |
| |
| Experiment 4 - applying dropout layer |
| Option |
Tra |
Val |
Tra |
Val |
| 0.2 (slight) |
0.10 |
0.54 |
0.97 |
0.90 |
| 0.4 (medium) |
0.22 |
0.47 |
0.92 |
0.90 |
| 0.6 (strong) |
0.61 |
0.54 |
0.79 |
0.88 |
Table 8.
Summary of parameters of the Experiment 5 – investigating the number of learning samples (N).
Table 8.
Summary of parameters of the Experiment 5 – investigating the number of learning samples (N).
| Parameter |
Value |
| dataset partition |
gestures as captured |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
64x64 (one channel) |
| augmentation |
20% |
| number of learning samples per class |
5, 6, 7, 8, 9 |
| number of validation samples per class |
4 |
| total number of samples per class |
15 |
| dropout rate |
0.4 |
| optimizer |
adam |
Table 9.
Summary of the Experiment 5 showing the minimum average loss () and maximum average accuracy () for training and validation.
Table 9.
Summary of the Experiment 5 showing the minimum average loss () and maximum average accuracy () for training and validation.
| Experiment 5 - number of learning samples |
| Option |
Train |
Val |
Train |
Val |
|
0.37 |
0.55 |
0.88 |
0.87 |
|
0.23 |
0.48 |
0.92 |
0.90 (0.897) |
|
0.24 |
0.48 |
0.92 |
0.88 |
|
0.16 |
0.41 |
0.95 |
0.90 |
|
0.15 |
0.35 |
0.95 |
0.92 |
Table 10.
Summary of parameters of the Experiment 6 – inference on testing set.
Table 10.
Summary of parameters of the Experiment 6 – inference on testing set.
| Parameter |
Value |
| dataset partition |
gestures as-captured, Bézier-smoothed |
| number of participants |
5 |
| number of repetitions per participant |
10 |
| number of epochs |
30 |
| image size |
64x64 (one channel) |
| augmentation |
20% |
| number of learning samples per class |
9 |
| number of validation samples per class |
2 |
| number of test samples per class |
4 |
| total number of samples |
15 |
| dropout rate |
0.4 |
| optimizer |
adam |
Table 11.
Summary of the Experiment 6 minimum average loss () and maximum average accuracy () for training and validation.
Table 11.
Summary of the Experiment 6 minimum average loss () and maximum average accuracy () for training and validation.
| Experiment 6 - system performance on testing set |
| Gestures as-captured |
| Participant |
Accuracy |
Precession |
See Figure 16
|
| 1 |
0.88 |
0.89 |
(a) |
| 2 |
0.94 |
0.94 |
(b) |
| 3 |
0.92 |
0.93 |
– |
| 4 |
0.91 |
0.93 |
– |
| 5 |
0.90 |
0.91 |
– |
| average |
0.91 |
0,92 |
|
| Smoothed gestures |
| 1 |
0.88 |
0.89 |
(c) |
| 2 |
0.94 |
0.95 |
(d) |
| 3 |
0.91 |
0.92 |
– |
| 4 |
0.93 |
0.94 |
– |
| 5 |
0.90 |
0.92 |
– |
| average |
0.91 |
0,92 |
|
Table 12.
Summary Experiment 7 – AUC values; corresponding example figures.
Table 12.
Summary Experiment 7 – AUC values; corresponding example figures.
| Experiment 7 - rejection of unknown, casual gestures |
| Participant |
AUC |
See Figure 16
|
| 1 |
0.96 |
– |
| 2 |
0.99 |
(a) |
| 3 |
0.95 |
(b) |
| 4 |
0.98 |
– |
| 5 |
0.97 |
– |
| average (aggregated) |
0.97 |
(c) |