Swarm Intelligence Based Deep Learning Approach for Human Activity Recognition in Wearable Internet of Healthcare Things (IoHT) Applications

Anil Kumar Chengali; Seetharamulu B.

doi:10.20944/preprints202606.0833.v1

Submitted:

09 June 2026

Posted:

10 June 2026

You are already at the latest version

Abstract

The automatic prediction of daily human activities like walking, running, cooking, and office work is called Human Activity Recognition (HAR). The medical industry can greatly benefit from it, especially those working with the elderly, personal health care aides, those keeping patient records for reference in the future, etc. A HAR system can take (a) video or still images of people doing things, or (b) data showing the human body's motions as they do those things gathered from sensors in smart devices (accelerometers, gyroscopes, etc.), smart homes, eldercare, and the Internet of Things (IoT). The suggested HAR applications heavily rely on the latest developments in AI approaches, such as optimisation algorithms from Deep Learning (DL) and Swarm Intelligence (SI). Here, we use open-source data from wearable sensors to construct a reliable HAR system that combines DL and SI. A method for light feature extraction called Residual Bidirectional Long Short-Term Memory (Res-BiLSTM) has been developed. Based on the Marine Predator Algorithm (MPA), we presented novel feature selection approaches to choose the best collection of features. Using three publicly available HAR datasets from the UCI machine learning repository, we assess the performance of the suggested model. We evaluate the suggested model against different DL architectures that have recently been suggested as solutions to the HAR problem. The proposed model surpasses other state-of-the-art approaches in terms of accuracy 96.92%, precision of 95.45%, recall of 94.07%, and F1 score of 96.15% on all three datasets. The suggested approach outperforms several reported results in robustness and activity detection. As well as adapting activity aspects, it has fewer parameters and improved accuracy.

Keywords:

deep learning

;

human activity recognition

;

Internet of Things (IoT)

;

multimodal sensing devices

;

swarm intelligence

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

Human Activity Recognition (HAR) has been an active research area for several decades, driven by the significant societal benefits it enables when applied to human-centric, real-world scenarios. In parallel, the rapid advancement of microelectronics and sensor technologies, along with the widespread adoption of smartphones, has accelerated the growth of ubiquitous sensing, which focuses on extracting meaningful insights from data captured by pervasive sensors [1]. With the rapid advancement of technology and the growing need for applications in fields includes ambient assisted living, context-aware systems, pervasive and mobile computing, and security based on surveillance, smartphone-based HAR has become increasingly important. Being able to detect activity in an inconspicuous manner is an additional feature of this method, making it appropriate for everyday use [2]. The fundamental goal of HAR in many real-world contexts is to precisely identify the physical actions carried out by a single person or a small group. Running, leaping, strolling, and sitting are just a few of these activities that may be performed by a single person using their complete body [3]. Some actions, including making hand gestures, are carried out by means of specific movements of the body [4]. In some situations, like when cooking, it is possible to do the task by talking to the things involved [5]. The HAR can also describe any kind of unusual behavior, such as a fall [6]. The most common and widely used applications of HAR include healthcare monitoring, human-computer interaction, assisted living in the environment, nursing homes, rehabilitation, and surveillance [7]. As a result of its many potential uses, HAR has recently emerged as a hot topic among academics. According to the data types used, HAR can be broadly classified as either vision-based or sensor-based. While sensor-based systems analyse data from accelerometers, gyroscopes, radars, and magnetometers as a time series [8] format, vision-based techniques analyse data from cameras as video or image data. Because of its small size, low cost, and portability, the accelerometer is the most commonly utilised sensor for HAR. Figure 1 shows the general layout of a typical activity recognition system. Despite the challenges in deployment, object sensors such as radio frequency identifier (RFID) tags find use in the home setting.

According to the studies, sensor-based HAR [9] is more private and easier to use than vision-based HAR [10]. Despite being less costly to create, vision-based HAR [11] is more affected by ambient factors such as camera angle, lighting, and individual overlap. Deep learning (DL) algorithms have gained significant traction recently because they excel at autonomously extracting meaningful characteristics from data, be it visual [12] or time-series. By learning high-level patterns directly from raw input, these models negate the need for manual feature engineering. This process not only preserves essential data relationships but also yields highly discriminative representations. When it comes to activity detection, DL algorithms have been consistently beating classical Machine Learning (ML) methods on classification performance metrics like F1 Score, recall, accuracy, and precision [13]. There are a number of critical steps that make up the composite system that is human behaviour recognition using deep learning architecture, particularly CNN. In every classification application, feature extraction is an essential step. Therefore, improving the classification accuracy of the used approach is possible by isolating the relevant features [14]. The extraction of features has been made easier by the new deep learning algorithms. In this work, we integrated the Marine Predator Algorithm (MPA) into a lightweight DL model (Res-BiLSTM). The convolution layers in the suggested model use a skip-connection (residual connection) mechanism and a BiLSTM method to extract features from the wireless Inertial Measurement Units (IMUs) dataset. Featuring features for learning and extraction, the suggested model is fine-tuned for human activity classification based on publicly available sensors data. Classifier performances are heavily influenced by feature selection approaches. In order to get the most essential features, many technologies are employed. When it comes to feature selection and other complicated engineering challenges, metaheuristic (MH) optimisation methods, such as swarm intelligence (SI) optimisation algorithms, have recently demonstrated remarkable performance [15]. Several MH methods have been used for feature selection, including GSA, artificial bee colonies, salp swarm algorithm, particle swarm optimisation, and differential evolution (DE) [16,17]. The MPA, an effective SI method, was developed by [18]. Problems like time-series forecasting, global optimisation, image segmentation, and feature selection were all tackled by this very effective MH and SI algorithm. As far as we are aware, this is the initial instance of utilising the MPA for HAR purposes. Everyone knows that MH algorithms have their share of issues, and the no-free lunch (NFL) theorem states that no algorithm can ever be perfect. Consequently, there are a few restrictions on the MPA as well. Thus, in this case, we tackled the feature selection issue in HAR applications by using three variants of the MPA, in addition to the original approach. To carry out the binarization, those variations employ two commonly used transfer functions. The MPAV employs the V-shape transfer function, whereas the MPAS and MPAS10 both employ the S-shape function. Three publicly available datasets including extensive and complicated activities were chosen for the purpose of developing a complete HAR methodology. We have chosen data from the Opportunity, UCI-HAR, and DAPHNET databases. There are a variety of fall actions and everyday activities included. In order to guarantee the quality of the MPA algorithms, we do additional study by comparing them to well-known MH and SI algorithms. To summarise, the following are the primary goals of this research:

To improve HAR applications, we use DL and SI advancements. In addition, we investigate HAR feature selection using optimization algorithms in detail.
Construct a novel feature extraction technique that relies on the MPA to extract features from signals received by IMUs. In addition to convolution layers, skip connections, and BiLSTM, the Res-BiLSTM is made up of a number of distinct components, with the skip being developed in a parallel architecture.
Conduct comprehensive evaluation experiments to compare the proposed MPA variations to other cutting-edge DL algorithms and evaluate their performance.

The following is the outline of the paper. Several previous HAR experiments were reviewed in Section 2. In Section 3, we cover the proposed approach and dataset description. In Section 4, the outcomes results of the suggested method are detailed. Section 5 provides a comprehensive summary and outline of the future plans.

2. Related Work

Over the course of the past few years, researchers have carried out a vast number of studies in order to investigate various sensing technologies. Additionally, a number of techniques have been presented in order to model and identify human behaviours [19]. The area of computer vision faces HAR as a difficult research problem. Researchers from all around the world have been working on this topic for a considerable amount of time in order to develop a recognition model that is nearly perfect. Prior research on HAR was extensive. Providing a rundown of previous actions taken into account by the datasets we’ve selected is the primary goal of this section. For instance, [20] assembled a remarkable body of HAR-related literature. The goal of HAR is to distinguish between ADLs such as sitting, lying, running, cycling, exercising, etc. Methods for addressing HAR have progressed from DL-based techniques, which were prevalent in the early 2000s, to more conventional ML algorithms [21]. At this precise moment, DL is the backbone of any solution in a number of IT fields that deal with large data. Neural networks and deep learning have revolutionised numerous fields, including voice recognition, NLP, signal processing, and image recognition.

A number of recent articles [22] demonstrate DL models that can identify human activities, extract and select characteristics, and even put the identified behaviours to use in real-world applications. This HAR type has attracted the most attention because motion sensors are now standard on most devices. Activity recognition is described as a complex process in [23]. It consists of four main tasks: (1) selecting and deploying sensors to objects and environments to record user behaviour and environmental changes; (2) gathering, storing, and processing perceived information using data analysis techniques and/or knowledge representation formalisms at suitable abstraction levels; (3) building computational activity models to enable software systems/agents to reason and manipulate; and (4) They go on to say that there is a plethora of methodologies, technology, and tools at your disposal for each given endeavour. Often, the methodology chosen for a secondary problem can inform the approach applied to the primary one. For instance, to identify the most discriminative features in Human Activity Recognition (HAR), [24] introduced feature selection methods based on the Gradient-based Optimiser (GBO) [25] and the Grey Wolf Optimiser (GWO) [26]. In their work, a Support Vector Machine (SVM) was subsequently used for classification. Building on this concept of optimised feature selection, [27] proposed a system that minimises computational cost and complexity by employing a Recurrent Neural Network (RNN) where the Colliding Bodies Optimisation (CBO) meta-heuristic handles feature selection.

More recently, the field has been influenced by the success of deep learning in areas like image classification and speech recognition, leading researchers to apply these models to human activity detection. A notable example is [28], where data from three-axis accelerometers was first converted into an image-like format. This representation was then fed into a Convolutional Neural Network (CNN) featuring three convolutional layers and a single fully connected layer for activity recognition. Expanding on this deep learning approach, [29] combined a deep CNN with a Long Short-Term Memory (LSTM) network to successfully classify twenty-seven hand gestures alongside five distinct motion types. Similarly, [30] also transformed three-axis accelerometer signals into images, employing a comparable CNN architecture with three convolutional layers and one fully connected layer to identify human activities, further validating the effectiveness of this image-based transformation technique. When it came time to identify five movements and twenty-seven hand gestures, [31] suggested using deep CNN and LSTM. Instead of using conventional micro-Doppler image pre-processing, a new iterative CNN method with autocorrelation pre-processing capabilities was suggested by [32], which can correctly categorise seven activities or five subjects. To automate feature definition and extraction, this approach made advantage of an iterative deep learning architecture. The convolutional neural network (CNN) has been widely adopted for image recognition and other tasks involving complex data and complex relations since the publication of the ‘AlexNet’ network in the ImageNet LSVRC-2010 contest [33]. With an even smaller number of parameters and significantly better accuracy, GoogLeNet (Inception-v1) [34] improves the ease of discovering relations even further. The inception module, which has been improved multiple times since its introduction in 2015, is the central idea of the Inception-v1 architecture. These modules are stacked to form the GoogleNet network, with max-pooling layers that reduce the grid’s resolution in half. A 1×1 convolutional layer is extensively utilised in Inception-ResNet to enhance the network’s depth. In contrast, 1×1 convolution is employed in [35] as dimensionality reduction modules to eliminate computational bottlenecks that would have otherwise constrained the size of the constructed networks. Table 1 describes the state-of-the-art algorithms to solve HAR.

This allows for an increase in both the depth and the breadth of the resulting networks without incurring a significant performance penalty. The second and third versions of Inception were unveiled in [36]. As with Inception-v2 [37], and Inception-v3 incorporated batch normalisation, which brought factorisation. According to [38], Inception-v4, the fourth version, was an upgrade above v3. In addition, they present Inception-ResNet in [39], which we have tweaked and utilised for our study. Despite the fact that there is contradictory evidence in [40], several studies have shown that residual connections from [11] are essential for training extremely deep convolutional models. Unlike previous deep neural networks, the suggested model can grow substantially without losing its exceptional performance due to the residual connection.

3. Methodology

HRA are also hierarchical in the sense that complex activities are made up of simple moves or actions that are needed to do the activity itself. In addition, they are translation-invariant because many people have different ways of doing the same kind of activity and because different parts of the same activity can appear at different times [1]. Though earlier DL methods improved HAR system performance, they overfitted since they couldn’t scale as well. To tackle the activity recognition problem, we provide the smart Res-BiLSTM, which expands upon the MPA’s achievements.

3.1. Dataset Description

The data from three sources that are available to the public is summarized in Table 2 [3,4,10,30]. The UCI-HAR dataset was built from the recordings of 30 participants, which is the highest number of volunteers compared to other datasets.

In comparison to the UCI-HAR dataset, the DAPHNET dataset comprises six activities; however, it contains the most samples. Later on, we will discuss how this dataset is imbalanced. The OPPORTUNITY dataset includes seventeen different actions. Accelerometers, gyroscopes, magnetometers, object sensors, and ambient sensors were the five kinds of sensors that gathered the data.

(1): UCI-HAR

The UCI-HAR dataset [30] was constructed from audio recordings from 30 individuals ranging in age from 19 to 48 years. Every participant was asked to adhere to a certain set of instructions while the recording was underway. Worn around their waist was a smartphone a Samsung Galaxy S II with inertial sensors built right in. There are six basic motions that everyone must make every day: standing, lying down, walking, going upstairs, and down. The following postural transitions are also included in this dataset: sitting to standing, sitting to laying, laying to sitting, standing to laying, and laying to standing. There is a total of eight such transitions. Due to the modest percentage of postural shifts, only six basic activities were chosen as input examples in this work. The experiments were videotaped so that the data could be manually annotated. At last, the researchers recorded data on three-dimensional acceleration and three-dimensional angular velocity at a steady 50Hz. Table 3 displays the detailed information, and the number of samples in this dataset is 748,206, according to statistics.

(3): DAPHNE

The total number of samples in the DAPHNET dataset [31] is 294,739, and Table 4 displays the percentage of the total number of samples that are connected with each activity. An imbalanced dataset, DAPHNET, is clearly visible. While standing only makes up 4.4% of the total, activity walking accounts for 38.6%. It uses 36 participants as its experimental object. With an Android phone tucked into their front leg pockets, these individuals went about their usual routines. An accelerometer sampling at 20 Hz is the sensor that is utilized. The smartphone also has a motion sensor integrated into it. Standing (Std), sitting (Sit), walking (Walk), going upwards (Up), down (Down), and jogging (Jog) were the six actions marked. Someone committed to ensuring high-quality data oversaw the data collection process. To better understand the properties of the raw data on each axis, Figure 2 displays the acceleration wave-form of each activity during a 2.56-second period (128 points in total).

(5): OPPORTUNITY

The 17 complicated motions and gestures included in the OPPORTUNITY dataset [32] were recorded in a sensor-rich environment. In total, it features four individuals engaged in various morning tasks in real-life settings. Various types of sensors were embedded in the environment, in objects, and on people’s bodies. Regarding the configuration of the sensors, the OPPORTUNITY challenge recommendations [33] were followed. We just took into account the on-body sensors, which comprise twelve Bluetooth 3-axis acceleration sensors, two InertiaCube3 sensors for the feet, and five inertial measurement units for the sports jacket. Table 5 summarises the gestures in this dataset, with the symbols of motions denoted by letters in parentheses.

3.2. Pre-Processing

The following pre-processing of raw data obtained by motion sensors is necessary to feed the suggested network with a certain data dimension and enhance the model’s accuracy.

(1): LINEAR INTERPOLATION

The subjects wear wireless sensors, and the datasets mentioned are realistic. Consequently, it is possible for some data to be lost when collecting; typically, this data is denoted as

N a N / 0

. In order to circumvent this issue, this work utilised the linear interpolation approach to fill in the missing numbers. Figure 3 describes Segmentation of sensor data.

(2): SCALING AND NORMALIZATION

It is important to normalise the input data to the range of 0 to 1, as seen in Equation (1), because training models directly using big values from channels can introduce training bias.

X_{i} = \frac{X_{i} - X_{i_{m i n}}}{X_{i_{m a x}} - X_{i_{m i n}}} (i = 1, 2, \dots, n)

(1)

the maximum and minimum values of the

i^{t h}

channel is represented by

X_{i_{m a x}}

;

X_{i_{m i n}}

, respectively, where n is the number of channels.

3.3. Proposed Residual Convolutional BiLSTM Network

The process begins with the collection of activity data using various devices such as Bluetooth, WIFI, radar, and others. The data is then preprocessed before being identified, based on the human activity identification capabilities of wearable sensing devices. The current approaches are slow and don’t differentiate between actions that are quite similar, like moving upstairs and downstairs. We offer a new architecture, Res-BiLSTM with MPA, to address the issues with current models. The three main parts of the model’s network architecture are the fully connected layer, Res-BiLSTM, and 1DCNN, as shown in Figure 4. The first part, 1DCNN, processes the preprocessed data by extracting spatial features. It successfully shortens the time series by adjusting the convolution kernel’s step size. The model can then decrease the time it takes to recognize objects. After the data is processed using 1DCNN, time series features are extracted using the upgraded Res-BiLSTM network. This model’s capacity to capture long-term dependencies in the time series data is improved by the Res-BiLSTM component, which combines the strengths of BiLSTM with residual connections. The model’s recognition accuracy and its capacity to comprehend complicated temporal patterns are both enhanced by this integration. In order to enhance the final recognition features even further, we present the MPA mechanism. By using weights for the Res-BiLSTM network’s feature information, this mode enables the model to zero in on the most relevant aspects of the input data. The attention mechanism improves activity recognition accuracy by highlighting the most relevant features, which increases the model’s discriminative capacity. When it comes time to classify the behavior information, the fully connected layer and SoftMax function are chosen. A prediction of the current activity is provided by the recognition result, which is the output of this categorization process. We will describe each part in depth in the sections that follow, including what they do and how they fit into our proposed model.

3.4. 1DCNN

Image processing and human behavior identification are two areas where convolutional neural networks (CNNs) shine due to their powerful feature extraction skills in handling tensor data. In this research, we successfully extract features using a one-dimensional convolutional neural network (1DCNN) [35]. Figure 2 shows that in the collected sensing data, the time series is represented vertically and the multi-axis channel features obtained by various sensors are exhibited horizontally. For spatial feature extraction, the model designers opted for 1D convolution—a convolution in behavioral units—rather than the more conventional 2D convolution because the former preserves the integrity of the sensor channels even when dealing with a large number of sensors, while the latter destroys them. A nonlinear activation function is computed in the following way: the input data is convolved with each filter, and then the 1 D convolution is computed:

X_{j} = f (\sum_{i = 1}^{n} (W^{i} \cdot x_{j} + b^{i}))

(2)

Here,

X_{j}

denotes the output activation,

W^{i}

represents the weight matrix of the

i^{t h}

filter, and

x_{j}

corresponds to the input sensing data convolved with

W^{i}

. The term

b^{i}

indicates the bias associated with the

i^{t h}

filter,

n

is the total number of filters used in the layer, and

f (\cdot)

denotes a non-linear activation function. In this work, the convolutional layers employ the Swish activation function [36]. Swish is particularly well suited for sensor-based data, as it alleviates the dead neuron issue associated with the negative input region in the ReLU activation function [37]. The mathematical formulation of the Swish function is given as follows:

s w i s h (x) = x \cdot s i g m o i d (x)

(3)

The pooling layer is used for down sampling after the activation stage. This down sampling procedure uses the ‘same’ padding. Furthermore, the stride of the pooling kernel can be adjusted to shorten the length of time in the 1DCNN layer, and the length of the time series changes in the following way:

{len}_{out} = \frac{l_{en}^{input}}{}

(4)

where s is the size of the pooled kernel step,

{L e n}_{out}

is the length of the pooled time series, and

{}^{l e}n_{input}

is the length of the input time series. Issues with disappearing or ballooning gradients can arise during neural network training due to the continual changes in the probability distribution of input inputs in each layer.

The phrase for this occurrence is the intermediate covariate shift issue [38]. Batch normalisation (BN) was proposed by [19] in 2015 to mitigate the issue of intermediate covariate shifts. Batch normalisation is based on the idea of taking the average and standard deviation of a data collection and replacing them with new values: 0 for the mean and 1 for the variance. Batch normalisation shortens the training time of neural networks by integrating normalisation into the training process, which speeds up convergence during gradient descent. The steps involved in the computation are detailed below.

\hat{x_{i, k}} = \frac{x_{i, k} - μ_{k}}{\sqrt{σ_{k}^{2} + ε}}

(5)

Here,

x_{(i, k)}

denotes the

k

-dimensional component of the input sample

x_{i}

from the training set

\{x\}

,

μ_{k}

represents the mean of the

k

-dimensional feature computed across all training samples, and

\sqrt{σ_{k}^{2} + ε}

corresponds to the standard deviation of the same feature, where

ε

is a small constant added for numerical stability. Following each convolutional layer, a fixed sequence of operations—convolution (Conv), batch normalization (BN), Swish activation, and max pooling—was applied, as illustrated in Figure 4. This block was repeatedly stacked to form a four-level hierarchical structure in the 1D-CNN component of the proposed model.

3.5. Res-BiLSTM

Activity recognition cannot be accomplished exclusively through the use of 1DCNN for the extraction of spatial features since human actions are essentially temporal. Also, the order in which the events transpired is crucial. When dealing with time series data, RNNs perform admirably. On the other hand, RNN models are susceptible to information loss and gradient vanishing as the time series increases. A long short-term memory network (LSTM) was suggested by [39]. LSTM recurrent neural networks are able to efficiently store longer-term temporal information, in contrast to basic RNNs. When dealing with longer time series, it even surpasses basic RNNs. However, both the moments immediately before and the moments immediately after have an impact on behavioral data. An LSTM network that can process data in both directions is called a bidirectional LSTM (Bi-LSTM). Time series feature extraction is improved with BLSTM over LSTM because bidirectional dependencies are captured. So, a BLSTM network is a good tool to use for behavioral data feature extraction. While BLSTM networks excel in time series feature extraction, they fall short when it comes to spatial feature capture, and the issue of gradient disappearance becomes much more problematic as the number of stacking layers increases during training. In 2015, a group from Microsoft Research developed ResNet, a residual network, to address this issue of gradient disappearance [40]. In 2015, the network achieved victory at the ILSVRC championship after reaching 152 layers. In Figure 5, we can see the precise structure of the residual. We can express each leftover block as:

x^{i + 1} = x^{(i)} + F (x^{i}, W_{i})

(6)

The remaining blocks are split into two sections: the residual section, denoted as

F (x^{i}, W_{i})

, and the direct mapping,

x^{i}

. Likewise, the encoder component in the Transformer model likewise makes use of the structure discussed before. Taking advantage of the strengths of the BLSTM network, our study presents a residual structure that uses this architecture.

It is also possible to employ normalisation techniques in BLSTM networks. The following is an expression for layer normalisation (LN), which is computed in the same way as batch normalisation (BN) and has the same benefits for recurrent neural networks as BN [20]:

\hat{x^{(i)}} = \frac{x^{(i)} - E (x^{(i)})}{\sqrt{v a r (x^{(i)})}}

(7)

In this context,

x^{(i)}

denotes the input vector corresponding to the

i^{t h}

dimension, while

{\hat{x}}^{(i)}

represents the normalized output obtained after applying layer normalization. In this study, a novel architecture that integrates a residual connection with layer normalization within a BLSTM network is introduced. This combined framework is referred to as Res-BiLSTM, and its overall structure is illustrated in Figure 6. The recursive feature information y can be described as:

x_{t}^{f (i + 1)} = L N (x_{t}^{f (i)} + L (x_{t}^{f (i)}, W_{i}))

(8)

x_{t}^{b (i + 1)} = L N (x_{t}^{b (i)} + L (x_{t}^{b (i)}, W_{i}))

(9)

y^{t} = c o n c a t (x_{t}^{f}, x_{t}^{b})

(10)

Here, the layers are normalized and the input states are processed through the LSTM network. The subscript

t

in

x_{t}^{f (i + 1)}

denotes the

t^{t h}

time step in the input time series, while the superscript

f

represents the forward hidden state and

b

indicates the backward hidden state. The term

(i+ 1)

corresponds to the number of stacked layers in the network. The encoded representation

y_{t}

at time

t

is obtained by combining information from both the forward and backward states. As illustrated in Figure 7, the proposed Res-BiLSTM architecture consists of parallel forward and backward LSTM networks that jointly capture temporal dependencies in both directions.

3.6. Marine Predators’ Algorithm for Training the Weights and Hyperparameter Tuning

MPA is a novel algorithm that attempts to seek prey by simulating the actions of marine predators. Two random walks, like Brownian motion and Lévy flight, are the basis of the two foraging tactics used by marine predators. Here are the mathematical explanations for these foraging strategies.

(1): BROWNIAN MOVEMENT

A probability function that is defined by a Gaussian distribution determines the step lengths in this stochastic model. The model’s probability density function is defined at the x-point as before:

f_{B} (x; μ, σ) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- (\frac{(x - μ)^{2}}{2 σ^{2}})} = \frac{1}{\sqrt{2 π}} e^{- (\frac{x^{2}}{2})}

(11)

where

μ = 0

and

σ^{2} = 1

.

(2): LEVY FLIGHT

The Lévy distribution can be used to express the step sizes of this random walk in the following way:

L (x_{j}) \approx {|x_{j}|}^{1 - α}

(12)

where

x_{j}

stands for the flight length and

1 < α \leq 2

signifies the power-law exponent. The Lévy stable model’s integral formulation is provided by [2].

f_{L} (x; α, γ) = \frac{1}{π} \int_{0}^{\infty} e x p (- γ q^{α}) c o s x q d q

(13)

where α is used to determine the scale unit and

γ

is the distribution index that is used to change the model’s scale attributes. Equation (3) offers a resolution in two instances. A Gaussian distribution is shown in the first scenario where the value of α is 2. The second scenario depicts a Cauchy distribution with α equal to 1. In addition, the integral in Equation (3) is solved using the series expansion method as x approaches infinity.

f_{L} (x; α, γ) \approx \frac{γ Γ (1 + α) s i n (\frac{π α}{2})}{π x^{(1 + α)}}

(14)

were,

x

approaches infinity, Γ stands for the Gamma function, where Γ(1+α) is equal to α! for integers α. In order to create a Lévy stable model with an index distribution (α) whose values range from 0.3 to 1.99, the Mantegna algorithm was suggested in Mantegna (1994). To produce Lévy-distributed random integers, the Mantegna method is used as

L e v y (α) = 0.05 \times \frac{x}{| y |^{1 / α}}

(15)

were, x and y are variables that follow a normal distribution, with

σ_{x}

and

σ_{y}

being their standard deviations, respectively.

x = N

ormal

(0, σ_{x}^{2})

(16)

y = N

ormal

(0, σ_{y}^{2})

(17)

σ_{x}

is formulated in Equation (6) as:

σ_{x} = {[\frac{Γ (1 + α) s i n (\frac{π α}{2})}{Γ (\frac{(1 + α)}{2}) α 2^{\frac{(α - 1)}{2}}}]}^{1 / α}, σ_{y} = 1

and

α = 1.5

(18)

(3): MPA FORMULATION

The MPA is quite similar to other metaheuristic algorithms. When these algorithms are first started, they are defined by:

X_{0} = X_{\min} + r a n d (X_{\max} - X_{\min})

(19)

where the variables’ lower and upper limits are represented by

X_{\min}

and

X_{\max}

, respectively, and rand is a vector of uniformly distributed random numbers from 0 to 1. The most fit predators will also be the best foraging, according to the idea of natural selection (Viswanathan et al., 1999). In order to create the Elite matrix, the fittest predators are used to symbolise the best solutions.

Elite = {[\begin{array}{l} X_{1,1}^{I} & X_{1,2}^{I} & \dots & X_{1, d}^{I} \\ X_{2,1}^{I} & X_{2,2}^{I} & \dots & X_{2, d}^{I} \\ \dots & \dots & \dots & \dots \\ X_{n, 1}^{I} & X_{n, 2}^{I} & \dots & X_{n, d}^{I} \end{array}]}_{n \times d}

(20)

wherever

{\vec{X}}^{I}

is not equal to create the Elite matrix, we take the fittest predator vector, where I stand for it, and multiply it by n. The number of dimensions is represented by d while the number of searching agents is represented by n. The positions of the prey are used by the Elite matrix arrays to detect it. The hunter and the hunted are both acknowledged as search agents. Next, we create a new matrix called Prey that shares the same dimensions as the Elite matrix.

P r e y = {[\begin{array}{l} X_{1,1} & X_{1,2} & \dots & X_{1, d} \\ X_{2,1} & X_{2,2} & \dots & X_{2, d} \\ \dots & \dots & \dots & \dots \\ X_{n, 1} & X_{n, 2} & \dots & X_{n, d} \end{array}]}_{n \times d}

(21)

where the

i^{t h}

dimension of the

j^{t h}

prey is denoted by

X_{i, j}

. To keep the predator’s positions up-to-date, the Prey matrix is utilised. Specifically, the MPA’s whole optimisation scenario relies on the Elite and Prey matrices.

(4): Optimization scenarios in the MPA

The three main optimization stages of the MPA algorithm are based on different velocity ratios, simulating the entire life cycle of a predator and prey. The three stages are described in the following way:

First phase (Exploration phase): During this time, the prey uses Brownian motion to rapidly move about and hunt for food. Instead, the predator watches its prey move while remaining motionless. During the initial three-quarters of iterations, which are mathematically represented as [12], the exploration phase takes place.

Iteration

< \frac{1}{3}

Maximum_Iteration

\vec{{step size}_{i}} = {\vec{R}}_{B r} \otimes (\vec{Elite} - {\vec{R}}_{B r} \otimes \vec{P r e y_{i}})

for

i = 0, \dots, n

\vec{P r e y_{i}} = \vec{P r e y_{i}} + P \cdot \vec{R} \otimes \vec{{step_size}_{i}}

(22)

A vector of randomly produced numbers with a Gaussian distribution is represented by

R_{B r}

, and it indicates the Brownian movement. Multiplications performed entry-by-entry are shown by the symbol

\otimes

. The product

\vec{R_{B r}} \otimes \vec{P r e y_{i}}

represents the tracking of prey. Here,

P

is a constant with a value of 0.5, and

R

represents a vector containing uniformly distributed random integers ranging from 0 to 1. The term Maximum_Iteration refers to the total number of iterations allowed in the process, whereas Iteration indicates the current iteration number.

Second phase (Transition phase between exploration phase and exploitation phase): In this second stage, the pace of the hunter and the hunted are almost equal. There is a gradual change from the exploration phase to this transitional phase. The predator completes the exploring phase with Brownian motion, whereas the prey reaches an exploitative phase with Lévy flight. Actually, there are two equal divisions in the population: one group is responsible for exploration and the other for exploitation [4]:

while

\frac{1}{3}

Maximum_Iteration

<

Iteration

< \frac{2}{3}

Maximum_Iteration As far as the first group is concerned,

\vec{{step_size}_{i}} = \vec{R_{L v}} \otimes (\vec{E_{i}} - \vec{R_{L v}} \otimes

\vec{P r e y_{i}}

) for

i = 0, \dots, n / 2

\vec{P r e y_{i}} = \vec{P r e y_{i}} + P \cdot \vec{R} \otimes \vec{{step_size}_{i}}

(23)

The Lévy flight is shown by the vector

R_{L v}

, which is created at random according to the Lévy distribution. The Lévy model of prey motion is given by the product

\vec{R_{L v}} \otimes \vec{P r e y_{i}}

, and the prey’s motion is modelled by adding the step size to its position. The Lévy distribution’s step sizes are helpful for exploitation because they mostly consist of tiny steps. This is the model that describes the second group’s actions:

\vec{{step_size}_{i}} = \vec{R_{B r}} \otimes (\vec{R_{B r}} \otimes \vec{{Elite}_{i}} - \vec{P r e y_{i}})

for

i = n / 2, \dots, n

.

\vec{P r {e y}_{i}} = \vec{{Elite}_{i}} + P \times C F \otimes \vec{{step_size}_{i}}

(24)

where

C F = {(1 - \frac{Iteration}{Maximum Iteration})}^{(2 \frac{Iteration}{Maximum iteration})}

raised to the power of represents the convergence factor (CF), which helps predators manage their search space during exploitation by adaptively adjusting the step size of their mobility. The product

\vec{R_{B r}} \otimes \vec{{Elite}_{i}}

models the predator’s Brownian motion, while the Brownian-based predator motion updates the prey’s positions.

Third phase (Exploitation phase):

At this point in the game, the predator is outpacing its prey in terms of speed. During the Lévy flight, the predator carries out an exploitative phase in order to capture its prey. The final one-third of iterations constitute the third stage. In terms of mathematics, the third stage can be expressed as [5]:

while Iteration

> \frac{2}{3}

Maximum_Iteration

\vec{{step_size}_{i}} = \vec{R_{L v}} \otimes (\vec{R_{L v}} \otimes \vec{{Elite}_{i}} - \vec{\Pr ey} \vec{i}

for

i = 0, \dots, n

\vec{P r e y_{i}} = \vec{{Elite}_{i}} + P \times C F \otimes \vec{{step_size}_{i}}

(25)

The product

\vec{R_{L v}} \otimes \vec{{Elite}_{i}}

can be used to represent the predator’s motion when it follows the Lévy strategy. Adding a step size to Elite position models the predator’s motion to make it easier to update the prey’s position.

(5): Eddy formation with the effect from FADs

FADs and eddy formation both have significant impacts on predator behavior in the ocean. Based on the information provided in [1], the majority of the time that sharks are available, they stay close to FADs. When they are not, they use that time to explore different dimensions and find areas with different distributions of prey. FADs and lengthy skips enhance the algorithm’s performance by preventing the MPA from stagnating at local optima. An example of the FADs scenario is:

\vec{P r e y_{i}} = \{\begin{array}{l} \vec{P r e y_{i}} + C F [\vec{X_{m i n}} + \vec{R} \otimes (\vec{X_{m a x}} - \vec{X_{m i n}})] \otimes \vec{U} & if & r \leq F A D s \\ \vec{P r e y_{i}} + [F A D s (1 - r) + r] (\vec{P r e y_{r 1}} - \vec{P r e y_{r 2}}) & if & r > F A D s \end{array}

(26)

\vec{U}

is a binary vector with arrays of 0 and 1, and FADs is the likelihood of FADs effect, which is 0.2.r is a uniformly distributed random integer between zero and one. The subscripts

r_{1}

and

r_{2}

denote indices of arbitrary numbers for the prey matrix.

(6): Memory of the marine predators

Inspired by marine predators’ remarkable memory for high-production feeding sites, which allows them to swiftly capture optimal solutions while avoiding local solutions, he integrated this feature into his algorithm by comparing current best replies to those from previous rounds. The solutions are adjusted depending on the optimal one during the comparison stage. The MPA pseudo-code is displayed below 1:1:

Algorithm 1: Steps of MPA

Initialize a set of N solutions U.
while stop conditions are not met do
Calculate fitness values and generate Elite matrix.if

t < t_{m a x} / 3

then
using Equation (22) to Update generation values (solutions);else if

t_{m a x} / 3 < t < 2 \times t < t_{m a x} / 3

thenfor the first-half of the solutions

(i = 1, \dots, \frac{n}{2})

.
Apply Equation (23) to update solution values;for the second half of the solutions

(i = 1, \dots, \frac{n}{2})

.
Apply Equation (24) to update solution valueselse if

t > 2 \times t_{m a x} / 3

then
Apply Equation (25) to update solution values;
end if
Apply Equation (26) and FADs effect for updating current
solutions.
Update memory and Elite.
end while

Equations (27)–(29) state the transfer function used in the three binary forms of MPA, which are S-shaped MPA (MPAs and MPA 10), V-shaped MPA (MPAV), and a third version that is not specified.

T F = 1 / (1 + e^{- X (i, j) - 0.5})

(27)

T F = 1 / (1 + e^{- 10 * X (i, j) - 0.5})

(28)

T F = |\frac{2}{π} {t a n}^{- 1} (\frac{π}{2} * X (i, j))|

(29)

where

X (i, j)

is the

j^{t h}

dimension of the

i^{t h}

solution and TF is the value of the transfer function. Then, each solution is updated by comparing the TF value to a randomly generated number in the range of 0 to 1.

(7): Computational complexity

There are two steps to the suggested model’s operation: feature optimization and feature extraction using the RCNN-BiGRU model. Phase two involves using MPA and its variants to pick the right features to boost accuracy, with classification tasks handled by SVM [15] and the random forest (RF) algorithm [5]. While training on the datasets in question, the Res-BiLSTM model’s 1.5 million parameters were changed. The complexity of feature optimization, represented by

T^{F}

, is determined according to the formula in Equation (17).

T^{(F))} = O (t_{m a x} \times (N_{s} d + C F E \times N_{s}))

(30)

where

N_{s}

is the total number of search agents and d is the dimension, which stands for the number of features. The cost function evaluation, abbreviated as CEF, is classifier-dependent. The training time for the SVM algorithm is

O (N_{T E}^{2})

, while the training time for the RF algorithm is

O (N_{T R} \times N_{5} l o g N_{5} \times d)

. Here,

N_{T E}

is the number of training instances and

N_{T R}

is the number of trees with the RF algorithm.

4. Results

This section begins with an introduction to the experimental setup, which includes a discussion of how we constructed the various models with their respective parameters, the various configurations of the datasets, and a subsequent study of the machine specifications on which we conducted the experiments. Then, we’ll compare the outcomes for the various models and present and talk about the results we got for each dataset. Implementation of the suggested model architecture was carried out in TensorFlow [32] through the Kera’s API. Machine learning algorithms can be expressed and executed using TensorFlow, which serves as both an interface and an implementation. In order to speed up training on GPUs, we employ TensorFlow 2.4.0, which has eager execution capabilities. The Kera’s application programming interface (API) for TensorFlow simplifies the process of creating artificial neural networks (ANNs) by hiding the underlying complexity.

4.1. Performance Measures

Three publicly available and benchmark datasets are used to assess the performance of the proposed model on the HAR problem and compare it to other models; we go into more depth about these datasets when we provide their results. Consistency and meaningful comparability were achieved by training the proposed model on the identical train, validation, and test sets. Our datasets are public and have been for some time; many other studies have used them as well, with competing claims of superiority [18,19,23,25,26]. In order to create a more consistent standard for deep learning applications and to compare them to newer methods, we optimised these datasets in our trials. For instance, with subject-specific information present in all datasets, we have taken precautions to avoid using training and testing sets that contain data from the same patients. Although the user’s data came from distinct experiments or drills, we utilised some of their data in both the testing and validation sets when working with the DAPHNET dataset. The scarcity of data belonging to the freeze class was the reason behind this. The subsections that follow have touched on a few other factors. A common problem with gathering data on human activities in their natural habitats is the inherent class imbalance in the resulting datasets. Depending on the class, there may be a large number of samples in some and very few in others. Among our four datasets, the UCI HAR dataset is the most evenly distributed, with 13% of the train set’s samples going to the smallest class and 19% to the biggest. The test and validation classes are no different. The Opportunity dataset is severely skewed in favour of the Drink from Cup class, which uses over 23% of the training data compared to Close Drawer 2’s 2%. The most straightforward metric for gauging a model’s efficacy on a dataset is its accuracy, which is defined as the percentage of observations that were correctly predicted relative to the total number of observations. If our accuracy is good, it would be easy to assume that the suggested model is ideal. The only situation in which accuracy is a useful metric is in symmetric datasets, where the values of false positives and false negatives are very close to one another. Consequently, other parameters should be considered while assessing the model’s performance. The model’s accuracy (A), is calculated as:

A = \frac{T P + T N}{T P + F P + F N + T N}

(31)

where TP and FP stand for the number of correct positive results and TN and FN for the number of incorrect negative results, respectively. When a classifier makes a prediction about the accuracy of each class’s categorisation, larger classes tend to do better than smaller ones. When evaluating performance, the overall classification accuracy is not the right metric to use [16]. The F1 score, F1, gives equal weight to each class’s accurate classification. In determining the final grade, it takes into account each class’s memory and precision. As a measure of accuracy, precision (P) is defined as the proportion of correctly predicted positive observations relative to the total number of expected positive observations:

P = \frac{T P}{T P + F P}

(32)

Recall (R), often referred to as sensitivity, measures the ability of a model to correctly identify positive instances. It is calculated as the ratio of correctly predicted positive observations to the total number of actual positive observations present in the dataset:

R = \frac{T P}{T P + F N}

(33)

The F1 score, which accounts for class imbalances by weighting classes according to their sample composition, is calculated as the weighted average of P and R:

F 1 = 2 \times \frac{R \times P}{R + P}

(34)

in which recall is denoted by R and accuracy by P. The performance of the proposed models was assessed using several key metrics: total parameter count, accuracy, F1 score [3], and categorical or binary cross-entropy loss. Notably, larger and more complex architectures do not always yield superior results, as certain compact models were observed to achieve lower loss values, highlighting the efficiency of simpler designs.

4.1.1. UCI HAR Dataset

The UCI HAR smartphone dataset, introduced by [18], contains recordings of 30 participants performing basic activities of daily living (BADL) while carrying a waist-mounted smartphone equipped with inertial sensors. The primary objective is to classify six distinct activities using triaxial angular velocity and triaxial linear acceleration data, all sampled at a consistent rate of 50 Hz. These activities comprise three static postures standing, sitting, and lying along with three dynamic movements: walking, walking downstairs, and walking upstairs. Data collection employed sliding windows of 2.56 seconds duration with a 50% overlap, resulting in 128 readings per window. To ensure signal quality, preprocessing involved applying a median filter alongside a third-order low-pass Butterworth filter with a 20 Hz cutoff frequency for noise reduction. Additionally, the acceleration signals were further decomposed into body acceleration and gravity components using another Butterworth low-pass filter. From this preprocessing pipeline, a total of nine signal channels were ultimately extracted for input into the deep learning models.

In order to create a reproducible and comparable benchmark, we used the datareader.py program from [34] to divide the dataset into three sets: train, test, and validation. These sets are organised according to the topics, as shown in Table 6. The goal was to make sure that all of our models produced the same, verifiable outcomes and that the trained models could be applied to any user. A balanced dataset across all six classes is indicated by the minimum and maximum percentages. We used many models to analyse the UCI HAR dataset, and the results are shown in Table 7. In comparison to the other models, our suggested model achieves the best test accuracy (96.12%) and F1 score (95.15%). The CNN-LSTM network comes in second, with an 89% F1 score and an accuracy that is 3.61% lower.

Although the CNN-LSTM network uses more than twice as many parameters as the proposed model (1,300,000 vs. 85,000), the vanilla LSTM network uses substantially fewer parameters overall (about 85,000 vs. 1,300,000). In Figure 8 and Figure 9, we show the accuracy and loss for the proposed model during training and validation. We utilised a learning rate scheduler to repeatedly lower the learning rate when training reaches a plateau, and we trained all of the models for a maximum of 350 epochs with early stopping patience set to 100 epochs.

Figure 10 shows a comparison of the confusion matrices from all six of our models, revealing that the suggested model is the most effective at distinguishing between the classes. The sitting and standing classes are typically mistaken for one another in most models due to their shared properties.

4.1.2. Opportunity Dataset

A dataset for opportunity activity detection that includes real-life actions recorded using 72 different sensors (both external and internal) in a very sensor-rich setting. Using 72 sensors across 10 modalities, embedded in items and the environment, as well as on the body, it records data from 12 participants over 15 networked sensor systems. For these reasons, it is an excellent choice for comparing different activity identification methods. The only inertial measurement units that were taken into consideration were those that come from the columns that ranged from 38 to 134. While we did include data from other sensors like the triaxial accelerometer, gyroscope, and magnetometer, we did not include the quaternion readings. The end consequence was that we had 77 channels (signals) to work with. We obtained 90 samples per window from the data, which was captured at 30 Hz, by extracting 3-second windows.

At first, the Opportunity dataset presents an 18-class multi-class classification challenge; however, we remove the extra label known as the null class, reducing the set to 17 classes. Table 8 lists these. The classes titled “Drink from Cup” include the greatest amount of data, whereas the classes titled “Close Drawer 1” and “Close Drawer 2” comprise the smallest portions of the dataset, accounting for approximately 2.5% apiece. The unequal distribution of the data across the classes indicates that this dataset is unbalanced. Table 9 provides a summary of the outcomes from the Opportunity dataset for model training.

When tested on this dataset, our proposed model outperforms the competition. Compared to the CNN LSTM and stacked BiLSTM models, it performed better with an F1 score of 93.23%. Our model accuracy of 95.14% and model loss of 0.23 are significantly better than the stacked LSTM and CNN-LSTM networks. Figure 11 compares the confusion matrices of our six models and demonstrates that the suggested model outperforms the others when it comes to distinguishing between the various classes. Using the UCI HAR dataset as an example, the suggested model parameters are only lower than the ResNet and LSTM models. Figure 12 and Figure 13 show the accuracy and loss for the proposed model during training and validation, respectively.

We trained all of the models, including the one we presented, for a maximum of 160 epochs, with early stopping patience set at 20 epochs. When training reached a plateau, we utilised a learning rate scheduler to iteratively cut the learning rate. Due to their shared properties, the “Drawer” related classes in most models are easily confused with one another.

4.1.3. Daphnet

Daphnet dataset that was created to test artificial algorithms for identifying gait freeze using acceleration sensors worn on the hips and legs. The sudden and temporary inability to walk, known as freezing of gait (FOG), affects over half of the people with severe Parkinson’s disease (PD). A person’s quality of life is greatly diminished, and they are more likely to fall as a result. Successful non-pharmacologic treatments are of particular importance for PD patients’ gait defects because these deficits are frequently resistant to pharmacologic treatment. Their research set out to test the feasibility of a wearable gadget that could track a person’s steps in real time, analyse the data, and then offer support according to user preferences. They created a wearable FOG detector that can detect fog in real-time, play a signal when it detects fog, and continue to do so until the person starts walking again. In research including ten individuals with PD, this wearable assistive device was assessed. In post-hoc film analysis, expert physiotherapists were able to identify 237 FOG occurrences. We captured the dataset in a controlled laboratory environment where we intentionally generated a large number of freeze events. A more realistic activity of daily living (ADL) task saw users entering several rooms to retrieve coffee, unlock doors, etc., after which they walked in a straight path, walked with many turns, etc. Freeze and No Freeze are the two activities included in this dataset. The data was captured at 64 Hz and sampled using 3 second fixed-width sliding windows with 50% overlap. This allowed for 192 readings per window. As input to the DL models, we utilised the nine accelerometer signals, as well as the triaxial accelerometer from the ankle, upper leg, and trunk. Figure 14 compares the confusion matrices of DAPHNET dataset on six models and demonstrates that the suggested model outperforms the others when it comes to distinguishing between the various classes.

In order to create a reproducible and comparable benchmark, we used the datareader.py file from [34] to divide the dataset into three parts: train, test, and validation. These parts are based on the experiments conducted for each participant, as shown in Table 10. We have a severely biassed dataset if the percentages for the No Freeze and Freeze groups are so different. Table 11 provides a summary of the outcomes from the Daphnet dataset model training. Even on this dataset, our suggested model outperforms the competition. Compared to the LSTM, ResNet, and stacked LSTM models, it performs 3% better with an F1 score of 94.07%. Our model accuracy of 96.32% and model loss of 0.25% are better than the stacked LSTM and CNN-LSTM networks. Similar to the CNN-LSTM model, the parameters of the suggested model are lower when applied to the UCI HAR dataset. Figure 15 and Figure 16 show the accuracy and loss during training and validation for the suggested model. With early stopping patience set at 20 epochs and a learning rate scheduler to iteratively reduce the learning rate when training plateaus, we trained the proposed model and the other models for a maximum of 160 epochs.

As a result of data imbalance, the “Freeze” class is under classified in the majority of models. In contrast to the other models, the suggested one is the most effective at identifying these freeze events.

5. Conclusions

As a result of the advancements in deep learning and swarm intelligence techniques, this study has addressed the topic of human activity recognition based on data that has been acquired publicly from wearable sensors. In order to address the HAR issue, we put up a novel feature extraction strategy that use a residual convolutional BiLSTM to extract pertinent features from sensor input. We used the latest developments in swarm intelligence algorithms, which have proven to be very effective in this area, to address the issue of feature selection. We compared opportunity, UCI-HAR, and Daphnet, three publicly available benchmark datasets, to various optimization techniques in our evaluation trials. The results demonstrated that the suggested Res-BiLSTM with MPA achieved the highest performance, as measured by several performance indicators and statistical tests. It improved classification accuracy and beat numerous optimization algorithms, including state-of-the-art DL. Additional research is needed to address other concerns related to feature development, such as making use of unlabeled data for easy implementation in real-time HAR applications and lowering computation costs.

Funding

No funding was received for conducting this study

Conflicts of Interest

The authors declare no competing interests.

References

Abdel-Basset, M.; Hawash, H.; Chakrabortty, R. K.; Ryan, M.; Elhoseny, M.; Song, H. ST-DeepHAR: Deep learning model for human activity recognition in IoHT applications. IEEE Internet Things J. 2020, 8(6), 4969–4979. [Google Scholar] [CrossRef]
Zhou, X.; Liang, W.; Kevin, I.; Wang, K.; Wang, H.; Yang, L. T.; Jin, Q. Deep-learning-enhanced human activity recognition for Internet of healthcare things. IEEE Internet Things J. 2020, 7(7), 6429–6438. [Google Scholar] [CrossRef]
Islam, M. M.; Nooruddin, S.; Karray, F.; Muhammad, G. Multi-level feature fusion for multimodal human activity recognition in Internet of Healthcare Things. Inf. Fusion 2023, 94, 17–31. [Google Scholar] [CrossRef]
Javeed, M.; Abdelhaq, M.; Algarni, A.; Jalal, A. Biosensor-based multimodal deep human locomotion decoding via internet of healthcare things. Micromachines 2023, 14(12), 2204. [Google Scholar] [CrossRef]
Yu, J.; Zhang, J. Monitoring and analysis of physical activity and health conditions based on smart wearable devices. J. Intell. Fuzzy Syst. 2024, (Preprint), 1–16. [Google Scholar] [CrossRef]
Khalid, A. M.; Khafaga, D. S.; Aldakheel, E. A.; Hosny, K. M. Human Activity Recognition Using Hybrid Coronavirus Disease Optimization Algorithm for Internet of Medical Things. Sensors 2023, 23(13), 5862. [Google Scholar] [CrossRef] [PubMed]
Algethami, S. A.; Alshamrani, S. S. A Deep Learning-Based Framework for Strengthening Cybersecurity in Internet of Health Things (IoHT) Environments. Appl. Sci. 2024, 14(11), 4729. [Google Scholar] [CrossRef]
Priyadarshini, I.; Sharma, R.; Bhatt, D.; Al-Numay, M. Human activity recognition in cyber-physical systems using optimized machine learning techniques. Clust. Comput. 2023, 26(4), 2199–2215. [Google Scholar] [CrossRef]
Hamza, K.; Riaz, Q.; Imran, H. A.; Hussain, M.; Krüger, B. Generisch-Net: A Generic Deep Model for Analyzing Human Motion with Wearable Sensors in the Internet of Health Things. Sensors 2024, 24(19), 6167. [Google Scholar] [CrossRef]
Hemalatha, T.; Kalaiselvi, T. C.; Gnana Kousalya, C.; Rohini, G. Multimodal deep learning for activity detection from iot sensors. IETE J. Res. 2024, 70(5), 5006–5018. [Google Scholar] [CrossRef]
Gaud, N.; Rathore, M.; Suman, U. MHCNLS-HAR: Multi-Headed CNN-LSTM Based Human Activity Recognition Leveraging a Novel Wearable Edge Device for Elderly Health Care. IEEE Sens. J. 2024. [Google Scholar] [CrossRef]
Thakur, D.; Guzzo, A.; Fortino, G. Attention-based multihead deep learning framework for online activity monitoring with smartwatch sensors. IEEE Internet Things J. 2023, 10(20), 17746–17754. [Google Scholar] [CrossRef]
Menaka, S. R.; Prakash, M.; Neelakandan, S.; Radhakrishnan, A. A novel WGF-LN based edge driven intelligence for wearable devices in human activity recognition. Sci. Rep. 2023, 13(1), 17822. [Google Scholar] [CrossRef]
Al-qaness, M. A.; Dahou, A.; Trouba, N. T.; Abd Elaziz, M.; Helmi, A. M. TCN-Inception: Temporal Convolutional Network and Inception modules for sensor-based human activity recognition. In Future Generation Computer Systems; 2024. [Google Scholar]
Wazwaz, A.; Amin, K.; Semary, N.; Ghanem, T. Dynamic and Distributed Intelligence over Smart Devices, Internet of Things Edges, and Cloud Computing for Human Activity Recognition Using Wearable Sensors. J. Sens. Actuator Netw. 2024, 13(1), 5. [Google Scholar] [CrossRef]
Waghchaware, S.; Joshi, R. Machine learning and deep learning models for human activity recognition in security and surveillance: a review. In Knowledge and Information Systems; 2024; pp. 1–32. [Google Scholar]
Ashwin, M.; Jagadeesan, D.; Raman Kumar, M.; Murugavalli, S.; Chaitanya Krishna, A.; Ammisetty, V. Novel hybrid optimization based adaptive deep convolution neural network approach for human activity recognition system. In Multimedia Tools and Applications; 2024; pp. 1–25. [Google Scholar]
Dahou, A.; Al-qaness, M. A.; Abd Elaziz, M.; Helmi, A. Human activity recognition in IoHT applications using arithmetic optimization algorithm and deep learning. Measurement 2022, 199, 111445. [Google Scholar] [CrossRef]
Issa, M. E.; Helmi, A. M.; Al-Qaness, M. A.; Dahou, A.; Abd Elaziz, M.; Damaševičius, R. Human activity recognition based on embedded sensor data fusion for the internet of healthcare things. In Healthcare; MDPI, June 2022; Vol. 10, No. 6. [Google Scholar]
Bolhasani, H.; Mohseni, M.; Rahmani, A. M. Deep learning applications for IoT in health care: A systematic review. Inform. Med. Unlocked 2021, 23, 100550. [Google Scholar] [CrossRef]
Nagarajan, S. M.; Deverajan, G. G.; Chatterjee, P.; Alnumay, W.; Ghosh, U. Effective task scheduling algorithm with deep learning for Internet of Health Things (IoHT) in sustainable smart cities. Sustain. Cities Soc. 2021, 71, 102945. [Google Scholar] [CrossRef]
Helmi, A. M.; Al-Qaness, M. A.; Dahou, A.; Damaševičius, R.; Krilavičius, T.; Elaziz, M. A. A novel hybrid gradient-based optimizer and grey wolf optimizer feature selection method for human activity recognition using smartphone sensors. Entropy 2021, 23(8), 1065. [Google Scholar] [CrossRef]
Islam, M. M.; Nooruddin, S.; Karray, F. Multimodal human activity recognition for smart healthcare applications. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC); IEEE, October 2022; pp. 196–203. [Google Scholar]
Thanarajan, T.; Alotaibi, Y.; Rajendran, S.; Nagappan, K. Improved wolf swarm optimization with deep-learning-based movement analysis and self-regulated human activity recognition. AIMS Math. 2023, 8(5), 12520–12539. [Google Scholar] [CrossRef]
Bhattacharya, D.; Sharma, D.; Kim, W.; Ijaz, M. F.; Singh, P. K. Ensem-HAR: An ensemble deep learning model for smartphone sensor-based human activity recognition for measurement of elderly health monitoring. Biosensors 2022, 12(6), 393. [Google Scholar] [CrossRef]
Jain, R.; Semwal, V. B. A novel feature extraction method for preimpact fall detection system using deep learning and wearable sensors. IEEE Sens. J. 2022, 22(23), 22943–22951. [Google Scholar]
Hnoohom, N.; Chotivatunyu, P.; Mekruksavanich, S.; Jitpattanakul, A. Multi-resolution CNN for lower limb movement recognition based on wearable sensors. In International Conference on Multi-disciplinary Trends in Artificial Intelligence; Springer International Publishing: Cham, November 2022; pp. 111–119. [Google Scholar]
Alonazi, M.; Alshahrani, H. M.; Kouki, F.; Almalki, N. S.; Mahmud, A.; Majdoubi, J. Deep convolutional neural network with symbiotic organism search-based human activity recognition for cognitive health assessment. Biomimetics 2023, 8(7), 554. [Google Scholar] [CrossRef] [PubMed]
Waghchaware, S.; Joshi, R. Machine learning and deep learning models for human activity recognition in security and surveillance: a review. In Knowledge and Information Systems; 2024; pp. 1–32. [Google Scholar]
Bebortta, S.; Singh, S. K. An intelligent framework towards managing big data in internet of healthcare things. International conference on computational intelligence in pattern recognition, 2022, April; Springer Nature Singapore: Singapore; pp. 520–530. [Google Scholar]
Khaled, H.; Abu-Elnasr, O.; Elmougy, S.; Tolba, A. S. Intelligent system for human activity recognition in IoT environment. In Complex & Intelligent Systems; 2021; pp. 1–12. [Google Scholar]
Ronald, M.; Poulose, A.; Han, D. S. iSPLInception: an inception-ResNet deep learning architecture for human activity recognition. IEEE Access 2021, 9, 68985–69001. [Google Scholar] [CrossRef]
Fra, V.; Forno, E.; Pignari, R.; Stewart, T. C.; Macii, E.; Urgese, G. Human activity recognition: suitability of a neuromorphic approach for on-edge AIoT applications. Neuromorphic Comput. Eng. 2022, 2(1), 014006. [Google Scholar]
Boudjema, A.; Titouna, F.; Titouna, C. AReNet: Cascade learning of multibranch convolutional neural networks for human activity recognition. Multimed. Tools Appl. 2024, 83(17), 51099–51128. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, J.; Zhang, Y.; Liu, H.; Chen, Z. A.; Lu, Y.; Gao, S. Flexible and wearable EMG and PSD sensors enabled locomotion mode recognition for IoHT-based in-home rehabilitation. IEEE Sens. J. 2021, 21(23), 26311–26319. [Google Scholar] [CrossRef]
Zheng, G. A novel attention-based convolution neural network for human activity recognition. IEEE Sens. J. 2021, 21(23), 27015–27025. [Google Scholar]
Ige, A. O.; Noor, M. H. M. A deep local-temporal architecture with attention for lightweight human activity recognition. Appl. Soft Comput. 2023, 149, 110954. [Google Scholar] [CrossRef]
Al-qaness, M. A.; Dahou, A.; Abd Elaziz, M.; Helmi, A. M. Human activity recognition and fall detection using convolutional neural network and transformer-based architecture. Biomed. Signal Process. Control 2024, 95, 106412. [Google Scholar]
Uddin, M. A.; Talukder, M. A.; Uzzaman, M. S.; Debnath, C.; Chanda, M.; Paul, S.; Aryal, S. Deep learning-based human activity recognition using CNN, ConvLSTM, and LRCN. Int. J. Cogn. Comput. Eng. 2024, 5, 259–268. [Google Scholar]
Choudhury, N. A.; Soni, B. In-depth analysis of design & development for sensor-based human activity recognition system. Multimed. Tools Appl. 2024, 83(29), 73233–73272. [Google Scholar]

Figure 1. An overall system architecture of human activity recognition.

Figure 2. 2.56-second acceleration waveform per activity.

Figure 3. Segmentation of sensor data.

Figure 4. The difference between a) 1DCNN and b) 2DCNN in human activity recognition.

Figure 5. Proposed Res-BiLSTM model architecture.

Figure 6. Residual block in ResNet.

Figure 7. The Res-BiLSTM network is composed of forward and backward LSTM networks, each LSTM is added to the residual structure and LN, and the final encoding information forward state and backward state are spliced.

Figure 8. Proposed model accuracy graph on the UCI HAR dataset.

Figure 9. Proposed model loss graph on the UCI HAR dataset.

Figure 10. Proposed model confusion matrices on the UCI HAR dataset.

Figure 11. Proposed model confusion matrices on the opportunity dataset.

Figure 12. Proposed model loss graph on the opportunity dataset.

Figure 13. Proposed model accuracy graph on the opportunity dataset.

Figure 14. Proposed model confusion matrices on the DAPHNET dataset.

Figure 15. Proposed model accuracy graph on the DAPHNET dataset.

Figure 16. Proposed model Loss graph on the DAPHNET dataset.

Table 1. Some state-of-the-art algorithms to solve HAR.

Reference	Year	Method	Dataset	Results	Limitations
Thanarajan et al. [24]	2023	PSO-Optimized CNN	MHEALTH Dataset	Achieved 92.5% accuracy, reduced computational cost by 15% compared to standard CNNs, robust against noise.	High convergence time during optimization.
Battacharya et al. [25]	2022	GA-Enhanced LSTM	UCI HAR Dataset	Improved temporal prediction with 91.8% accuracy, reduced false positives by 12%.	Struggles with real-time data processing, affecting deployment in dynamic environments.
Jain et al. [26]	2022	Ant Colony Optimization (ACO) + CNN	WISDM Dataset	Enhanced minor activity detection, achieving F1-Score of 88.2%; computational overhead reduced by 10%.	Poor generalization to new sensor data; requires dataset-specific tuning.
Priyadarshini et al. [8]	2024	Firefly Algorithm + Bi-LSTM	Opportunity Dataset	Precision of 90.1%, detected overlapping activities effectively, handled long-term dependencies well.	Slow optimization and difficulty scaling to larger datasets.
Menaka et al. [13]	2024	PSO-Inception V3	PAMAP2 Dataset	Delivered 93.2% accuracy, handled multi-sensor fusion efficiently, reduced energy consumption by 18%.	Susceptible to overfitting on small training sets; requires regularization.
Hnoohom et al. [27]	2022	Swarm-Based RNN	RealWorld HAR Dataset	Achieved recall of 89.4%, with efficient identification of rare activities, improved battery life by 10%.	Complexity leads to high computational power demands for wearable devices.
Alonazi et al. [28]	2023	Bee Colony Optimization + DNN	HAPT Dataset	Achieved 90.5% accuracy, 20% faster convergence than standard methods, robust for varying user habits.	Limited scalability when new sensors or data types are introduced.
Waghchaware et al. [29]	2024	Particle Swarm Optimization (PSO) + CNN	WISDM Dataset	Detected walking and running with 88.9% accuracy, reduced training time by 25%, low memory usage.	Lacks privacy-preserving mechanisms for wearable healthcare devices.
Bebortta et al. [30]	2023	Grey Wolf Optimizer (GWO) + GRU	UCI HAR Dataset	Enhanced sequential motion recognition with 91.2% accuracy, reduced latency to 20ms per prediction.	Lower performance in high-noise scenarios, requiring pre-processing.

Table 2. Represents detailed Information on public datasets.

Dataset	Sensors	S. Rate	Volunteers	Samples
UCI-HAR	A, G	50Hz	30	748,206
DAPHNET	A	20Hz	36	294,739
Opportunity	A, G, M, O, A, M	30Hz	4	701,366

Table 3. Activities of UCI-HAR.

Activities	Samples	Percentage
Walk	121, 191	15.3%
Up	117, 607	14.6%
Down	108, 861	15.4%
Sit	125, 577	15.9%
Stand	137, 205	17.5%
Lay	137, 765	17.3%

Table 4. Activities of daphne.

Activities	Samples	Percentage
Walk	42, 300	37.6%
Jog	41, 277	32.2%
Down	21, 769	10.2%
Up	90, 327	9.2%
Stand	52, 739	5.4%
Sit	46, 297	4.3%

Table 5. Activities of Opportunity.

Door 1	Open Drawer 1
Door 2	Close Drawer 1
Fridge 1	Open Drawer 2
Fridge 2	Close Drawer 2
Door 1	Open Drawer 3
Door 2	Close Drawer 3
Clean Table	Open Drawer 1
Drink from Cup	Open Drawer 1

Table 6. Splitting the UCI HAR dataset.

Set	Subject	Total Samples	Min	Max
Train	1, 3, 5, 6, 11, 14, 15, 16, 17, 19, 21, 22, 23, 28, 29, 30.	7342	13.5%	18.1%
Test	2, 9, 10, 13, 18, 24.	1946	14.2%	17.2%
Validation	4, 12, 20	990	13.2%	18.2%

Table 7. Quantitative evaluation on UCI HAR dataset.

Method	Accuracy	Precision	F1-score	Recall
ResNet	93.57	82.34	87.23	90.12
Inception	92.38	90.23	89.14	91.23
CNN+PSO	91.70	88.45	82.32	90.12
LSTM	89.72	87.54	85.23	86.14
BiLSTM	91.81	92.12	88.23	90.15
This Work	96.12	96.24	95.15	94.26

Table 8. Splitting the opportunity dataset.

Set	Subject	Samples	Min	Max
Train	S1-2, S1-4, S1-5, S1-Drill, S2-1, S2-3, S2-4, S2-5, S3-4, S3-5, S4-1, S4-2, S4-Drill	3015	2.5%	22.1%
Test	S2-2, S2-Drill, S3-1, S4-5	1179	2.2%	21.1%
Val	S1-1, S3-2, S3-Drill, S4-4	1077	3.2%	18.9%

Table 9. Quantitative evaluation on opportunity dataset.

Method	Accuracy	Precision	F1-score	Recall
ResNet	82.24	81.13	79.23	80.45
Inception	82.41	79.54	81.45	80.12
CNN+PSO	77.79	75.12	77.16	75.24
LSTM	82.82	79.45	81.34	80.23
BiLSTM	80.90	80.14	79.45	78.23
This Work	95.14	94.45	93.23	95.00

Table 10. Splitting the UCI HAR dataset and data disparity.

Set	Subject	Number of Samples	Min	Max
Train	S1-1, S1-3, S3-1, S3-2, S6-1, S6-2, S7-1, S8-1, S9-1, S10-1	7935	91.3%	8.6%
Test	S2-1, S4-1, S5-1	2322	91.8%	7.1%
Validation	S2-2, S3-3, S5-1	1612	83.0%	15.0%

Table 11. Quantitative evaluation on DAPHNET dataset.

Method	Accuracy	Precision	F1-score	Recall
ResNet	91.97	90.23	93.00	89.12
Inception	90.97	91.45	90.34	90.23
CNN+PSO	94.22	94.67	93.10	92.27
LSTM	88.65	86.78	88.67	91.34
BiLSTM	91.41	90.89	92.02	90.37
This Work	96.92	95.45	94.07	96.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Swarm Intelligence Based Deep Learning Approach for Human Activity Recognition in Wearable Internet of Healthcare Things (IoHT) Applications

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset Description

3.2. Pre-Processing

3.3. Proposed Residual Convolutional BiLSTM Network

3.4. 1DCNN

3.5. Res-BiLSTM

3.6. Marine Predators’ Algorithm for Training the Weights and Hyperparameter Tuning

4. Results

4.1. Performance Measures

4.1.1. UCI HAR Dataset

4.1.2. Opportunity Dataset

4.1.3. Daphnet

5. Conclusions

Funding

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe