The study employed a methodology that was executed on Google Colab using a T4 GPU with 16GB of RAM [
26]. The suggested approach was implemented using Colab Notebook [
26], a web-based interactive platform that seamlessly integrates live code execution, visualization, and explanatory text. The programming language chosen for this implementation was Python, and other libraries, including sci-kit-learn and pandas, were utilized throughout the process. TensorFlow is a widely used open-source deep learning framework designed for the Python programming language. The platform provides a variety of tools for different applications such as classification, regression, and clustering [
27].
3.2 Dataset:
The dataset utilized in this study, retrieved from [
28], comprises a substantial collection of 25,000 images. These images were meticulously captured using a Charge-Coupled Device (CCD) camera, meticulously modified to seamlessly integrate with an optical microscope. This tailored approach ensures the acquisition of precise depictions of squamous cells, contributing to the dataset's robustness and relevance to cervical cancer analysis. Categorized into five distinct groups—Dyskeratotic, Koilocytotic, Parabasal, Metaplastic, and Superficial-Intermediate—the dataset captures a comprehensive spectrum of morphological characteristics inherent to cervical cells. Dyskeratotic cells, for instance, exhibit early and aberrant keratinization, distinguishing them by specific visual features. Koilocytotic cells, on the other hand, are characterized by vesicular nuclei, often observed in binucleated or multinucleated arrangements.
To further enhance the dataset, we have collected an additional 1,000 images per class from Sheikh Zayed Medical College, as shown in
Figure 2, along with sample images from our own data collection in
Figure 3 and from a Kaggle dataset in
Figure 5. This augmentation strengthens the dataset's diversity and improves its capacity for accurate classification and analysis.
Figure 4 serves as a visual representation, providing an illustrative glimpse into the diversity of sample images sourced from the dataset. This diverse and well-defined dataset, characterized by distinct morphological categories, forms the foundation for robust model training and evaluation in the subsequent stages of the research.
Figure 5.
Samples images from the dataset.
Figure 5.
Samples images from the dataset.
3.3 NFE Feature Extraction
The Neural Feature Extractor Model (NFEModel) is a customized neural network that leverages the architecture of the famous VGG16 model, with its focus on extracting features from images. This model modifies the VGG16 convolutional network, which is well-known for its high performance in classifying images, to focus on tasks that include analyzing and identifying complex patterns in images, such as medical imaging for cervical cancer. The NFEModel gains an advantage by utilizing the pre-trained weights obtained from the ImageNet database, which provides it with a strong and comprehensive knowledge of various visual content. It excludes the uppermost layer of the original VGG16 model to provide customization based on unique project requirements [
29]. After applying these changed layers to the images, the model utilizes a Global Average Pooling (GAP) layer. This layer reduces the large amount of data from the previous convolutional layers by averaging down the spatial information, making it more manageable [
30]. The simplified collection of features is now prepared for additional analysis or classification in following phases of the model, making the NFEModel an essential initial step in intricate image-based diagnosis and classification systems.
The model architecture described is designed for efficient feature extraction from images, using a pre-trained VGG16 model as its backbone. The VGG16 model, originally trained on the ImageNet dataset, has been shown to be highly effective for recognizing and analyzing common visual patterns in large-scale image datasets. In this model, the top layers responsible for classification are excluded (include_top=False), ensuring that the focus remains on the lower convolutional layers for feature extraction. The input to the model consists of images resized to 512×512×3, representing the height, width, and three-color channels (RGB).
The feature extraction process proceeds with a Global Average Pooling (GAP) layer, which significantly reduces the dimensionality of the feature maps produced by VGG16. Mathematically, for a 3D tensor x∈R
h×w×d representing the feature maps, where ℎ and
w are the spatial dimensions and
d is the depth, the GAP layer computes the average of all values in each feature map. This can be expressed as:
This operation compresses spatial information, creating a lower-dimensional but representative feature vector that retains the most essential information from the input image [
31].
Following the GAP layer, a Dense (fully connected) layer with 256 neurons is applied. The Dense layer performs the following transformation on the input
y:
where W∈R
256×d is the weight matrix, b∈R
256 is the bias vector, and
f is the ReLU (Rectified Linear Unit) activation function. The ReLU function, defined as f(x)=max(0,x), introduces non-linearity, allowing the network to learn complex relationships between the input features [
32].
To mitigate overfitting, the model includes a Dropout layer with a dropout rate of p=0.5. During training, dropout randomly sets a fraction
p of the neurons to zero, as defined by:
This technique helps the model generalize better by preventing it from relying too heavily on specific neurons or patterns in the data [
33].
The final layer is another Dense layer with 128 neurons, which further refines the extracted features. The mathematical operation is similar to the earlier Dense layer, with the final output represented as:
where W′∈R128×256 is the weight matrix, and b′∈R128 is the bias term.
The data processing pipeline includes an Image Data Generator, which loads and preprocesses the images in batches. The images are resized to 512×512, and the preprocess input function normalizes pixel values for compatibility with VGG16, ensuring the images are in a format the model can process efficiently [
29].
Figure 6.
Architecture of the NFE Model.
Figure 6.
Architecture of the NFE Model.
The use of powers of 2 for layer sizes is justified by its computational efficiency, particularly in GPU-based frameworks like TensorFlow and PyTorch. These sizes align better with binary systems, leading to marginally improved memory usage and performance. Additionally, the powers of 2 provide a convenient scaling factor for hyperparameter tuning, simplifying experimentation and reducing the search space for model optimization. While not strictly necessary, this approach strikes a balance between performance and practical ease of use.
3.3.1 AutoInt Features Extraction
The AutoInt model is a neural network structure specifically created to understand and represent the complex connections between features in datasets with a large number of dimensions. AutoInt's strength lies in its ability to automatically learn feature interactions through self-attention mechanisms, which are well-suited for capturing complex relationships between input features, especially in high-dimensional datasets like medical imaging. Unlike traditional models that rely on manual feature engineering, AutoInt identifies important feature interactions at multiple levels, focusing on the most relevant combinations. This leads to improved generalization and prediction performance. We will expand on this theoretical foundation to better explain how AutoInt enhances performance, particularly in tasks involving intricate patterns such as medical image classification. AutoInt utilizes attention processes to automatically identify and represent interactions at multiple levels, instead of relying on manually designed feature interactions as traditional systems do. The fundamental concept underlying AutoInt is to employ self-attention layers, similar to those present in Transformer models, to acquire knowledge about the significance of interactions between pairs of features without relying on explicit manual feature engineering [
34]. This strategy allows the model to concentrate on the most pertinent feature combinations, which could enhance prediction performance in tasks such as classification, regression, and recommendation systems.
AutoIntModel, is designed for automatic feature interaction learning through a series of dense (fully connected) layers. It processes input data in a hierarchical fashion, gradually reducing the dimensionality while applying non-linear transformations, thereby enabling the model to capture complex relationships between input features. The use of dense layers in this model is fundamental for transforming and combining features, which is a critical step in many machine learning tasks, such as classification and regression.
The first layer of the model, referred to as Dense Layer 1, consists of 128 units (neurons) and utilizes the Rectified Linear Unit (ReLU) activation function. The dense layer can be mathematically described as:
where W1∈R
128×d is the weight matrix, b1∈R
128 is the bias vector, x∈R
d is the input vector of dimensionality
d, and
f represents the ReLU activation function, defined as f(z)=max (0, z). This operation applies a linear transformation to the input data, followed by the ReLU activation, which introduces non-linearity into the model. The non-linearity enables the model to learn complex feature interactions by allowing certain neurons to become inactive (outputting zero), depending on the value of their inputs [
32].
Following Dense Layer 1, the model passes the output to Dense Layer 2, which consists of 64 units and similarly applies the ReLU activation function. This layer can be described mathematically as:
where W
2∈R
64×128 and b2 ∈R
64 are the weight matrix and bias vector, respectively. Dense Layer 2 further refines the feature interactions learned in the first layer by applying another set of transformations and non-linear activations. This hierarchical structure, where each dense layer learns progressively more abstract representations of the input, is a standard approach in deep learning architectures for modeling complex patterns in data [
35].
The final layer of the model, called the Output Layer, consists of 32 units and also utilizes the ReLU activation function. The transformation applied in the output layer is given by:
where W3∈R32×64 and b3∈R32. The output of this layer is a 32-dimensional vector, representing the final feature interactions learned by the model. These interactions are the result of the sequential transformations and activations applied through the dense layers. Depending on the specific task, this output could be used for further processing, such as classification or regression, in subsequent layers or models.
The use of dense layers for automatic feature interaction learning has been widely studied and found to be effective in capturing intricate patterns in data [
36]. The ReLU activation function is particularly advantageous in such models due to its computational simplicity and ability to prevent gradient vanishing during backpropagation, a common problem in deep neural networks. The dense structure, combined with ReLU activations, facilitates efficient and scalable learning from high-dimensional input data.
Figure 7.
Architecture of the AutoInt Model.
Figure 7.
Architecture of the AutoInt Model.