3. Results
This paper presents the results of the developed platform, which was designed to address the key challenges identified in the current literature. The system integrates artificial intelligence for image analysis and lesion classification (infectious versus non-infectious) and is deployed on portable edge-computing hardware based on the Raspberry Pi 3, specifically targeting low-resource environments. The device is capable of fully offline operation, while also supporting cloud connectivity via Google Cloud when internet access is available. This hybrid architecture enables autonomous use in remote settings and facilitates expert consultation under connected conditions.
The platform was developed in collaboration with the Anesvad Foundation, a non-governmental organization with extensive experience in the management of cutaneous neglected tropical diseases (NTDs) in endemic regions. Their involvement ensured that the proposed solution is aligned with real-world clinical workflows and can be integrated into existing healthcare systems in resource-constrained settings.
The goal of this research is to develop an integrated teledermatology platform based on image classification with AI that addresses the challenges faced in diagnosing skin NTDs.
The specific infectious diseases and their clinical importance that we included in the study are explained in
Table 3.
The non-infectious controls were Atopic Dermatitis, Eczema, Psoriasis, Contact Dermatitis, Benign Tumors, Malignant tumors (melanoma, BCC, SCC), Genetic skin diseases, Systemic disease manifestations. The excluded conditions were nail disorders.
3.1. System Architecture
Our system architecture consists of a hardware component responsible for image acquisition, a software component where we include our CNN, and a cloud component for uploading the acquired images to the cloud. The architecture is broadly divided into data acquisition, data transmission, processing, and inference/diagnosis.
Figure 3.
System architecture diagram.
Figure 3.
System architecture diagram.
The stored data are processed by a deep learning model composed of two main stages. In the first stage, a ResNet50 convolutional neural network is employed as the backbone for feature extraction. The adopted architecture follows the standard ResNet50 design, consisting of five sequential stages that include convolutional layers, batch normalization, ReLU activation functions, as well as identity and convolutional residual blocks.
In the second stage, the features extracted by the ResNet50 backbone are passed to a custom fully connected classification head. This head comprises a linear layer with 2048 input features and 256 output features, followed by a ReLU activation function and a dropout layer with a rate of 0.3 to mitigate overfitting. The final output layer produces logits that represent the raw classification scores. These logits are subsequently subjected to a thresholding mechanism (e.g., pinf > 0.5) to enable binary classification.
The trained and optimized model is stored in the cloud to support version control and facilitate efficient deployment across local devices.
Once a model is trained and validated, the next step is to download and deploy it on the local Raspberry Pi. Using this implemented model, we can perform real-time classification of the images we acquire with the Raspberry Pi itself. The result is a binary prediction: "INFECTIOUS" or "NON-INFECTIOUS," which serves as triage, and both the image and the classification can be sent to a medical professional. This would provide a clinical decision support tool that helps the physician make a final pathological diagnosis.
3.2. Data Collection and Preprocessing
For any AI model to be reliable and generalized adequately, the training data must be high-quality, diverse, and representative. In the current context of NTDs, the challenge is even greater because there is no available data representing these diseases, as the Anesvad Foundation stated when we requested images to develop the model. It is also very difficult to obtain images of patients with darker skin tones or from certain geographic regions. This section details the design, sources, selection criteria, and preprocessing techniques we used to build a robust dataset suitable for binary classification of skin lesions into infectious and non-infectious categories.
Data Sources
We obtained all the images used in this project from the publicly available Skin Diseases, DermNet dataset hosted on Kaggle [
17]. This dataset contains thousands of curated dermatological images originally sourced from DermNet New Zealand, a reputable open-access resource managed by dermatologists.
The dataset consists of non-dermatoscopic clinical photographs, taken in natural light, depicting a wide range of dermatological diseases across various anatomical regions and skin types. From the entire repository, we extracted a subset of images corresponding to 23 disease categories, which we manually reclassified into a binary taxonomy: infectious and non-infectious (see
Table 4). We performed the reclassification following the clinical dermatology literature.
In total, we included in the working dataset:
- ✓
1,204 images labeled as infectious
- ✓
3,615 images labeled as non-infectious
This class imbalance reflects the nature of the original dataset, in which infections are underrepresented (25% compared to 75% non-infectious), and this has influenced the modelling strategy.
These images were already divided into three folders in the dataset itself: train, test, and validation, and we decided to respect this division. Therefore, our project dataset is divided into six folders:
- ✓
959 images belonging to the infectious train set (80% of the infectious set)
- ✓
100 images belonging to the infectious test set (8% of infectious)
- ✓
145 images belonging to the infectious validation set (12% of infectious)
- ✓
2,880 images belonging to the non-infectious train set (80% of the non-infectious set)
- ✓
300 images belonging to the non-infectious test set (8% of non-infectious)
- ✓
435 images belonging to the non-infectious validation set (12% of non-infectious)
The training set was used for model optimization. The validation set was used to tune hyperparameters and early stopping. Finally, the validation and test sets were used for final model evaluation and metrics reporting.
To ensure clinical relevance and model integrity, included images were:
- ✓
Images corresponding to confirmed dermatological diagnoses, based on original DermNet metadata.
- ✓
Cropped images that lack watermarks, logos, or overlays that could bias the model.
Images in standard JPEG/PNG format with sufficient resolution. We verified that they were at least 300 pixels wide or high.
3.3. Preprocessing Pipeline
Several preprocessing steps were applied to the raw data using Python scripts to prepare the dataset for CNN training.
All images had a white text watermark in the center (see
Figure 4), which could introduce spurious correlations during training because it can lead the model to associate background text features with class labels. Therefore, we attempted to remove it from the image using OpenCV's inpainting technique.
A quick look at the images showed that image restoration cleaned up some of the watermark elements, but it was image-specific and introduced quite a few artifacts (
Figure 4). Additionally, the process sometimes removed non-text regions (e.g., bright highlights on skin or scar edges) if they exceeded a threshold.
For these reasons, we ultimately decided not to include this approach in the final preprocessing process. The next strategy was to avoid the watermark by cropping up the images (see
Figure 5).
The original dataset was organized by disease category, with each folder containing images associated with a specific diagnosis. However, to make the binary classification task (distinguishing between infectious and non-infectious diseases) easier, we simplified and reorganized the data. We categorized each image to correspond to one of two labels based on the nature of the disease (e.g., fungal, bacterial, viral, or parasitic for infectious diseases; autoimmune, inflammatory, neoplastic, or allergic for non-infectious diseases).
We implemented robust error handling to identify and skip corrupted or unreadable files in both scripts, using exception handling provided by PIL's UnidentifiedImageError. This ensured that the pipeline was fault-tolerant and could run unattended on large datasets. The result was a clean and standardized directory structure with preprocessed images categorized into six folders corresponding to the binary classes (infectious and non-infectious) and the three data splits.
3.4. Model Development
The core of our AI system is the model development stage. The algorithm we developed is designed to automatically classify dermatological images into two clinical categories: infectious and non-infectious diseases. This system is based on deep learning models, specifically convolutional neural networks (CNNs), which have demonstrated very good performance in visual diagnostic tasks in dermatology.
As an exploratory step to improve the model's lesion localization capabilities, we decided to first program a classical computer vision workflow in Python to perform unsupervised segmentation of the images. We tested this method with examples of both infectious and non-infectious lesions to see if a basic thresholding and contour-based approach could isolate the region of interest (ROI), that is, the lesion area, from the background.
The segmentation had several parts. First, we loaded each image using OpenCV and converted it from BGR to RGB format for visualization. Then, we transformed it into grayscale to simplify pixel intensity analysis. We applied Otsu thresholding to the grayscale image using the cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU flags. This adaptive method automatically determines an optimal global threshold value that separates the lesion (usually darker or textured) from the lighter skin or background. We inverted the binary mask we obtained in this step to make the dark regions the foreground. To identify external contours, we used cv2.findContours(). We assumed the largest contour by area was that of the main lesion. We drew a bounding rectangle around this contour to extract the ROI and created a copy of the original image with the contour overlaid for visualization.
Although the segmentation experiments provided qualitative insights into lesion localization, the approach was not integrated into the final classification pipeline due to inconsistent performance across heterogeneous dermatological images. Future work may explore more advanced segmentation methods, including attention-based deep learning architectures.
The visualization of each image has four panels (
Figure 6): original image, binary segmentation mask, extracted ROI and original image with the lesion contour superimposed. With this technique, we got visually significant ROIs in many cases, especially when the contrast between the lesion and the surrounding skin was very strong. However, the results were more variable in other cases. Sometimes the algorithm misclassified skin folds or light reflections as foreground, thus segmenting the image incorrectly.
3.5. Architecture Selection
To begin the development process, we benchmarked two widely used CNN architectures: EfficientNet-B4 and ResNet50. We selected these two networks for their proven effectiveness in the literature for classifying dermatological images [
18,
19]. EfficientNet is a family of scalable and lightweight CNNs, introduced by Tan and Le in 2019 [
20], which has become popular for its parameter efficiency and very high accuracy in computer vision tasks. ResNet50 is a red residual proposed by He et al. in 2015 [
21], which is known for its simplicity and robustness, especially in clinical applications.
To compare these two architectures, we decided to implement two initial prototypes in separate Jupyter notebooks, using a reduced version of the dataset for faster experimentation: 400 images in total, which we split into train and test (80-20%). In ResNet, we initialized the model with pre-trained ImageNet weights and adapted it to binary classification by replacing its final fully connected layer with a custom head that outputs only a logit. We used the model's output in conjunction with the BCEWithLogitsLoss function, which combines two things: a sigmoid activation and a binary cross-entropy loss in a numerically stable way, resulting in efficient optimization for binary classification tasks. During this stage, we retrieved the selected images by connecting the notebook to Google Drive, where they are hosted in the /MyDrive/skinpatients folder. The data is organized into two subfolders: test infection/ and test non-infection/. The chosen images were resized to 224x224 pixels, and the training protocol used the Adam optimizer with a learning rate of 1x10⁻⁴, a batch size of 4 for 10 epochs, and the binary cross-entropy with logit loss objective function (BCEWithLogitsLoss). We evaluated performance using precision, recall, f1 score, support and a confusion matrix on a test set.
In EfficientNet, we used the same subset of images taken from Google Drive in the same way. We initialized it with pre-trained ImageNet weights and adapted it to binary classification by replacing its final layer with a custom head that outputs a single sigmoid value. The output was used in conjunction with the BCELoss loss function to enable optimization for binary classification tasks. The images were resized again, but this time to 380x380 pixels, and our CNN was trained using the Adam optimizer with a learning rate of 1×10⁻⁴ and a batch size of 4, for 10 epochs. We evaluated performance with the same metrics: precision, recall, F1-score, support and a confusion matrix on a test set.
In the results of this first test, we saw a trade-off between overall accuracy and sensitivity to the minority class. While ResNet50 achieved higher overall accuracy (74% vs. 57%) and better performance for the majority class (non-infectious lesions), EfficientNet-B4 demonstrated substantially higher sensitivity for the minority class (infectious lesions), achieving a recall of 0.70 compared to 0.10 for ResNet50. Despite this, we decided to select ResNet as the final architecture because it performed more consistently across different epochs and achieved much higher accuracy. However, this model needed some changes before training, validation and testing on all images.
3.6. Final Model
Previous CNNs were limited, it was just a test: we split the dataset into training and testing using train_test_split, which does not give us a reliable way to monitor the model's generalization during training. Furthermore, at that time, we had not yet included early stopping in the training process, which increased the risk of overfitting. As we already knew, there was also a significant class imbalance between infectious and non-infectious samples, which could bias the model toward the majority class, so it was very important to address this.
To address all these limitations, we introduced several methodological improvements in the improved version (
Table 5). First, we decided to load the dataset from Google Drive directly into the three subsets we had prepared during data preprocessing: training, validation, and testing, so we could monitor performance in real time and retain the best model based on the validation loss. We added early stopping with a patience of 5 epochs, stopping training when we saw no improvement in the validation loss, thereby improving generalization. We addressed class imbalance by calculating a pos_weight parameter based on the ratio of negative and positive samples in the training set and integrating it into the BCEWithLogitsLoss function, which increased the model's sensitivity to the minority class, in this case, infectious diseases. Additionally, we increased the batch size from 4 to 16 and the maximum number of epochs from 10 to 20, which allowed for more stable gradient updates and improved convergence. We included resizing 224x224 pixels and normalization using ImageNet statistics to ensure the input data had a standardized range and distribution. Finally, instead of saving the final model, the refined pipeline tracks and saves the best-performing model (best_model_resnet.pth) based on the validation loss, thus ensuring optimal generalization performance in the version used for final evaluation.
To improve model robustness and reduce the risk of overfitting, data augmentation techniques were applied during training. These included horizontal flipping, random rotations, brightness adjustments, and scaling transformations. Augmentation was particularly important given the class imbalance and variability in dermatological image acquisition conditions.
At the end of the entire process, we decided to change the number of layers, comparing two architectures: ResNet50 (50 layers) and ResNet18 (18 layers), to choose the most suitable model for a limited computing environment, like the one we have with the Raspberry Pi. Both versions shared the same structure in terms of training logic, training hyperparameters, loss function and added classifier layers, the only difference being the base CNN architecture (ResNet18 vs. ResNet50), which differs in depth and number of parameters [
22]. Our specific model has approximately 11.33 million parameters, which includes the pretrained ResNet-18 backbone and the custom classification head composed of two fully connected layers. Knowing each parameter is stored as a 32-bit floating point number (4 bytes), the total size of the model is around 43.2 megabytes.
To mitigate overfitting issues, we reduced the learning rate from 1×10⁻⁴ to 1×10⁻ in ResNet18, allowing for smoother and more stable convergence. We kept it at 1×10⁻⁴ in ResNet50.
In this case, although we found ResNet18 to be lighter and faster, we also know that it has a lower representation capacity. After the comparative analysis, we decided to continue with ResNet50 as the final system architecture, despite it having a bigger computational burden. We based our decision on its better performance in key metrics (accuracy, sensitivity, specificity), as well as its better ability to capture complex patterns in dermatological images. This robustness is very important given the tool's clinical objective as an AI-assisted triage system.
3.7. Integration with Raspberry Prototype
The device selected to build the prototype to be implemented was the Raspberry Pi 3 Model B, a low-power single-board computer developed by the Raspberry Pi Foundation. The hardware configuration includes a Pi board, a high-resolution camera module connected via the CSI interface, a USB Wi-Fi adapter for wireless connectivity, and a micro-USB power supply. The camera module allows for direct image acquisition, and the device has sufficient onboard capacity to perform local classification using the trained CNN model.
The Raspberry Pi 3 Model B features a 1.2 GHz quad-core 64-bit ARM Cortex-A53 CPU, 1 GB of RAM, and four USB 2.0 ports, in addition to an Ethernet port, HDMI output, a CSI camera interface, and a microSD card slot for boot and storage. It supports an HDMI display and GPIO interfaces for peripherals [
23]. Despite not being the latest version, the Pi 3 Model B is still well-suited for low-power AI applications, especially where real-time GPU-based acceleration is not strictly required.
Figure 7.
Prototype including macro lens.
Figure 7.
Prototype including macro lens.
Performance evaluation on Raspberry Pi 3 Model B demonstrated an average inference time of approximately 420 ms per image, with average CPU utilization of 65% and memory usage of approximately 320 MB during inference. These results indicate that real-time offline deployment is feasible on low-cost embedded hardware platforms.
The goal of testing the macro lens was to see if it improved the visibility of fine-grained texture, edges, and surface features, which we thought might be important for diagnosis (e.g., whether there are scabs, ulcerations, or parasite burrows). However, capturing the image presented other challenges: there was less depth of field, it was more sensitive to motion blur, and the field of view was narrower, requiring more precise alignment and a stable mount during acquisition.
We also decided to create an acquisition protocol, whereby images are captured at a fixed distance of 10 cm from the skin using the standard settings, which would be without macro, and in direct contact with the skin using the macro settings. Standardizing these procedures ensures that all the images we capture follow the same framing, scale, and lighting conditions, which will ultimately improve the reliability of the model's predictions.
Figure 8.
Acquisition with macro lens.
Figure 8.
Acquisition with macro lens.
3.8. User Interface and Software Inference
We developed the prototype's user interface using the Flask microframework, with the goal of creating a lightweight and accessible web application hosted locally on our Raspberry Pi. The interface can be accessed through any web browser over the local network and was designed to be simple, fast, and easy to use by healthcare professionals with no technical training.
Upon launching the system, the professional sees a structured form with two fields: the patient ID and the anatomical location of the lesion. Captured images are saved locally in the static/captures/ directory and can be immediately viewed within the web interface.
To allow remote collaboration, centralized data access and backups, we needed to integrate cloud storage into our teledermatology system, especially since we would likely need it for real-time diagnoses or for specialists to perform secondary reviews.
To achieve this, we incorporated a Google Drive-based cloud upload module into the system, implemented through the Google Drive API and protected by a dedicated service account.
Images are uploaded to the cloud using the MediaFileUpload protocol, which saves the structured file name, including the patient ID, injury site, and timestamp, so uploaded files remain traceable and well-organized. The system logs each upload for auditing purposes and allows re-uploading in the event of connectivity issues or failures.
To ensure security and privacy, we do not store or transmit sensitive personal data. The service account cannot access other Drive content and is limited to a single project folder. Future versions may integrate encryption at rest, user-level access control, and cloud-hosted inference, depending on the application context.
Using the Flask web framework again, app.py creates a local server that hosts an interactive graphical user interface (GUI), accessible from any device on the same local network.
While full regulatory certification falls beyond the scope of this academic research project, we have added several important assessments to make easier future clinical translation. Our preliminary evaluation against FDA Software as a Medical Device (SaMD) [
24] Class II requirements identifies key gaps that would need addressing for regulatory submission at some point. Similarly, we analyze CE Marking requirements to understand the pathway for European deployment. Although the current prototype represents a research-stage system, future clinical translation would require formal software validation, cybersecurity assessment, usability testing, and compliance with Software as a Medical Device (SaMD) regulatory framework, including FDA and MDR requirements.
Ethical considerations are a very important part, with all prospective data collection done under Institutional Review Board approved protocols. We implement GDPR-compliant anonymization techniques to protect patient privacy while maintaining data utility. The platform's design incorporates privacy-by-design principles, including on-device processing options and encrypted data transmission for cloud-based analyses.
We exclude from the current work more resource-intensive aspects of regulatory compliance. Future steps are full 510(k) submission processes, post-market surveillance studies, and point-of-care device certification, as that would require additional partnerships and funding beyond this project's scope. However, our documentation approach ensures all necessary technical files and risk management documentation (per ISO 14971 [80]) are prepared to support such future endeavors.
In parallel, we have structured the system architecture and software lifecycle following the best practices described in the IEC 62304 standard [
25], which governs software development for medical devices. We maintain the source code, version control and traceability of changes throughout the project to facilitate the preparation of a future design history file (DHF). While the prototype does not make clinical decisions autonomously, we have integrated a neural network that acts as a triage tool, providing clinical suggestions based on automatic image analysis. The final decision always rests with a physician, positioning the system as a decision-making aid rather than a substitute for clinical judgment. To this end, we have implemented robust logging and tagging mechanisms to facilitate both traceability and audit readiness, both very important aspects in a future regulatory review. We also ensure design modularity, for example by separating inference logic from the user interface, data transmission and storage, which would simplify future validation and cybersecurity assessments when required by regulatory bodies.
We designed this prototype and algorithm following key principles of privacy by design, ensuring that we do not collect, process or store personally identifiable information (PII) unless explicitly necessary. We trained the CNN using completely anonymized, publicly available images, thus avoiding the use of sensitive data in the early stages of development. The only metadata entered by the user during image acquisition is a patient identification code and the anatomical region of the lesion. This data is manually provided through the web interface and used to label the image for traceability purposes. The system never captures or records names, dates of birth, contact information or full facial images.
3.9. Results and Evaluation
Recall that after comparing EfficientNet with ResNet, the final model we chose was ResNet. We then decided to compare ResNet18 with Resnet50 due to the difference in layers between the two, which translates into a difference in learning capacity and depth. Therefore, ResNet50 should be able to capture more complex features in images, but with a higher computational cost. We will have to evaluate whether it is worth it or whether it is better to stick with a simpler model, in case it yields good results.
The ResNet50 model has a batch size of 16, a maximum number of epochs of 20, a learning rate of 1×10⁻⁴, resizing to 224x224 pixels, and normalization with standard ImageNet parameters. We also added an early stopping strategy with a patience of 5 epochs to avoid overfitting and optimize training time.
The validation loss starts at 1.0356 and ends at 0.9565, indicating that the model is learning effectively. The training loss follows a very similar decreasing pattern (from 1.0660 to 0.9989), meaning there is no overfitting.
Out of 20 epochs, the model improved 10-fold. We achieved the best performance in epoch 20, so early stopping was not necessary, and the model benefited from training for the full 20 epochs.
Initial experiments performed using simplified training configurations and reduced datasets resulted in limited sensitivity for the minority infectious class, particularly for ResNet18 (recall = 0.17). To address this limitation, the final ResNet50 pipeline incorporated several methodological improvements, including class-weighted loss functions, validation-based model selection, early stopping, normalization, and optimized training parameters. These modifications significantly improved infectious lesion detection, increasing recall to 0.89 on the independent test set. This highlights the importance of balanced optimization strategies and robust validation procedures when working with imbalanced dermatological datasets.
In ResNet50, during training, we see how the loss on both the training and validation sets progressively improves, from 0.96 in the first epoch to 0.10 in subsequent epochs, reaching an optimal point before stopping intervened early (
Table 6). The validation loss steadily decreases from 0.72 to a minimum of 0.087 at epoch 13. After five consecutive epochs with no improvement in the validation loss, training was stopped early (epoch 18), indicating convergence. We automatically save the best model for later evaluation.
We calculated the time for training using the time.time() function, showing that the CNN was quite efficient for its complexity, requiring only 16 minutes to complete training using the access that Google Colab gives us to NVIDIA T4 GPUs. This is because the model architecture is highly optimized and makes efficient use of computational resources, allowing for rapid iteration and deployment in our context (with limited infrastructure) without sacrificing performance.
On the test set, consisting of 400 samples (300 non-infectious and 100 infectious), the model achieved the classification metrics we see in Table 9.
Table 7.
Classification metrics.
Table 7.
Classification metrics.
| Metric |
Non-Infectious Class |
Infectious Class |
| Accuracy |
96% |
| Precision |
0.96 |
0.96 |
| Recall |
0.99 |
0.89 |
| F1-score |
0.98 |
0.92 |
The model demonstrated high accuracy in both classes, indicating few false positives. Recall is slightly lower for the infectious class, reflecting the inherent difficulty of correctly identifying the minority class, although a recall of 0.89 still means high sensitivity.
Using a class-weighting strategy during training, we improved the model's ability to detect infectious cases by compensating for imbalances in the dataset. This is very important in clinical settings, where false negatives (undetected infections) have serious consequences.
The confusion matrix of the binary classification model is a visual summary of how well the model performed on the test set, classifying lesions as infectious or non-infectious. It is divided into four parts, representing, from left to right and top to bottom:
- ✓
True Negative (TN): an image from a non-infectious disease (negative) and classified as non-infectious (negative)
- ✓
False Positive (FP): an image from a non-infectious disease (negative) and classified as infectious (positive) (also known as a Type I error or false alarm).
- ✓
False Negative (FN): an image from an infectious disease (positive) and classified as non-infectious (negative) (also known as a Type II error).
- ✓
True Positive (TP): an image from an infectious disease (positive) and classified as infectious (positive) [
26].
- ✓
We correctly classified 296 images as non-infectious (TN).
- ✓
We incorrectly predicted 4 non-infectious images as infectious (FP).
- ✓
We missed 11 infectious images, and they were predicted as non-infectious (FN).
- ✓
We correctly classified 89 images as infectious (TP).
Figure 9.
Confusion matrix results.
Figure 9.
Confusion matrix results.
The model performs very well, especially at detecting non-infectious cases, as we get very few false positives.
Although we have already calculated some of them directly in the code (Table 8), it is worth mentioning that these values are used to calculate:
Overall correct predictions. Accuracy answers the question “Out of all the predictions we made, how many were true?” [
27].
Precision gives us the proportion of true positives to the number of total positives that the model predicts. It answers the question “Out of all the positive predictions we made, how many were true?” [
27].
Recall focuses on how good the model is at finding all the positives. Recall is also called true positive rate and answers the question “Out of all the data points that should be predicted as true, how many did we correctly predict as true?” [
27].
Specificity focuses on how good the model is at finding all the negatives. Specificity is also called true negative rate and answers the question: “Out of all the data points that should be predicted as false, how many did we correctly predict as false?”
It is the harmonic mean of precision and recall, useful when there is class imbalance. As we have seen there is a trade-off between precision and recall, so F1 can be used to measure how effectively our models make that trade-off.
One important feature of the F1 score is that the result is zero if any of the components (precision or recall) fall to zero. Thereby it penalizes extreme negative values of either component [85].
NPV focuses on how much you can trust a negative prediction. It answers the question: “Out of all the data points that were predicted as false, how many were actually false?” It reflects the proportion of true negatives among all negative predictions that the model made.
FPR focuses on how often the model incorrectly labels negatives as positives. It answers the question: “Out of all the data points that should be predicted as false, how many did we incorrectly predict as true?” It represents the proportion of false positives among all actual negatives and is the complement of specificity.
FNR focuses on how often the model misses positive cases. It answers the question: “Out of all the data points that should be predicted as true, how many did we incorrectly predict as false?” It represents the proportion of false negatives among all actual positives and is the complement of recall.
Balanced Accuracy focuses on giving a fair evaluation of model performance when we have imbalanced datasets. It answers the question: “How well does the model correctly identify both positive and negative classes?” It is calculated as the average of recall and specificity, making sure that both classes are equally weighted in the final performance metric.
3.10. PR Curve
The precision-recall (PR) curve graphically represents the trade-off between precision and recall at different decision thresholds [
28].
The area under the precision-recall curve (PR AUC) is a summary of the model's ability to maintain high precision and recall across all thresholds. A PR AUC close to 1.0 means the model is successfully identifying most positive cases with few false positives. A lower PR AUC means the model is failing to identify the minority (positive) class.
The ROC (Ranking Operating Characteristic) curve is a graph that shows the recall versus the false positive rate across several thresholds. The area under the ROC curve (AUC ROC) represents the probability that the model will classify a randomly selected positive instance higher than a randomly selected negative instance.
Although the ROC curve is widely used, when there is class imbalance (as in our case) it can offer an overly optimistic view [
29]. This is because the false positive rate considers the large number of negative samples, which can mask the poor performance of the model in the minority class. This is why we chose to report the PR-AUC metric. This curve gives us more information when there is imbalance, as it focuses specifically on the precision and recall of the minority class. This is very important to distinguish between infectious and non-infectious skin lesions, where the minority class is precisely the clinically relevant disease.
For this analysis we obtained two graphs. The first one is the PR trade-off as a function of decision threshold (
Figure 11).
Figure 10.
PR trade-off as a function of decision threshold.
Figure 10.
PR trade-off as a function of decision threshold.
This graph shows how precision and recalls change when adjusting the threshold used to determine whether a lesion is infectious. At lower thresholds (for example, <0.3), the model detects almost all infectious lesions (high recall) but also misclassifies some as non-infectious (lower precision). As we increase the threshold, precision improves (meaning predictions are more reliable), but recall decreases, so some infectious lesions may be missed.
If the priority is not to miss any infectious cases, it is beneficial to have the threshold slightly below the middle (for example, 0.4) to increase recall.
In
Figure 11, we see the PR curve obtained for the final model, which shows a high level of performance, with an average precision (AP) or area under the curve (AUC) of 0.97. This means that the model distinguishes very well between infectious and non-infectious skin lesions. In this curve, we see that the model maintains high precision even with high recall levels, which is very important in medical applications where false negatives can have serious consequences. We need to reliably detect lesions of infectious origin, even in images that are not of the best quality. The model demonstrates high recall without sacrificing much precision, meaning it can detect most infectious cases and avoid many false positives, which are needed for the early detection and triage of NTDs.
3.11. Predictions on Unseen Images
To evaluate the model's classification ability with data it has not previously seen, such as a patient image when using it in the field, we tested it with an image from another dataset that was not included in the training, validation, or test sets. The first is an image of a skin lesion that we know to be infectious. We first load the image using the PIL library and display it. We then preprocess it by resizing it to the input size required by the model, converting it into a normalized tensor. The model, pretrained and saved as best_model_resnet.pth, is loaded and set to evaluation mode.
We perform a forward pass without calculating gradients (torch.no_grad()) and pass the output through a sigmoid function to obtain a probability between 0 and 1. This probability is compared to a threshold (default value 0.5; but as we saw in Figure 21, we could lower it a bit) to determine the final class. We visualize the result, the predicted class (infectious or non-infectious) and the associated confidence level (Figure 23).
Figure 12.
Image predicted as infectious.
Figure 12.
Image predicted as infectious.
We repeat the process, this time uploading to the model an image that it has never seen and we know is non-infectious.
Figure 13.
Image predicted as non-infectious.
Figure 13.
Image predicted as non-infectious.
As we can see, the classification is done properly, so the model would be ready to be used as a form of triage in places with high prevalence of NTDs.
3.12. Raspberry Pi Results
We built the image capture step by step, improving an initial script. These first images were taken with the Thesis.py script (Figure 26). The next step was taking images with the app.py and app1.py scripts, both with and without macro lens for comparison.
Figure 14.
First images taken by the prototype.
Figure 14.
First images taken by the prototype.
Figure 15.
Image taken using the macro lens accessory, named1_arm_20250514_14224. The name is standardized and represents (the patient ID)_(lesion area)_(date)_(time with minutes and seconds). The name of every image follows this structure for traceability purposes.
Figure 15.
Image taken using the macro lens accessory, named1_arm_20250514_14224. The name is standardized and represents (the patient ID)_(lesion area)_(date)_(time with minutes and seconds). The name of every image follows this structure for traceability purposes.
Several limitations of the present study should be acknowledged. The dataset was derived primarily from publicly available DermNet images and may not fully represent the variability encountered in real-world clinical environments, particularly regarding skin tone diversity, acquisition conditions, and geographic variability. Furthermore, external validation on independent datasets was not performed and will be necessary before clinical implementation.