3.1. Data Preprocessing
3.1.1. Traps and Trap-Types
Traps listed in the “Trap” column of the database—T240, T240B, and T143—have missing data. For T240 and T240B, we have identified the address as 24 Lincoln Park with latitude 41.9187 and longitude -87.6715. The address for T143 is Norwood Park, with approximate GPS coordinates of 41.995 (latitude) and -87.799 (longitude). We have imputed these values for these traps only.
Some traps in the Chicago database are “satellite traps”. These are traps that are set up near (usually within 6 blocks) an established trap to enhance surveillance efforts. Satellite traps are postfixed with letters. For example, T220A is a satellite trap to T220 [
32].
This dataset is organized in such a way that when the number of mosquitoes found in the catch bug/backet exceed fifty, they are split into another record (another row in the dataset), such that the number of mosquitos is capped at fifty. Therefore, the maximum number of mosquitoes per batch is fifty with only 2 exceptions in the records in 2014 and 2022 with 77 and 61 mosquitoes respectively (probably outliers).
Regarding trap types, in this dataset the OVI trap exists only in a single valid record in 2007 and, therefore, has no influence on the statistics. There are other entries as well, but the crucial parameter of the infection status is missing and for that reason these records are dropped. There are 195 trap ids in the dataset (see
Table 1). 169 are GRAVID traps, 28 CDC, 12 SENTINEL and 1 OVI. These numbers add up to 210 because some IDs appear with two different trap types. Discrepancy arises because some traps appear under multiple TRAP_TYPE categories (e.g., Trap 009). When one sums these counts, it treats each instance of a trap across different TRAP_TYPE categories as unique, leading to an inflated total. This is either a data entry mistake or TRAP ids are reserved for a location, but the type of trap can change during the monitoring period.
3.1.2. Missing Values
After we impute traps: T240, T240B, T143, we drop all records that do not have data entry (NaN value) in the column of RESULTS (infection status) as this is the most crucial variable, it is rare, and we refrain from imputing it. We end up with a database of 36233 rows and we keep 14 relevant columns (variables): ’SEASON YEAR’, ’WEEK’, ’TEST ID’, ’BLOCK’, ’TRAP’, ’TRAP_TYPE’, ’TEST DATE’, ’NUMBER OF MOSQUITOES’, ’RESULT’, ’SPECIES’, ’COMMUNITY AREA NUMBER’, ’COMMUNITY AREA NAME’, ’LATITUDE’, ’LONGITUDE’.
3.2. Data Analytics
In this section we proceed to pose useful questions of practical value. These are the questions that would affect policy decisions, would be used to improve public awareness and to evaluate intervention strategies. The code is provided in the appendix and would be applicable to any other infection and mosquito database with corresponding structure.
3.2.1. The Distribution of WNV Positive Cases by Year
In
Figure 1 we meant to visualize the distribution of the West Nile Virus presence by year to see if there is a potential trend in this pattern. Each bar represents the number of occurrences where the virus was detected in that specific year. By doing so, the histogram helps to identify which years had a higher or lower incidence of West Nile Virus presence and the trend. We see that the number of incidents in Chicago, based on this particular database, has been relatively stable over the years. This picture can be used to assess the impact of an intervention policy. While we are not aware of the specific intervention policies currently in place,
Figure 1 does not show a steady decline in the phenomenon.
The mosquito species distribution of the whole database is gathered in
Table 2. This Table shows all the species that are included in the database. The
Culex pipiens/restuans categorization is the most prevalent, followed by
Culex restuans and
Culex pipiens. The term
Culex pipiens/
restuans is sometimes used when the differentiation between the two species is not clear, especially in mixed pools of collected mosquitoes. Because of their similarities in appearance and overlapping habitats, many mosquito surveillance programs use the combined term
Culex pipiens/
restuans when distinguishing between the two is difficult, especially without genetic testing. Therefore,
Culex pipiens and
Culex restuans are distinct species but are often grouped together due to their similarity. This vagueness in class attribution imposes an additional difficulty in the classification experiments. What this data definitely suggests is that the majority of the captured mosquitoes belong to the Culex genus, known for their role in transmitting diseases like West Nile Virus.
3.2.2. Mosquito Species Composition in Catches of Mosquito Traps over Time
Figure 2 provides an overview of mosquito trends in Chicago, focusing on variations in mosquito species and the prevalence of WNV over time. The analysis reveals the evolving population of different species, which may indicate changes in environmental factors, mosquito control measures, or virus prevalence.
The
Culex pipiens (orange line in
Figure 2) shows an initial peak in 2007, reaching the highest count among all species at that time. After 2007, the population rapidly declines in 2008, and remained consistently low from 2009 onwards, with only minor fluctuations in 2013 and 2014. The
Culex pipiens/
restuans (green line in
Figure 2) is the dominant case throughout most of the time period (mind though that this is not a species but a collective characterization). Peaks can be observed in 2007, 2012, 2015, 2021, and 2023. The population shows a cyclical pattern, with significant rises and falls. Notably, there is a sharp decline in 2024 for the attribution to the mixed class
Culex pipiens/
restuans indicating either potential classification errors or some advancement in discerning these species.
The Culex restuans (red line) initially had a low population but begins a steady increase from around 2013 to 2015. There are some year-to-year fluctuations but generally stay moderate from 2015 onwards. Notable peaks occur in 2015 and 2023, with a general trend of maintaining a steady presence.
Culex erraticus (blue line) demonstrates very low numbers throughout the entire period. This species shows no significant spikes, suggesting either low prevalence or limited environmental suitability in the study area.
Culex salinarius, Culex tarsalis, Culex territans, Unspecified Culex (purple, yellow, grey lines) consistently have nearly zero to low populations throughout the time period.
This suggests that these species are either not as prevalent in the area or may be more challenging to trap using the specific trap types.
3.2.3. Most Probable Date to Detect WNV Infection in Batches of Traps’ Catches
Figure 3 is, in our view, the most significant figure in this work as it highlights the peak and distribution of WNV occurrences over time, providing insight into the seasonal pattern of outbreaks. It illustrates WNV-positive batches in relation to the weeks and months in which the virus was detected, across all years and species. This visualization is especially valuable for identifying potential seasonal trends, such as spikes in virus presence during specific months. It allows us to pinpoint periods of heightened activity, which can inform vector control strategies and public health responses. Notably, peak activity is observed between the last week of August and the first week of September.
3.2.4. Effectiveness of Trap Types in Catching Mosquitoes and WNV-Infected Mosquito Batches, Species Composition
To compare the effectiveness of each trap type fairly, we need to account for the unequal distribution of traps among trap types (see again
Table 1). Since each TRAP_TYPE has a different number of traps, directly comparing the total mosquito counts would be biased. Normalizing by the number of traps within each TRAP_TYPE allows us to account for this imbalance, providing a fairer comparison of each trap type’s effectiveness.
Figure 4 visualizes the effectiveness of different trap types in catching mosquitoes, broken down by species. The data is grouped by the TRAP_TYPE and SPECIES variables of the database, aggregating the normalized number of mosquitoes caught (variable NUMBER OF MOSQUITOES) for each species by each trap type. Since each bar in the bar-plot of
Figure 4 represents a specific trap type and species combination, we can see which traps are more successful at capturing certain species. This insight can guide the deployment of different trap types to target specific mosquito populations more effectively, focusing on species that are major vectors of diseases like WNV. It is also helpful in optimizing trapping strategies by selecting the most effective trap types based on the target species in an area. The outcome of this analysis is that the GRAVID trap type is found to be the most effective trap in Culex catches followed by CDC. Note the difference in species caught by each trap type. The SENTINEL trap does not perform very well with Culex. We get almost the same picture when the y-axis holds the WNV positive cases normalized by the number of traps in each trap type (not presented here to avoid redundancy but can be found in the provided code).
3.2.5. Best Traps for Mosquito Catches and WNV Infected Mosquito Batches
There are 195 unique mosquito Trap Numbers that the public health workers in Chicago set up and scattered across the city. In
Figure 5, we identify the top-performing mosquito traps based on two key metrics: West Nile Virus presence (variable RESULT in the left y-axis) and the total number of mosquitoes caught (in the right y-axis). If a trap has high mosquito counts but low WNV detections, it may indicate that the mosquitoes caught are not the primary carriers of the virus, suggesting a lower risk. Conversely, a high number of WNV detections, even with a moderate number of mosquitoes, points to a high concentration of infected mosquitoes, indicating that the location has a heightened risk of virus transmission.
Once we have the best performing trap names, we proceed into composing a table of their Address in
Table 3.
Trap T009 is located at 91XX W HIGGINS RD and appears with two different trap types—CDC and GRAVID. This is either a data entry mistake or they have the same physical trap location reused over time, but the trap type is changed during different trapping periods.
3.2.6. Locations in the City as Hotspots for WNV Positive Batches
We identify the geographic locations of traps associated with a high presence of WNV-positive cases. However, these locations do not necessarily correspond to true hotspots in the field, as the trap network only samples the mosquito population and is neither densely populated nor evenly distributed. The first approach is to find the community areas (variable COMMUNITY AREA NAME) associated with virus-positive cases and sort them by value. Then we derive heatmaps of the trap locations with the highest numbers of WNV-positive cases. The histogram in
Figure 6 visualizes the distribution of WNV detections across different areas of the town.
Figure 6 helps in identifying high-risk locations, to guide public health efforts for targeted vector control and preventing the spread of WNV. This information can help in prioritizing vector control efforts, such as targeting these high-risk areas for increased spraying, public awareness campaigns, or other preventive measures. Understanding which traps consistently detect the virus can help in allocating resources efficiently. Health authorities can use this information to optimize monitoring locations, ensuring that the most significant risk areas are continuously observed to prevent outbreaks.
Figure 6 allows you to see which geographic blocks have higher instances of WNV presence, indicating potential hotspot areas like O’Hare airport.
Figure 7a and 7b depict two types of geospatial visualizations that can be used to analyze the spatial distribution of WNV presence in the region covered by the dataset. The heatmap displays the intensity of WNV occurrences geographically. Each point on the map represents a location with the attributes of latitude and longitude, with the color intensity indicating the presence of the virus. Note that the points correspond to traps’ locations.
The heatmap helps in identifying hotspot regions where the density of infected mosquitoes is highest. Areas with darker colors indicate higher virus activity, suggesting areas of greater risk. Health officials can use this information to focus vector control efforts like pesticide spraying or mosquito breeding habitat elimination in the most affected regions.
The convex hull can be used to define the boundary of the region that needs to be monitored or controlled for WNV. It gives an idea of the geographical limits of areas where traps have detected the virus. By looking at how the traps are distributed within the convex hull, authorities can assess the spatial spread and identify areas where traps may be missing (i.e., identifying gaps in monitoring). Regions within the hull but with fewer traps could need additional monitoring.
Both figures are useful for effective resource allocation, monitoring coverage, public health interventions, and communicating risk to stakeholders and the public. They can be used in public health campaigns to inform communities of areas with a high risk of WNV transmission and encourage protective behaviors, such as avoiding outdoor activities at peak mosquito times or using insect repellent.
3.2.7. Identification of Outbreaks
To identify outbreaks of West Nile Virus (WNV), we can look for clusters of positive cases within a certain time period and/or geographic area. But how do we define an outbreak? In [
34], the authors argue that any temporal anomaly from the expected number of cases is classified as an outbreak. This definition raises two problems: a) there are many ways to define an anomaly in the data and, b) the data on which an anomaly is to be detected are imperfectly sampled by health systems. In this work, we derive the outbreaks of the dataset from 2009-2024, using 4 different ways and we gather them in
Table 4.
Identifying outbreaks can be approached using various definitions, each providing different insights into the data. Alternative definitions of an outbreak result in different catalogues of events. The first approach, which is common in anomaly detection in general, an outbreak occurs when the weekly count of WNV in batches exceeds the historical average by a certain number of standard deviations (one std for two weeks for all traps pooled together in
Table 4). This method accounts for natural fluctuations in the data and identifies unusually high counts.
The second approach tracks week-over-week growth. An outbreak is detected when there’s a significant increase (e.g., doubling of counts) in WNV positive batches compared to the previous week. Rapid increases may indicate the onset of an outbreak.
The third approach involves using cumulative counts over a specific period. An outbreak is defined when the cumulative sum of WNV-positive cases within a given timeframe (e.g., a month) exceeds a predetermined threshold, capturing sustained periods of heightened activity.
Finally, the moving average threshold identifies an outbreak when the moving average of seven days exceeds a predefined threshold. It smooths out short-term fluctuations and highlights longer-term trends.
Different definitions may capture different aspects of the data. The statistical threshold method is useful for identifying unusually high activity compared to historical averages, while the rate of increase method is sensitive to rapid changes, even if the absolute numbers are low. The moving average allows to have time-varying thresholds instead of mean and standard deviations derived from all data.
The results in
Table 4 and in
Figure 8 indicate that the different methods do not consistently coincide. The various approaches yield different outbreak periods, suggesting significant differences in the detection criteria or underlying methodologies they use. Therefore, it needs some attention when people refer to an ‘outbreak’.
By identifying periods of outbreaks with different criteria, public health authorities can peak the one that fits their need plan and execute targeted interventions such as mosquito control, spraying campaigns, and public awareness initiatives. Knowing the precise periods when outbreaks tend to occur helps in taking proactive measures rather than reactive responses, thereby reducing the spread of WNV. By analyzing the timing of outbreaks over multiple years, authorities can understand whether they follow a predictable seasonal pattern or are influenced by certain environmental or climatic conditions. This information can be used to forecast future outbreaks, thereby allowing for preparedness and mitigation planning can evaluate the effectiveness of previous public health interventions and mosquito control efforts. If the frequency or intensity of outbreaks decreases over time, it may indicate that current strategies are effective.
3.2.8. The Distribution of WNV Positive Cases over Mosquito Batch Size
In the Chicago database, 9.15% of the mosquito batches are classified as infected with WNV, meaning some mosquitoes in those batches tested positive for the virus. In
Figure 9, we examine the batch sizes when they were found to be WNV-positive. The histogram displays the distribution of the number of mosquitoes in each batch where the virus was detected. This visualization helps reveal the relationship between batch size and WNV presence, offering insights into the data distribution. As expected, larger batches of fifty mosquitoes were more likely to test positive, but positive cases were also observed in smaller batches.
3.3. Prediction of WNV Positive Batches
A prediction model in the context of this work would infer which batches are going to be found infested based on the rest of the variables. Note that such an approach relies on reliable, historical data from a network of traps on the same locations. This information can guide public health officials on when to implement interventions such as pesticide spraying, public awareness campaigns including alerting, and vector control measures. By understanding the probability of WNV occurrence over a time span, resources can be optimized, instead of evenly distributing resources in time and locations. This helps in efficient allocation of resources, for example, more frequent mosquito trapping and testing during peak times, reducing resource use during periods with low risk. For instance, if the peak occurrence falls around mid-August, health authorities can plan proactive measures just before this peak, focusing around geographic locations (hotspots) to minimize mosquito populations and, consequently, the transmission of WNV.
In this work we are interested only in the accuracy of a single model implementing a core idea, and we do not examine approaches like stacking or voting of a group of classifiers. We also focus only on the data of the Chicago database, and we do not integrate environmental factors such as spraying records, temperature, precipitation, and humidity, which are not part of this database, but it is known to greatly affect mosquito activity each year.
We introduce a new approach based on a bivariate Normal fitting with trap significance assessment (see Appendix for code and mathematical derivation). b) We introduce local regression, and we upload the dataset and the associated code so that different approaches can be tested for different splits of training and testing years of the Chicago database.
We make a base predictor by fitting a bivariate Normal jointly on the variables of the train dataset: The variable ‘Dates’ has been re-indexed from 1st of August and ‘Number of mosquitoes’ (the log of it) for the WNV positive class and WNV negative class. This can be seen in
Figure 10 and
Figure 11 (see also Appendix for code and mathematical derivation). In both figures, the pdfs are partly disjoint. The bivariate fit is applied on the training set, and it is assumed to characterize also the test set. Using the dates and the corresponding number of mosquitoes of the test set we can predict the probability of the infested batches of the test set. Note also that the peak of positive cases comes after about two weeks after the peak in mosquito catches.
Using basic Bayesian statistics, we can derive the probability of an infested batch given the test date and test ’NumMosquitos’ variables that are observables assuming that the same bivariate fit holds for the test set. The suggested approach alone will give an AUC of almost 80% without using any other variable or processing on the database. This is a result of interest to our point of view as it returns an accuracy very close to more complex classification methods but still is embeddable in microprocessors with few lines of code (see Appendix). In
Figure 11 we see that the bivariate Gaussian fits on Days and log(Number of Mosquitoes) for WNV positive and negative classes of the train set are partly disjoint. This will allow us to extract some information on the probability of a batch being infected as the histograms of virus-negative mosquito catches and positive cases to be better separated. Note also in
Figure 12 that the peak of WNV positive cases comes a bit after the peak in mosquito catches and is more concentrated in this 2-D feature space. In
Figure 11 we can easily see gross decision boundaries: a) before August, it is unlikely to have WNV positive batches especially in batches with small number of catches, b) between the last week of August and the first week of September, in batches with high number of mosquitoes, the probability of an infested batch is at its peak.
3.3.1. The Trap Biases
We suggest an approach that combines statistical hypothesis testing with predictive modeling to enhance the detection of WNV presence based on mosquito trap data. By statistically validating traps and adjusting predictions accordingly, the model accounts for spatial variations in WNV activity, leading to more accurate and reliable predictions. The model incorporates several statistical techniques, including hypothesis testing, multivariate normal modeling, and adjustments for multiple comparisons.
We focus on validating whether certain mosquito traps have a significantly higher occurrence of WNV-positive cases than would be expected by chance. This is achieved through the hypergeometric test. Let N represent the total number of trials in the population (training dataset) and K the total number of “successes” (WNV-positive cases) in the population. The hypergeometric test assesses whether the number of WNV-positive cases observed in a specific trap is significantly higher than expected under the null hypothesis (i.e., the trap has the same positive rate as the overall population). We tried to adjust the p-values to account for multiple hypothesis testing using the Bonferroni correction, but it was a very conservative approach, and we removed it. A p-value less than 0.05 is considered significant. We then determine if a trap has a statistically significantly higher rate of WNV-positive cases compared to the overall population.
The weight is the ratio of the trap’s positive proportion to the overall positive proportion. This amplifies the influence of traps with higher-than-expected positive rates. If a trap is deemed non-significant, we assign a weight of 1.0, indicating no adjustment. If a trap has a significantly higher positive rate, its weight increases the predicted probability and modifies the predicted probabilities based on the trap weights determined from the hypergeometric test. The key statistical technique used for validating the traps is the hypergeometric test, which assesses whether the observed number of WNV-positive cases in each trap is significantly higher than expected under the null hypothesis. This test is appropriate because each mosquito trap’s data represents a sample without replacement from the overall population. The total number of observations and successes are known and finite. By performing the hypergeometric test for each trap, we identify traps that have a statistically significantly higher occurrence of WNV-positive cases. Adjusting the predicted probabilities using the weights derived from this test enhances the model’s ability to account for spatial heterogeneity in WNV prevalence. We have tried other tests such as t-tests, but all returned inferior results in terms of ROC accuracy.
3.3.2. Kernel-Weighted Regression by Applying Distances in Time and Locations
The core idea revolves around a Bayesian classification approach augmented with a form of local regression to adjust predictions based on nearby mosquito counts. We suggest a kernel-weighted regression approach to estimate the expected mosquito count for each observation, considering neighboring data points in time and space. Kernel-weighted regression is a type of ‘local regression’ where data points closer to a given input point are given more weight in estimating the probability of that batch being infected. The code estimates the number of mosquitoes per row based on the count of mosquitoes from similar rows. The “similarity” is defined by temporal proximity, spatial proximity, and optionally species or trap type. Since the distances in the Chicago area are small, we did not employ geodesic distance calculation, and the distance was just the Euclidean distance in GPS coordinates and the temporal distance is the absolute value of days from First of August. The final estimate is computed as a weighted average, where weights are determined by a distance function that considers both the inverse of temporal and spatial distances. The statistical significance of each nearby row’s contribution is adjusted by the count of mosquitoes and more weight is given to data rows in time and geographic location. This approach helps in providing a more robust estimate of mosquito counts, particularly when the original data may be sparse or inconsistent across different spatial and temporal dimensions. All these biases and their influence on the AUC score of the test set are gathered in
Table 4 (see also Appendix).
We partitioned the data into a training set containing all years from 2007 to 2022 and held out the years 2023 and 2024 as a test set. This presents a challenging scenario, as we need to predict two consecutive years based on training data that also includes years from the distant past. The results are gathered in
Table 5.
3.3.3. Other Classifiers
The Chicago Database is a tabular one, and this kind of data structure is typically treated with tree-based classifiers.
The following variables have been converted to categorical: ’TRAP_TYPE’, ’Species’, ’Trap’, ’Address’, ’COMMUNITY AREA NAME’]). The columns used are [’Block’, ’Species’, ’TRAP_TYPE’, ’Trap’, ’Latitude’, ’Longitude’, ’month’, ’week’, ’NumMosquitos’, ’Address’, ’COMMUNITY AREA NAME’]. The results are gathered in
Table 6.
The ROC curve rises above the diagonal (gray dashed line), indicating that the model is better than random chance in distinguishing between positive and negative classes (see
Figure 12).
The area under the curve (AUC) is 0.81, which suggests that the model has a reasonably good ability to discriminate between the two classes. An AUC closer to 1 would indicate a very strong classifier, while an AUC of 0.5 would represent a classifier that performs no better than random guessing. In Figure 13 the operational point is marked with a red dot. Based on its position: True Positive Rate (TPR) ≈ 0.78 and False Positive Rate (FPR) ≈ 0.28. This means that the model correctly identifies 78% of actual positive cases (i.e., it captures 78% of true cases as positives and 28% of actual negative cases are mistakenly classified as positive (i.e., false alarms).
3.3.4. Additional Parameters
Spraying records and weather conditions can significantly influence indirectly the probability of a mosquito batch testing positive for WNV. Spraying—a common mosquito control measure—can directly reduce mosquito populations, particularly those carrying WNV, thereby lowering the likelihood of WNV-positive batches. Weather factors such as temperature, humidity, and rainfall also play a crucial role; warmer temperatures and increased rainfall create favorable conditions for mosquito breeding, potentially raising WNV transmission risk. Therefore, integrating spraying records and weather data into predictive models could enhance accuracy by accounting for these critical environmental factors.