2.4. Design and Architecture
The
sensortoolkit package is designed to be highly extensible to a wide range of air sensor datasets and pollutants in a procedurally consistent and reproducible manner. This is achieved by organizing workflows associated with
sensortoolkit around an object-oriented approach. The objects and general flow of
sensortoolkit is shown by
Figure 1.
To get started, users can find an initial setup script in the online documentation (
https://sensortoolkit.readthedocs.io/en/latest/template.html#script-template, last accessed: 12/04/24). This script provides a line-by-line example for how to utilize
sensortoolkit to conduct sensor evaluations and is intended to be executed code-block by code-block. It includes various components to set the path, create directories, collect information about the testing organization and testing location, specify the sensor make/model, run the sensor and reference setup routines (
Section 2.4.1), and specify what pollutant report the user wishes to create. During this process, the user supplies information about the testing organization including the name of the test, organization conducting the test (e.g., name, division, type of organization), and contact details (e.g., website, email, phone number). The user can also supply information about the test location including site name, address, latitude, longitude, and air monitoring station’s Air Quality System (AQS) Identification Number (ID), if applicable.
The sensor and reference setup routines (
Section 2.4.1) prompt users to supply sensor data in .csv, .txt, and/or .xlsx format and reference monitor data either as a file or by pulling in reference data using AQS, AirNow, or AirNow-Tech APIs and the user’s existing authentication key. Using a series of prompts to describe the data being supplied, a set of three testing attribute objects (sensortoolkit.AirSensor, sensortoolkit.ReferenceMonitor, and sensortoolkit.Parameter) are created and specified for the air sensor, reference monitor, and pollutant to be analyzed, respectively. The testing attribute objects are then passed to the evaluation objects (sensortoolkit.SensorEvaluation and sensortoolkit.PerformanceReport) and the code calculates performance metrics and/or produces a testing report. A detailed discussion of testing attribute objects and evaluation objects is summarized in
Section 2.4.1 and
Section 2.4.2, respectively. Subsequent mention to either testing attribute objects or evaluation objects will omit the use of “sensortoolkit.[object]” terminology indicating it is a
sensortoolkit package object.
2.4.1. Testing Attribute Objects
Users who have not previously created a setup configuration file for the air sensor they intend to evaluate will start with the AirSensor.sensor_setup() method. A flow chart of this process is shown in
Figure 2. Users are prompted to specify the file type, location, header row, and describe each column within the file. The setup module will use this information to create a setup.json file. The AirSensor testing attribute object is then created. Users follow a similar procedure to start the ReferenceMonitor object reference_setup() module which creates and configures the ReferenceMonitor object. As noted previously, users may query reference monitor data from the AQS API (
https://aqs.epa.gov/aqsweb/documents/data_api.html). This method is preferred because the stored data has undergone rigorous quality control. There can be a delay in data availability as data is typically quality assured and uploaded quarterly. Data can also be pulled from the AirNow API (
https://docs.airnowapi.org/) which provides real-time air quality data from monitors managed by state, Tribal, local, and federal agencies. However, data from AirNow have not been as extensively validated and verified in the same manner as AQS data. Finally, data can also be pulled from AirNow-Tech but must be manually downloaded to the user’s computer from the AirNow-Tech website (
https://airnowtech.org/) prior to use with
sensortoolkit. Unpivoted datasets from AirNow-Tech are preferred by
sensortoolkit.
To reanalyze a sensor dataset at a later date with the same testing attribute objects, users can use the setup.json configuration files saved for the AirSensor and ReferenceMonitor objects eliminating the need to go through all of the setup prompts again. This idea also applies to new sensor evaluations that match sensor types that have been previously evaluated. With these files, users can import sensor and reference data and process files that have been previously converted to the
sensortoolkit data formatting scheme (SDFS) (
Section 2.4.2). The json file is saved in the user’s /data folder within their project directory.
Following setup, the AirSensor and ReferenceMonitor testing attribute objects contain information about the air sensor (AirSensor), reference monitor (ReferenceMonitor), and the pollutant or environmental parameter of interest for the evaluation (Parameter). The AirSensor and ReferenceMonitor objects house time series datasets at the original recorded sampling frequency and at averaging intervals specified by the Parameter object (e.g., PM
2.5 data are averaged to both 1-hour and 24-hour averages, O
3 data are averaged to 1-hour intervals). The averaging intervals are specified in accordance with the sampling and averaging intervals reported by FRMs and FEMs for the pollutant of interest and are not user specified. Both objects also store device attributes (e.g., make and model). Testing attribute objects (
Section 2.4.1) are passed on to the two evaluation objects (
Section 2.4.3 and
Section 2.4.4) to compute various quantities and metrics, create plots, and compile reports as recommended by Sensor Targets Reports [
21,
22,
23,
25].
2.4.2. Data Formatting Scheme
Converting both sensor and reference datasets into a common formatting standard allows for ease of use in accessing and analyzing these datasets. The SDFS is used by the library for storing, cataloging, and displaying collocated datasets for air sensor and reference monitor measurements. SDFS versions of sensor and reference datasets are created automatically following completion of the respective setup modules for the AirSensor and ReferenceMonitor objects.
Table S1 details the parameter labels and descriptions used in the SDFS. The list of parameters contains common pollutants including both particulates and gas phase compounds and many are deemed criteria pollutants by U.S. EPA. The SDFS groups columns in sensor datasets by the parameter, with parameter values in one column and the corresponding units in the adjacent column.
SDFS is intended for use with time series datasets recorded via continuous monitoring at a configuring sampling frequency, whereby a timestamp is logged for each consecutive measurement. Each row entry in SDFS datasets is indexed by a unique timestamp in ascending format (i.e., the head of datasets contain the oldest entries while the tail contains the most recently recorded entries). Timestamps are logged under the “DateTime” column, and entries are formatted using the ISO 8601 extended format. Coordinated Universal Time (UTC) offsets are expressed in the form ±[hh]:[mm]. For example, the timestamp corresponding to the last second of 2021 would be expressed as “2021-12-31T11:59:59+00:00” in the Greenwich Mean Time (GMT) Zone. By default, datasets are presented and saved with timestamps aligned to UTC.
2.4.3. Sensor Evaluation Object
The SensorEvaluation object uses instructions from the Sensor Targets Reports to calculate performance metrics and generate figures and summary statistics.
The SensorEvaluation.calculate_metrics() method creates data structures containing tabular statistics and summaries of the evaluation conditions. A pandas DataFrame is created which contains linear regression statistics for the sensor vs. FRM/FEM comparison. The linear regression statistics included are linearity (R2), bias (slope and intercept), RMSE, N (number of sensor-FRM/FEM data point pairs), and precision metrics (coefficient of variation (CV) and standard deviation (SD)). These statistics describe the comparability of the tested sensors, as well as the minimum, maximum, and the average pollutant concentrations. Data are presented for all averaging intervals specified by the Parameter object. A deployment dictionary, SensorEvaluation.deploy_dict, is also created which contains descriptive statistics, sensor uptime at 1- and 24-hour averaging intervals, and textual information about the deployment, including the details about the testing agency and site supplied by the user during setup. This information is written and saved to a json file and is accessed when creating the final report.
The SensorEvaluation object also creates several plots. These include time series, sensor vs. reference scatter plots, performance metric plots, frequency plots of meteorological conditions, and figures exploring how meteorological conditions impact sensor performance.
The SensorEvaluation_plot_timeseries() method plots display sensor and reference concentration measurements along the y-axis as a function of time on the x-axis. Data are presented for the time averaging intervals specified by the Parameter object. For particulate matter, one plot is created for 1-hour averaging intervals and another for 24-hour averaging intervals. For gases, one plot is created for 1-hour averaging intervals. The reference measurements are shown in solid black lines and sensor measurements are shown in colored lines with a different color representing each of the collocated sensors. The time averaging interval, sensor name, and pollutant of interest are displayed in the figure header. These graphs assist users in determining whether the sensor is capturing the trends, or changes in concentrations, similarly to the reference monitor. Users may also be able to identify time periods or conditions when the sensor under- or over-reports concentrations, if the sensor may be prone to producing outliers, and/or when data completeness was poor.
The SensorEvaluation.plot_sensor_scatter() method plots time-matched measurement concentration pairs with sensor measurements along the y-axis and reference measurements along the x-axis. The one-to-one line (indicating ideal agreement between sensor and reference measurements) is shown as a dashed gray line. Measurement pairs are colored by relative humidity measured by the independent meteorological instrument at the monitoring site or sensortoolkit will attempt to use onboard sensor relative humidity (RH) measurements. The comparison statistics, linear regression equation, R2, RMSE, and N are printed in the upper left of the graphs. Separate plots are created for each collocated sensor. The sensor name, time averaging interval, and pollutant of interest are displayed in the figure header. These graphs assist users in determining whether the sensor typically under- or over-reports concentrations and whether the sensor/reference comparison is consistent through the testing period. Users can also compare the graphs for each sensor to determine if the sensor/reference comparison is consistent among the three identical sensors.
The SensorEvaluation.plot_metrics()method creates a plot that visualizes how the performance metrics for the sensor compares to the target values proposed in Sensor Targets Reports. This plot is a series of adjacent subplots each corresponding to a different metric. Ideal performance values are indicated by a dark gray line across the plot although RMSE, normalized RMSE (NRMSE), CV, and SD would ideally all be zero making this line difficult to see. Target ranges are indicated by gray shaded regions. The calculated metric for each sensor (R2, slope, intercept) or the ensemble of sensors (RMSE, NRMSE, CV, SD) are shown as small circles. Metrics are calculated for each of the time averaging intervals specified in the Parameter object. These graphs assist users in quickly determining which target values are met, how close the sensors are to meeting the missed targets, if there is variation among the batch of sensors tested, and if time averaging interval affects the sensor’s ability to meet the targets.
The SensorEvaluation.plot_met_dist() method plots the relative frequency of meteorological measurements recorded during the testing period in a bar graph. Temperature (T) and relative humidity (RH) data reflects measurements by the independent meteorological instrument at the monitoring site. The time averaging interval is displayed in the figure header. These graphs assist users in determining whether the test conditions are similar to conditions in the intended deployment location. If deployment conditions are likely to be significantly different, further testing is advised before deployment.
The SensorEvaluation.plot_met_influence() method creates plots to visualize the influence of T or RH on sensor measurements. Sensor concentration measurements are normalized by (i.e., divided by) the reference concentration measurements and then plotted on the y-axis. T and RH, as measured by the independent meteorological instrument at the monitoring site, are plotted on the x-axis. These graphs assist users in determining whether T and RH influence the performance of the sensor. For instance, if the RH graph shows an upward trend, it would indicate that the sensor concentration measurements are higher during high humidity conditions. Thus, users may wish to include a RH component when developing a correction equation for the sensor measurements.
The SensorEvaluation.plot_wind_influence() method is an optional plot to visualize the influence of wind speed and direction on sensor measurements. For wind speed, sensor concentration measurements are shown as the concentration difference between the sensor and reference monitor and plotted on the y-axis. For wind direction, concentration measurements reported by the sensor are shown as the absolute value of the concentration difference between the sensor and reference monitor and plotted on the y-axis. These graphs assist users in determining whether wind speed and wind direction influence the performance of the sensor.
In addition to plots, the SensorEvaluation object can print descriptive summaries to the console. The SensorEvaluation.print_eval_metrics() method prints a summary of the performance metrics along with the average value and the range (min to max) for that metric for each of the collocated sensors. The SensorEvaluation.print_eval_conditions() method prints a summary of the mean (min to max) pollutant concentration as reported by both the sensor and reference monitor and meteorological conditions (T and RH) experienced during the testing period. These summaries are displayed for each of the time averaging intervals specified in the Parameter object.
2.4.4. Performance Report Object
The PerformanceReport object takes the output from the SensorEvaluation object to create a testing report utilizing the Sensor Targets Report reporting template. Information about the testing organization and testing site, specified in the initial setup script, are included at the top of the first page of the testing report. Details about the sensor and FRM/FEM reference instrumentation will be populated into the report from the information obtained in the sensor and reference data ingestion module but can also be adjusted or added manually after the report is created.
Several plots generated via the SensorEvaluation object are displayed on the first page including the time series plots, one scatterplot, performance metrics and target value plots, the T and RH frequency plots describing test conditions, and plots describing meteorological influences on the sensor data. One goal of the reporting template is to provide as much at-a-glance information as possible on this first page.
The second page of the report is populated with a tabular summary of the performance metrics including R2, slope, and intercept for each of the collocated sensors. The tabular summary makes it easier to compare the metrics with the performance targets. Scatterplots for each collocated sensor are displayed on this page or the subsequent page depending on space. Multiple scatterplots are presented here to allow users to visually compare performance among the cadre of sensors.
The subsequent pages of the reporting template are designed to document supplemental information, and testers can begin populating this material after the report is generated. In accordance with the requested information in the Sensor Targets Reports, a large table of supplemental information is printed. Testers can use the check boxes to note if information is available and can manually enter details or URLs as available. If testers require more space or choose to provide more information, additional information can be added to the end of the report.