Mapping Global Surface HCHO Distribution with Confidential Interval by Satellite Observation

Formaldehyde (HCHO) is one of the most important carcinogenic air contaminants. However, the lack of global surface concentration of HCHO monitoring is currently hindering researches on outdoor HCHO pollution. Traditional methods are either too naïve or data-demanding for a global scale research. To alleviate this issue, we trained two fully-connected neural networks respectively for deriving point and interval estimation of surface HCHO concentration in 2019, where vertical column density data from TROPOMI, in-situ data from HAPs (harmful air pollutants) monitoring network and ATom mission are utilized. Our result shows that the global surface HCHO average concentration is 2.30 μg/m3. Furthermore, in terms of regions, the concentration in Amazon Basin, North China, South-east Asia, Bay of Bengal, Central and Western Africa are among the highest. Our study makes up for the global shortage of surface HCHO monitoring and helps people have a clearer understanding of surface concentration distribution of HCHO. In addition, with the help of quality-driven algorithm, interval estimation of surface HCHO concentration is believed to bring confidence to our results. As an early work adopting interval estimation in AI-driven atmospheric pollutant research and the first to map global HCHO surface distribution, our paper will pave way for rigorous study on global ambient HCHO health risk and economic loss, thus providing basis for pollutant controlling policies worldwide.


Introduction
Formaldehyde (HCHO) is a carcinogenic trace gas and toxic pollutant in the atmosphere [1]. It is considered by U.S. Environmental Protection Agency (EPA) to be one of the most important carcinogens in outdoor air among 187 harmful air pollutants (HAP) [2], and accounts for more than 50% of the total risk of HAP related cancer in the United States [3]. 13 out of every million people receive nasopharyngeal carcinoma after being exposed to an average concentration of 1 microgram per cubic meter of HCHO for a lifetime [4]. As the most abundant aldehyde compounds in the atmosphere, HCHO is one of the major volatile organic compounds (VOCs) and pollutants in troposphere [5], which has a close relationship with the formation and extinction of O3 and NO2 in the atmosphere. HCHO pollution is a global scale issue. Ambient HCHO can be produced naturally and artificially, such as photolysis of isoprene from vegetation [6,7] farmland emissions [8], energy production and automobile exhaust emissions [9,10].
Surface concentration can denote the amount of HCHO that people are exposed to, and is the direct data source of health risk estimation. Nevertheless, despite the crucial role of HCHO in human's health and atmosphere, it is difficult to monitor HCHO systematically and comprehensively by using traditional ground method because of the large error and the expensive cost [11]. As a result, there is still no regular or large-scale monitoring of HCHO over most regions of the world. Most countries and regions with serious pollution fail to measure the surface HCHO concentration. Only in the United States, the HAP sampling network collects HCHO information but is limited to cities and industrial sites [12].
In contrast, remote sensing technology can not only monitor the long-term serial and large-scale dynamics, but also avoid many interference factors. Many satellites have been recording HCHO vertical column density (VCD) [13], which provides data foundation for many related researches. The main sensors used to measure the concentration of HCHO VCD in the atmosphere include GOME-1 [14], GOME-2 [15], SCIAMACHY [16], OMI [17] and TROPOMI [18]. In terms of precision, TROPOMI is the most advanced atmospheric monitoring spectrometer with the highest resolution in the world. The imaging width is 2600km, covering nearly all parts of the world every day [19]. However, remote sensing only provides the column concentration due to the limit of satellite vertical data collection. Therefore, most studies on ambient HCHO only focus on the total amount in the vertical column in certain regions, such as North America [20], South America [21], Europe [22], Asia [23,24], Africa [7], instead of focusing on its surface concentration.
With the increasing attention towards health risks and photochemical pollution, demand for HCHO concentration distribution from the global perspective is growing more urgent. Previous researchers have used fixed forms of linear models to assess the relationship between VCD and in-situ concentration 1 of NO2, SO2, CO, PM [25], and used R 2 to assess the relationship between vertical column density and ground in-situ concentration [26]. However, these methods are either too naive or too limited to specific pollutants. In the few existing study addressing HCHO surface concentration, GEOS-Chem model was adopted to utilize remote sensing monitoring data [27]. Nevertheless, based on atmospheric transportation model, it needs numerous input parameters, which impedes its appliance to a global scale, where surface monitoring data are scarce. Therefore, our concern is to derive the global surface HCHO concentration distribution based on vertical air column of HCHO from satellites, with quite limited in-situ HCHO concentration.
Neural network, a powerful machine learning algorithm, has gained its reputation for revealing hidden patterns beneath data with starling accuracy in various fields, such as image classification [28], object detection [29], image denoising [30], image synthesis [31], person re-identification [32], etc. However, vanilla neural network does not assign confidential level nor confidential interval, which is necessary for scientific estimation and public policy decision, to its point estimation results. To quantify uncertainty of results derived from neural networks, a diverse of approaches have been adopted, including Bayesian neural network [33], delta method [34], bootstrap [35], mean variance estimation [35], interpreting dropout as performing variational inference [36]. But these methods are either computational demanding or require strong assumptions. Quality-driven (QD) method, a method based on LUBE to derive confidential intervals for the neural network, by combining the uncertainty estimating loss and the neural network loss function as a whole [37], is not only compatible with gradient descent algorithms, but shrinkages the average confidence interval length up to 10%, compared with previous works [38]. So, to enhance the credibility of our model, this method is leveraged to obtain the interval estimation of surface concentration of HCHO. By combining the point and interval estimation, it is believed to meet a balance between maintaining accuracy and controlling uncertainty in the form of a pre-set confidential level.
The potential health impact of HCHO but lack of global monitoring data calls for an efficient way to get better understanding of global HCHO surface distribution with limited data. In this paper, we, for the first time, derived the global surface concentration of HCHO in 2019 by feeding TROPOMI VCD data and limited surface HCHO concentration data into neural network models. We also captured the seasonal changes of key areas and gave interval estimation of surface HCHO by QD method. As an early work adopting interval estimation in AI-driven atmospheric pollutant research and the first to map global 1 In-situ HCHO concentration include surface concentration and high-altitude concentration from ATom flight data HCHO surface distribution, our paper will pave the way for rigorous study on global ambient HCHO health risk and economic loss, thus providing basis for pollutant controlling policies worldwide.

Sentinel-5P VCD Data
The data of vertical column density of HCHO in this study comes from TROPOMI (Tropospheric Monitoring Instrument), which is carried on Sentinel-5P [19]. Sentinel-5P is a global air pollution monitoring satellite launched by ESA on October 13, 2017, as part of the Copernicus project. TROPOMI can effectively observe trace gas components in the atmosphere around the world, including NO2, O3, SO2, HCHO, CH4, CO and other important indicators closely related to human activities, and can strengthen the observation of aerosols and clouds [39]. In terms of accuracy, TROPOMI is the most advanced atmospheric monitoring spectrometer with the highest spatial resolution in the world. The satellite covers all parts of the world every day with an imaging resolution of 7km×7km. Each time the satellite passes through the equator, the time is about 13:30 local time, which effectively ensures the comparability of data in different regions [19]. Sentinel-5P data are currently available from public access 2 . We use the data of 2019 because 2018 is the first year that Sentinel-5P is in operation; the algorithm of the product is not stable then. 2020 witnessed the COVID-19 pandemic, which might have special impact on anthropogenic sources, making the result hard to represent a long-term status. Offline HCHO data from January 1 to December 31, 2019 are collected. According to the technical documents, data points whose quality index (QA_ value) is less than 0.5 are removed. After doing mosaic on the datasets and applying Ordinary Kriging interpolation, we obtained the distribution of global average column concentration of HCHO with 0.05°resolution. Because of the sparsity of satellite data and scarceness of human activities there, the data beyond 60°S and 60°N is discarded, which has little impact on health risk estimation.

In-situ Data
Since our study aims to estimate the surface concentration of HCHO in a global level, we need data from diverse types of underlying surfaces and different altitudes to train our model. Therefore, the following two data sources are chosen.
ATom flight data. NASA's atmospheric tomography mission (ATom) is a systematic, global sampling of the atmosphere in the United States from 2016 to 2018, and continuous profile analysis from 0.2km to 12km. The volume mixing ratio of HCHO in air was measured in ATom flight data. A large number of gas and aerosol payloads were deployed on NASA's DC-8 aircraft, and the HCHO on NASA's high-altitude aircraft was measured by ISAF instrument [40,41]. The instrument uses laser-induced fluorescence (LIF) to obtain the high sensitivity needed to detect HCHO in the upper troposphere and lower stratosphere, which has an abundance of 10 parts per trillion. LIF can also achieve quick response to measure the abundance of HCHO in the fine structure outflow of convective storms. These HCHO measurements will be used to elucidate the mechanism of convective transport and to quantify the effects of boundary layer pollutants on ozone photochemistry and cloud microphysics in the upper atmosphere [42].
HAPs ground monitoring data. We obtained ground HCHO observations from EPA SLTS network at https://www.epa.gov/outdoor-air-quality-data, which reports average 24-hour HCHO concentration all around the year. Here, we selected 5965 data of 109 sites in 2019, covering the whole country, as shown in Figure 2 (a).
These two datasets cover a wide range of altitudes, from -8.1977° S to 82.9404° N, and a diverse variety of landscapes in U.S. The HAPs data ensure that the concentration distribution feature of ground level is emphasized in our model, and the ATom data ensure that our model can be generalized and applied to a global extent. Therefore, our dataset satisfies the requirements of this study.
(a) (b) Figure 2. (a) The geographical distribution of our data, where red represents ATom flight data points and green represents HAPS ground monitoring network. (b) The meaning of "Height" and "Altitude" for ATom mission data Since ATom data are obtained far above the surface, and the vertical distribution of HCHO usually changes fiercely from ground to 1~2km above [43], we take "Height" as another input variable in our model to control the impact of vertical distribution along the column. For those HAPS ground monitoring data, we assign 0 as their heights.

Global DEM Data
Since descriptive statistics show a negative relationship between surface altitude and in-situ concentration, with a Pearson's correlation of r=-0.3907 in our in-situ dataset, we use global Digital Elevation Model (DEM) data to serve as one of the input variables-"Altitude", in estimation of in-situ concentration. The relationship between variable "Height" and variable "Altitude" is shown in Figure 2 In our study, we use the Shuttle Radar Topography Mission (SRTM) DEM product and resample it to resolution of 0.05°. This dataset has an initial resolution of 90m at the equator and is provided in WGS84 projection with 1 arc resolution [44].

Data Processing
After collecting and organizing data into formattable structure, we firstly visualize and preprocess these data. Then, two neural networks are implemented for point and interval estimations using PyTorch, a well-known deep-learning framework. Our code is available online 3 .
The preprocessed data with ground truth in-situ HCHO concentration are then spilt into training and testing dataset to train our models. After that, global VCD data are fed into the model to derive global in-situ HCHO concentration.

Preprocessing
Though theoretically, a neural network is able to handle input data from different distribution, a significant defect was noticed in the training process without preprocessing, owning to the highly imbalanced, skewed distribution of the HCHO concentration (both column and in-situ). The logarithm of the HCHO concentration data shows a bell-shape distribution, and increments in estimation accuracy have also proven the effectiveness of log-transformation.

Neural Network Architecture
As a universal function approximator, neural network plays a vital role in helping us deriving the point and interval estimations of the HCHO concentration. But instead of training a single network to get these estimations jointly, two separate neural networks are constructed for point and interval estimation respectively, because experiments indicate that a joint model always has to compromise between point estimation and interval estimation, thus greatly damaging the accuracy of point estimation.
Like ordinary fully-connected neural networks, each neural network in our model contains three input nodes, three BFR blocks (whereas the ReLUs in the last blocks are disabled). The network for point estimation has one output nodes, and the other network for interval estimation gets two. The structure of our model is shown in Figure 3. For the sake of stabilizing the training and prediction procedure, instead of stacking full-connection and non-linear activation layers, we proposed to stack BFR blocks, which are made up of a batch normalization layer, a full connection layer and a ReLU activation layer sequentially.
Batch normalization (BN) is firstly introduced to address Internal Covariate Shift, a phenomenon referring to the unfavorable change of data distributions in the hidden layers. Just like the data standardization, BN forces the distribution of each hidden layer to have exactly the same means and variances dimension-wisely, which not only regularizes the network, but also accelerates the training procedure by reducing the dependence of gradients on the scale of the parameters or of their initial values [45].
Full connection (FC) layer is connected immediately after the BN layer in order to provide linear transformation, where we set the number of hidden neurons as 50. The output from the FC layer is non-linearly activated by ReLU function [46,47].

Loss function
Objective functions with suitable forms are crucial for applying stochastic gradient descent algorithm to converge while training. Though point estimation only needs to take the precision into consideration, two conflicting factors are involved in evaluating the quality of interval estimation -higher confidential level usually yields an interval with greater length and vice versa.
Point estimation loss. Instead of fancy forms, we found that a loss is sufficient for training rapidly: Interval estimation loss is relatively complex compared to point estimation loss. The QD-loss takes the confidential level and interval length into consideration simultaneously [38]: On one hand, to control the confidential level of the interval estimator, is set to indicate at most how many(proportionally) intervals failing to cover the true value can be tolerated. We set multiple , including 0.05, 0.10, 0.20, in our model to derive interval predictions of various confidential level and average coverage length, and it is verified that higher yields shorter intervals.
indicates the covering rate of intervals: where < < = 1 if and only if < < , else it equals to 0.
On the other hand, the average length of intervals subject to > 1 − should be minimized. However, intervals that fail to capture their corresponding data point should not be encouraged to shrink further. The average interval length to penalize is, therefore, , works as a continuous approximation towards "hard" < < . Since the sigmoid function is known for providing a differentiable alternative to discrete stepwise functions, and = 160 is a super-parameter for smoothness.

Point Estimation
Our point estimation model shows a relatively high accuracy and is generally consistent with previous studies on the vertical distribution of HCHO, where   By loading global DEM, logarithm VCD and height (0m at surface) into the model, we get the annual average of global in-situ HCHO distribution map. We can see from Figure 5 that there are generally 6 regions where HCHO in-situ concentration is high, namely the Amazon area, south east U.S., Central and Western Africa, North Eastern India, South East Asia, and North China, with an average concentration of more than 4 μg/m 3 . We will get a closer look on the seasonal change of HCHO in these key areas in section 3.3. The uneven distribution of HCHO concentration on the sea and land surface deserves to be mentioned as well. It is obvious that the HCHO is relatively lower and more homogeneous on the sea surface than on the land, statistics in Table 2 have also confirmed this observation. We hypothesize that sea surfaces with a high concentration are often affected by propagations from nearby continent. This phenomenon is especially obvious in low altitude regions. We call these areas "transmission paths", which will be further discussed in section 4.2. Cities, as the regions with the densest population, deserve specific attention towards their HCHO concentration due to its known and potential harm to people living there. Table 3 shows the in-situ concentration of HCHO of some of the typical cities in these regions, where Jakarta and Singapore, two major cities (country) in South East Asia, rank the first and second, reaching 6.18 and 5,83 μg/m 3 .

Interval Estimation
Other than point estimation, our model also provides us with the estimation of upper and lower bounds of in-situ concentration of HCHO, so that we can evaluate the uncertainty, or fluctuation, of the in-situ concentration. In Figure 6, the relationship between estimation of upper bound, lower bound and the point estimation which is acquired in section 3.1 are visualized in a 3D space. We would like to emphasize that the captured uncertainty, or the interval length, delineates the fluctuation range of the data itself, not the lower trustworthiness of our model or its estimations.
(a) (b) Figure 6. Confidential level, together with the covering length, lays the foundation for the trustworthiness and precision of our interval prediction. As we shall see in Table 4, our interval estimation model obtains the covering rates, the ratio of true values covered by predicted interval, of 94.41% and 88.74%, which both exceeds the pre-set confidential level α = 0.9 and α = 0.8, respectively.
What's more, as what we have expected in section 2.2.1.2, a higher confidential level yields a longer interval length, which is 4.530 for α = 0.9, 17% more than 3.864 for α = 0.8. Such a phenomenon can also be configured via statistics for minimum, maximum and mean values for upper and lower bounds for the two confidential levels respectively in the table. However, we witness a quite greater standard deviation of upper bounds comparing to the lower bounds' one in both scenarios in Table 4. By plotting the upper and lower bounds as are in Figure 7 below, it is self-evident that upper bound estimation is not deterministic, though interval estimation successfully covers the true values (and point estimations as shall discussed below) of in-situ concentration. Nevertheless, further exploration of seasonal changes of HCHO in some key areas in section 3.3 could basically explain that seasonal variations of in-situ HCHO may contribute to the majority of the uncertainty of interval estimation.
(a) (b) Figure 7. (a) The joint density distribution for upper bound (x-axis) and lower-bound (y-axis). We observe that the upper and lower bounds share a significantly positive relation, and a majority of predicted interval are in the regions of 0.5~1.0 for lower bound and 5~10 for upper bound. (b) Relation between point estimation (x-axis) and predicted intervals (yaxis, red points for upper bounds and grey points for lower bounds). The black line, = , aims for indicating the relative positions of true values in the predicted intervals. We observe that our predicted intervals can basically cover true values.

Seasonal Changes of HCHO in Some Key Areas
To better understand the seasonal variation of surface HCHO, four typical months of some key areas where in-situ concentration is relatively high are visualized. The reason why typical months, instead of season average, are visualized is that they can provide a stronger contrast and represent the trend better.
America. Figure 9 shows the in-situ concentration of February, May, August, and November in South America and around Caribbean Sea. We can see that the Amazon Basin, Paraguay, and Eastern Central America have a high HCHO in-situ concentration in November and February, while the south-east coast of U.S. has the highest concentration in Novembor and are almost free from HCHO pollution in Febuary and May. The Andes Mountains shows a significantly low cencentration of less than 0.5 μg/m 3 . Africa. As is shown in Figure 10, there are two regions in Africa whose HCHO insitu concentration is relatively high. One is in the south of R. D. Congo around the city of Kolwezi, a mining center with humid subtropical climate. The in-situ concentration of HCHO here reaches its climax in February. The other pollution belt stretches along the Gulf of Guinea, which is famous for its rainforest climate. (c) (d) Figure 11. In-situ concentration of HCHO in Indo-Pacific region in some typical months.

Consistency and innovativeness
Through the works above, we, for the first time, successfully obtain the global surface distribution of HCHO in 2019 with point and interval estimation. As is seen in Figure 12, our result is generally consistent with previous mapping of surface HCHO which is obtained by OMI data and GEOS-Chem model from 2005 to 2016, but shows less noise along the trace of satellite. We can get a clearer view on this phenomenon from the smaller figure on the right side. Our estimation result shows some reversal trend in the Cordillera mountains area. Future research may do some validation on this case. However, since this difference occurs in places where population is sparse, it is not likely to have perceivable influence on the estimation of cancer risks. The result of global surface concentration estimation of 2019 gives us a closer look at the global distribution pattern of HCHO. We notice that HCHO tend to prevail on the plain of continent, instead of on the ocean or on high altitude areas. According to previous study, this can be attribute to the scarceness of VOC sources like chemical industry, combustion and rainforests, which are common precursor of the free radical reaction of HCHO production [46][47][48]. By mapping the distribution of HCHO, we can also preliminarily distinguish two kinds of sources around the world. One is plant-related, including Amazon, South East Asia and Gulf of Guinea, the other is human-related, including North China Plain and Pearl River Delta [49,50]. More works are needed to accurately identify the source of these HCHO-polluted area.
In addition, we introduce the interval estimation of neural network in the conversion of in-situ concentration for the first time, increasing the credibility of the model by providing uncertainty information. This new idea can make up for the deficiency of inexplicability of neural network model [51], thus being useful for the application of neural network into the field of atmospheric pollutant or health risk estimation in the future.

HCHO Transmission path along the equator
As far as we know, the phenomenon of HCHO transmission path along the equator has not been discussed in previous studies. Figure 13 shows four transmission paths of HCHO, namely the Central America-Pacific path, the Africa-Atlantic path, the India-Indian Ocean path and the Southeast Asia-Pacific path. These paths are all around the equator, and indicate the possibility that HCHO pollution has significant cross region transmission feature. The path on the west side of America, Africa and Indonesia can be attributed to the constant west wind along the equator [52]. While the path on the east side of Papua New Guinea is hard to explain, since this region should be dominated by southeast trade wind [53]. Future researchers may collect observations data in longer period to see if this is a normalcy and if it has any significant impact on global atmospheric process.

Health risk of HCHO in major cities
HCHO, as one of the most important carcinogens in outdoor environment [2], draws little attention due to the lack of ground detection of HCHO in most countries and regions for a long time, leading to the shortage of knowledge about health and economic losses caused by it. Even if the vertical column density of HCHO is currently available and do settle parts of our concerns about these issues, the in-situ HCHO concentration shall bring more benefits, as it can reflect the actual HCHO concentration people exposed to better. Take 2019 as an example, we assume that the HCHO concentration is always the same as that in 2019. According to inhalation unit risk estimate from EPA and population data [4,54], Health risks in main high-risk cities are calculated in table 5, exhibiting more than a thousand people get cancer due to be exposed in the HCHO in Jakarta, Dhaka, Bangkok, Kolkata, Beijing and Guangzhou. Jakarta has the highest patients due to be exposed, which up to 2593. Meanwhile, Jakarta, Singapore, Kuala Lumpur, Dhaka, Lagos are the highest prevalence cities, and have 80.34, 75.79, 72.93, 71.63, 71.37 people per million. Interestingly, the main cities of high health risk concentrate in Southeast Asia, which is neglected in HCHO pollution and health risk research and Southeast Asia may become the next research focus in HCHO pollution

Deficiencies and Prospects
For one thing, the data of HCHO in-situ concentration is seriously insufficient in spatial and temporal dimensions. Since only United States monitors HCHO in-situ concentration routinely, even if ATom data are also adopted, in-situ concentration data in low latitude regions is still sparse, which may lead to estimation bias in low latitude areas like Asia and Africa. It has also been a major obstacle to reaching a better result by adding more covariates into our model. Experiments with additional covariates input, such as latitude and months, have failed with degenerated or overfitting outputs unfortunately. What's more, the large gap between true values and upper bounds from our interval estimation model may suggest a heterogeneous in-situ concentration of HCHO distribution in different months or seasons, as our model is required to give the interval estimations in the scale of a whole year, rather than finer-grained time scales. Seasonal changes of HCHO in some key areas in visualized section 3.3 has also shown this phenomenon directly.
Therefore, we expect that as HCHO in-situ monitoring network develops, larger amount of data from a more diverse sites could enable us to adopt a careful designation of temporal data input and could help give a better estimation towards in-situ concentration of HCHO. Meanwhile, as Sentinel-5P is accumulating more data, we expect that our model can take more factors, including latitude and seasons, into consideration, which could provide more precise estimation of a global scale health risk and economic loss based on specific regions and seasons. Besides the significance of the health risk, our study is also conducive to researches on the generation of photochemical pollution, the concentration of VOC, NO2 and other photochemical reaction related pollutants.

Conclusions
With the facilitation of quality-driven interval estimation algorithm designed for neural network, we manage to give the confidential interval and a precise point estimation of 2019 global surface HCHO on different confidential levels with limited amount of data. By mapping the HCHO concentration distribution, we find that Southeast Asia, North China, Central and Western Africa, and the rainforest area of Latin America have relatively more serious HCHO pollution. Major cities in these regions, such as Bangkok, Beijing, Guangzhou, Singapore, have an annual concentration over 5.00μg/m 3 , the health effects of which is worthy of more attentions from the academia and governments.
Our work paves the way for researches on formaldehyde-related cancers, and provide guidance for policy making and insurance pricing. To the best of our knowledge, we are the first to map the global distribution of HCHO and provide insights on its potential health risks. As HCHO VCD data from Sentinel-5P accumulate, we can map the surface concentration of HCHO for longer period of time and give more precise estimation of the global risk distribution of formaldehyde-related cancers, which would be more statistically reliable with our confidential intervals. Data Availability Statement: The data presented in this study are openly available in https://s5phub.copernicus.eu/dhus/#/home for Sentinel-5P VCD Data; https://www.epa.gov/outdoor-air-quality-data for HAPs ground monitoring data; https://drive.google.com/drive/folders/0B_J08t5spvd8VWJPbTB3anNHamc for Global DEM Data. ATom flight data available in a publicly accessible repository [40,41]