Developing High-resolution Gridded Rainfall and Temperature Data for Bangladesh: the ENACTS-BMD dataset

: This manuscript describes the construction and validation of high resolution daily gridded (0.05° × 0.05°) rainfall and maximum and minimum temperature data for Bangladesh : the Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset. The dataset was generated by merging data from weather stations, satellite products (for rainfall) and reanalysis (for temperature). ENACTS-BMD is the first high-resolution gridded surface meteorological dataset developed specifically for studies of surface climate processes in Bangladesh. Its record begins in January 1981 and is updated in real-time monthly and outputs have daily, decadal and monthly time resolution. The Climate Data Tools (CDT), developed by the International Research Institute for Climate and Society (IRI), Columbia University, is used to generate the dataset. This data processing includes the collection of weather and gridded data, quality control of stations data, downscaling of the reanalysis for temperature, bias correction of both satellite rainfall and downscaled reanalysis of temperature, and the combination of station and bias-corrected gridded data. The ENACTS-BMD dataset is available as an open-access product at BMD’s official website, allowing the enhancement of the provision of services, overcoming the challenges of data quality, availability, and access, promoting at the same time the engagement and use by stakeholders.


INTRODUCTION
Meteorological observations data (historical records and near-real-time) are the backbone of National Weather Services (NWS). Long-term historical data allow the assessment of climate risks or long-term trends in the rainfall and temperature, while near-real-time data are important for monitoring weather-related hazards, allowing the development of timely and actionable early warning systems by governments and agencies (Maidment et al.,2017). Traditionally, records from weather stations networks (in-situ) maintained by NWS provide the most reliable means to obtain accurate local information about weather and climate. However, stations networks are often spatially sparse and missing records issues pose a challenge in terms of information gaps. Therefore, high-quality datasets with an appropriate spatial and temporal coverage are highly demanded.
For Bangladesh, long-term rainfall records are available for multiple weather stations that are distributed all over the country from the Bangladesh Meteorological Department (BMD) (Fig.1). However, some of the stations have long periods of missing data, and although the spatial coverage is adequate considering the size of the country, over some regions they are sparsely distributed, especially in the Northeast, the rainiest region of the country. Consequently, large areas of Bangladesh have an inadequate coverage which, for example, limits the monitoring and analysis of small-scale and short life span convective systems, making rain-gauge measurements to be used as representative of the several-square kilometers area surrounding the station. Similar, sparse temperature records can be representative of a small area when factors such as topography modify the spatial patterns, limiting the use of neighboring stations for data filling. Moreover, since Bangladesh is very prone to extreme weather events leading to flash floods, drought or heatwaves, high quality rainfall and temperature data for their monitoring are of great importance, and also for other applications such as services for agriculture and water resources management in remote areas (Nashwan et al. 2019). However, simple statistical interpolation approaches of station data are a partial solution as the associated uncertainty can be high when the commonly used co-variables explain a small proportion of the observed variance. Limitations associated with station observations can be more important over regions with complex topography where station measurements are generally sparse or nonexistent. Notwithstanding, satellite rainfall estimates or reanalysis data which are easily and freely available provide an alternative for obtaining climate data over regions where ground observations are limited or unavailable. However, without direct reference to ground measurements, satellite rainfall estimates and reanalysis are subject to systematic bias. In principle, a suitable fusion of weather station data with satellite and reanalysis products is expected to provide sufficiently accurate spatial and temporal rainfall and temperature estimates on an operational basis for a wide range of data users. To overcome the problem of unavailability of long-term quality-assured climate data, a high-resolution gridded rainfall and temperature dataset, the ENACTS-BMD product, being developed as part of the "Enhancing National Climate Services" (ENACTS) initiative under the Columbia University's World Project "Adapting Agriculture to Climate Today, for Tomorrow" (ACToday; https://iri.columbia.edu/actoday/) with a close collaboration between and Society (IRI), Columbia University. IRI has a long history generating high-resolution meteorological data for African countries under the ENACTS program (Dinku et al.,2017). The main processing steps to generate ENACTS-BMD include spatial downscaling and bias correction of the temperature reanalysis and satellite rainfall estimates, and finally the procedure to merge the station data with the bias-corrected gridded ones. In this study, we describe the construction and validation of the ENACTS-BMD dataset.

Development of ENACTS-BMD database
The development of the ENACTS-BD dataset considers several steps, including the data collection, quality control of stations data, spatial downscaling of reanalysis for temperature, bias correction for both rainfall and temperature gridded data, and finally the combination of station and bias-corrected gridded data. Each of the above steps is performed using the Climate Data Tools (CDT), an open-source R package based on a set of utility functions for meteorological data quality control, homogenization and merging station data with satellite and others proxies such as reanalysis. This tool is developed and maintained by the International Research Institute for Climate and Society (IRI), Columbia University. All functions in CDT are available at https://github.com/rijafiri/CDT in graphical user interface (GUI) mode.

Station data availability
The main data source used to generate the ENACTS-BMD dataset are daily observations provided by BMD. However, prior 1979, data are known to be not of high quality and contain a large number of missing records. Although this data could be also considered for improvement, high-resolution satellite precipitation products are typically available since 1980. Therefore, we decided to make ENACTS-BMD begins in January 1981. The percentage of available (non-missing) data 1981-2019 for rainfall, minimum and maximum temperature are presented in figure 2. For rainfall, 35 out of 54 available stations have more than 80% whereas most of them (54) report 90-100% of temperature data.

Quality control
As it was mentioned, the data quality control was performed using CDT. The typical workflow for quality control is explained as follows: • Verification of in-situ station's geographical coordinates.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 18 December 2020 • For daily rainfall data, the second step corresponds to checking the false zeros for a given month. A false zero is defined as an abnormal observed zero values for the whole month during the rainy season.
The main source of the false zeros is the ambiguity in coding the days with no observations (missing values) and days without rain (zero values) when digitizing the paper records. In these cases, data entry operators leave empty the input for both no observations and no rain. • The next step is to check the presence of outliers in the time series. Outliers can be checked for both temporal and spatial cases. o Temporal check is performed for each month to ensure that each observed value is consistent with the climatology for each station. All values detected by the test are flagged as suspicious or outliers and must be checked very carefully. o A spatial check is performed by comparing each value of a given station with the values of neighboring stations for the same date. o Internal consistency checks for temperature data. In other words, checking if minimum temperature is greater than maximum temperature. • Erroneous values and outliers were consequently identified. However, undetected observation errors are always possible even using objective methods. Since we did not have the original observations from paper version for a manual verification, the erroneous values were replaced by missing values. All days with precipitation greater than 600 mm were eliminated. For temperature data, whenever possible the outliers were replaced with the estimated data from neighbors' stations. Removing valid extreme values can cause errors in the merged data as easily as keeping erroneous extreme values.
An example of the spatial check results for rainfall and maximum temperature is presented in figure 4. The station in red reported extremely high daily rainfall and relatively low maximum temperature, which are very different compared to values recorded by the neighbor stations.

Merging stations with gridded data
For constructing the gridded rainfall product, BMD stations data merged using satellite rainfall estimates from the Climate Hazard Group InfraRed Precipitation (CHIRP), developed by the Santa Barbara Climate Hazards Group at the University of California in association the U.S. Geological Survey Earth Resources Observation and Science Center. CHIRP is a last-generation satellite-only high temporal (daily, pentad and dekadal) and spatial resolution (0.05° x 0.05°) rainfall product with a quasi-global coverage (50°S-50°N), with rainfall timeseries available since 1981 to the present. CHIRP precipitation is generated from satellite thermal infrared (TIR) measurements with mean bias removed using a satellite-improved station-based climatology. The detailed procedure of the generation of CHIRP can be found in Funk et al (2014) and data are available at ftp://chgftpout.geog.ucsb.edu/pub/org/chg/products/CHIRP/. Similar, the main source of temperature data in Bangladesh are BMD stations. However, since the stations network is sparse over some regions, temperature values for noncovered locations have to be computed based on the neighboring stations when necessary. Therefore, the use of reanalysis data as a proxy can potentially solve this problem. Among the main advantages of reanalysis products it can be mentioned: (1) they provide a multivariate, spatially complete, and uniform record of multiple atmospheric variables; (2) they are produced using a single version of a data assimilation system and therefore they are not affected by changes in methods, which makes them suitable for studies of longer-term climate variability. In the present work, the JRA-55 reanalysis (Japanese 55-year Reanalysis) developed by Japan Meteorological Agency is used to generate the temperature datasets, which is among the most recent long-term reanalysis projects. It adopts a relatively high-resolution (∼55 km) atmospheric model and uses state-of-the-art assimilation techniques. The dataset extends from 1958 to the present with 3 hourly temporal resolution. Both the data and the documentation are available at https://jra.kishou.go.jp/JRA-55/index_en.html. These products were selected because it spans a long time period, over 30 years, their spatial and temporal resolution, and their availability in near-real-time.
The steps considered for merging stations and gridded data include: 1) spatial downscaling of the reanalysis for the temperature data, 2) bias correction of downscaled temperature and satellite rainfall estimates, and 3) finally merging the station data with the bias-corrected gridded data. There are many ways to correct specific errors and artifacts in the gridded data. They can be corrected regardless of the source of the error using ground truth measurements such as station data. In this sense, CDT has 4methods of bias correction which are Multiplicative The workflow used to generate the final gridded products for rainfall and temperature time series is as follows: • Downscale: Since reanalysis data from JRA-55 have a spatial resolution of approximately 55 km, it is therefore necessary to downscale the temperatures data from reanalysis to meet the rainfall grid (0.05° x 0.05°). As the temperature depends on altitude, a constant lapse rate (vertical temperature gradient) is used for each month using station temperature data and elevation data from a digital elevation model. Then these coefficients are applied to the reanalysis data using a linear model. A bilinear method is used for the interpolation. • Bias correction: The bias coefficients are computed and interpolated into the grid of the satellite rainfall estimates by using an inverse distance weighted interpolation. To interpolate a grid point, at least 3 and at most 9 stations within a radius of influence of 1.0 degree for rainfall and 1.5 degree for temperature are used. To correct the bias, the satellite rainfall estimates and downscaled reanalysis temperature data are multiplied by the interpolated bias coefficients. • Merging: To merge the station data with the bias-corrected satellite rainfall estimates and downscaled reanalysis temperature, a nested merging method was used. The "Regression Kriging" merging method using Modified Shepard interpolation was used to generate the merged data for the daily and dekadal time scale. The merged data are generated using three passes, the bias-corrected data are used as the first guess.

Technical Validation of ENACTS-BMD and comparison with other products
A meticulous validation of ENACTS-BMD data against observations from BMD stations was carried out using CDT's leave-one-out cross-validation (LOOCV). For this, LOOCV takes the total n observations and one of them is set aside for validation, and then the merging is performed with the remaining n-1 observations. In a second step, the merged value at the discarded station is extracted. This is repeated n times, leaving out each observation at the end of the procedure. The evaluation is performed daily and dekadal data. For daily data, time series from 2005 to 2018 are used both for rainfall and temperature data and from 1991 to 2018 for dekadal data. A point-to-grid analysis was performed to compare and validate the performance of the gridded data in relation to station observations. Grid values of the gridded data containing the station locations were extracted.
The scatter plots of rainfall, minimum and maximum temperature between stations and ENACTS-BMD results extracted at the station locations for daily and decadal are shown in figures 7 and 8, respectively. It is observed that in general, the association between ENACTS-BMD rainfall (temperature) and stations data is moderate-to-strong for both daily and decadal data.  The performance of ENACTS-BMD was also assessed by computing a set of skill scores. These scores are described on https://www.cawcr.gov.au/projects/verification/, and implemented within CDT. The performance of rainfall results was evaluated using categorical and continuous verification metrics considering the fact that rainfall has a mixed nature as both a categorical (occurrence/non-occurrence) and continuous (amount) process. The evaluation method includes pairwise comparison continuous statistics to assess the performance of satellite rainfall estimates in terms of amount, as well as categorical statistics used to assess rain or no rain events detection capabilities. For categorical statistics, a threshold value of 1.0 mm is used both for daily and dekadal data. For the ENACTS-BMD temperature data, the evaluation is performed using only continuous statistical indicators. Although CDT has an option of a long list of skill scores for performance assessment, for the brevity we only discuss seven of them: five continuous and two categorical for daily and dekadal rainfall and temperature data. The five continuous skill scores include (1) Pearson's correlation coefficient (CORR), varies between plus-minus one, a value of 1 is the perfect score; (2) Nash-Sutcliffe efficiency coefficient (NSE), varies from minus infinity to one, a value of 1 is the perfect score; (3) multiplicative bias (BIAS), a value greater than 1 indicates an overestimation while a value less than 1 indicates an underestimation; (4) the mean absolute error (MAE), a value of 0 as a perfect score, MAE is less sensitive to outliers; (5) RMSE measures the average magnitude of the error, a low RMSE value indicates larger central trends and a smaller extreme errors. The two categorical skill scores include the probability of detection (POD), and false alarm ratio (FAR), which are computed based on a contingency table in order to assess the successes and failures of detecting the occurrence of rain. POD indicates the fraction of rainy events from station observations detected correctly by the merged data, it varies between 0 (no detection) and 1 (perfect detection) whereas FAR corresponds to the portion of events identified by the merged data but not confirmed by stations observations, a value of 0 is the perfect score.
Results of the above-presented skill scores for ENACTS-BMD dataset with respect to the BMD-station data for all the stations at daily and dekadal scale are presented in Table 1. The spatial distribution of each skill score also calculated for each station location. The Bias and CORR for rainfall and minimum temperature are shown in figure 9 and 10 respectively. Table1. Skills score for the ENACTS-BMD with respect to the station dataset for all the station's location. As POD and FAR are categorical skill score, it is Not Applicable (NA) for temperature data.