1. Introduction
The study of atmospheric processes is crucial for understanding environmental change, predicting climate variations, and developing strategies to mitigate the effects of dangerous natural phenomena. To this day, despite significant advances in atmospheric science, many aspects of the atmosphere's behavior remain poorly understood due to its complexity. The process of identifying connections between atmospheric characteristics and external factors affecting its state is a challenging and multidimensional task [
1,
2]. Lidar (Light Identification, Detection, and Ranging) systems are a powerful tool for remote sensing of aerosol formations in the atmosphere including high-level clouds (HLCs). However, processing and interpreting lidar data often requiring complex analysis methods. In this context, it is relevant to consider the use of machine learning (ML) techniques for processing the data of the experiments on remote sensing of natural environment, as well as for solving inverse problems in atmospheric physics, ecology and etc. These methods allow detecting and investigating of interrelationships between various parameters in large volumes of data that would be difficult to identify through classical statistical analysis [
3,
4,
5,
6].
The solar radiation flux reaching the Earth's surface is formed by a combination of direct and scattered components. Scattered radiation is formed as a result of interaction between direct solar radiation and atmospheric gases (molecular scattering), water droplets in clouds and fog, ice crystals in clouds, and aerosol particles. Most of the solar radiation energy that reaches the Earth's surface is in the short-wavelength region of the solar spectrum, with wavelengths ranging from approximately 300 to 4,000 nm [
7]. The scattered radiation flux depends on the transparency of the atmosphere and is primarily determined by the number of clouds in the sky and their optical and microphysical properties. Aerosol particles in the atmosphere affect the Earth's radiation budget. They can scatter and absorb radiation, directly changing the amount of radiation reaching a particular location. In addition, aerosol particles indirectly affect the Earth’s radiation budget, acting as condensation nuclei of water vapor and, thereby, accelerating the process of cloud formation [
8]. Under clear weather conditions, scattered radiation accounts for approximately 15–20% of the total radiation in warm and cold weather [
9].
The relationship between the energy flow of solar radiation entering the Earth's surface and the transparency of the atmosphere is being studied by the world scientific community due to the fact that a change in the Earth’s radiation budget may indicate increased air pollution associated with increasing anthropogenic emissions, which leads to changes in weather in certain regions of the Earth and global climate [
10]. For example, in [
11], the characteristics of solar radiation in the surface layer of the atmosphere were studied under various air pollution conditions over Nanjing, China. The change in the flow of solar radiation affects the surface temperature, the processes of evaporation and condensation of water vapor, the water cycle and, in general, the Earth's ecosystem [
12]. Cirrus clouds have less effect on the transmission of direct solar radiation compared to other clouds due to their insignificant optical thickness. At the same time, the large horizontal extent of such clouds, which takes values up to a thousand kilometers [
13], as well as the presence of horizontally oriented ice particles in the ice significantly affect the fluxes of scattered solar radiation. A noticeable decrease in the flux of scattered solar radiation in the near-zenith region of the sky by HLC specular areas compared with non-specular areas of the same cloud was first established in experiments on laser polarization sensing of the HLCs by a ground-based lidar oriented in the "zenith" direction [
14]. This phenomenon is observed even when the optical thickness of the specular area of the cloud is less than the non-specular one.
The article [
15] employed ML tools to investigate the correlation between HLC BSPM elements and meteorological conditions. It is well known, however, that the type of clouds and the Sun’s altitude above the horizon have a significant impact on the flow of radiation coming to the Earth's surface, for example, in the Arctic [
16,
17]. Consequently, a hypothesis has been proposed regarding the independence of HLC BSPM elements determined during polarization laser sensing experiments on atmospheric polarization from the Sun’s zenith and azimuthal angles. During nighttime, the Sun is absent from the sky entirely, and during daytime, its effect is mitigated by correcting lidar signals against the background noise. The methodological aspects of evaluating the strength of a lidar signal originating from clouds amidst background interference from the "daytime sky" have been elucidated in [
18,
19]. The purpose of this work was to examine the effect of scattered solar radiation, which depends on the Sun’s zenith and azimuth angles, on the performance of the algorithms used in the work [
15]. The next sections describe the experimental setup, data collection, and analysis methods used in this study.
2. Materials and Methods
The data on HLC BSPM for this study were collected using the high-altitude matrix polarization lidar (HAMPL) developed at the National Research Tomsk State University (NR TSU) [
20]. The measurements have been performed from 2009 to 2024. To assess the meteorological situation at the altitudes of the examined clouds, data from the ERA5 reanalysis of the European Centre for Medium-Range Weather Forecasts were used. The lidar is located in Tomsk and is oriented vertically in the zenith direction. The HAMPL block diagram is shown in
Figure 1.
The HAMPL design is equipped with a Nd:YAG laser operating at a wavelength of 532 nm with a pulse energy of up to 400 mJ, and a pulse repetition rate of 10 Hz, which is used as an optical radiation source. A Cassegrain mirror lens with a primary mirror diameter of 0.5 m and a focal length of 5 m is used as a receiving antenna. The lidar receiving system includes the ThorLabs FL532-1 interference filter with a central wavelength of 532 ± 0.6 nm and a half-width of the transmission spectrum of 3 ± 0.6 m. Then the Wollaston prism, which divides received backscattered radiation into two orthogonally polarized beams. These radiation beams are registered with two photomultiplier tubes (PMTs) operating in the photon-counting mode with time strobing of the signal, which provides the altitude resolution from 37.5 to 150 m [
21]. To suppress active backscattering interference from the near lidar zone (up to 3 km), electro-optical shutters (EOSs) based on a potassium dideuterium phosphate (DKDP) crystal are installed in front of the PMTs. The use of EOSs allows the PMT characteristics to be to maintained linear even during lidar operation in the daytime with the maximum energy of the probing pulse.
During each sensing cycle, pulses of radiation with four different polarization states (three linear and one circular) were sent to the atmosphere one by one. For each pulse, the polarization state of backscattered radiation described by the Stokes vector was determined. Thus, 16 intensity vertical profiles, from which 16 BSPM elements were calculated, were measured in each sensing cycle. The HAMPL provided the registration of lidar returns from HLC in the parallel accumulation mode of 16 arrays of single-electron pulses. This mode allowed the intensity of all 16 lidar signals from the clouds, which were necessary to determine all elements of the BSPM, to be estimated with the same error. During sensing in this mode, there was a continuous change of polarization elements in the transmitting and receiving systems of the lidar, due to which the minimum time of a complete cycle of measurements for determining all BSPM elements was 2 s. Thus, the movement of the examined air volumes falling into the field of view of the telescope affected the measurements in the same way with each of the used combinations of the polarization states of sensing and received radiation [
20,
21]. Lidar signal processing is based on the application of the laser sensing equation (LSE). In vector form [
22] of this equation:
where
P(
z) and
s(
z) are the power and the normalized Stokes vector parameter of radiation incident on the input of the lidar receiving system from the scattering volume located on the sensing path at a distance
z from the source, respectively;
c is the speed of light in the medium;
W0 =
P0Δ
t is the pulse energy of the lidar transmitter (
P0 and Δ
t are the power and duration of the laser pulse, respectively);
A is the effective area of the receiving antenna;
G(
z) is the geometric factor (when sensing high-level clouds,
G(
z) is usually equals to 1);
Mπ(
z) is the normalized BSPM of the scattering volume;
s0 is the normalized Stokes vector parameter of sensing radiation; ε(
z',θ,φ) is the attenuation coefficient; θ and φ are the polar and azimuthal angles, respectively. It should be taken into account that the LSE in this form does not take into account the dependence of the radiation attenuation on the propagation direction relative to the axes characterizing the medium anisotropy. Also excluded from consideration is the possible change in the polarization state of direct and scattered radiation as it passes through a section of the sensing path located in an anisotropic medium. Note that the
P(
z) is corrected for the background noise, i.e., its value here is subtracted from the initially measured value.
Special attention should be paid to the characteristics of the examined clouds, which are determined in lidar measurements. The key among them is the BSPM, which mathematically is the operator of the transformation of the Stokes vector parameter describing the polarization state of sensing radiation and the Stokes vector parameter of backscattered radiation received with the lidar. The effect of the scattering volume on the polarization state of radiation is determined by the parameters of its microstructure: the parameters of the distributions of solid particles in clouds in shape, size, and spatial orientation. In other words, physically, the BSPM contains information about the cloud microstructure. Separately, we focus the reader's attention on the fact that each BSPM element is determined by the ratios of the components of the Stokes vector parameter of received radiation for various polarization states of sensing radiation. At the same time, to determine these components, not pure linear signals or even just lidar signals minus the background level are used, but also their ratios. Thus, the possible effect of the background noise on the BSPM elements is not explicitly present, and, taking into account the background level at the early stages of lidar signal processing, we exempt from its effect the HLC optical characteristics obtained in the lidar experiment.
The NR TSU HAMPL is located in the southern part of Tomsk (56°26' N 84°58' E), about 0.5 km from the bank of the Tom’ River. Measurements are performed in the absence of low clouds, precipitation and gusty winds. The duration of one series of lidar measurements is usually 16 minutes and 40 seconds. The description of the distributions of the number of lidar measurements and the number of cases of HLC registration in them by year and season was previously published [
20,
21]. To evaluate the meteorological conditions at the altitudes of the examined clouds, we relied on data from the ERA5 reanalysis provided by the European Centre for Medium-range Weather Forecasts [
21]. The most reliable source of data on the vertical profiles of meteorological parameters is radiosonde observations, although the nearest stations to Tomsk where radiosonde launches are performed regularly are approximately 200–230 kilometers away (Kolpashevo and Novosibirsk). Although the meteorological conditions at the altitudes of cloud formation in the upper atmosphere are usually similar according to these data, the ERA5 reanalysis offers a higher temporal resolution (1 hour compared to 12 hours). With a high spatial resolution (30 × 30 km), the ERA5 reanalysis allows access to the vertical profiles of the meteorological parameters and Tomsk coordinates [
23]. As initial data for the ERA5, measurements from various sources around the globe are used: satellite radiometers, ground-based, ship-based and aircraft weather stations, weather buoys, balloon-borne sensors, and ground-based radars [
24]. At the HLC formation altitude, the vertical resolution of the ERA5 data is 25–50 hPa, which is approximately equivalent to 0.5–1 km. To evaluate the meteorological situation at the formation altitudes of the examined clouds, the altitudes of their boundaries are determined based on the lidar measurement data, for which the corresponding reanalysis data are then adjusted by coordinates, date, time, and altitude.
To test the hypothesis that the signals measured during lidar experiments are independent of the zenith and azimuth angles of the Sun’s position, these angles were calculated using a similar method as described in [
25]. The input parameters for the calculations included the coordinates and time zone of the observation location, as well as the specific date and time of the event. There is a known issue in treating azimuthal and zenith angle values when they approach 360 degrees—a discontinuity, which can cause errors in program calculations due to the large numerical difference between angles that are essentially adjacent. For example, the angles 359° and 1° are mathematically close, but their direct numerical representation introduces a sharp transition, leading to inconsistencies when performing trigonometric calculations. We converted the angular values into a continuous form by calculating their sine and cosine components to ensure that angles, which are next to each other on a circular scale, are handled consistently by the algorithms. Four continuous variables from the sine and cosine of both the azimuthal and zenith angles were computed and then used in the analysis.
In a previous study [
15], the relationship between the HLC BSPM elements and altitude was investigated, with results showing no significant correlation. Based on these results, in this study, for each set of lidar measurements, the median values from the experimental sample were used as the values for the BSPM elements. The article [
15] also demonstrates that only the elements on the matrix main diagonal (m
22, m
33, and m
44) have a specific dependence on meteorological variables. The block diagram of our methodology is shown in
Figure 2.
As previously mentioned, a dataset used in this research has been generated from the median values of BSPM elements and their associated date, time, coordinates, and altitude values for meteorological parameters based on the ERA5 data [
26]. This dataset was then enhanced by including the sine and cosine values of the Sun's azimuth and zenith angles. After that, machine learning models were trained using two versions of the dataset: one containing only meteorological data, and the other incorporating solar position data. Next, the results of the trained models were evaluated and compared using a held-out sample. Random Forest (RF) models were used, including a version with data preprocessing using principal component analysis (RF+PCA). Validation sample that was held out used to assess the accuracy of the models' forecasts for HLC characteristics was generated from data covering the period from February 15, 2020, to September 22, 2023 (65 observations). This validation set was not used during the training phase, allowing for an unbiased assessment of the model's ability.