1. Introduction
Low-light images are generally caused by factors such as insufficient light given by the environment, limitations or malfunctions of the equipment, noisy signals, or improperly set parameters when shooting. The presence of such images can make it difficult to perform complex optical tasks, which include identity recognition and target detection. There is widespread use of low-light image enhancement algorithms in security surveillance, medical imaging and other fields [
1,
2,
3,
4]. How can the luminance of an image be enhanced, its colour recovered and its texture details improved, while also boosting the computational efficiency of enhancing low-light images. Studying this area is essential for enhancing images taken in low-light. Low-light image enhancement is one of the hot topics in the field of image processing.
Image enhancement algorithms can be classified into two main subsets: conventional methods that are manually constructed parameters, and deep learning-based methods for enhancing images.
Figure 1.
A comprehensive investigation into the enhancement of micro-optical images, enc-ompassing state-of-the-art methodologies incorporating HEP [
44], URetinex [
19] and FMR-N-et [
20], is undertaken through a meticulous quantitative and qualitative analysis (i.e., PSNR/SSIM/parameter).Compared with other similar methods, the method in this paper achieves better enhancement results with fewer model parameters.
Figure 1.
A comprehensive investigation into the enhancement of micro-optical images, enc-ompassing state-of-the-art methodologies incorporating HEP [
44], URetinex [
19] and FMR-N-et [
20], is undertaken through a meticulous quantitative and qualitative analysis (i.e., PSNR/SSIM/parameter).Compared with other similar methods, the method in this paper achieves better enhancement results with fewer model parameters.
Development using traditional methods based on artificially constructed parameters began in the mid-to-late 20th century.These methods improve the overall visual effect by adjusting the illumination, contrast and colour of an image using artificial parameters. Typical algorithms include histogram equalisation [
5,
6,
7], gamma correction [
8,
9,
10] and Retinex [
11,
12,
13,
14].The histogram equalisation method improves image contrast by redistributing values across pixels; The gamma correction method adjusts the luminosity and contrast of an overexposed or underexposed image using a non-linear transformation; and Retinex algorithm, which was proposed in collaboration with McCann and Land, is rooted in the visual system of the human eye and improves the effect of uneven lighting by separating the light and reflection components. Although these traditional algorithms can effectively improve image clarity and visual effect, they have problems such as relying on experts’ experience, poor scene adaptability, and general enhancement effect.
Deep learning based image enhancement algorithms are gradually becoming mainstream methods as their results are generally better than traditional algorithms. According to the training method, it falls into two categories: supervision-based methods and unsupervision-based methods. Methodologies involving supervision necessitate the acquisition of paired datasets under low and normal lighting conditions through synthesis, shooting and data enhancement, and utilize the paired data for supervised training of image enhancement models. Common low-light image enhancement datasets include LOL dataset [
15], MIT-Adobe FiveK dataset [
16], LIME dataset [
17], etc.
Zhang [
18] et al. proposed Image Kindling the Darkness (Kindling the Darkness), which uses supervised learning to train the network using paired image datasets with good results. The network first breaks the image down into light and reflection parts. Then, the light component is used for lighting adjustments, while the reflection component is employed for removing degradations. However, it requires multi-stage training, and the convolutional neural networks (CNNs) used to decompose color images, perform noise reduction on reflectance, and adjust illumination need to be trained independently and then connected together for end-to-end fine-tuning, which complicates and lengthens the process of training. As an extension of the Retinex framework, Wu et al. [
19] put forward a deep unrolling network called URetinex-Net,which makes use of pair-wise data, and implements image enhancement through a manually designed a priori and optimization-driven approach. This method achieves excellent results in noise suppression and feature detail preservation, but its strong dependence on training data and high computational complexity limit its generalization ability in complex unknown scenes.FMR-Net proposed by Chen [
20] et al. is a fast multiscale residual network, which rapidly boosts the image quality in low-light by combining a highly optimised residual block with a multibranch structure, while preserving image details and contrast. while maintaining image details and contrast, but it suffers from serious loss of high-frequency details and insufficient noise suppression.
The above methods are mainly based on supervised learning, in order minimise dependency on paired illumination datasets, numerous scholars put forward image enhancement methods founded on unsupervised learning for low-light conditions. Jiang [
21] et al. proposed EnlightenGAN, which uses generative adversarial networks for unsupervised learning and introduces a self-regularization mechanism, but it suffers from limited generalization and unstable training, etc. Guo [
22] et al. proposed a zero-reference depth curve estimation method-Zero-DCE, which learns the dynamic range adjustment through image-specific curves, but its problem is that the accuracy is limited under extreme conditions and the non-reference loss function relies on the image-specific curves, but its accuracy is limited under extreme conditions and the non-reference loss function relies on the image-specific curves. The problem lies in the limited accuracy under extreme conditions and the dependence of the non-reference loss function on empirical parameters (e.g., exposure values).Ma [
23] proposed a self-calibrating illumination (SCI) learning framework, which achieves fast convergence of multi-stage outputs to a consistent state through a progressive illumination estimation module with shared weights and a self-calibration module, thus requiring only single-stage inference at test time. Combined with unsupervised training loss (fidelity and smoothing constraints), it enhances the model’s adaptability to complex low-light scenes. However, the network still suffers from detail loss or color bias.
Currently, low-light image enhancement algorithms primarily suffer from detail feature degradation, inadequate contrast enhancement, as well as the model has a high computing burden, for this reason, a low-light image enhancing network is put forward in this paper, which is founded upon multi-scale feature fusion and adaptive contrast adjustment, and designs a local-global image feature fusion module (LG-IFFB) and an adaptive image contrast enhancement module (AICEB). LG-IFFB extracts multi-scale image features through a local-global dual-branch structure and uses element-by-element multiplication to fuse the local details with the global light distribution to alleviate the problem of image detail loss; AICEB combines a linear contrast enhancement formula with a confidence stability-driven adaptive stopping mechanism to dynamically adjust the computational depth according to the confidence of the feature map, balancing the contrast enhancement and computational efficiency. The primary outcomes of this study are summarised below:
1. A low-light image enhancing network based upon multi-scale feature fusion and contrast adaptive adjustment is proposed to achieve low-light image enhancement through a lightweight architecture synergizing multi-scale image feature fusion with dynamic optimization of image contrast.
2. A local-global image feature fusion module (LG-IFFB) is designed, which adopts a dual-path structure of local branching and global branching to simultaneously extract local and global information at different scales of the image, realizing a balance between detail preservation and global light optimization, and providing parameter mapping more suitable for complex low-light scenes for the subsequent luminance enhancement network.
3. An adaptive image contrast enhancement module (AICEB) is proposed, which consists of multiple iterative sub-modules, each of which dynamically generates contrast enhancement factors and luminance parameters through an adaptive attention normalization block (AANBlock). A confidence scoring mechanism is introduced in the module to realize the adaptive contrast enhancement, effectively balancing the contrast enhancement and computational efficiency.
2. Methods
The MSF-ACA model architecture is illustrated in
Figure 2. Firstly, the low-light image undergoes a convolutional operation that increases its amount of channels for feature maps, resulting in 24 channels.Then, using the local-global image feature fusion module (LG-IFFB), the image multi-scale image features U are extracted and fused.In the luminance enhancement network, the multi-scale image features U are first uniformly channelized, and each group of the segmented features serves as a luminance tuning parameter mapping, which is utilized to perform luminance adjustment on the original image m. Specifically, through the optimization process of eight iterations, the output of the Luminance-enhanced feature map Θ. Θ will then be used as a component of the input to two subsequent cascaded Adaptive Image Contrast Boosting Modules (AICEBs), where the input to the first AICEB is the two Θs, and the output is the intermediate feature map Y. Input to the second AICEB is the intermediate feature maps Y and Θ, while its output represents the intermediate features map Z. AICEB dynamically generates contrast factors and luminance parameters, and adaptively terminates the redundant computation through a confidence scoring mechanism to balance performance and efficiency. The two AICEBs sequentially optimize the features in a recursive form to gradually improve the image contrast and brightness. Finally, MSF-ACA fuses the channels and reconstructs the image by using a 3×3 convolutional fusion channel.The main structure of MSF-ACA references the residual join learning mechanism of local combination global [
24].
2.1. Local-Global Image Feature Fusion Block
Local Global Image Feature Fusion Block (LG-IFFB),
Figure 2(B) illustrates its structure, the division of the process is primarily into two steps: first, multi-scale feature extraction; second, image feature fusion.
Inspired by the structure of RFDB [
24] as well as RRFB [
24], LG-IFFB employs a multi-scale reparameterization technique for feature extraction,which outputs six different scales of image features (denoted as L1 to L6) by cascading six kinds of convolutional kernel to maximise the preservation image texture information.In order to reduce memory loss, ADD operation is used instead of Concat operation for feature fusion. In addition, we segment the input feature channels and select only one-fourth of the total number of channels to extract and fuse features. Remaining image feature channels are then directly combined with the fused features using the residual join method. This design is conducive to enhanced computational efficiency as well as reduced feature redundancy within the cascade structure.
Traditional multi-scale feature fusion methods [
43] usually perform only simple feature splicing or summing processing, which cannot deeply fuse to utilize the global contextual information of the features. In order to realize adaptive fusion of image features, LG-IFFB is designed with a two-branch structure: local branch and global branch. The local branch focuses on the optimal fusion of local features, and the usual practice is to extract local features using a single 3×3 convolution [
25,
26,
27], which ignores the correlation of multi-scale local features. For this reason, in this paper, the input multiscale image features (L1 to L6) undergo a 1×1 convolutional process that extends their dimensionality.Next, a deep convolutional process is employed to extract local information, and in particular, for each scale, we splice the extracted local features with other local features extracted at different scales to enhance the correlation. The final generated locally optimized features L1
L. .... .L6
L is obtained by processing the spliced features using deep convolutions and activating functions. In the global branch, the image features at each scale are firstly summed to form a combined representation U′ , and the global feature descriptor S is obtained by global average pooling (GAP), and the global feature The global feature descriptor S is processed by a fully connected (FC) network consisting of two linear layers, a ReLU activation layer and a Sigmoid layer to obtain the feature attention weight S′. The attention weight S’ is calibrated by multiplying it with each scale feature to obtain the global calibration feature L1
G......L6
G. Finally,the local optimisation feature (L1
L...L6
L) is multiplied by the global calibration feature (L1
G...L6
G), where the multiplication is performed element by element and fused to output the final fused feature map U. The fused feature map U contains both local details and global context information, which can significantly improve the feature expression capability.Compared with the DCE-Net [
22] in Zero-DCE, which relies primarily on single-branch convolution for learning the mapping relating the input image to the best-fit curve,LG-IFFB can address the issue of detail loss or contrast imbalance in extremely low-light or high-dynamic-range scenarios more effectively by integrating the link with multi-scale local details and global light distribution.
2.2. Luminance Enhancement Network
This study proposes a luminance enhancement function for luminance Enhancement Network, based on the illumination adjustment curves framework of Zero-DCE. The developed recurrent mapping function adheres to two fundamental design criteria: (1) preservation of inter-pixel intensity relationships through enforced monotonicity constraints, and (2) implementation of computational simplicity and gradient accessibility to enable efficient error backpropagation. The luminance enhancement function is formulated as a parametric quadratic transformation and can be represented mathematically by the following expression:
where S
p (m) is the pixel-by-pixel parameter matrix,the luminance-enhanced image resulting from p iterations is known as E
p(m), whereas the enhanced version of this image after p-1 iterations is referred to as E
p-1(m).Based on (1), this paper presents a design for a luminance enhancement network, refer to
Figure 2(C). Two parts make up the network’s inputs: the original low-light image m ∈ R
H×W×3 and the fused feature map U ∈ R
H×W×24 after LG-IFFB processing. The final output of the luminance feature map Θ∈ R
H×W×24 with a channel number of 24 is obtained by dividing the LG-IFFB output of the fused feature map U is partitioned into 8 identical pixel-by-pixel parameter maps (S
p(m), m = 1, ..., 8) in order to take part in the iteration of the function.The initial input original image E
0(m) = m, after 8 iterations, outputs a 3-channel enhanced image E
8(m), which is subsequently extended to 24 channels by 3 × 3 convolution to generate the luminance-enhanced feature map Θ = Conv(E
8(m)).
2.3. Adaptive Image Contrast Enhancement Block
Direct contrast enhancement of input features (e.g., linear stretching) introduces noise or distortion due to unstable feature distribution. In a style migration task, Vedaldi et al. [
28] argued that instance normalization techniques can eliminate the contrast difference of image features in order to allow the network to focus on learning content structure rather than luminance or colour distribution.Drawing on this idea, this paper uses channel normalization technique to improve the stability of contrast enhancement of image features. The channel normalization operation is to normalize each channel of the input feature map separately so that its mean is 0 and variance is 1, i.e.,:
where the feature map of channel C is represented by h
C, μ
c refers to the mean values of channel c, and σ
c refers to the standard deviation value of channel c. ε indicates the stabilisation coefficient.
On this basis,this paper proposes an adaptive image contrast enhancement block (AICEB).
Figure 3(a) illustrates the framework of the AICEB.The AICEB consists of several iterative submodules (Iteration Sub-module), and each Iteration Sub-module contains 1 Adaptive Attention Normalization Block (AANBlock) and 1 ReLu Activation layer.
Figure 3(b) demonstrates the detailed architecture of the AANBlock.The AICEB comprises several iterative submodules (Iteration Sub-module), and each iteration contains one Adaptive Attention Normalization Block (AANBlock) and one ReLu activation layer.
Figure 3(b) illustrates the details of the AANBlock’s network structure.
The AANBlock input in the kth submodule has two inputs: the luminance enhancement feature map Θ obtained after processing by the luminance enhancement network and the preceding iterative submodule Xk’s output feature map.For the 1st iterative submodule,X1 is the luminance feature map produced from the luminance enhancement network, which is represented by the symbol Θ.Through the learning of the luminance enhancement feature map Θ, it generates the parameters a and b needed for image contrast adjustment, thus allow the network to adaptively adjust the image contrast according to the luminance distribution information adaptively. The specific realization process is as follows:
In each iterative sub-module, take the kth iterative sub-module as an example, the features of the luminance feature map Θ are first further extracted using a 3×3 convolution and a 5×5 convolution to obtain the feature map Θ’. Then, the global information of Θ’ is computed by Global Average Pooling (GAP) and Global Maximum Pooling (GMP), respectively, and the outputs of GAP and GMP are merged in the dimension of the channel to obtain statistical features of global information. The global information statistical features obtained from splicing are processed by a Fully Connected (FC) Layer and a Rectified Linear Unit (ReLU) activation functions are applied to generate the Channel Attention Weight T. Multiplying Θ’ with the channel attention weight T produces the feature map α*. The Relu Activation Function is used for guaranteeing that pixel values α* are non-negative. Based on α*, a 7×7 convolutional layer combined with a Sigmoid Activation Function needs to be used in order to generate the feature map β*. In the formula for linearly enhancing the contrast, pixel values from the feature maps α* and β* act as the parameters a and b.
Employing parameters a and b derived from feature maps α* and β*, a linear contrast transform is applied to the features that have been normalized by the channels, and the contrast-enhanced result X’
k+1 can be obtained. the output after ReLU activation processing is then output, i.e., it is the output of the kth iterative sub-module, X
k+1. Particularly, in order to ensure the effective enhancement of the low-light image contrast, a bias term of magnitude 1 is added to α * is added with a bias term of size 1, i.e., α*+1. The specific formulation is set out below:
where
and
denote the image features to be enhanced and the image features after contrast enhancement, respectively; a and b are the parameters of image contrast enhancement, specifically the pixel values of the feature maps α* and β*.
where σ and µ represent the standard deviation and average of the feature map X
k, respectively.
In order to balance the image contrast enhancement effect and computational efficiency, this paper sets out a proposal for an adaptive stopping mechanism founded on the stability of the confidence level, while iteratively enhancing the image contrast, the contrast confidence level is calculated in real time to evaluate the enhancement results, and the computational depth of the network is adaptively adjusted, and its flow is presented in
Figure 3(a).
After the iterative sub-module completes the image contrast enhancement processing, the confidence of the current results is generated using the contrast confidence calculation module C. The contrast confidence calculation module C consists of a variance calculation layer (Contrast), a Layer 1×1 Convolution (Conv) followed by a Sigmoid Activation layer. Confidence is calculated as follows:
where the variance calculation layer Contrast(x) is:
w and h refer to the width and height of the feature map x; x
u,v represent a pixel’s value at position (
u,v) in feature map X; and μ indicates the mean value of all pixels in x. Variance is widely used to measure image pixel dispersion; the higher the variance, the greater the difference between pixels, i.e., the higher the contrast. If the absolute value of the difference of the confidence level for three consecutive times is less than the preset threshold λ, the enhancement effect of the feature map can be considered to be stabilized, and the iteration is terminated and exited:
where confidence
k represents the confidence at the kth layer, f is the index of the current iteration number k, and the value of f is not less than 3. The preset threshold λ is obtained by experiments on the validation set [
15] by taking into account the average number of iterations and the PSNR loss, which is specifically taken as 0.0005.
2.4. Loss Function
For the purpose of optimizing MSF-ACA training, this paper uses spatial consistency loss, color consistency loss, mean absolute error loss,gradient-guided structural consistency loss, and perceptual loss to comprehensively evaluate the image enhancement effect.
Spatial Consistency Loss (Lspa): The structural consistency before and after enhancement is maintained by analyzing the luminance difference in the local region of the image, firstly, the image is divided into M pixel blocks, where each block p is compared with its four neighborhoods Ф(p); then the luminance differences between the reference image S and the enhanced image T between the corresponding blocks are calculated respectively, and finally the stability of the spatial relation is constrained by the mean value of the squared error:
Color Consistency Loss (Lcol): It is used to ensure the stability of the color distribution of the content during the image enhancement process. The loss function achieves this goal by constraining the intensity differences between different color channels. For any pair of channels (m, n) in the set of channel pairs ε, the sum of the squared differences of their average intensity values
and
is computed:
Mean Absolute Error Loss (L1): It is mainly used to prevent images from localized overexposure or underexposure. This loss function achieves this goal by comparing the difference in pixel luminance between the reference image and the enhanced image. The luminance value of the ith pixel in the reference image is denoted as Y
i, and the luminance value of the corresponding pixel in the enhanced image is denoted as O
i, and let the total number of pixels in the image be N:
Gradient-guided Structural Consistency Loss (Lgsc): By establishing structural similarity constraints in the gradient domain, the consistency of luminance, contrast and structural features during image enhancement is effectively maintained.This paper presents the gradient-based structural similarity loss function (G
SSIM).The G
SSIM method outperforms the conventional SSIM method in images with low light and blurring [
29]. The gradient magnitude of the augmented image, O, is expressed as G
o(u, v), whilst the gradient magnitude of the reference images, Y, is expressed as G
Y(u, v). (u,v) are the row-column coordinates representing the pixels in the image, whereas C acts as a stabilisation constant to prevent the denominator from equalling zero:
Perceptual Loss (Lperceptual): Perceptual loss aims to maintain semantic consistency between the augmented image and the reference image through deep feature alignment. To achieve this goal, the study uses an ImageNet-based pre-trained VGG network architecture as a static feature encoder.The layer
l features obtained from the pre-trained VGG network are represented by Φ
l(x) and Φ
l(y). x refers to the low-light-enhanced output image, y denotes the reference image and λ
l indicates the weight used for the perceptual loss of layer
l.The VGG network is used to minimize the distance between the low-light-enhanced image and reference image. ‖.‖
2 denotes the L2 paradigm.
The total loss function (L
total) is the weighted average of the above five loss functions.
where
is the weight parameter of each loss function.