An Integration of Deep Network with Random Forests Framework for Image Quality Assessment in Real-Time

In recent years, data providers are generating and streaming a large number of images. More particularly, processing images that contain faces have received great attention due to its numerous applications, such as entertainment and social media apps. The enormous amount of images shared on these applications presents serious challenges and requires massive computing resources to ensure efficient data processing. However, images are subject to a wide range of distortions in real application scenarios during the processing, transmission, sharing, or combination of many factors. So, there is a need to guarantee acceptable delivery content, even though some distorted images do not have access to their original version. In this paper, we present a framework developed to estimate the images’ quality while processing a large number of images in real-time. Our quality evaluation is measured using an integration of a deep network with random forests. In addition, a face alignment metric is used to assess the facial features. Experimental results have been conducted on two artificially distorted benchmark datasets, LIVE and TID2013. We show that our proposed approach outperforms the state-of-art methods, having a Pearson Correlation Coefficient (PCC) and Spearman Rank Order Correlation Correlation Coefficient (SROCC) with subjective human scores of almost 0.942 and 0.931 while minimizing the processing time from 4.8ms to 1.8ms.


Introduction
With the ongoing advances in technology, data providers are producing and streaming a significant amount of data. In particular, the huge interest in the development and usage of multimedia-based applications tasks due to its deeper structure [9], there are several existing approaches [10][11][12] that quantify the image quality degradation based on the CNN. Despite the fact that these approaches achieve high accuracy, they have high time complexity due to an excessive number of multiplications between the images and the layers, which will increase its response time when classifying these images. Therefore, this will limit their use in real-time applications. To clarify the previous points, we provide the following scenario.

Motivating scenario
Let us take, for example, the situation of a photo-sharing company (shown in Figure 1), which offers its customers the ability to share and publish images online. The company is demanded to instantly process these images as it receives an unbounded stream of distorted and undistorted images. Moreover, the company provides additional services such as: • Protecting his/her identity to avoid disclosing sensitive information using several techniques, such as masking functions.
• Adapting their images to meet the limitations imposed by the available resources-for example, image compression. Although the users are getting benefits from these services, the images' content could be damaged or removed. Therefore, the customers start sending their complaints about image quality degradation. So, the company is looking for a new solution that satisfies the needs of the users. In summary, the framework that should be implemented in the company must be able to: • Estimate the images' quality: assessing the quality of the distorted images that were already affected during the transmission or processing phase. These images are considered the most challenging cases since they do not have access to their reference images.
• Preserve the images' quality: ensuring that the modified images' features, such as color, shape, texture, etc., are remained intact and can still be extracted from the output after applying the previously mentioned services.
• Handle the unbounded stream of images: processing the multimedia data content requires higher bandwidth, bigger memory, and faster computational resources. Therefore, it forces strict quality of service requirements and demands efficient network architecture. For this reason, we need to find a way to treat the significant number of images by ensuring a successful image processing and reliable delivery of the data instantly and in real-time, even though the previously mentioned services, along with the quality assessment process, could take much time.
Several existing works tried to meet these objectives using many methods/techniques cited, along with their limitations, in the next section before presenting our proposed approach.

Related Work
In this section, we compare various existing works to our approach based on the following criteria: 1) The extent of the useful images' information that remained intact when addressing business or users' needs, such as adaptation or protection, 2) The metrics used to estimate the images' quality; this will indicate the number of features learned as well as the quality prediction accuracy, and 3) The amount of time it takes to process the images, especially the proposed approaches that work in real-time.
Image quality assessment measures are typically classified into three categories: Full-Reference Image Quality Assessment (FR-IQA), reduced-reference (RR) IQA, and No-Reference (NR) IQA. To assess the quality of the distorted images, FR-IQA requires reference images. To compute the quality measure in RR-IQA, partial information from a reference image is needed. As for the NR-IQA, it provides the quality without the need for any reference image. In the following sections, we will present the techniques used to assess the images' quality and their limitations. However, we mainly focus on the FR and NR categories since they are mostly encountered in real-time streaming scenarios.

Full-Reference Image Quality Assessment
Although traditional signal fidelity measures like mean square error (MSE) and peak signal-to-noise ratio (PSNR) have no consideration of characteristics of an image signal and the HVS, they are still widely used as FR measures [3]. However, it does not correlate with perceived visual consistency. This resulted in developing a whole zoo of image quality metrics to improve the agreement with human perceptions of image quality.
In [13], the authors proposed a method to lossy image compression that generates files 2.5 times smaller than JPEG and JPEG 2000 while maintaining images' quality. The proposed approaches in [14,15] assessed the distorted faces in real-time using objective quality methods. While these techniques provide valuable results, they assessed the distorted images by analyzing only the color and structure features without considering the other features that may be damaged and led to content degradation. In [16], a hybrid feature descriptor-based method is proposed to recognize human emotions from their facial expressions. A combination of a spatial bag of features (SBoFs) with spatial scale-invariant feature transform (SBoF-SSIFT), The suggested solution in [18] provides a novel DQAMLearn framework that aims to support mobile learner's seamless access to educational multimedia content from a variety of mobile devices with different characteristics. Moreover, as mobile users are increasingly becoming quality-aware, the framework integrates novel mechanisms for decreasing the video quality in a controlled way, with the aim to support a good learner quality of experience (QoE) even in resource-constrained situations. In [19], the authors propose an application-layer and middleware-based solutions that increase network reliability and flexibility and pro- In [20], The authors present the architecture of an adaptive multimedia learning service, where their engine enables users to identify the best combination of adaptive features of visual and audio content.
Machine learning has been partially adopted in Full-Reference image quality assessment (FR-IQA).
Instead of directly combining quality-related image features, some IQA metrics are based on learning techniques for feature discovery and integration. An obvious advantage of using machine learning techniques in feature integration is that the model can be mathematically optimal and therefore has superior performance.
A singular value decomposition (SVD) based measure was proposed in [21], and the authors first calculated the distance between the singular values of the reference image blocks and distorted image blocks. They then computed a global value from each block to represent the final quality. Liu et al. [22] introduced a novel parallel boosting measure that inherited the advantages of some state-of-the-art FR measures. Specifically, the authors utilized the SVR to integrate the quality features extracted by state-of-the-art FR measures.
In [23], multiple features were extracted from the difference of Gaussian frequency bands and regressed onto the quality score.

No-Reference Image Quality Assessment
Most of the state-of-the-art IQA methods [24][25][26] train their network to predict the distorted images' quality using human subjective quality scores that are available in several datasets, e.g., the images in the TID2013 [27] and LIVE [28] databases. All the previously mentioned methods follow a two-step framework: feature extraction and model regression by human scores. The IQA metric in [24] creates and combines image pairs within individual databases as the training set, which effectively bypasses the issue of perceptual scale realignment. The authors compute a continuous quality annotation for each pair from the corresponding human opinions, indicating the probability of one image having better perceptual quality. The authors in [26] proposed an IQA metric based on deep meta-learning. They first trained the model based on a number of distortion-specific NR-IQA tasks to learn a meta-model. The latter can capture the humans' shared metaknowledge when evaluating images with various distortions, enabling fast adaptation to the NR-IQA task of unknown distortions.
There are several existing studies using no-reference IQMs to assess faces' quality. The "MagFace" approach from Meng et al. [29] expanded on the idea of FR with integrated Face Image Quality Assessment (FIQA). In contrast to previous approaches such as ProbFace [30], the data uncertainty learning approach from [31], or PFE [32], MagFace does not have separate quality or uncertainty output at all. Instead, the quality is directly indicated by the magnitude of the FR feature vector. The approach works by extending the ArcFace [33] training loss, changing the angular margin to a magnitude-aware variant, and adding magnitude regularization.

Motivated by the recent success of CNNs for image classification tasks, there are several Deep Neural
Network-based approaches [12,34,35] to image quality assessment. It can be used in a no-reference as well as in a full-reference IQA. The authors in [36] rely on extracting feature vectors from the distorted and reference image to be then concatenated together while assigning weights for each region. They showed superior performance compared to the state-of-the-art. However, the authors did not show the execution latency and time complexity. This method may take more time to assess an image since the authors consider each region and the distorted parts, limiting their use in real-time applications. In [37], they train a CNN to predict the objective scores of all metrics by their proposed Multi-Task Learning (MTL) framework. Afterward, they use this framework to extract features to train another small regression network for subjective score prediction.
We present, in the following Table, the previously mentioned approaches along with their limitations. Full-Reference IQA [13][14][15] The proposed methods maintain the image quality, while processing the images in real-time Their use in many real-time streaming scenarios could be unpractical, especially when the original image is not present [18][19][20] They provide an end-to-end quality assessment framework that could guarantee a high level of QoS They assessed a limited number of features [21][22][23] They train a regression model to predict the image quality score using multiple features in order to identify the remaining useful images' features when adapting the content No-Reference IQA [29][30][31] A universal 512-D face feature representation is provided to measure the quality of a given face Even though these techniques achieve [26,[34][35][36] The authors used the FR-IQA methods to annotate and train their CNN models.
state-of-arts results, their use could be limited in real-time applications due to the structural complexity of the trained models Our work aims to find a fair trade-off between the quality of the altered content and the users' expected outcome in real-time, as detailed in the next section.

Contributions
According to [38], the authors showed that the random forests have the best time complexity among all machine learning models. This will allow us to process and predict the images' quality score within a short time. However, the authors revealed that their use for large-scale multi-class image classification is unpractical since they cannot classify data of high dimensions. For this reason, we propose a faces quality estimation contained in the images by integrating a Deep Neural Network with the random forests while processing these images in real-time. Hence, we first apply a feature extraction process through the use of a Deep Neural Network to reduce the images' dimension by keeping the useful information before classifying the images according to their quality scores using the random forests. Thus, this combination will minimize the inference time comparing to any single-handedly Deep Network. We assume that the faces are affected by adaptation or protection functions to be evaluated on the fly.
Our contributions can be summarized as follow: • We propose a faces quality estimation in the images by integrating a Deep Neural Network with the random forests. We opt to choose the Convolutional Neural Network (CNN) as our Deep Network since it is the most commonly used when it comes to analyzing images. After extracting the features, we train the random forests using three Full-Reference metrics: -Structural Similarity Index (SSIM) [2].
• We come up with a second metric that will allow us to assess the facial features in an image using a face alignment metric while reducing the dimension of the face feature vector.
It is worth mentioning that the two main differences between our method and the state-of-arts methods are: 1) it is the first time an integration of CNN with the random forest has been used to predict the images' quality while considering such a number of features to train our model, and 2) an unprecedented combination composed of a face alignment metric with the previously trained model.
• We develop a framework with the capability of evaluating a stream of images efficiently while estimating its quality.
The remainder of this article is organized as follows. Section §4 presents some definitions and terminologies used in our work. The Machine Learning Model for image quality assessment and the face alignment metric are described in section §5, while the proposed framework is then detailed in section §6. We evaluate our proposed approach in section §7 through a set of experiments. Conclusions and future work are summarized in section §8.

Definitions
In this section, we present the data model and data manipulation functions needed to fully understand the proposed framework.
Definition 1 (image) An image, denoted by im, is a basic data structure consisting of attributes that provide clues about its content. It is written as follows: where, • DESC is a user-provided set of textual descriptions, keywords, or annotations.
• F is the set of features that depicts an image. It can describe an entire image or a feature at a specific location.
• SO is a set of salient objects representing objects of interest in an image detailed in the following definition.
Definition 2 (salient object) It is an object of interest, denoted by so, in an image. For example: a person's face. It is defined as: where, • w and h are the width and height of the salient object.
• coord indicates the coordinates of so.
• DESC is a set of textual descriptions related to so.
• F is the set of features revealing a salient object's visual content such as color, texture, and shape.

Definition 3 (entity)
It is a semantic object that exists by itself (e.g., person, vehicle) and is expressed as e.
Each entity is represented by a set of salient objects, which can be either distorted or not. A relationship, e → {so 1 , so 2 , ..., so n }, done via manual or automatic annotation, shows the salient objects {so 1 , so 2 , ..., so n } that are associated with entity e, where so represents a modified salient object.
Definition 4 (multimedia data stream) It represents an infinite sequence of images, designated as mds, that may contain a mix of distorted and distortion-free images. It is formally defined as follows:

Data manipulation functions
This section defines the functions used to modify the images in the multimedia data stream; either protect or adapt the salient objects in these images. Our presumptions focus mainly on the identification of the salient objects that might be protected or adapted based on predefined rules in authorization or adaptation schemes, which are out of the scope of this paper. We assume that the protection and adaptation functions are known and that they can be called implicitly on a subset of specified entities.
Definition 5 (image manipulation function) It is a low-level function, designated by imf, that modifies, suppresses, or removes a set of features attributed to so in an image im. imf (so, im) takes a salient object so, the image im that contains so, and returns a modified salient object denoted by so'.
As previously mentioned, we focus on two types of functions: protection and adaptation. The first type is used to hide the images' content by deleting some of its features to conceal some sensitive information related to an entity (for example, his/her identity). The second type is used to meet resource constraints such as hardware limitations.
A group of manipulation functions could be applied on the entity that exists in the images. In our work, this group is known as entity manipulation function, and it is formally defined as follows; Definition 6 (entity manipulation function) It is denoted by emf and defined as: where, i and n ∈ N * . emf combines several image manipulation functions (imf 1 (so 1 , im 1 ), ... ,imf i (so n , im n )), which modifies the salient objects representing entity e in the multimedia data stream mds, by altering their features. As a result, this function returns a set of modified salient objects SO' that represents entity e.

Machine Learning Model For Image Quality Assessment
An overview of our Machine Learning Image Quality Assessment model is shown in Figure 2. It is composed of two main modules: • The Convolutional Neural Network.
• The set of random forests.
But before going into the details of each module, we first divided the dataset into three subsets where each subset will be used to train a random forest and labeled using the following FR-IQA metrics: • Structure Similarity Index.
We select these metrics since each one of them can assess the images' quality by considering different features. In our work, the labels represent the quality scores ranging from 0 to 1, while leaving a margin of 0.1 between the scores and expressed as S = [s 1 , ..., s N ] with 0 ≤ s ≤ 1.

Convolutional Neural Network Module
Usually, the number of parameters in a neural network grows rapidly as the number of layers increases; hence, tuning so many parameters can be a huge task and force the model to be computationally heavy.
However, the Convolutional Neural Network (CNN) minimizes the time taken for tuning these parameters since it effectively reduces the number of parameters without losing the quality of models. Therefore, this sort of network is the most widely used when processing and analyzing images due to their high dimensionality. For this reason, we employ the Convolutional Neural Network.
Normally, a CNN is composed of two basic parts: • Feature extraction.
In our work, we adopt the CNN to apply a feature extraction using several convolutional, max-pooling and, flatten layers, as shown in Figure 2. The feature extraction process will allow us to reduce the image dimensionality while keeping the most useful information. Hence, it will lead to: • Minimizing the processing time, especially that our solution is targeting real-time applications.
• Facilitating the learning and classification process of the random forests since they are not able to classify data of high dimensions.
So, this module will return a feature vector, denoted by F Conv , for any given image im.

Random Forests Module
The random forest consists of a large number of individual decision trees that operate as an ensemble. Hence, in a random forest with T trees we have t ∈ { 1, ..., T }. The random forest gets a prediction from each tree and selects the best solution by means of voting. We use the random forest as a classifier to know what class (the quality score in our case) an image belongs to. The random forest model works so well because a large number of relatively uncorrelated trees operating as a committee will outperform any of the individual constituent models.
The low correlation between trees is the key. Uncorrelated trees can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this wonderful effect is that the trees protect each other from their respective errors. In our work, we are ensuring that the trees are uncorrelated since they are trained separately using different features.
Usually, the random forest predicts the probabilities, from each decision tree, of the input image belonging to each given class. In our case, we define eleven classes for each decision tree that represent the ground-truth image quality scores provided by the FR-IQA metrics to train these trees. Therefore, we define the ground-truth vector of probabilities, denoted as P = [p s 1 , ..., p s N ], with N i=1 p s i = 1 and where p s i represents the probability of a quality score falling in the ith bucket. In our work, we make use of three random forests to predict the image quality score by training each one using a specific subset, as shown in Figure 2.
After finishing the training phase, and during the testing process, each feature vector F Conv of an image im is simultaneously pushed through all random forests, where each one will predict an image quality score, denoted by IQS, as follows: where: s i the predicted quality scorê p s i the probability of the predicted quality scoreŝ i Finally, we calculate the final score by averaging the values from the random forests as shown below: As a result, µ will return a value between 0 and 1. Higher values indicate image quality preservation. Some facial landmarks points are considered more representative than others. As stated in [39], the eyes and the mouth are very representative parts of a person's face. The authors showed that these two parts achieve an emotion classification accuracy equal to 81.5%. For this reason, we refer to the points that are relative to these parts as key points, denoted by k(x, y), with x and y are the coordinates of point k.

Face Alignment
More Specifically, we select six main key points, as shown in Figure 3a: • The midpoint of the eyebrows (k 1 and k 3 ).
• The centroids of the two eyes (k 2 and k 4 ).
• The center of the face (k 5 ).
• The midpoint of the mouth (k 6 ) (a) The key points.
(b) The Distances between key points. The centroids of the two eyes are calculated from the six facial landmarks of each eye. These centroids will represent the eyes' key points. We then find the eyebrows' midpoint by computing the distances between #18 and #22 for the left eyebrow, #23 and #27 for the right eyebrow. For the center of the face, we directly use the landmark #31 while finding the midpoint of the mouth using the landmarks #49 and #55. As shown in Figure 3b, we make use of these key points to find the following three distances: • D so (k 1 , k 2 ): left eyebrow midpoint -left eye centroid • D so (k 3 , k 4 ): right eyebrow midpoint -right eye centroid • D so (k 5 , k 6 ): #31 -mouth midpoint These distances are calculated using the Euclidean Distance as follows: With {i ∈ 1, ..., 5 | i = 2, 4} and h is the diagonal bounding box used to normalize the value and computed as: h = √ so.w 2 + so.h 2 . These distances will create a 3D feature vector, denoted by F A, representing the face so. We then train an unsupervised neural network by applying kmeans on the feature vectors. This mechanism will produce K clusters, denoted by C = {c 1 , c 2 , ..., c K }, with K ∈ N. Thus, each cluster c i will have a centroid represented by the mean feature vector of this cluster, designated as FA i . After completing the training phase, we assessed the facial landmarks of a distorted face by finding the deviation between its landmarks points and the trained points using the relative error formula as follows: As a result, δ will contain a value between 0 and 1. The closer the value is to 0, the more likely the facial landmarks will be aligned. Therefore, and after finding the deviation of the landmarks points for a face, we compute the face alignment for all of the faces contained in a distorted image im as follows: Finally, we subtracted the result from 1 (as shown in the equation below) to adjust the measure with the outputted score of our machine learning model, described in the previous subsection. Therefore, higher scores will lead to face alignment preservation.

Image Score
In the end, we combine the previous methods to calculate the final image quality score. We compute the Image Score, denoted by IS, as follows: where w 1 and w 2 are weights between 0 and 1 with their sum equal to 1. The administrator chooses these weights to indicate the importance of each method based on his preferred features. The value of IS has a range from 0 to 1, and higher scores indicate quality preservation.

Framework
An overview of our framework is shown in Figure 4. It consists of two main modules: • Stream Processing.
In the following, we present in detail the framework's modules.

Stream Processing Module
This module is responsible for processing continuous, never-ending data streams with no beginning or end that provide a constant feed of data that can be utilized or acted upon without being downloaded first. By using Stream Processing Module, data streams can be instantly processed and analyzed as it is generated in real-time. Therefore, administrators will have the ability to query continuous data streams and detect conditions within a short amount of time from the date of receipt of these data. In our work, we took Twitter as a source for multimedia data streaming while processing only images.
As shown in Figure 4, our framework has the ability to process two kinds of images: • A distorted image.
• A distortion-free image.
Moreover, the images in the stream are marked from one to k, where k is equal to an infinite number, indicating that we are processing images without any bounds. In the end, im is the resulted images that are returned from the back-end module.

It consists of two main submodules: a) Identity Recognition and b) Quality Estimation.
The first one is responsible for detecting and recognizing: • The entities.
• The distortion if available .
These tasks are done through the use of the following components: 1) Face Detection, 2) Face Recognition,
The second submodule's duty is to assess the image quality with the help of the remaining components:

Face Recognition
This component attempts to recognize the individuals' faces by comparing the detected faces with those stored in the Entities Knowledge Base, which will be detailed in the next section. If a match is found, the image is forwarded to the Distortion Detection. Otherwise, the image will return to the Stream Processing Module.

Entities Knowledge Base
This component represents the database in which the trained entities reside. An administrator can add new entities to the database to build his schema and train more images to current entities. This will provide us with the opportunity to increase recognition accuracy.

Distortion Detection
This component is in charge of determining whether the image is distorted or not. If a distortion is detected, the image is sent to the Distortion Recognition component. Otherwise, it will be headed to the Entity Manipulation Function.

Distortion Recognition
After detecting the distortion, and in order to identify its type, this component compares the latter with those stored in the Distortion Trained Repository. If the distortion is recognized, the image is sent to the Quality Estimation Module to assess the quality of the distorted image. Otherwise, it will be redirected to the Stream Processing Module.

Distortion Trained Repository
This component represents the repository where different types of distortion are trained and stored. It can recognize five kinds of distortions: Median blurring, Gaussian blurring, Pixelate, additive Gaussian noise, and compression. In addition, an administrator can add more distortion types.

Entity Manipulation Function
This component modifies the salient objects of the entities, which are considered the persons' faces in our work. As shown in Figure 4, the administrator could add and impose constraints by applying a series of image manipulation functions on the salient objects of entity e. Hence, this component will return a set of altered salient objects. We recall that the image manipulation functions are divided into two types: • Protection to hide the users' identity as we used four main protection functions: Pixelate, Gaussian blurring, Median blurring, and additive Gaussian noise.
• Adaptation to meet some limitations imposed by the available resources. We used two compression techniques: lossy and lossless.
In fact, the functions differ in terms of the number of features that they preserve. For example, a median blur returns a modified image so that certain visual and multimedia features are damaged while keeping the semantic features intact. Considering the fact that each function maintains specific features, we apply the previously mentioned list of functions to find the most suitable one that could guarantee an acceptable image quality with the help of the following image quality assessment metrics.

Face Alignment
After modifying the faces, the first metric used to assess the image's quality is the Face Alignment component by extracting the facial features. In this component, we measure the divergence between the landmarks points of a distorted face and the trained landmarks points to ensure that the facial expressions can still be extracted and recognized from the distorted face. As a result, the Face Alignment will return a value between 0 and 1.

Machine Learning Model
This model will help us estimate the images' quality by considering a various number of features. We recall that we trained this model using three FR-IQA metrics: SSIM, CBIR, and PCM. When the training phase is completed, the model will determine the image quality by assessing several important features that are related to the FR-IQA metrics. Therefore, the model will output a value between 0 and 1 for any distorted image.

Image Score
The final component, which resides in the output, is the Image Score. It is responsible for: • Aggregating the scores, which are returned from the previous Machine Learning Model and the Face Alignment component while assigning them weights based on the administrator preferred features selection.
• Selecting the image manipulation function that has preserved the images' features when imposing the constraints.
• Displaying the image score, which is calculated using equation 8.
Simultaneously, the distorted images (im') coming from the stream and the recently modified images will return to the Stream Processing Module to be then published.

Experiments
In this section, we first present the experimental setup and protocol before testing our framework's efficiency.

Experimental Setup
First of all, we start by training the Machine Learning Model using two different scenarios: • In the first one, we used the CSIQ [39] dataset. It consists of 30 reference images and 866 distorted images with five different distortions: JPEG compression, JPEG-2000 compression, global contrast decrements, additive pink Gaussian noise, and Gaussian blurring. In this scenario, we aim to prove our approach's validity by comparing the model's prediction accuracy with the state-of-art methods.
For this purpose, we use the ResNeSt269 as our backbone network due to its highest accuracy, as highlighted in the dark gray color in Table 2.
• In the second one, we used CelebA [40] dataset. It contains 202,599 face images with 10,177 identities, from which we select 52,800 images to prepare our training set. We then applied three manipulation functions on these images while considering many distortion levels: Pixelation (a.k.a mosaicking), Gaussian blurring, and Median blurring. Those three functions are the most commonly encountered distortions in practical applications. Our goal here is to use this model to predict the faces' quality in real-time. For this reason, we change the backbone network to SE-ResNeXt101_64x4d, which has an excellent trade-off between inference time and memory consumption, as shown in Table 2 in the light gray color. We note that the inference time represents the time needed for feature extraction and classification. In our work, the CNN processing time is lower than the presented values in Table 2 since we only use this network for feature extraction.
In both scenarios, we divided the dataset into three subsets using the FR-IQA metrics, mentioned in subsection §5.1, and we distributed the images to the eleven classes according to their quality score. Each manipulation function will have 230 images per class; hence, each class will contain 690 distorted images for training and 173 for validating. We recall that we choose eleven classes ranging from 0 to 1 while keeping a margin of 0.1 between each class. We tried several scenarios by minimizing and maximizing the margin between the classes, and we noticed that the margin of 0.1 gives us the best result in terms of accuracy.
Moreover, we run into several situations to select the ideal number of trees. We notice that 505 trees with a maximum depth of 70 deliver the most acceptable result. In order to train the unsupervised neural network of the face alignment, we select 100 celebrities having each 50 images from the CelebA dataset. We then extract their facial landmarks to build a 3D face feature vector. We create in this training 2 clusters by referring to two main metrics: Inertia (Intra Distance) and Dunn-index (Inter Distance). Inertia tells how far away the points within a cluster are. Therefore, lower Inertia values mean that clusters are internally coherent as the range of inertia's value starts from zero and goes up. The Dunn-index aims to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated. Thus, for a given assignment of clusters, a higher Dunn-index indicates better clustering. We notice that the Dunn-index value drops to 0.03 between clusters 2 and 3, while the Inertia value decreases slightly from 2.61 to 2.09. For this purpose, the optimal number of clusters is 2.
Finally, we gathered images for each celebrity to train an identity recognition model. However, the CelebA dataset is not designed for face recognition tasks. Therefore, we first grouped images of the same individuals based on the identity annotations provided by the CelebA dataset. We then considered the celebrities that have 30 images and more, which resulted in almost 2,600 identities.
After completing the training above, a software was built in Java using eclipse on a desktop computer with a 2.66 GHz core 2 duo and 4 GB RAM running Linux Ubuntu 14.04 64 bit. After running the program on one computer, the framework described in section §6 is deployed in a distributed system called Apache Storm [44]. In order for the storm cluster to run successfully, we must implement all of its components. To do so, we show in Tables 3 and 4 the Apache Storm configuration as well as the needed libraries.   Trained ResNet Model It is used to recognize the distortion and the faces.

Experimental Protocol
We conducted three sets of experiments as shown below: • Firstly, we evaluated our model's efficiency by measuring the quality score prediction accuracy of the No-Reference images from two datasets: LIVE [47], and TID2013 [27]. A subjective quality score, But, and as in many previous works [36,48,49], we only consider three types of distortions that are common to the two databases: JPEG compression, additive Gaussian noise, and Gaussian blurring.
We then compare our results with the state-of-art methods.
• Secondly, we assessed the images' quality from Twitter stream that may be affected after applying a manipulation function. We limit the processed images' size to 9,000 as our goal is to determine the image quality using the Machine Learning Model and the Face Alignment defined in section §5. We started by varying the number of faces in the images from 1 to 3 and applying the existing list of manipulation functions to find the appropriate one that returns the highest score and preserves most of the images' features. In these two scenarios, the framework is tested only on a local cluster without being uploaded to the distributed system.
• Thirdly, we implemented our framework on Apache Storm as we evaluated in real-time its performance in terms of: 1. Execution latency: The average amount of time it takes for an image to be executed.

Number of nodes: Number of supervisors engaged in processing the images.
To do so, the following scenario was executed: 1. We distributed the libraries on all nodes.
2. We uploaded the framework to the cluster while processing 50,000 images from Twitter stream.
We repeat step 2 several times while incrementing in each run the number of nodes by 2 to evaluate the Apache Storm's performance in terms of execution latency.

Performances Comparison
This test aims to evaluate the performance of our Machine Learning Model by comparing the prediction scores of the No-Reference images to the subjective ratings. As mentioned before, the two largest publicly available subject-related databases used are: LIVE [47], and TID2013 [27]. Two correlation coefficients between the prediction results and the subjective scores have been adopted to evaluate the performance of our method: • Spearman Rank Order Correlation Coefficient (SROCC) assesses how well the relationship between two variables can be described using a monotonic function.
• Pearson Correlation Coefficient (PCC) measures the degree of relationship between two random variables.
A high correlation coefficient (close to 1) with the subjective score MOS indicates a good method. After processing the images from LIVE and TID2013 datasets, we obtain the results along with the state-of-arts methods shown in Table 5. Zhang et al. [24] NIMA [25] MetaIQA [26] UNIQUE [34] DeepFL-IQA [37] CORNIA [50] Our approach We first start by comparing our method to the FR-IQA metrics: PSNR and SSIM. Table 5 shows that our approach gives a good SROCC and PCC in both datasets and outperforms the PSNR metric. Moreover, our approach has slightly better results than SSIM, especially in the LIVE dataset. The reason behind this correlation is due to the fact that this metric was taken into consideration when training our model. In general, our approach achieves good results than the FR metrics in the overall datasets.
As for the NR-IQA metrics, we can notice that our approach achieves state-of-the-art performance on only TID2013. More specifically, our method was not able to outperform DeepFL-IQA (the best method in Table 5) on the LIVE dataset since the authors trained and tested their network on this dataset. However, our approach reaches better average values on both datasets. We believe the performance improvement arises in both datasets for two main reasons: • Our model is able to assess an important feature, i.e., the color component. The human vision system and the subjective quality scores are very sensitive to color information, which is the most critical and straightforward feature that humans perceive when viewing an image.
• The random forest can generate new datasets from existing data by creating samples with replacement.
Therefore, and in contrast to the deep networks that need large datasets, the random forest achieves high accuracy on small datasets since multiple versions of the dataset are generated.

Evaluating The Images Quality After Applying a Manipulation Function
Our objective is to estimate the images' quality that may be affected after applying a manipulation function.
Hence, our goal is to find the most suitable function that could guarantee an acceptable trade-off between the images' quality and the constraints imposed by an administrator/user. We use in this study three manipulation functions, which are mainly considered as protection functions: Pixelation (a.k.a mosaicking), Gaussian blurring, and Median blurring. For each manipulation function, we choose fixed parameters while allowing users to select weights for the quality methods based on their preferred images' features. Table 6 shows the manipulation functions' parameters and the quality methods' weights. we prioritize the Face Alignment over the remaining features due to its high importance in face detection and facial expression analysis. In order to find the most appropriate manipulation function, we process 9,000 images from Twitter stream as we partition them into three equal parts based on the number of persons (from one to three persons) contained in each image. After applying the manipulation functions on each image, we obtain Table 7 and the graph shown in Figure 5. These results represent the average values of the predicted quality scores from the Machine Learning Model (µ im ) and the Face Alignment (F alignment ) for each manipulation function over 9,000 images.
According to the below graph ( Figure 5), the manipulation function with the highest image score is the Gaussian blur. Moreover, and as stated in Table 6, this function will meet the users' needs since it could preserve several features, including structure, contrast, color, and most importantly, the location of the facial landmarks points. Hence, this function will guarantee that the facial expressions are still recognized while maintaining the remaining features. We recall that the image quality assessment is achieved through the use of the Machine Learning Model (µ im ) and face alignment (F alignment ).  Figure 5: The dependence between a manipulation function and the image quality score.

Evaluating The Framework in Real-Time
Since IQA measures are often used in real-time applications, speed is an important issue in determining whether an IQA measure can be used in these applications. For this purpose, we treat 50,000 images from the Twitter stream. We deployed our framework in Apache Storm distributed system to measure the execution latency at each component by conducting two sets of experiments. In the first one, we vary the number of images from 5,000 to 50,000 while fixing the number of nodes to 4. As a result, we obtain the graph shown in Figure 6. In this Figure, we can see that when the number of images is incremented by 5,000, the execution latency gets higher values due to the fact that a node needs to execute more images in each run.
In the second set, we vary the number of nodes from 2 to 7 while fixing the number of images to 50,000.
Therefore, the results are shown in Figure 7.
We notice that while incrementing the number of nodes, the time required to process the images has decreased due to an increase in the number of workers, which may lead to an improvement in Apache Storm performance. More specifically, and according to Table 2, the backbone network used in this test needs 2.62 ms to process an image. However, as shown in Figure 7, the maximum time needed for our Machine Learning Model to predict the quality score is 1.811 ms (marked with red circle). We notice that the time is diminished since we are only using the CNN for feature extraction, and the random forest has better time complexity than the latter in the classification process. Consequently, our model is faster by 3 ms than the best approach in the state-of-art (DeepFL-IQA).
Furthermore, and according to the literature [31,32], the face alignment takes an average of 1 sec to verify and recognize the facial expressions. In our work, the time is dropped to 0.643 sec (marked in red rectangle) since we are reducing the face feature vector dimension by taking only the most representative parts in the face.

Conclusion
In this paper, we presented a framework that intends to assess the faces' quality in images that may be distorted during the processing and transmission phase while treating these images in real-time. The images' quality is estimated using a Machine Learning Model that represents an integration of a deep network with a random forests; we trained this model using three FR quality metrics. We also used face alignment as a second metric to estimate the faces' quality. Three sets of experiments have been tested in order to evaluate our approach.
In future work, we intend to provide a framework that can process a wide range of multimedia data, such as videos and audio. Moreover, we aim to consider more features in assessing the images to improve quality prediction accuracy while minimizing the execution latency, especially at the face detection and recognition components.