Vision acuity index. A novel metric to select the best fit computer vision algorithm for smart cities and Industry 4.0

Computer vision is considered as an ally to solve business problems that require human 1 intervention, intelligence and criteria. This topic of research has evolved in XXI century at faster peace, 2 delivering various alternatives from open source until commercial platforms. With so many options 3 and market growing, it result difficult to make a decision on which one to use, or even worse, realize 4 it was not suited for different scenarios. In this paper we analyze five options selected arbitrarily and 5 tested on a dataset of 755 images to detect persons in an image, using object detectors. We analyze 6 elapsed time to process an image, error with observations by humans, number of persons detected, 7 correlation of time and person density, object detected size and F1 Score, considering precision and 8 recall. As we found there are score ties and similar behaviors among options available, we introduce 9 a novel index that takes in consideration the number of persons and their pixel size, to propose the 10 Vision Acuity Index of Computer Vision. The results demonstrate this is a good option to serve as 11 indicator to make decisions. Also, this index proposed have a potential to be expanded for different 12 business use cases, and to measure new proposed algorithms in the future along with the traditional 13 metrics used previously. 14

. YOLOv3 Architecture of Darknet-53. Source image: [6] and combined into a final total histogram. Developed by Dalal and Triggs [8] summarized in equation in C++ and now available for python 1 .
where T h is the total number of features found by HOG descriptor, B i are the blocks of the image, Another performance evaluation is Receiver Operating Characteristic (ROC) and the Area Under the Average ROC (AUC). The input indicators are recall and false positives, and mean accuracy (mA) was also used for object detection algorithms [11]. Equation 3 takes in consideration Positives P i and Negatives N i and λ as the number of attributes.
matter what path is chosen, each option has specific performance related to number of correct objects 110 labeled, confidence percentage and speed of processing.

111
As it is proposed at the beginning of this paper, in a public place there will be persons walking by 112 at different distances, and therefore, there will be smaller objects that computer vision may or may not 113 have difficulty to detect. In a human analogy, it could be expected a smaller person would have more 114 difficulty to be identified opposed to a larger person, which could be easily identified. This is a human 115 medical and optical concept known as vision acuity. As computer vision or algorithms usually are 116 measured by accuracy, precision, recall or F1-Score or F2-Score, but not by how small the object can be 117 detected, it is important to define the vision acuity index (VAI) of computer vision (Υ) algorithms (or 118 commercial APIs) by: Where ω is the confidence (probability) of object detected in the range [0,1], S represents the 120 proportional surface of object detected in relation to the whole image surface, as defined in 5, and F1 j 121 is the F1 Score of the whole image.
Where S is the proportional surface of object, w, h are object's width and height in pixels returned by 123 algorithm, and W, H are the image's width and height in pixels. The methodology to find this index is described in Figure 2 with five steps detailed in the following 127 sub sections.
from PIL import Image 7 import os, sys • Processed images are stored at independent location per algorithm, with same filename as the 166 original images, but transformed in grayscale. Also, pictures will have a bright green rectangle 167 and a red dot at object center to identify fast the person detected (see Figure 3).

176
• Set a ground truth by human inspection. This is achieved by recording the human inspection 177 of each picture regarding how many persons they detect. In our experiment eight volunteer 178 students from Instituto Tecnológico de Ciudad Juárez participated and recorded their findings.

179
This information is good to calculate the mean person detected by humans (mPDH).

180
• Compare persons detected by each method against the mPDH to find in quick fashion the error in each image. Three values can be obtained when using equation 6: a) error is negative, meaning the computer algorithm under-detected persons; b) error is zero, meaning the computer algorithm was accurate; and c) error is positive, meaning the algorithm detected more persons than human eye.
where equation 6 ε is the measured error, i is the i − th picture, and CV is the number of persons 181 detected by computer vision algorithm. Analyze the errors to identify anomalies and possible 182 patterns on each algorithm/method.

183
• Finally, calculation of F1 Score for each image is required in order to apply equation 4 to obtain the VAI for that picture. F1 Score is done by equation 9.   The first analysis was option's speed performance. From data insight shown in Table 1 it   198 can be observed the fastest option is Yolo3Tiny and the slowest is HOGCV; Azure presented the 199 maximum elapsed time with more than 10 seconds, and Yolo3Tiny reported the minimum elapsed 200 time. Figure 4 shows the resulting distribution of each option, where HOGCV presents a double lump 201 distribution, suggesting two main groups of elapsed times, while the rest of options shows a more 202 regular distribution.

203
Another data coming from the observations is the number of objects detected per image. It is important to find if there is observable correlation between the number of objects detected 209 and the speed of the algorithm. Figure 5 shows Azure as the most uniform performance with less pattern.
212 Figure 6 shows a correlation between Yolo3Tiny speed and maximum number of objects detected.

213
It can be observed it is the lowest elapsed time but also the lowest number of objects found; Azure 214 follows similar pattern as well as Yolo3. It is also interesting to observe HOGCV with second worst  taking mPDH as the true label to match. Table 3 shows the results of error ε where Yolo3Tiny is the  and SEM HOGCV = 0.1617. The least SEM is Yolo3 option, and the second best is AWS, while worst 238 option is HOGCV. This metric is very useful, because just plotting the errors could mislead the 239 interpretation. Figure 10 shows the comparison of algorithms errors from best and worst SEM, Yolo3 240 and HOGCV respectively. It can be seen easily on the figure plotting their errors doesn't show a 241 significant difference to conclude HOGCV is worst than Yolo3.
242 Table 3. Error count per option divided by less objects found (ε < 0), equal match (ε = 0), or more objects found (ε > 0) Option   For the 775 images in the dataset all algorithms registered 14757 objects found. Figure 11 describe 243 a very regular object width and size, with two outstanding findings. First, Azure seems to be more 244 relaxed in the location of object within the image, it presented a slight higher value in both dimensions.

245
The second finding is HOGCV is the tightest are among these five options. That could explain why its 246 performance was not very good compared with the other four. Each object was detected with a specific 247 confidence, and Figure 12 shows the distribution of those confidences by option available. From this 248 analysis, AWS seems the best confidence with the majority of their findings above 0.9 confidence. Yolo3 249 follows the same distribution pattern. Among these two options, AWS detected 1110 more objects than and Yolo3 describe a very similar pattern, and the most likely to this behavior is Azure (Figure 13).   Last example from the dataset is Figure 16 showing a person in first plane and then five more 285 bystanders in different sizes. Again, Yolo3 detected five persons with F1 Score of 0.9091, Yolo3Tiny and 286 Azure found one person and reached F1 Score of 0.2857 each, AWS found all six persons, therefore its 287 F1 Score is 1, and HOGCV detected one person and provided a false positive for a F1 Score of 0.2500.

288
These results are interesting and also can mislead the decision to select the best option for a 289 specific application. It was found equal F1 Score for two algorithms but with different person detected.   To discuss in depth the results obtained applying VAI, Table 5 shows the VAI obtained for each 304 Figure and option available. Figure 14 obtained F1 Score of 1 with Yolo3, AWS and Azure, meaning the 305 three of them are excellent options. By calculating VAI, it shows Azure has a lower index compared to 306 Yolo3 and AWS, and Yolo3 has a slight edge over AWS. That is explained as Yolo3 reported a higher 307 confidence level on each object detected. In Figure 15, where all persons are of similar dimensions, 308 Yolo3, AWS and Azure again obtained a perfect F1 Score. Once again, Azure has the lowest VAI of 309 those three. This is explained as the confidence reported by Azure in each of the persons detected has 310 a mean of 0.6758, and that could lead to false negatives in real applications or multi-object detection.

311
Yolo3 and AWS differ slightly again on VAI. Yolo3 had a mean confidence of 0.9871 and AWS had 312 0.9746, however, AWS reported tighter boxes, meaning a preciser object location, therefore higher VAI.  Although, relay on this solely variable is cumbersome and requires a vast amount of time investment  to have a good evaluation. This will eventually lead to evaluate the results with F1 Score, however, it 323 was also observed the F1 Score is not enough to decide the option to use in an application.

324
The novel Vision Acuity Index provides a better comparison measurement to evaluate two or 325 more options that may have the same F1 Score or similar metrics. This is very important when you 326 are trying to select the option for a business application, and the only measurement they have is 327 commercial specifications, or data scientist training their own solutions, or leveraging an existing one.

328
The business projects can get a lot of benefit by using this VAI as a method to select object detection The following abbreviations are used in this manuscript: