Darknet on OpenCL: A multiplatform tool for object detection and classification

The goal of this article is to overview the challenges and problems on the way from the state‐of‐the‐art CUDA accelerated neural network code to multi‐GPU code. For this purpose, the authors describe the journey of porting that existing in GitHub, a fully featured CUDA‐accelerated Darknet engine, to OpenCL. This article presents the lessons learned and the techniques that were put in place for this porting. There are few other implementations on GitHub that leverage the OpenCL standard, and a few have tried to port Darknet as well. Darknet is a well‐known convolutional neural network (CNN) framework. The authors of this article investigated all aspects of porting and achieved a fully featured Darknet engine on OpenCL. The effort was focused not only on classification using YOLO1, YOLO2, YOLO3, and YOLO4 CNN models. Other aspects were also covered, such as training neural networks and benchmarks to identify weak points in the implementation. Compared with the standard CPU version, the GPU computing code substantially improves the Darknet computing time by using underutilized hardware in existing systems. If the system is OpenCL‐based, it is practically hardware‐independent. The authors also improved the CUDA version as Darknet‐vNext.

that the authors found after approximately two years in the AI and GPU computing community is that most questions in forums and mailing groups are about very deep technical details. Therefore, we wanted to include them in this study.

The GPU generic architecture
The best way to understand computing for any GPU device is to compare it with CPU computing, which is designed for long-running tasks consisting of threads and processes in the operating system. Any modern CPU has several compute cores fed by memory organized in three levels of cache memory: L1, L2, and L3, where L1 is the smallest and fastest level. Each core is a sophisticated computing machine organized to uncover independent instructions fetched from the memory in a stream and execute these instructions as quickly as possible in parallel.
The core architecture can be compared with a large Walmart supermarket. 5 Clients (instructions) enter the market (the core) sequentially and spread over the shopping hall to find proper stands, store shelves, or lockers (execution units) to realize the purpose of arrival. One of them (instructions) takes something from the market, and the other brings something to transform or leave for the next client (instruction). Clients (instructions) are queued in front of the stands waiting to be serviced or waiting for the arrival of goods (data). When the mission of the client is accomplished, the client (the instruction) is queued to pay and exit the market by very wide exit doors in the same order as he/she entered.
The goal of the core is to create the illusion that instructions are executed strictly sequentially as they are ordered in the code. In reality, instructions are executed by the core in parallel despite the fact that they are mutually dependent in an unpatterned manner. The cores execute threads and tasks that are connected loosely but collectively, making the impression that the threads are a centrally managed computational system.
Conversely, GPUs have thousands of cores. Each core executes instructions in exactly the same order as they are fetched from the local memory.
Instructions are pipelined, and the data are transformed from the local memory. If the code consists of a myriad of loosely connected tasks, which can be executed in parallel, and we can repeat the execution and data exchange episodes in a well-established pattern, then we can schedule usable work for each of the thousands of GPU cores. In this manner, the task can be completed quickly. There is one more important aspect to consider regarding GPUs: video RAM (VRAM) and system RAM (RAM). To allow fast operations on GPUs, we must transfer data from RAM to VRAM because a GPU has no access to the RAM itself from OpenCL code. In fact, for some systems, RAM and VRAM are not separated, for example in the Intel or NVidia systems. However, those types of systems are not described in this article.
CPUs and GPUs are used simultaneously to achieve a synergistic effect. The above overview provides, in brief, a sufficient understanding of the general CPU/GPU architecture for practical computation usage.

OpenCL library for GPU computing
OpenCL 6 abstraction fits all modern GPUs and uses context, queues, kernels, and memory buffer abstractions. This method is very smart because context covers all GPUs int the platform in the system, a queue is used for computing tasks on each GPU, and a kernel is a code that can be compiled and added to execution in the queue. Furthermore, memory abstraction on the VRAM is used to transfer data from and to the RAM of the system. This practical definition provides readers with an understanding of GPU versus CPU hardware. It should be noted that a GPU has many more cores than a CPU, which allows small computation tasks to be performed on the GPU. Processing these tasks is much more efficient than on a CPU, which is optimized for long tasks, threads, and processes in the operating system. This is the essence of the GPU computing practice. Figure 1 presents the OpenCL program execution steps. 7

Deep learning neural networks
Practical implementations of artificial neural networks (ANNs) exploded ten years ago. 9 ANNs have been studied as a specific tool for machine learning since the mid-last century. Conceptually, they are inspired by the nervous systems of animals. The base element of the ANN is a neuron, an element that nonlinearly and nonreversibly maps multiple inputs to one output. This element implements the idea of separating the input data into two independent classes using one sword cut. Even one neuron has useful functionality, and under the name "perceptron," it was implemented over fifty years ago. 9 Meanwhile, the idea of multilayered neuron structure ANNs emerged, in which each layer is fed by the output of the previous layer, the number of layers is counted in hundreds or thousands, and the number of neurons can reach millions. The improved ANN is known as a DNN. Such a neural net is especially suitable for classification problems, which might be applied to many different areas such as natural language translation, automatic face recognition, or automatic driver assistant systems, to name only a few.
F I G U R E 1 OpenCL program execution 7,8 For an ANN to be useful, each neuron's parameters must be trimmed in a process of learning that is very similar to a multiparameter optimization process. The number of trimming parameters can reach the hundreds of millions for a useful ANN, and the learning requires computing power that only a few supercomputers attain. However, a new concept has emerged. It has been mathematically proven that for any sophisticated ANN, there exists a three-layered ANN that is functionally identical to the original one. Therefore, scientists have limited themselves to training only three-layer ANNs in the hope of finding in each case the Saint Graal, that is, the simplest net that can realize functionality.
However, the initial results were disappointing. Few neural nets could be trained to the point that they could solve real-world problems. The hypothesis at that time was that problems arose in the trimming parameters. Mathematically, real numbers comprise an infinite set, which is dense and continuous. However, these abstract mathematical properties are not easily imitated by computers. For computers, the trimming parameters are numbers obtained from a large yet finite set. Perhaps the problem was related to the learning algorithms used. Nevertheless, just 10 years ago, personal computers and workstations were not ready for the fast computing necessary to model deep-learning neural networks.
The training of some of the models that we use today is simply reserved for high-performance computing (HPC) servers, also often called computation grid clusters (CGCs). Currently, we have workstations with GPUs that can be compared to small HPC or CGC server farms. For example, the authors of this study used a workstation with two NVidia Titan RTX GPUs that together offered more than ten thousand compute cores and 48 GB of VRAM. This computing power accelerates the training of deep-learning neural networks. Moreover, mathematical models are ready to learn the features and differences between cats or dogs, cars, buses, and other elements.
We tested Darknet on OpenCL using five models: CIFAR-10, YOLO1, YOLO2, YOLO3, and YOLO4. All the models are deep. CIFAR-10, VOC, 10 and COCO 11 are benchmarks for training and can be used to validate any classification algorithm.

PORTING METHODOLOGY
The Darknet CNN engine can model various types of DNNs and allows, by a simple change in configuration, the use of DNN or CNN models. 12 These models exhibit excellent performance owing to their unique architecture and the introduction of GPU acceleration. Unlike region proposal classification networks (fast RCNNs) or deformable parts models (DPMs), input images are not processed in a deep pipeline. Object detection and classification are reduced to a regression problem by the YOLO neural network. Objects are searched and classified simultaneously throughout the image. 13 In addition, the Darknet models have been accelerated by modern CUDA-compliant GPUs. Video streams can be examined in real time.
The authors in Reference 13 claim that YOLO's base network runs at 45 fps on the TITAN X graphics card, while the fastest and simplified YOLO "tiny" versions of the network can process at 145 fps. The performance of the Darknet engine prompted the authors to move the engine to OpenCL to allow its use in all modern hardware and GPU-accelerated software.

GPU-computing challenges
Before explaining the porting methodology, we list the general problems that readers can find in any technical implementation of GPU and CPU computing. Problems that are solved early can reduce the implementation time.

Abstraction of VRAM
On the OpenCL the VRAM cannot be directly addressed and accessed by a CPU and the GPU cannot address and access the conventional RAM. The data to be processed by the GPU must first be transferred from RAM to VRAM. After the calculation is completed, the most important results must be transferred from VRAM to RAM. The transfer mechanism is hidden in the OpenCL implementation. A helpful "pair" rule is that every buffer in RAM should be permanently bound to the buffer in VRAM. In this situation, the Darknet project uses "pull" and "push" conventions for VRAM. This means that "pull" transfers data from the VRAM buffer to the RAM buffer and "push" transfers data from RAM to VRAM.
To formalize the "pair" rule, an abstraction of cl _mem _ext is introduced. This was the key to success after many different techniques were tried by the authors. The cl _mem _ext type can be used wherever the cl_mem structure is used. The last one is the VRAM in OpenCL. This structure contains cl_mem, as well as a pointer to the RAM associated with the VRAM space.
Because of the cl_mem_ext abstraction, the code not only gains readability but is also accelerated during data exchange between RAM and VRAM, and vice versa. This is because, on the Intel platform, creating cl_mem (VRAM) is associated with direct access to I/O map pointers. Given that OpenCL runs quite slowly on Intel GPUs, even this slight acceleration of the memory exchange is noticeable. The abstraction described here was used whenever the GPU computing was used. However, the authors noted that when using CUDA, memory exchange between RAM and VRAM is approximately 10 times faster than in OpenCL in any situation, and it seems to be the main weakness of the OpenCL implementation.
After careful analysis of the code, it turned out that the "pair" rule is widely used in the CNN DarkNet engine code. However, in the CUDA version, this rule is repeatedly broken. There are three significant exceptions to this rule in the OpenCL implementation in the training process of the YOLO1, YOLO2, YOLO3, and YOLO4 models. At the end of the training step, the layer.output_gpu (cl_mem_ext) is rewritten to the net input (float *) to calculate LOSS (YOLO uses sum-squared error SSE for the LOSS function), AVG IOU (intersection over union of correct and wrong rectangle field sizes), AVG CLASS (class of objects detected), and more factors of the training process step that can be performed efficiently only on the CPU code.
Before the idea of abstraction of cl_mem_ext, there were a few failures in porting. Because OpenCL Darknet was tested for end-to-end CIFAR-10 training, the testing process was time-consuming. This was an opportunity to introduce many bugs. The abstraction of "cl_mem_ext" helped solve many of the problems encountered. This abstraction can be considered the best practice for proper OpenCL abstraction usage (Listing 1).
typedef s t r u c t _cl_mem_ext cl_mem_ext ; typedef s t r u c t _cl_mem_ext { cl_mem mem; cl_mem org ; s i z e _ t len ; s i z e _ t o f f ; s i z e _ t obs ; s i z e _ t cnt ; cl_mem_ext ( * c l n ) ( cl_mem_ext buf ) ; cl_mem_ext ( * i n c ) ( cl_mem_ext buf , i n t inc , s i z e _ t l e n ) ; cl_mem_ext ( * dec ) ( cl_mem_ext buf , i n t dec , s i z e _ t l e n ) ; cl_mem_ext ( * add ) ( cl_mem_ext buf , i n t add , s i z e _ t l e n ) ; cl_mem_ext ( * rem ) ( cl_mem_ext buf , i n t rem , s i z e _ t l e n ) ; void * p t r ; void * map ; cl_command_queue que ; } cl_mem_ext ; Listing 1: The cl_mem_ext abstraction All cl_mem_ext abstraction variables are described in detail one by one below. We have described each part of the abstraction, which has a few more uses not included in this section, but the detailed description helps to read and understand all its proposals. This allows us to create all "_gpu suffix fields in structure types to obtain all valuable information of the VRAM." The aforementioned abstraction is described in detail for each reference, and it is an example of the best practice in the C programming language.
• mem -general usage cl_mem means VRAM; • org -the original copy of cl_mem VRAM to read-only; • len -length of the VRAM memory buffer; • off -offset of possible subbuffer of VRAM; • obs -object size, most often it is sizeof(cl_float); • cnt -counter of subbuffer "jumps" on mem; • cln -function to the clean function of the subbuffer state; • inc -function to use as a "+=" operator; • dec -function to use as a "-=" operator; • add -function to use as a "+" operator; • rem -function to use as a "-" operator; • ptr -pointer to RAM buffer read-only in all cases; • map -mapped VRAM buffer to/from RAM transfers cases; • que -reference to the OpenCL queue for this abstraction.

Prevention of GPU-computing run time errors
The code for the GPU cannot be debugged effectively. This concerns the kernels, that is, functions running on the GPU. There is no memory protection, as in the CPU computing runtime. The GPU code is checked statically during compilation, and in the case of any errors, it is solved. But during execution, if a parameter is specified that, for example, causes the code to override the GPU video memory outside the designated buffer, your computer's operating system may crash, and you must restart it and fix the issue. Subsequently, the over-run-time tests begin. We used asserts to ensure that the parameters provided on critical kernels were correct just before the call. In the case of erroneous values, the assertion breaks in CPU computing on the run time, and the value can be easily fixed. In other words, the assert failure in the execution just before the call of the kernel function on the GPU saves implementation time because the provided parameters are checked in the assert. The time execution cost of asserts is minimal and acceptable. Checking the "len" values is possible thanks to the "cl_mem_ext" abstraction. In the code from Listing 2, the assertion introduced in the CPU code checks the parameters given in runtime to the GPU code against range overruns in tables passed to the GPU code for calculation. An example is to copy data from one centiliter _mem VRAM buffer to another cl_mem VRAM buffer on the GPU with acceleration. Parameter N is the number of threads and the size of both buffers simultaneously. It is provided by a two-dimensional structure type. As you can see, checking the sizes is impossible in the "kernel" code.
Kernel void copy_kernel ( i n t N, Multithreading for GPU-computing modeling The data model for multithreading is the key to success in improving the computing speed on any workstation or multi-GPU server with a modern operating system. Multithreading issues should be solved using the right data model, which helps avoid synchronization techniques by correctly separating the data. Obviously, thread synchronization is necessary to combine the obtained values; however, because of the inevitable thread stalling, it should be used scarcely. In the solution described in this article, each thread runs in a separate OpenCL queue, which isolates programs and kernels running in the queue, such as the Listing 3 kernel, from the other programs and kernels. Listing 4 shows the declaration of the most important data model that has been used for multithreading. The declarations of the variables used in Table 4 are listed in detail below. This is a proven formula for multithreading modeling in the programming language C. It can be reused to separate problems using a data model whenever calculations have taken a long time, and the results must be synchronized by combining events and task data. The "__thread" modifier should be set for a specific thread as soon as possible, as an appropriate OpenCL queue identifier is required. The modifier "__thread" in C is called a "static thread" and each thread has a static copy of this value. Therefore, it is important to set its value as soon as possible using the correct value in the thread. In the "opencl _context" declaration, only one value of this type exists. Meanwhile, "opencl_queues" and "opencl_devices" are declared dynamic run-time arrays. The number of arrays is equal to the number of physical GPUs to be used. In our DarkNet port, this number can be specified using the "-gpus" line parameter. For example, the value of this argument "1,2" means using 2 GPUs with indexes 1 and 2 on the OpenCL platform. Separate array occurrences are also created for each kernel, that is, for each chunk of code executed on the GPU. The details of the thread model specification are as follows: • *gpusg -pointer to the global array of the GPU indexes; • ngpusg -global number or counter of all GPUs to use; • opencl_device_id_t -GPU device id in particular thread; • opencl_device_ct_t -GPU devices counter in particular thread; • opencl_context -global one context of OpenCL platform; • opencl_queues -global array of OpenCL queues (for multi-GPU); • opencl_devices -global array of OpenCL devices (GPUs).
This solution has been proven to be able to ensure the separation of multithreading by the data model. It is used not only in the "opencl _set _device" function. The entire multithreaded process also requires the separation of array pointers into compiled GPU kernels. However, with "opencl_device_device_id_t" it can be easily accessible by indexing the   training logs is shown in Figure 3.
A powerful GPU load-tuning mechanism was introduced while operating on the DarkNet OpenCL port. The idea is to use more than one thread, a local thread space, and data models for the function being called. For example, this idea is used for a fast_mean function. The essence of the solution is the "tuning" parameter, which determines the number of multiplications of the GPU threads assigned to the kernel function. Basically, code execution with a tuning parameter equal to 1 is also multithreaded, but there are long-running threads and the tuning parameter equals 16, resulting in 16 times more threads and 16 times less work on each thread (Listing 6). It works very well and improves execution in the most nested loop by a number of tuning value that is computed dynamically by dividing the "filters" variable by "4". The last important information is that parameter "t" cannot be correctly checked in the conditions or printed out. Listing 7 shows one of the core functions used to calculate the average values from the data collected in a three-dimensional "x" array (tensor) on a GPU.
The calculated values are returned in a one-dimensional "mean" array. The function takes the value of variable "I," which is the index of the output array "mean." Many calls of the function with different values of the "i" parameter that change from 0 to "filters" −1 fill in the table "mean" with the calculated values. Each call is a separate, concurrent thread for the GPU, and the "mean" array is not necessarily filled in order. Each call is executed for a relatively long time because the variable "spatial" is at least 10 k. The issued threads do not involve the entire GPU because the "filters" variable usually does not exceed 128 and is less than the number of GPU cores. This must be optimized in the following manner. In the most internal loop, the "k" index increases by the value of the total variable "tuning." A value of the "tuning" variable greater than 1 means that the number of function calls must be the product of the "tuning" and "filters" variables for the calculations to work correctly (the variable "t" changes from 0 to "tuning" −1 for all values of "i"). The values are collected in the array "sums," which is then aggregated with the condition "t == 0" for all variables 0 to "filters" −1 in the "mean" output array. Even with relatively small values for the "tuning" variable, all GPU cores participated in the calculation. The GPU function code is completely free of synchronization or atomic operations. This optimization is based on a data model and not a synchronization model. In other words, it is an optimization method where the same code is called in separate threads as many times as the value of the "tuning" variable is and can be considered as the OpenCL optimization good practice. _ _ k e r n e l void f a s t _ m e a n _ k e r n e l ( Other challenges The performance of the solution was tested by introducing the BENCHMARK compilation flag to collect the verbose log of one step of computation.
Calculation problems are solved with the built-in Jet-Brains CLion debugger, which helps the user easily dig into the source of errors. Changes were tracked and compared with the original code using the Scooter Software Beyond Compare. The rule-based comparison used in this tool is particularly helpful. The switch was implemented to enable/disable the GPU during runtime. An abandoned parameter scheduled in the original DarkNet code for the same purpose was fixed. It is now a switch called "-nogpu"-enabled back. Some of the computations were verified by a sandbox project in a separate environment. 15 The DarkNet OpenCL port was tested on AMD Radeon VII, NVidia Titan RTX, Intel Iris 655, and Mali GPU on macOS or GNU/Linux, depending on the computer device capability.

Porting darknet to the OpenCL
This subsection provides readers with information on the practical method used to port the Darknet engine from a CUDA-based to an OpenCL-based solution. First, the authors removed the entire CUDA code from the GitHub fork repository. Thus, the shortened code could not be compiled. To satisfy the compiler, all methods with the "cuda" prefix were renamed to start with the "opencl" prefix. Empty methods were created in the opencl.c file with opencl.h header files to make compilation possible. Obviously, while this project was fine for C, it did not work. Therefore, we needed to create a set of methods with similar signatures to create memory buffers, load the data from RAM to VRAM, and replace the CUDA code with OpenCL code. In addition, on the replace stage, all "_gpu" suffix fields that were "float*" or "int*," for example in layer type, were replaced with the "cl_mem_ext." The special abstraction was for the "cl_mem." The question is why the "cl_mem_ext" type was introduced? This abstraction is required to store a few things, the most important of which is a pointer to the RAM registry and the "cl_mem" VRAM registry as a pair. In the OpenCL code, it is strictly needed that each "cl_mem" has its own "float*" or "int*" pointer. This is a subtle aspect of the OpenCL memory transfer technique. In this way, the authors ensure that most data transfers between RAM and VRAM only take place between statically paired buffers. There were only three exceptions to this. Data must be transferred in violation of this rule on the LOCAL (in YOLO1 model), REGION (in YOLO2 model), and YOLO (in YOLO3 model) layers. These are the two main reasons for memory registers being retained as pairs. The first is performance, which allows for faster data transfer in an Intel OpenCL implementation. The second is to ensure that the C code is consistent. The latter reason has been omitted many times in the original Darknet code because buffer pairing has no impact on the performance in the case of buffers in the CUDA code. With the OpenCL code, this is not the case, and we had to correct it.
This abstraction ensured that all elements of the code were ready to be improved and filled with correct implementations. Some of the OpenCL code was taken from a few other computationally ready implementations on GitHub, for example, from Reference 16 fork. The rest were created from scratch by the authors. We know that there are automatic translators from CUDA to OpenCL, such as CU2CL, 17 and some other projects on GitHub use that type of tool, but the code made with an auto translator needs to be checked, and sometimes it simply does not work. We decided to use the basic linear algebra systems (BLAS) to create all work manually and create all kernel codes from scratch or from CPU implementations to ensure that all the code will work.
Creating all of the work manually was a very important milestone that ensured that both computing and training worked for the CIFAR-10 model. This model was selected as the fastest option to test the engine and was used frequently to check CNN training processes on the CIFAR-10 image set. Then, all codes were carefully and slowly reviewed to identify aspects that were missing or no longer working from the original code. For example, a "-nogpu" switch was added to allow testing of all aspects of the CPU without recompilation. The first reason was to ensure that each time the C code compilation created 100% of the same binary code. The second reason is that it is better to have a switch to test and compare performance rather than recompile the entire engine code every time. The authors also added new switches for the compilation, such as "BENCH-MARK" and "LOSS_ONLY," for testing the performance of the solution and looking for some weak points and bottlenecks. Owing to this testing, some OpenCL methods for the allocation of buffers and copying of memory data from and to VRAM to RAM were improved. Owing to the redesigned multithreading models, CNN models can be trained in multi-GPU systems. Currently, it is possible to train multiple GPUs only on macOS. The reason is that the authors' patched clBLAS library originally had a nonthread secure implementation. All the above steps make the OpenCL version consistent and accurate. The code was then ready for a test that showed the clear potential of the implementation performed by the authors on the GPU, such as AMD-based GPUs.
Support for multi-GPU systems for training was achieved by introducing a trivial general matrix multiplication (GEMM) implementation. It is not performance-optimized but mathematically perfectly fine and multithreading ready. This implementation is only a slightly tuned multiplication of the matrix; however, it clearly shows that clBLAS 18 and CLBlast 19 solutions that are fast and very well optimized are not yet ready for multithreading computation. We believe that libraries will soon support multithreading computations. The authors created an enhancement for clBLAS 18  Finally, the last to mention improvement is a permutation of the input image set for training. Each time the image set is loaded from storage to the RAM and push to VRAM, we want to make sure that this set contains only unique images.

TESTING ENVIRONMENT: HARDWARE AND SOFTWARE
The tests were performed on a quite modern workstation based on Asus Rampage V Edition 10 motherboard, Intel i7-5960X 8-core, 16-threads CPU, and 64 GB of DDR4 2400 MHz RAM. The authors used the latest 2019 GPUs. The first tests were run on two NVidia Titan RTX 24 GB DDR6 VRM GPUs in an NVlink bridge configuration and two AMD Radeon VII 16GB HBM2 VRAM. This configuration allowed the authors to test both single-and multi-GPU computing scenarios. The CUDA solution supports multi-GPU; on the OpenCL solution, the authors improved the original clBLAS library to correctly support multi-GPU on AMD cards as well, but only on macOS. 20 For the present research, the authors decided to use the latest GPUs available on the market, to step ahead, and predict that in the near future, the same power will be available for the automotive, Intel Iris GPU, which is available in almost every notebook with an Intel processor. The GPU is slower than AMD or NVidia processors but requires no additional investment, and all classifications are fully hardware-accelerated.
Last, but not least, definition files (Make and CMake tools) were made for compilation on different platforms. The implemented compilation switches were NVidia, AMD, and ARM. They allow GNU/Linux to choose a supported platform and quickly perform not only tests but also benchmarks or simple training-based exercises on the lightweight CIFAR-10 model. The NVidia switch enables OpenCL on the CUDA toolkit. The AMD switch turned on OpenCL from the AMD GPU. The ARM switch not only enables proper use of OpenCL for ARM on a single-board computer, but also allows the use of a Trivial GEMM implementation, only for testing. The ARM support was tested on a single-board computer that natively supports OpenCL on a Mali T760 GPU. The authors also attempted to use OpenCL on a DSP with Beagleboard AI and X15 computers to ensure that OpenCL works with detection (YOLO2) and training (CIFAR-10), but the tests were passed only on the Mali T760 GPU.

RESULTS OF TRAINING AND COMPUTATION
As readers may feel, CUDA on NVidia hardware runs faster than the OpenCL version of the Darknet CNN engine. 21 However, the second version of the technology allows running on any OpenCL-compatible device to accelerate computing. Even FPGA may be considered as the fastest computation device to date, especially when the authors compare the power consumption of the FPGA 22 cards. Another aspect that may be considered by the industry is the choice of hardware for AI computation. The industry must understand the equipment that needs to be acquired to create a data center.
We believe that creating the OpenCL port allows the use of new hardware such as AMD-based hardware, which has recently become strong and fast. The authors feel that researchers in this field may consider purchasing this hardware, but only if the software for both training and computation is ready. Therefore, gaps in the software introduced via GPUs were resolved using the OpenCL version.
The authors believe that the dominance of one technology, CUDA, provided by NVidia, is not sustainable in the long run, and OpenCL-based software can contribute to a more sustainable technological development.
Considering the applications of the OpenCL-based engine developed by the authors for neural-network-based calculations, there are potentially many more applications for OpenCL-based software than for the CUDA engine.
For example, virtually all smartphones have been equipped with a Mali GPU graphics chip (or similar) and can be used to classify data from many types of onboard sensors because the Mali GPU supports OpenCL 1.

23
There are several other applications in the automotive industry. Mass-produced cheap GPU-type acceleration chips can be introduced to make decisions essential for the cybersecurity of onboard electronic systems. 24 The important aspect in this case is that even if one of the sensors is damaged, a classifier based on convolutional neural networks will still make the right decision. 25 The use of this type of technology in industrial plant lines allows users to make decisions regarding the condition of their equipment with the help of vibration sensors. Such hardware and sensors will enable early detection of a machine that is going to break down, which means that users can order and replace the machine immediately if it is predicted to break down. 26 The value of such an early failure warning is very high, as any downtime in a factory involves a huge cost, and it is impossible to have all replacements in the store. Additionally, there can be many similar consumer-grade applications, such as adjusting salinity in aquariums, lighting control, intelligent home heating, and water heating control. 27-29

Comparison of results on CNN models
The Darknet CNN engine offers several models for comparison purposes. In this section, the authors focus mostly on YOLO2 ( Figure 4) and YOLO3 to share all the accuracy results for all the YOLO1, YOLO2, and YOLO3 models. Each model detects objects correctly, but with different detection accuracy percentages. In Figure 7, the authors compared on CIFAR-10 one step of the CIFAR-10 training on slower GPUs, whereas in Figure 8 it is compared for faster GPUs.

Comparison of results on CUDA and OpenCL
First, the same backpropagation computation for the convolutional layer was compared (see Figure 8). The X-axis is logarithmic to better show the differences in timing as a number of "ticks" returned by the time() C function. All measurements were performed on the same workstation; the only difference was the GPUs used. For the CUDA version we used 2x NVidia Titan RTX and for the OpenCL version we used 2x XFX AMD Radeon VII GPUs with the latest drivers and computation libraries on Ubuntu 18.04 GNU/Linux and macOS. Figure 8 shows that CPU F I G U R E 7 CIFAR-10 on darknet timings layer test (slower GPUs) 14 F I G U R E 8 YOLO2 on darknet timings layer test (faster GPUs) 12,14 computation is approximately 2 ⋅ 10 5 times slower than that of the original CUDA version. OpenCL implementations are several dozen to thousand times faster than CPU versions. This depends on the hardware used and implementation details. One part of the OpenCL version is the multiplication of matrices, and clBLAS and CLBlast are measured. OpenCL with CLBlast for training is faster but does not work in all cases; for example, the classification may fail when CLBlast is used. This is why clBLAS is the default library used in the solution. The authors believe that in the future, both clBLAS and CLBLast libraries will evolve, and Darknet in OpenCL will choose the best one. For now, in the source code of this version, the reader will find in the "patches" folder the "clblast.patch" file, which is ready to apply to replace clBLAS with CLBLast. However, the most surprising result was achieved using OpenCL on MacOS. The training time of the first convolutional layer conducted on the AMD GPU was only slightly longer than the training time for the same layer conducted on NVIDIA CUDA (see Figure 8). Sixty compute units/3840 stream processors of the AMD GPU cannot be easily compared to the NVIDIA GPU's 576 tensor cores. It seems that both chips are not far from each other in terms of performance, and that careful implementation of OpenCL does not have to perform worse than proprietary CUDA technology.
This was even more evident in Figure 9, where the authors provided a detailed computation performance test of all the CUDA and OpenCL "kernels" that were used in the GPU comparison test for the first and the largest convolutional layer in the YOLO2 model. The technique behind the tests was based on the "ticks" returned by the time() function in the C programming language. In the instrumentation method, the specific method invocations are simply surrounded by measurements of the tick unit taken by all the method invocations. This benchmark solution helped not only to compare solutions, but also to quickly identify bottlenecks in OpenCL-based Darknet during implementation. Each time measurement setup was assumed to be the worst-case scenario in the first-layer matrix sized 608 × 608 (width × height). This case is provided by the "BENCHMARK" compilation flag and produces a detailed output log for only one worst-case step. Therefore, this can be repeated without a starting point hazard.
Part of the log is shown in Figure 9 for a detailed comparison. The measurement overhead can be calculated as the difference between each value in F I G U R E 9 YOLO2 on darknet timings kernels test 12,14 F I G U R E 10 Darknet on OpenCL comparison: "ganyc717", 30 "Kylin-PHYTIUM", 31 and our "sowson" 14 the last row and the sum of all the rows above in one column. Each column is obtained from a separate log scenario. A green background was used, where the macOS/OpenCL kernel performed better than the CUDA counterpart. As can be observed, the use of the clBLAS library in the OpenCL implementation guaranteed a shorter backpropagation computation time than that in CUDA. We observed that from some operations the times of "ticks" were minimal 0 or 1. This means that the computation was very fast and, in one step, it was almost impossible to measure once the parallel execution was very well optimized.
A comparison is presented in Figure 10, where the authors compared their solutions 14 with those of the other two implementations from GitHub. 30,31 The solution described in this article is approximately 50% faster in training and 20% faster in the detection of a movie file at 1080p@60.
In addition, the presented solution is the only one that supports the neglected YOLO1 model. Figure 11 plots the loss in time for 10 k steps of the YOLO3 training process for the original CUDA version of the Darknet engine. Axis Y is logarithmic. Figure 12 plots the loss in time for 10 k steps of the YOLO3 training process in the OpenCL version of the Darknet engine. Axis Y is logarithmic. When we try to compare the plots, we see that the shape and values on the Y-axis are almost the same. The only difference is the computation time, which is more than five times longer for OpenCL with the clBLAS version on GNU/Linux.

Example of the application
The Darknet CNN engine allows for almost any type of application. In most cases and models, it is used as a classifier. It is used mostly on images, thanks to the fact that DNNs are able to learn features and classify objects with very high accuracy. However, images can be considered as a set of pixels, and it is possible to use data from many sensors to train CNN models to recognize patterns and detect alerts. This recognition and detection are very important because even if some sensors are incorrect, the entire pattern will still be detected and classified correctly. 32 Another possible application may be based on an Intel ® HD graphics engine integrated into an Intel ® Core TM microarchitecture. 33 A huge base of industrial-grade PCs is implemented in production lines, for quality control, for charging fees 34 and so forth which fully support OpenCL acceleration and can accommodate new DNN/OpenCL-based applications. For example, in a car recognition system, after a car is detected in front of a door, an AI system may be started. Not only is the car plate scanned, but the entire silhouette is recognized and classified, and an AI system can then open an access door and allow the car to drive into an advisable place in the garage managed by the forecasting subsystem. 35

Implications for modern engineers
Modern engineers can find a large value with the Darknet on OpenCL implementation. This implementation can be used for computation and training without requiring recompilation. It is also compatible with the CUDA version; therefore, trained models can be used in both implementations. In total, the OpenCL port contains 142 changed files, 19,431 added lines, and 6650 removed lines of the C code. The structure of both projects is the same, which means that once the CUDA version evolves, all new features are easily portable to the OpenCL version.

Unique capability of Darknet on OpenCL
As this article shows, the OpenCL version has a few additional built-in capabilities. One of them is the "BENCHMARK" and "LOSS_ONLY" compilation flag, which allows the computation of each GPU method and the computation time for each layer. These computation times allow users to look for

CONCLUSION
The port of the Darknet engine in OpenCL is nontrivial. Several aspects and code changes were implemented in this study. The accomplishment of this project has received considerable research attention. Thanks to this port, Darknet may be used on macOS and GNU/Linux on OpenCL 1.2+ ready hardware, which brings great value to the entire AI open source community. The OpenCL version is still slower than the CUDA-based version, in some cases even five times slower, and we believe that soon, thanks to the improved matrix multiplication (SGEMM) capability of clBLAS or CLBLast projects, the OpenCL version will achieve similar or better performance, especially on macOS.