COMBINATORIAL OPTIMIZATION FOR MULTI-TARGET TRACKING

In tracking-by-detection paradigm for multi-target tracking, target association is modeled as an optimization problem that is usually solved through network ﬂow formulation. In this paper, we proposed combinatorial optimization formula-tion and used a bipartite graph matching for associating the targets in the consecutive frames. Usually, the target of inter-est is represented in a bounding box and track the whole box as a single entity. However, in the case of humans, the body goes through complex articulation and occlusion that severely deteriorate the tracking performance. To partially tackle the problem of occlusion, we argue that tracking the rigid body organ could lead to better tracking performance compared to the whole body tracking. Based on this assumption, we generated the target hypothesis of only the spatial locations of person’s heads in every frame. After the localization of head location, a constant velocity motion model is used for the temporal evolution of the targets in the visual scene. Qualitative results are evaluated on four challenging video surveillance dataset and promising results has been achieved.


INTRODUCTION
One of the primary tasks of machine learning is to enable computers to learn from the data and automatically do thoughtful predictions. Such capabilities has applications in airline scheduling [1], crowd modeling [2], and face recognition based fraud detection [3]. For the visual data, it helps to analyze and classify a visual scene [4,5]. In the realm of visual scene analysis, multi-target tracking is one of the most important low-level computer vision problems that provides a backbone to many high level tasks like autonomous driving [6,7], action recognition [8][9][10], behavior analysis [11,12], anomaly detection [9,13,14], crowd management [15,15], and sports players analysis [16,17], to name a few. Even though tracking on its own is a low-level computer vision problem, intrinsically, it consists of other low-level tasks like object segmentation [18], object detection [19], and motion modeling [20]. With the advancement in object detection algorithms [21,22], the tracking-by-detection paradigm becomes the most suitable for tracking multiple objects in a visual scene. However, the biggest question that arises is which object part to track or track the whole object mass. Until now, almost all the tracking algorithms use the whole body detection [23][24][25][26][27][28][29][30][31][32][33][34]. For example, Milan et al. [27] proposed a highly non-convex cost function for multi-target tracking where different components like appearance, detection, motion, target mutual osculation, etc. are combined in a weighted average function. A gradient descent based optimization is used to optimize the cost and transdimentional jumps are used to avoid the local minima. Ullah et al. [26] proposed a bag of Bayesian filters to track multiple targets in the scene. Additionally, sparse coded deep features are incorporated to model the appearance of the targets. Schulter et al. [25] modeled multi-target tracking as a network flow graph. Instead of calculating the edge cost of the graph manually through handcrafted features, they learned the edge of the graph through back propagation. Similarly, Ullah et al. [34] also generated a directed acyclic graph for the multi-target tracking but used deep features for calculating the edge cost of the graph. Dehghan et al. [35] formulated the tracking problem as a multi clique problem. Initially, a graph is generated from a batch of frames and later each target trajectory is found as the maximum clique of the graph. Chu et al. [33] proposed a spatialtemporal attention mechanism for occulation handling and the interaction among different targets. They extract features from the different layers of CNN for modeling the appearance of the target. Compare to that, [24] proposed a Siamese neural network for modeling the appearance of targets and establishing the association between two targets.
One of the common attributes among all the tracking techniques is that they represent the whole human body as a rectangular rigid object, even though the human body goes It not only keeps track of the target state x t but also the uncertainty P t of the state. Initially, a prediction is made for the state of the target using the previously known state and the state transition model. In the update step, the measurement z t is used to correct the predicted state. In an iterative process, the target is tracked between two consecutive frames. through severe articulations. Compared to the standard approaches, in this paper, we focused on the relatively rigid organ of the human body i.e. head of a person and first generated the head hypothesis in all the frames and then used Combinatorial optimization to associate the target head in the consecutive frames. The organization of the paper is the following: In section 2, the proposed approach is briefly explained. The tracker details are given in section 3. The qualitative and quantitative results are given in section 4 and section 5 concludes the paper with future directions.

PROPOSED APPROACH
The proposed approach is based on the tracking-by-detection paradigm. Initially, the target hypothesis is generated in every frame. In our case, the target hypothesis is the spatial position of the target head. In theory, any rigid body part can be used as the target hypothesis. But due to the most important and the most visible position of the head, we select it as the key location for tracking. Our tracker is based on the Kalman fil-ter. The block diagram of the Kalman filter is given in Fig.  1. We assumed a smooth and constant velocity model for the targets in the scene. This is a reasonable assumption because once a target appear in the visual scene, it can not disappear abruptly. Similarly, the motion of the target is smooth as long as it stays in the scene. We modeled the target association as a combinatorial optimization problem. Association is important as with every time step, we have N numbers of target hypothesis and M number of tracks. In order to track the targets as accurately as possible, the correct hypothesis should be assigned to the corresponding tracks. Hence, at every time instance, we produce a N ×M matrix and used the Hungarian assignment algorithm to get the correct associations. A brief description of the Kalman Filter and the assignment algorithm is given in section 3 and 3.1, respectively.

KALMAN FILTER
Kalman filter is an Online filtering algorithm. Its graphical model is similar to a hidden Markov model. However, it assumes that the process v k−1 and measurement n k noises as well as the posterior pdf p(x k |z 1:k ) are normally distributed (Fig. 1). Moreover, the function f k and h k are linear. Based on these assumptions, the following state transition and measurement equations are conceived.
The matrix F k is called the state transition matrix and it helps to predict the current state of the target based on its previous state. Similarly, the matrix H k associate the observation z k to the target state x. The random variables v k−1 , n k show the process and measurement noise. They are zero mean, normally distributed with covariance matrices Q k−1 and R k respectively. A detailed description of the Kalman filter is beyond the scope of this paper. For details, readers may refer to [36,37]. In our problem, we instantiated an instance of the Kalman filter for each target in the visual scene.

Hungarian Algorithm
The Hungarian algorithm is a greedy combinatorial optimization algorithm and solves the assignment problem in polynomial time. The tracking problem is modeled as a bipartite graph matching problem where the first set of nodes corresponds to the established trajectories and the seconds set of nodes corresponds to the target hypothesis measurements from the real world. In our case, the measurements correspond to the spatial locations of the head in every frame. The input to the algorithm is a cost matrix with N number of rows and M number of columns. N corresponds to the established trajectories where the M corresponds to the number of measurements at time step t. There are a variety of ways to obtain the cost matrix [38]. A detailed description of appearance model based on visual features is illustrated in [39]. In our work, we mainly used the special constraints for calculating the cost matrix. Specially, we measured the Euclidean distance between the targets head location in the current and previous frame and treat it as the cost. The nearer are the targets in the consecutive frames, the smaller will be the cost and most probably, the targets with the least distance correspond to the same targets in the temporal domain. Similarly, the targets that are far from each other in the consecutive frames would yield the highest cost and corresponds to different targets in the temporal domain. Once the cost matrix is obtained, the Hungarian algorithm [40] works in three steps as the following: • Row reduction operation: Find the minimum cost of each row. Then subtract the corresponding minimum from each row entry to ensure at least one zero-entry in each row.
• Column reduction operation: Repeat the same procedure for each column. It will ensure at least on zero entry in each column.
• Optimally test: Find the minimum number of straight lines to cover all the zeros in the cost matrix. If the number of lines covering all the zeros equal to the number of rows and columns, optimality is achieved. However, if the number of lines covering all the zeros is not equal to the number of rows and columns, shift zeros such as to achieve the optimal assignment.

EXPERIMENT
The proposed algorithm is implemented in Matlab on a Core i7 system with 16 GB RAM. To ensure real-time performance, we excluded the deep feature based appearance model [26] but it could easily be incorporated in the cost matrix. To evaluate the network, we have chosen four datasets [41,42] that are commonly used for pedestrian tracking. It is also worth noticing that all the standard datasets have annotation available but that is for the whole body which is not useful for our case. Therefore, we annotated the datasets to generate the target hypothesis in every frame. The qualitative results of the proposed method are given in Fig. 2. It is interesting to observe that the head based tracking works well when the heads are not covered with anything. Additionally, due to the most visible part of the body, it also helps in accurate tracking for the partially occluded regions. The proposed algorithm fails when the people use an umbrella or cover the head with an opaque material. However, in the majority of surveillance scenarios where the head is visible, the proposed algorithm works well.

CONCLUSION
We proposed a multi-target tracking algorithm of tracking the heads of multiple humans in the visual scene. The trackingby-detection paradigm is followed where the spatial locations of the head are generated in every frame and a combinatorial optimization is used to establish the association between the corresponding targets. Especially, the Hungarian algorithm is used to associate the head of the corresponding targets in the consecutive frames in a greedy fashion. In the future, we are aiming to extend our approach to other Keypoint of the body part and rather than tracking the head of a person, track different body parts. Tracking the individual body parts would be a direction for the research for pose estimation and high-level behavior analysis.