Beyond Tracking in Crowd: Analyzing Crowd based on Physical Characteristics

The safety of people is an important phenomenon nowadays. This importance arises due to the crowded places including subway station, universities, colleges, airport, shopping mall and square, and city squares. Therefore, the development of an effective system based on physical characteristics of crowd layout is of significant demand. In this paper, we proposed a novel automated and intelligent systems for crowd event analysis based on a set of physical elements. For this purpose, we take into account optical flow and spatial-time gradient, contour features, and Gaussian processes. Our method combine these characteristics into a unique model to deal with the challenging problem of crowd event analysis. For evaluating our proposed method, we consider a benchmark dataset and a number of different performance metrics. These analysis demonstrate the robustness and effectiveness of our proposed method.


Fig. 1. Crowd scenes [1][2]
. Different crowd areas and scenes are presented where we can see the level of complexities. It is also important to note that the level of occlusion is significantly high.

Methodology of Our Approach:
There are two methods namely: optical flow [15] [16] and spatial-time gradient [17] [18]. Optical flow presents motion properly; however, it is driven by huge overheads. Detecting important events in realtime is important for people safety. Therefore, we have to avoid the calculation of optical flow for each pixel. On the other hand calculating spatial-time gradient is not very driven by computational overheads. However, the calculation is based on the extraction of contour features. In crowd areas, object bodies overlap each other and their relative locations change significantly. Therefore, the extraction of contour features is a complex process. Taking into account all these elements, we instead investigate KLT corner as crowd features demonstrate each crowded area. We then extract optical flow features. We also take into account the background subtraction to consolidate our method. In addition to that, we consider the extraction of velocity and direction of each feature produce motion vector as shown in Fig. 2.   Fig. 2. The velocity normalization process shows that each frame is divided into small patches.
Moreover, there are motion vectors in the middle and the distribution of motion vector is shown on the right hand side.
To develop our method, we combine coherent motion patterns in a novel way. For example, If we formulate each feature as a pattern model, the amount of features will change significantly. Therefore, it will cost a huge processing overheads. It is important to note that several motion patterns do not match properly in both temporal and spatial order. Therefore, we combine coherent motion patterns as an atomic unit M. In crowd areas, we don't know as a priory the total number or different types of movements. Therefore, we don't know the possible number of coherent regions. Therefore, we formulate the deviation among all motion patterns and develop models, if the smallest deviation is greater than a predefined threshold. Considering this case, we modify the parameters of our method according to the formulation as: In the above formulation, N k is the total number of motion patterns associated with our pattern method M k , and P l is the updated motion pattern.
Our method does not depend on multiple stages of individual object or person detection or tracking. To keep the computational complexity significantly low, we only consider the foreground areas. In our method, we fuse GMM background subtraction with motion patterns. Therefore, we consider only two holistic features namely: foreground pixel and motion features as illustrated in Fig. 3. We weight our features according to our perspective map to avoid perspective distortions. We also affiliate dense and sparse features to the crowd size. Therefore, we extract the localized corner features and accommodate global corner features. After taking into account the perspective map, the total number of foreground patterns in each video frame is modified according to the formulation as: In the formulation, NT(y) is the total number of foreground pixels in the yth row. We introduce local features for global features by considering the relationship between the number of features and the number of foreground pixels. We then develop a weighting model by exploiting the collected features as a crowd events. Our target is to weight the collected features and exploit the calculated features for the magnitudes di, i = 1...M. M is the total number of frames for the crowd video. We formulate the weight model as: We developed our method in such a way so that it is not changing according to the changing perspective and changing density of crowded areas. It handles the complexity and non-linearity of our model mapping the collected features to the changing density in the crowd. For the sake of considering more complex situation, we accomodate any unknown errors that could occur in the process of detecting any type of anomaly. We also investigate to exploit Gaussian process regression which adaptively changes and accommodate unknown complexities of crowd when the coherency of crowd changes both locally and globally. Our proposed method carries out crowd analysis for anomaly detection iteratively on smaller time slots thereby efficiently encoding the anomalous crowd situations independent of the complexity of the crowd. Considering the robustness of our method, it can be applied to various applications including environmental monitoring for detecting behaviour anomalies. Our proposed model consisting of multiple stages including: the extraction and combination of features based on crowd motion data, filtering the collected features to be transformed to the next stage, and the detection of anomalous regions.
Our proposed method internally exploits multiple aspects and features including contrast, correlation, energy and homogeneity. In fact, the borders of the video frame, where the remainder of the patches are incomplete, are padded with multiple columns and rows of zeros. A feature model is developed with rows equal to the number of frames and columns equal to the total number of patches. Depending on the area of the person, the patch in which the person is identified is modified with the aforementioned four elements. For a given crowd area, the four important elements are formulated as:

Experimental Analysis and Evaluation:
For experimental evaluation, we use UMN dataset. This dataset consists of three different crowd scenes, and the dataset has 11 videos from these scenes, with a resolution of 240 × 320. Each video sequence represents a normal window slot, for example walking, and an abnormal window slot. The total number of video frames in the scenes 1, scene 2, and scene 3 are 1450, 4415, and 2145, respectively. Initially, we calculate the optical flow between consecutive video frames. We then extract combined features considering different patches. In fact, there are 30x40 patches in a video frame, and the motion patterns are 30x40x16. Subsequently, the collected features from the normal video in the early video frames are collected to build the dictionary.
We used the operating characteristic curve receiver (ROC) to assess the robustness of our method. The ROC is a very good metric to represent the sensitivity and specificity of the collected features. This curve shows the affiliation between sensitivity and septicity through a detailed graph. Huge amount of area under the graph shows better performance. The sensitivity represents the true positive rate (TPR) and the specificity represents the false positive rate (FPR). In fact, the TPR represents the abnormal event that is detected accurately and the FPR represents the normal event that is detected incorrectly. We presented the results in Fig. 4. In this Figure, the result of scene 1 is presented in blue graph, the result of scene 2 is presented in red graph, and the result of scene 3 is presented in black graph. To present the differences of normal and abnormal frames in our method, both TPR and FPR are shown.
We show in Fig. 4. That the results are better in term of changing in the graphs. These graphs are stable and smoother. Therefore, our proposed method shows significant improvement.

Conclusion:
Analysis of abnormal crowd events in different places is significantly important for people safety. However, detecting these behaviors are very challenging due to changing densities and consistencies in crowd movements. In this paper, we have consider several sources of information to propose and develop a robust method for crowd event detection. We integrated the collected features into a single method that results into a method which works very effectively considering a challenging dataset.