Subject: Earth Sciences, Geoinformatics Keywords: indoor scene recognition; unsupervised representation learning; Siamese network; graph constraints
Online: 19 March 2019 (13:11:09 CET)
Indoor scene recognition has great significance for intelligent applications such as mobile robots, location-based services (LBS) and so on. Wherever we are or whatever we do, we are under a specific scene. The human brain can easily discern a scene with a quick glance. However, for a machine to achieve this purpose, on one hand, it often requires plenty of well-annotated data which is time-consuming and labor-intensive. On the other hand, it is hard to learn effective visual representations due to large intra-category variation and inter-categories similarity of indoor scenes. To solve these problems, in this paper, we adopted an unsupervised visual representation learning method which can learn from unlabeled data with a Siamese Convolutional Neural Network (Siamese ConvNet) and graph-based constraints. Specifically, we first mined relationships between unlabeled samples with a graph structure. And then, these relationships can be used as supervision for representation learning with a Siamese network. In this method, firstly, a k-NN graph would be constructed by taking each image as a node in the graph and its k nearest neighbors are linked to form the edges. Then, with this graph, cycle consistency and geodesic distance would be considered as criteria for positive and negative pairs mining respectively. In other words, by detecting cycles in the graph, images with large differences but in the same cycle can be considered as same category (positive pairs). By computing geodesic distance instead of Euclidean distance from one node to another, two nodes with large geodesic distance can be regarded as in different categories (negative pairs). After that, visual representations of indoor scenes can be learned by a Siamese network in an unsupervised manner with the mined pairs as inputs. In order to evaluate the proposed method, we tested it on two scene-centric datasets, MIT67 and Places365. Experiments with different number of categories have been conducted to excavate the potential of proposed method. The results demonstrated that semantic visual representations for indoor scenes can be learned in this unsupervised manner. In addition, with the learned visual representations, indoor scene recognition models trained with the learned representations and a few of labeled samples can achieve competitive performance compared to the state-of-the-art approaches.
ARTICLE | doi:10.20944/preprints202108.0094.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Siamese networks; Ensemble of classifiers; Loss function; Discrete cosine transform
Online: 3 August 2021 (15:49:22 CEST)
In this paper, we examine two strategies for boosting the performance of ensembles of Siamese networks (SNNs) for image classification using two loss functions (Triplet and Binary Cross Entropy) and two methods for building the dissimilarity spaces (FULLY and DEEPER). With FULLY, the distance between a pattern and a prototype is calculated by comparing two images using the fully connected layer of the Siamese network. With DEEPER, each pattern is described using a deeper layer combined with dimensionality reduction. The basic design of the SNNs takes advantage of supervised k-means clustering for building the dissimilarity spaces that train a set of support vector machines, which are then combined by sum rule for a final decision. The robustness and versatility of this approach are demonstrated on several cross-domain image data sets, including a portrait data set, two bioimage and two animal vocalization data sets. Results show that the strategies employed in this work to increase the performance of dissimilarity image classification using SNN is closing the gap with standalone CNNs. Moreover, when our best system is combined with an ensemble of CNNs, the resulting performance is superior to an ensemble of CNNs, demonstrating that our new strategy is extracting additional information.
ARTICLE | doi:10.20944/preprints201905.0160.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Siamese neural network, appearance model, contrastive loss, cross entropy.
Online: 13 May 2019 (13:32:25 CEST)
An appearance model plays a crucial rule in multi-target tracking. In traditional approaches, the two steps of appearance modeling i.e visual representation and statistically similarity measure are modeled separately. Visual representation is achieved either through hand-crafted features or deep features and statically similarity is measure through a cross entropy loss function. A loss function based on cross-entropy (KL-divergence, mutual information) find closely related probability distribution for the targets. However, if the targets have similar visual representation, it ends up mixing the targets. To tackle this problem, we come up with a synergetic appearance model named Single Shot Appearance Model based on Siamese neural network. The network is trained with a contrastive loss function for finding the similarity between different targets in a single shot. The input to the network is two target patches and based on their similarity, a contrastive score is output by the network. The proposed model is evaluated on accumulative dissimilarity metric on three datasets. Quantitatively, promising results are achieved against three baseline methods.
ARTICLE | doi:10.20944/preprints202010.0526.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: audio classification; dissimilarity space; siamese network; ensemble of classifiers; pattern recognition; animal audio
Online: 26 October 2020 (13:57:01 CET)
The classifier system proposed in this work combines the dissimilarity spaces produced by a set of Siamese neural networks (SNNs) designed using 4 different backbones, with different clustering techniques for training SVMs for automated animal audio classification. The system is evaluated on two animal audio datasets: one for cat and another for bird vocalizations. Different clustering methods reduce the spectrograms in the dataset to a set of centroids that generate (in both a supervised and unsupervised fashion) the dissimilarity space through the Siamese networks. In addition to feeding the SNNs with spectrograms, additional experiments process the spectrograms using the Heterogeneous Auto-Similarities of Characteristics. Once the similarity spaces are computed, a vector space representation of each pattern is generated that is then trained on a Support Vector Machine (SVM) to classify a spectrogram by its dissimilarity vector. Results demonstrate that the proposed approach performs competitively (without ad-hoc optimization of the clustering methods) on both animal vocalization datasets. To further demonstrate the power of the proposed system, the best stand-alone approach is also evaluated on the challenging Dataset for Environmental Sound Classification (ESC50) dataset. The MATLAB code used in this study is available at https://github.com/LorisNanni.