Self-Attention Autoencoder for Anomaly Segmentation

: Anomaly detection and segmentation aim at distinguishing abnormal images from normal images and further localizing the anomalous regions. Feature reconstruction based method has become one of the mainstream methods for this task. This kind of method has two assumptions: (1) The features extracted by neural network is a good representation of the image. (2) The autoencoder solely trained on the features of normal images cannot reconstruct the features of anomalous regions well. But these two assumptions are hard to meet. In this paper, we propose a new anomaly segmentation method based on feature reconstruction. Our approach mainly consists of two parts: (1) We use a pretrained vision transformer (ViT) to extract the features of the input image. (2) We design a self-attention autoencoder to reconstruct the features. We regard that the self-attention operation which has a global receptive ﬁeld is beneﬁcial to the methods based on feature reconstruction both in feature extraction and reconstruction. The experiments show that our method outperforms the state-of-the-art approaches for anomaly segmentation on the MVTec dataset. It is both effective and time-efﬁcient. of both transformer encoder and transformer decoder. Finally, we use a convolutional decoder symmetric to the convolutional encoder to reproduce the input feature maps.


Introduction
Anomaly detection aims at distinguishing abnormal images from normal images. Due to the lack of anomaly images and the diversity of anomalies, only defect-free images are used during training. In other words, the training data defines what is normal, and those who deviate from normal distribution can be judged as anomalous. It is also known as one-class classification, because it actually tries to find the decision boundary of training data. Nowadays, image-level anomaly detection is not enough. We want to further localize the anomalous regions in pixel-level for better interpretability, which is known as anomaly segmentation. Anomaly segmentation in natural images is an important task in computer vision. It has a wide range of applications in manufacturing, medicine and security.
In recent years, a lot of anomaly segmentation methods have been proposed [1][2][3][4][5][6][7][8]. The reconstruction-based method [1- 3,5] is one of the most commonly used methods. This kind of approaches uses an autoencoder to reconstruct the input image at training time. At test time, the anomaly score map can be calculated by comparing the difference between the input image and the reconstructed image. These methods suppose that the autoencoder solely trained on normal images can not reconstruct anomalous regions well. So anomalous areas will have relatively high reconstruction error. Pixel-level anomaly segment also can be obtained easily via anomaly score map. This method requires the autoencoder to have a special reconstruction ability that it can reconstruct normal images well, but it is not good at reconstructing anomalous regions, which means the reconstruction of anomaly regions does not look like the input ones.
Recently, compared to reconstruct the original image, reconstruction in feature space has reached better results [3] and has the potential to become a new paradigm. This kind of method uses pretrained deep convolution neural networks (CNN) [23] to extract the features of the input image. Then it uses autoencoder to reconstruct the feature representation to get anomaly segmentation on the feature map. Finally, the anomaly score map are upsampled to the size of the input image to get pixel-level segmentation. It assumes that the autoencoder solely trained on the representation of normal image cannot reconstruct the representation of anomalous region well in the feature space. It also requires that the feature map has spatial correspondence to the original image, so anomaly segmentation of the whole image can come from the anomaly segmentation on the feature map. The biggest advantage of feature the method based on feature reconstruction is that the features extracted by the CNN pretrained on large datasets contain enough semantic information which is beneficial to the anomaly detection task. In addition, this method aggregates the feature maps generated from different convolution layers together to build a dense regional feature representation. Since the receptive field of the convolution kernel increases from shallow layers to deep layers, the stacked feature maps contain a multiscale description of the original image. Thus, it is a better choice to reconstruct in feature space than reconstruct in image space.
To improve the performance of feature reconstruction based method, there are two important parts: (1) construct a good feature space (2) build an effective autoencoder. A strong feature extractor should be pretrained on a large dataset and provide enough features with semantic information which can represent the input image and be transferred to anomaly detection task well. An effective autoencoder should be able to reconstruct normal representation well and suppress the reconstruction quality of the representation of the anomalous region. For the above two points, we choose the vision transformer (ViT) [9] as the feature extractor and design a self-attention autoencoder based on the transformer architecture for feature reconstruction, which has stronger reconstruction ability for anomaly detection than a normal autoencoder.
In this paper, we propose a new anomaly segmentation method based on feature reconstruction with a self-attention autoencoder(SAAE). Firstly, we use ViT to extract the features of the image. This is a way to use pretrained models. Many works used model pretrained on large datasets in recent years, such as [3,4,6,7]. These models are trained for large-scale image classification task and thus will produce discriminative features. Anomaly detection task is a one-class classification task, so these discriminative features which contain high-level semantic information are beneficial to anomaly detection and can be transferred easily. Compared to CNN, more global context information will be retained in the features extracted by ViT. At the same time, similar to CNN, each location on the feature map of ViT perceives a corresponding spatial region of the input image. Then we aggregate feature maps from different ViT layers to build a dense hierarchical feature representation.
Then, the hierarchical feature maps will pass through a self-attention autoencoder. The self-attention autoencoder will reconstruct the input feature maps. Finally, we compare the input feature maps and the reconstructed feature maps to get an anomaly score map and further upsample it to the size of the original image to get the anomaly segmentation of the whole image.
The self-attention autoencoder consists of self-attention encoder and self-attention decoder. The self-attention encoder is composed of a convolutional encoder and a transformer encoder. The self-attention decoder is composed of a transformer decoder and a convolutional decoder. It is actually a transformer operates in the latent space of a convolutional autoencoder. The self-attention encoder first does dimension reduction with convolution to extract low-level local features. Then the self-attention operation will contact the information of each point on the hierarchical feature maps and fuse global semantic information. After that, the self-attention decoder reconstructs the feature maps according to the global context information and ascend the dimension with convolution layers. The decoding process is parallel, which is more efficient than the initial transformer [10]. There are two reasons why self-attention autoencoder is more suitable for anomaly segmentation than ordinary autoencoders. Firstly, it can reconstruct normal regions well. The convolutional layers extract local and low-level features, and the self-attention layers extract global and high-level semantic information. With a combination of local and global information, the self-attention autoencoder has enough ability to reconstruct normal images well. More importantly, it can suppress the reconstruction of anomalous regions to a certain extent. In contrast to the locality of convolution, self-attention has global receptive field. Some abnormal regions can be reconstructed well solely with local features, but global receptive field of the self-attention force the autoencoder to consider larger areas which are mostly normal. Thus, the self-attention autoencoder which is trained solely on the normal images can reconstruct the anomalous regions to look like normal ones with the connection between each pixel. This ability can be regarded as inpainting ability and it is suitable for anomaly detection. Our experiment has proved the effect of the self-attention autoencoder.
Our anomaly segmentation method is very effective and time-efficient. The experiment shows our approach outperforms the current state-of-the-art methods on the MVTec AD dataset. It indicates that the self-attention autoencoder has a strong capability for the anomaly detection problem.

Anomaly Detection Methods
Reconstruction-based methods [1-3,5] are one of the most commonly used anomaly detection and segmentation methods over the past few years. They train autoencoders (AE) [1] to reconstruct normal images. At test time, We can compare pixel-wise reconstruction error between the input image and the reconstructed image to get anomaly segmentation. These methods assume that the autoencoder solely trained on normal images can not reconstruct anomalous regions well. So anomalous regions will have relatively high reconstruction error. The reconstruction error can be measured by L2 distance or SSIM [12]. We can also use generative adversarial networks (GAN) [22] to do the same thing. There is a variant of reconstruction-based methods, called restoration-based methods [5,8]. This kind of method applies some argumentations to images, like or random erasing [5], graying [8] or random rotation [8], to erase some attributes related to semantic information. Then they train AE to restore the information. Their assumption is similar to reconstruction-based methods. They think it is hard to restore the attributes related to semantic information in anomalous images. It is a way to force the autoencoder to learn semantic information.
Another important trend of anomaly detection is to use pretrained models. This kind of method uses neural networks pretrained on large-scale dataset to extract features and model the distribution of normal features. The most popular pretrained model is convolution neural networks [23] pretrained on ImageNet [13] classification task. The anomaly detection methods using pretrained model are much better than those who purely train on normal dataset. The typical methods include Uninformed Students [7], Patch SVDD [4] and PaDiM [6]. These methods usually divided images into patches, extract features of patches and build a classification model with machine learning algorithms. The key to those methods is to have a good feature representation of images or patches. In other words, you need to choose a suitable feature space that can distinguish between normal and abnormal samples. CNN pretrained on large datasets for classification is useful because it can produce enough discriminative features which contains high-level semantic information. So it is reasonable that the methods using pretrained model perform much stronger than those who solely trained on small datasets which cannot extract rich features. From this point of view, it is essential for future anomaly detection methods to use pretrained models.
Reconstruction in feature space combines the advantages of reconstruction based methods and pretrained model based methods. This framework is proposed by [3], called deep feature reconstruction (DFR). DFR first uses pretrained CNN to extract features of the image. Then it aligns and aggregates the output feature maps of different convolution layers to get a multi-scale dense regional feature representation of the image. Next, it uses a deep convolutional autoencoder with one by one kernel to reproduce the dense feature maps. Finally, it compares the feature maps and reconstructed feature maps to get the anomaly score of the feature maps and upsample it to the spatial size of the image to get pixel-wise anomaly segmentation. It leverages pretrained model to enhance the reconstruction-based method. It is also a multiscale method which can detect different size of abnormal regions because the receptive field of convolution kernels increase from shallow layer to deep layer. The anomaly segmentation can be got from the anomaly score of the feature maps because convolution keeps the spatial relationship and each point on the feature maps corresponds to a spatial region of the image. Based on these advantages, so it is better than reconstruction in image space is much better than reconstruction in image space. This framework is pretty good, but the component still needs to be discussed. The upper limit of this framework is related to the feature extractor and the autoencoder. Convolutional autoencoder has strong reconstruction ability,but in many cases, the anomalous region can also be reconstructed well, which is not good for anomaly detection. Compared to DFR, our approach mainly has two improvements: constructing a better feature space and design a self-attention autoencoder that is suitable for anomaly detection and segmentation.

Visual Transformers
We introduce transformer into the anomaly detection task. Transformer [10] is initially designed for natural language processing (NLP) tasks and makes a great success. Transformer based models like BERT [11], GPT [14] dominate many aspects of NLP. Recently, transformer has been extended into vision tasks which shows competitive and even better performance on plenty of tasks compared with CNN. Now it has reached the state of art result in image classification, object detection, semantic segmentation, and so on. DETR [21] is a simple end-to-end object detection framework. It is a hybrid structure of CNN and transformer. It is an early attempt for visual transformer and attracts great attention. ViT is a pure transformer model proposed by [9]. It has reached state-of-the-art performance in image classification. IPT is proposed by [16] for multiple low-level vision tasks, such as denoising and super solution. TransGAN [24] is a pure transformer version of GAN, which produces a comparable result to state-of-the-art CNN-based GAN in image generation. In computer vision, the dominant architecture of neural networks is CNN. Now transformer has the potential to be an alternative to CNN with a larger receptive field and less inductive bias. Inspired by the success of visual transformers, we try to apply the transformer to anomaly detection.
Self-attention is the basic operation of the transformer. It is an attention mechanism that relates different positions of a sequence to compute a representation of the sequence. In contrast to the locality of convolution, self-attention use global information to extract features, which can handle long-distance dependency well. In computer vision tasks, this characteristic will capture the relation of distant pixels in an image and enhance high-level semantic features. In anomaly detection, we speculate that self-attention based autoencoder could reconstruct normal images well and reconstruct the anomalous regions like the normal part due to larger receptive and the interaction of global pixels which is mostly normal ones.  Figure 1. This is an overview of our anomaly segmentation method. Firstly, we use a vision transformer (ViT) to extract the feature of the input image. Then we reshape the output features of each ViT layer from 1D sequence into 2D feature map and concatenate the feature maps from different layers to get a dense hierarchical feature representation f . After that, we use a self-attention autoencoder to compress the feature representation and reconstruct it. Finally, we compute the L 2 loss between the feature maps f and the reconstructionf to get the anomaly score map and upsample it to the size of the input image. We can define a threshold T to binarize the anomaly map to get the anomaly segmentation of the whole image.

Method
Fig 1 shows the overview structure of our method. It can be divided into three steps. The first step is using pretrained ViT to extract features and stack feature maps together to become a dense hierarchical feature representation. The second step is using self-attention to reconstruct the dense feature representation. The third step is comparing the dense feature representation with its reconstruction to get anomaly score map, further upsample the anomaly map to the size of the input image and get anomaly segmentation of the whole image with a threshold. The details of the three steps will be described in the following paragraphs.

ViT Feature Extraction and Concatenation
We use pretrained ViT [9] as a feature extractor. ViT is pure transformer model designed for image classification. It follows the structure of the original transformer [10] as closely as possible. Since transformer receives a sequence as input, they split an image into fixed-size patches and linearly embed each of them as the input of the transformer. ViT consists of several encoders. Each encoder is composed of a multi-headed self-attention (MSA) layer and a multi-layer perceptron (MLP) layer. The output of each encoder is the feature representation of the input image. We reshape them from 1D sequence into 2D feature map and concatnate them together to get a dense hierarchical feature representation of the input image, as shown in Fig 1. Suppose the input image has height H, width W and C channels. ViT receives 1D sequence as input, so the image x ∈ R H×W×C is reshaped into a sequence of flattened 2D patches x p ∈ R N×P 2 C , where the patch size is P × P and N = HW/P 2 . Now we have a sequence of N patches and each patch has dimension P 2 C. Then a linear projection layer will map each patch into D dimensions. After that, we get a sequence of N patches with dimension D, called patch embedding, which is the input of the transformer.
The ViT consists of M encoders. The output of each encoder is size of N × D. The N patches are flattened from 2D images and each patches are corresponding to a region of the input image. So we could reshape the output of each encoder into 2D patches as feature maps {z 1 (x), z 2 (x), . . . , z M (x)} where z i (x) ∈ R H 0 ×W 0 ×D , H 0 = H/P and W 0 = W/P. Then we concatenate all the feature maps together to get stacked feature maps With self-attention, each point on the featue map of a ViT layer comes from a weighted average of all points on the output of the last layer. This allows ViT to have less inductive bias and more freedom compare to CNN, which leads to a good representation of the input image. The feature map contains enough discriminative semantic information due to pretraining on large-scale dataset for classification task. Besides, each point on the feature map corresponds to a spatial region of the input image, so the anomaly segmentation on the feature map can be mapped to the segmentation of the whole image. In addition, the feature maps of CNN have various sizes, which needs to align manually, whereas the feature maps of ViT are naturally aligned. So feature maps can be aggregated directly and become a dense feature representation, which will avoid the inaccuracy caused by manual padding and interpolation existing in generating the dense feature representation of CNN.
Finally, [15] shows that ViT is able to learn about higher level relationships very early as it uses global attention instead of local. It means we do not need to concatenate a lot of layers like CNN, which is more time-efficient.  Figure 2. This is the detailed structure of the self-attention autoencoder. The input feature maps pass through a convolutional encoder with solely 1 × 1 kernels and is compressed spatial separately in channels. Then it is flattened into 2D patches and entered into a transformer structure. Learnable position encoding is added into the input embedding of both transformer encoder and transformer decoder. Finally, we use a convolutional decoder symmetric to the convolutional encoder to reproduce the input feature maps. The detailed structure of the self-attention autoencoder is shown in Fig 2. In our method, only the self-attention autoencoder needs to be trained and it is trained solely on the representation of the anomaly free images. The input of the self-attention autoencoder is a dense feature representation f (x) and the aim of it is to reconstruct the input. We call the reconstructed feature maps asf (x).

Self-Attention Autoencoder
Self-attention encoder (SAE) consists of a convolutional encoder (CE) and a transformer encoder (TE). The input representation f (x) is size of W 0 × H 0 × U. It firstly passes through the CE which consists of several convolution units. Each convolution unit consists of a convolution layer with 1 × 1 kernel, a batch norm layer [18] and a Rectified Linear Unit(ReLU) activation layer [20]. The output of the CE g(x) is size of W 0 × H 0 × V, where V << U. The CE mainly compresses each point on the feature maps separately to a lower dimensions. Then the TE structure is almost the same as ViT [9]. In order to handle 2D images, we reshape the input image into a sequence of flattened 2D patches. Then these patches need to pass a trainable linear projection and be mapped to fixed E dimensions which is the constant latent vector size of the TE. The output of this projection is called the patch embeddings. Follow the configuration in [9], we add standard learnable 1D positional encodings are to the patch embeddings in order to retain positional information. The sum of patch embeddings and the positional encodings are as the input of the TE. The components of TE is the same as the standard transformer [10]. It is composed of multi-head self-attention layer (MSA) and feed forward network (FFN). The self-attention operation contacts the information between each point of the feature maps and use global Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 August 2021 doi:10.20944/preprints202108.0570.v1 context information to produce the output. The FFN consists of two multi-headed selfattention (MLP) layers, the input and output have the same dimension V, the inner layer has dimension L, called MLP size. In conclusion, the self-attention encoder expands ViT and add several convolution layers to do dimension reduction. Only use linear projection to do dimension reduction is also feasible, but it is too weak to capture enough information for reconstruction. Self-attention decoder (SAD) consists of transformer decoder (TD) and convolutional decoder (CD). The TD is composed of two multi-head attention layers and an FFN. The first attention layer is self-attention. For the second attention layer, the query is from the output of the first self-attention layer, the key and value are from the output of the TE. The TD accepts three inputs: decoder input embedding, positional encoding and the output of the TE. The decoder input embedding is learnable parameter that is similar to the object query described in [21] and contains position-related information. We directly use the positional encoding of TE as the positional encoding of TD. The decoding process is parallel which is different from the original transformer because the number of output patches is fixed and the information of each decoder input embedding can be transmitted in the self-attention layer. So we use parallel decoding which is much faster than serial decoding. The main operation of the TD happens in the second attention layer. It queries necessary information from the output of the TE to reconstruct the input feature maps. We add a linear projection layer at the tail of the TD. Finally, we use a CD which has several convolutional units with one by one kernel and is symmetrical to the CE, to ascend the channels of the output of self-attention autoencoder to the same as the input feature maps. The detailed structure of the CE and CD is listed in Table 1. We use L 2 distance between input representation f (x) and its reconstructionf (x) as loss function: Intuitively, the CE compress each component of the feature maps and the CD rebuild them again. The transformer autoencoder make connections between each components of the feature maps. The SAD considers more about the global information due to the global receptive field of self-attention, which brings strong inpainting ability. The self-attention autoencoder is both an effective and efficient autoencoder for feature reconstruction based anomaly detection.

Anomaly Segmentation
The last step is to calculate the anomaly score map and further upsample the anomaly map to get the anomaly score map to the size of the input image. We finally need to choose a threshold to binarize the anomaly score map for pixel-wise anomaly segmentation. We first compare the difference between the image representation f (x) and its reconstruction Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 August 2021 doi:10.20944/preprints202108.0570.v1 f (x) by calculating pair-wise reconstruction error to get the anomaly score on the feature maps: Where A(x) is the anomaly score map of the feature maps which is size of W 0 × H 0 . Then we bilinearly upsample the anomaly score map to the size of W × H, which is the same as the input image, to get the pixel-wise anomaly score map for the input image. We assume that the self-attention autoencoder solely trained on the representation of normal images is unable to reconstruct the regional features of anomaly images. So the anomaly regions will correspond to large reconstruction errors. We use the same method as described in [3] to decide a threshold T for anomaly segmentation, which uses the acceptable false positive rate (FPR) on the normal images to estimate the threshold. If the FPR is 0.005, that means 0.5 percent of pixels in the normal images will be misclassified as anomalous with the threshold.

Dataset
We evaluate our method on the MVTec Anomaly Detection (MVTec AD) dataset [1]. It mimics real-world industrial inspection scenarios and consists of 5354 high-resolution images of 10 objects classes and 5 texture classes from different domains. The MVTec AD dataset provides 3629 images for training and 1725 images for testing. The training set contains only defect-free images. The test set contains both defect-free images and anomalous images, including 73 different types of defects, such as contamination, hole, misplaced and bent. It provides pixel-wise ground truth regions for all anomalies. The resolutions of images are in the range between 700 × 700 and 1024 × 1024 pixels. Now it has been a commonly used benchmark for anomaly detection and segmentation task.

Training and Testing Configuration
In the training stage, we only use normal images. We choose ViT-B/16 [9] pretrained on ImageNet as a feature extractor. ViT-B/16 has 12 layers totally and we choose the first four layers to aggregate to dense feature maps f 1:4 (x) as the represent[ation of the input image. Additionally, we concatenate the output of the linear projection layer with the dense feature maps. The weights of the pretained ViT are frozen during training and testing. We choose Adam [19] as optimizer and the learning rate is set to 1e-4. All the input images are resized to 384 × 384. The size of the input of the SAAE is 24 × 24 × 3840. The patch size of the SAAE is 1 × 1. We use Principal Component Analysis (PCA) to compute latent dim V, as described in [3], which retains 80% information. The MLP size is V for texture classes and 2V for object classes. The depth of the transformer encoder and decoder is 1 and the heads are 3. We train our model on a single GTX 1080Ti for 700 epochs, and the batch size is 32.

Evaluation
As the measurements reported in the existing literature, we choose two thresholdindependent metrics to evaluate the performance of anomaly segmentation. The first one is the area under the receiver operating characteristic curve (ROC-AUC). It calculates the accuracy of pixel-level segmentation. We regard abnormal pixels as positive and normal pixel as negative, so the true positive rate (TPR) is the percentage of the pixels correctly classified as abnormal and the false positive rate (FPR) is the percentage of the pixels incorrectly classified as abnormal. ROC-AUC is simple and widely used, but it has a problem that a large area that is segmented correctly can offset many incorrectly segmented small areas. So we choose the second metric called the area under the per-region-overlap curve (PRO-AUC), which is proposed in [7]. In contrast to per-pixel ROC-AUC, it measures the performance of the anomaly segmentation at region level. It gives equal weights to all the connected components within the ground truth anomalous region, which means Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 August 2021 doi:10.20944/preprints202108.0570.v1 it treats equally to the big or small anomalous regions. We compute the PRO-AUC as described in [7] and record the normalized PRO-AUC up to an average per-pixel FPR of 30%.   Figure 3. This is the qualitative result of our model. The first row is input image, the second row is ground truth anomaly segmentation and the third is the anomaly score map produced by our model. The last row is the segmentation result when FPR of 0 for texture classes and FPR of 0.001 for object classes on corresponding training set are given.

1
We evaluate our method on the MVTec AD dataset for anomaly segmentation. The 2 quantitative results is shown in Table 2 and Table 3. We compare the performance of 3 our approach with the current state-of-the-art methods. We quote the ROC-AUC and   Table 3 shows the PRO-AUC results. Our method performs well both in 9 textures and objects, and improves the state-of-the-art result by 1.5% on average. In general, 10 our method has better performance on the MVtecAD dataset compared with all of the 11 other methods. In addition, it is noticeable that our model has balanced performance on 12 each category in texture and objects which means more general and stable. 13 Compared to DFR [3] which is a basic feature reconstruction based method, we 14 get similar or better results in all the classes, which suggest our feature extractor and 15 autoencoder are quite useful. We improve a lot in tile and transistor for over 10% ROC- 16 AUC.

17
We visualize some anomaly segmentation results of our method in   Table 4.

28
In the first ablation study, we use a convolutional autoencoder (CAE) to replace our 29 self-attention autoencoder. The structure of CAE is the same as described in [7], which is  PCA is not suitable for every feature map. We use PCA to compute latent dim V for the 48 feature map of layer 4 to retain 80% information. For the feature maps of other layers, their 49 latent dim is set to be the same as layer 4, which is simple and good in practice.

50
The result is shown in Table 5. Layer 4 produces the best result and the shallower or 51 deeper layers are slightly worse than it. We notice that in most cases, the shallow layers  in the configuration of f 1:12 [7]. It shows that our method is pretty time-efficient.