Staircase Recognition and Localization using Convolution Neural Network ( CNN ) for Cleaning Robot Application

Multi-floor environments are usually ignored while designing an autonomous robot for indoor cleaning applications. However, for efficient operation in such environments, the ability of a robotic platform to traverse staircases is crucial. Staircase recognition and localization is highly important for planning the traversal on staircase by an autonomous robot. This paper describes a deep learning approach using Convolutional Neural Networks (CNNs) based Robot Operation System (ROS) to staircase recognition and localization. We use an object detection network to detect staircases in images. We also localize these staircases using a contour detection algorithm to detect the target point, a point close to the center of the first step, and the angle of approach to the target point. Experiments are performed with data obtained from images captured on different types of staircases at different viewpoints/angles. Results show that the approach is very accurate in identifying the presence of a staircase in the working environment and is also able to locate the target point with reasonable accuracy.


Introduction
Robots are key for automation of various tasks and have revolutionized many fields in the 21 st century.They are extremely precise and efficient in doing tasks which otherwise would be difficult or impossible.One such type of robots that are becoming increasingly important are cleaning robots, programmed to work in the indoor environments [1][2][3].These robots have vast potential to enhance the productivity in cleaning tasks in domestic and commercial settings and have witnessed a steep rise over the last two decades.It is estimated that between 2015 and 2018, about 25.2 million robotic cleaning units would be sold worldwide [4].
Indoor robots are designed for fully autonomous traversal in indoor environments [1].Typical indoor traversal requires the robot to be able to avoid objects or obstructions by integrating the sensing models and communication modules [5,6].However, if the environment is multi-floor, the robot is required to have the ability to detect and localize the staircases in order to perform cleaning tasks on the staircase.Conventional indoor cleaning robots are usually designed for single floor operations i.e they cannot climb the stairs and reach the next floor.However, many real-world indoor environments have multiple stories connected with staircase.These staircase may be of different types e.g.straight, spiral, L-shaped staircase etc.This severely limits the ability of the conventional cleaning robot to be effective in such scenarios.To enable robots to traverse staircases, accurate detection and location of staircases are highly critical.This would enable robots to plan and navigate through such environments, thereby making them much more effective for real-world indoor environments.
The problem of recognizing staircase seems straightforward for humans.However, for robots to be able to detect, recognize and localize the staircases, this task is much more complex.For this, the robot should be able to analyze the incoming images and detect and recognize stair-like structure among many others objects.Further, the knowledge of the location of the staircase is required.Also, knowledge of the angle of approach is required as well for aligning the robot with the staircase in order to start climbing.This makes staircase detection and localization a highly demanding task for autonomous robots.
In this article, we have designed our solution to work with the sTetro platform [7].sTetro is a reconfigurable cleaning robot that can change its shape to climb staircases autonomously.In previous work, we validated the sTetro robot with respect to area coverage by bench-marking its performance with a fixed morphology robot.The results indicated that sTetro robot could achieve superior coverage performance through its shape-shifting ability.However, the validation was done by passing manual commands to navigate the robot towards the first step position of the staircase, and there were no autonomous strategies applied.In this paper, we extend our previous works by integrating the sensing modules and manipulations modules with the sTetro on ROS environment that enables the robot to navigate autonomously to the staircase.For this, we use a deep learning approach to detect and recognize staircases in the incoming image stream.However, the robot also requires knowledge of how to move based on the staircase position/location.So, after detecting staircases, the next step is to localize the robot with respect to stairs.For this, we use our own contour detection algorithm to find the target point (center of the first step of the staircase) and angle of approach.This enables the robot to approach the center of the staircase, align itself to the first stair and then start climbing.This article is organized as follows: We first discuss related works and other approaches used for staircase detection.This is followed by details of our proposed approach.Finally, we describe the experimental setup used to test our approach and the results obtained in this work.

Related Work
Many different approaches have been proposed in the literature to address the problem of staircase detection.These approaches can be grouped based on the sensors and algorithms used.
Standard sensors like RGB and RGB-D have been used for staircase detection, with the prominent approach being detecting parallel lines in the image [8].Though this algorithm works for many scenarios, they have many limitations.Hough transform [9], the preferred approach for detection of parallel lines, fails to detect staircases which are curved or spiral.Also, these algorithms assume that the robot is parallel to the staircase and facing it.However, this is not the case for most real-world scenarios.The robot must be able to detect and then plan their approach to the staircase.These approaches are also not designed for staircase climbing robots.These robots require alignment with the staircase which requires data pertaining to the first step of the staircase.Other approaches using RGB or RGB-D cameras include using Gabor filters, 3D column maps [10] etc.These are able to handle approaching staircases from different angles.Even LiDARs have been used for detection of staircases [11][12][13].However, these approaches also assume that the robot is parallel to the staircase and facing it.Some other sensors that have been used are monocular [14][15][16][17] and stereo sensors [18].
Object detection and classification by deep learning CNNs have been researched intensively over decades and has set revolutionary era in many applications especially in robotics [19,20] be identified, the most significant achievement of these deep learning-based methods is the ability to identify uncanny features in object classification.As a result, the effectiveness and accuracy of deep learning-based methods outperformed the conventional methods significantly [21].However, the biggest challenge is training these large networks, as they require a large amount of computation to converge by estimating the various parameters defined in the network.Recently, technologies in parallel computing such as Compute Unified Device Architecture CUDA [22] and NVIDIA CUDA Deep Neural Network CuDNN [23], enable parallel computing using multiple threads to process large calculations in separate graphics cards with their own Graphics Processing Unit (GPU).Consequently, the training time of these networks has reduced sharply.However, deploying these deep learning approaches in the compactly embedded controllers is still a big challenge for commercial applications.However, with the dawn of IoT devices, using a server-client model to reduce this real-time computation load is more practical.
There are many advantages to using deep learning for staircase detection.The model can be trained to detect different types of staircases including spiral and curved, which prove to be very difficult for standard algorithms.Also, they can be trained to recognize staircases from different angles.This eliminates the need for the robot to be in front of the staircase.These models are highly accurate as well.This is because these models learn the features from data rather specifying them in the algorithm.Recently, the advances in object detection, including object localization and objects classification, are driven by the success of state-of-the-art convolution network (ConvNet)-based object detectors called the Region-based Convolutional Network method (RCNN) [24].The issue with deep learning approaches was the computation power required to run these models in real-time.
However, with the recent update to MobileNets [25,26] and the release of SSD (Single Shot multi-box detectors) [27], these models are now capable of running in real time on low cost hardware.We are also computing the target point and the angle of approach, which allows the robot to align itself with the center of the staircase, thereby having enough space to be able to climb the staircase.ROS system.The ROS-based block diagram of the proposed system using a client-server model is shown in Figure 1.The ROS topics transmitted in ROS network through WiFi or data networks enable the communication between ROS nodes.Specifically, first, the perception client uses an image sensor to extract an image from the image stream and sends it over to the server.The server uses two steps to detect and localize staircases.The first step involves detection of staircases.This includes extracting features from an RGB image using the MobileNet architectures.We then classify and predict a bounding box using the SSD architecture.These rely on the CNN architecture, which is explained below.The second step has two components.First, we detect the first step of the staircase.

Methodology
The first step information is then used to detect the target point p x,y , a point close to the center of the first step.The angle of approach θ is also computed in this step.We then send this information to the portion of client responsible for movement.Based on our angle of approach, we decide a movement strategy which is executed by the onboard Micro-controller with the help of a motor.While this happens, the perception client updates the server on the new data.This is repeated until the target point is reached.and Pooling layer(s) extract features from the image.This is fed to the Fully Connected layer(s) that classify the image.Figure 2 shows the structure of a simple CNN.

Convolution layer
This layer involves applying a filter over the input layer.Typically, multiple filters are applied on the input layer.Filters are applied using a sliding window, which slides over the image and applies the filter on the region covered by the window.The input layer is zero padded so that the window can cover the entire input.The filter used is called the weight function.Equation 1 is used to compute the activation function f (x).
where w represents the filter (weights), b represents the bias and x represents the input.

ReLU layer
This layer applies a ReLU function to the input layer.This layer is typically used after the Convolution layer.This layer is used to remove the negative values which may have been formed from the filter.Equation 2 computes the ReLU activation function f (x).where x represents the input layer.

Pooling layer
This layer partitions the input layer into non-overlapping regions on which a pooling function is applied.Each region outputs a single value.This reduces the dimension of the input while retaining the important features.

Fully connected layer
This layer is a typical artificial neural network (ANN) layer.It is fully connected to the input layer.Multiple fully connected layers are used together to classify the image.These are used with the final pooling layer as input.The final pooling layer represents the extracted features.

Feature Extraction using MobileNets
Specialized CNN based architectures like MobileNets [25,26], AlexNet [29], Inception [30], ResNet [31] etc. are highly accurate at classifying images.The recently proposed Inception-ResNet [32] has the highest accuracy.However, we use the MobileNet architecture due to its low computation cost, which allows for real-time image classification on mobile hardware [33].Specifically, we use MobileNetv2 [26], which is an improvement over the standard MobileNet [25].The basic blocks of both architectures are shown in Figure 3 and Figure 4.The extracted features of MobileNet is used by SSD architecture to classify and localize staircases.The structure of SSD is discussed in the next section.

MobileNet architecture
The core idea behind MobileNet is depth-wise separable convolutions.These replace the typical convolution layers as computing convolutions are highly expensive.Depth-wise separate convolutions first apply a depth-wise convolution on the input.This filters the input data.This is followed by 1 × 1 convolutions which combine these filters into features.These depth-wise separable layers approximately mimic the function of typical convolution layers but with much faster speed.
Another optimization done was to use ReLU6 instead of ReLU, which applies an upper limit of 6 on the activation function.Equation 3 is used to compute the activation function f (x).
where x is the input.Structure of MobileNet v1 is shown in Figure 3.For this, we need a bounding box over the staircase to calculate its position, which is needed for calculating the target point and the angle of approach.For this purpose, object localization techniques such as Single Shot Multi-Box Detector (SSD) [27], faster R-CNN [34], YOLO [35] etc. can be used.These use the features extracted from CNN based architectures to classify and localize objects.Generally, faster R-CNN is the preferred method for object localization due to best accuracy.
However, SSDs have been shown to perform better in most scenarios for large objects, which is the case for staircases [33].SSD is also extremely fast since it requires only one forward pass for computation of all bounding boxes.Due to these reasons, SSD is highly appropriate for the current scenario.

SSD architecture
SSD passes the input through multiple convolution layers.These layers progressively decrease in size.Each of these layers generate a fixed set of predictions.This enables predictions of various sizes.In addition to this, SSD uses a set of default bounding boxes of different dimensions.These are applied to the predictions from different layers, which allows for predictions of boxes of different sizes and scales.Although the bounding boxes may not be pixel perfect, the loss in accuracy is found to be very minimal.However, due to this, the performance is drastically increased.

Loss in SSD training
Since SSD involves predicting a bounding box along with the class for an object, typical loss computation cannot be followed.Loss in SSD training is a weighted sum of loss due to two aspectsconfidence and localization.Equation 4 is used for computing total loss L.
where N is the number of matching boxes, α is the weight term which balances the confidence loss L con f and localization loss L loc .Confidence loss is the loss that occurs due to classification.This is computed as Softmax of confidence over multiple classes.Localization loss is the loss in predicting the bounding box of the object.This is computed as a smooth L 1 loss between predicted box and marked box.

Loss optimization
We use the Root Mean Square propagation (RMS prop) [36] w rms t+1 = w rms t where g t is the gradient along weight w rms .w rms t is the weight at any time t.v t is the exponential average of gradients.η is the inital learning rate.β is a hyperparameter to be tuned.is constant with value close to zero to avoid dividing by zero errors.

First step detection
Through SSD and MobileNets, we are able to detect the staircase and predict a bounding box over it.Now, we require knowledge of the first step to determine the target point and angle of approach.We do this by analyzing the staircase inside the bounding box.These values enable our sTetro robot to move towards the detected staircase autonomously.For detection of steps, the prominent method is Hough transform [37], which detects straight lines.However, steps of staircases may be curved as well.Traditional contour detection is difficulty to handle this situation.Also, we do not need to detect all the contours in the image, as we require only the first step in our calculations.
The proposed method for first step of staircase detection can be divided into two parts.First, we detect edges using Canny edge detection, which is followed by contour detection.

Edge Detection
To be able to detect the first step in the image, we first need to detect edges present in the image.
For this, we use the Canny Edge detection algorithm [38].toenhance the accuracy of detected edges, the Canny Edge algorithm is divided into 4 steps -Gaussian filters step to remove the noise, gradient calculation to find the edge pixel candidates, non-max suppression step to remove the edges with the weak gradient responses, hysteresis thresholding step to thin the detected edge lines.This returns a edge representation of the image i.e only the pixels that constitute edges are positive.Other pixels are 0.

Contour detection algorithm
This paper proposes a contour detection algorithm for determining the first step as shown in Algorithm 1.The algorithm detects points within a certain distance and certain bounds while giving preference to the points along the gradient of the most recent detected points.Before computation, we generate canny, which is the canny edge representation of the image.We use two functions for this algorithm.Function getContour() returns the contour representing the first step of the staircase.This function uses canny to generate a contour representing the first step.It checks the image from bottom to top (represented by i), left to right (represented by j) direction.If an edge point is detected it calls getContourWithIJ(), with a contour consisting of only the point i, j.Finally, it returns the contour only if its horizontal length is greater than a certain threshold, thres f ilter .We set this to 60% of the width of the staircase during experiments.This is done using getBounds(), which gets the difference between horizontal bounds of the contour.
Function getContourWithIJ() finds the remaining contour from the reference point i, j in canny.
This is merged with the contour detected prior to this point, which is represented by contour.It consists of 4 steps to identify the next point in the contour.First, the equation of the line that fits the previous 5 points in the contour is generated.The function fitLine() does this by line-fitting linear regression approach [39].We also set a bound on the slope of the line.arr.push(y coord , j + t, 0) 20: for item in arr do [1]] > 0 and bottom <= bottomLimit then return GETCONTOURWITHIJ(canny, bottom, item [1], contour.append([bottom,item [1]))
The working of algorithm for three different scenarios based on slope of line, has been shown in which iteration of algorithm would evaluate those pixels.For two pixels having same iteration number, the one which is closest to the slope will be evaluated first.

Determining the target point
Since we have the contour representing the first step, determining the target point p x,y (approximate central position of first step of staircase) is rather simple.We can find n points which are closest to the horizontal center.The distance here is linear distance between the X-coordinates, since there would not be multiple points with same X-coordinate.We can then fit a line through these points and predict the Y-coordinate of the horizontal center.The horizontal point x along X-coordinate and vertical point y along Y-coordinate gives the central target point p x,y .The slope of this line is the angle of approach (θ).This is necessary so that the robot can rotate itself to align with the staircase.In general, negative angles mean the robot is located to the right of the staircase and should rotate clockwise while moving left, so that the robot will become aligned to the staircase.
Similarly, positive angles mean the robot is located to the left of the staircase and should rotate counter-clockwise while moving right.Angles close to zero means the robot is already aligned to the staircase.We use a small threshold around zero to accommodate for slight variations, which may occur during detection.

Experimental Setup
In this section, we discuss about the data-set specification and specification of the models used.
The data-set was generated by capturing images of different staircases using a RGB camera.Details  of this is discussed below.The weights of the network were initialized to the weights of a network trained on the COCO data-set [40], which is a large data-set containing 80 most common classes for labelling.This was done so that there would be a faster convergence while training.This also helps in achieving a better overall accuracy.We used a batch size of 20 while training.We used a RMS prop optimizer for loss optimization with initial learning rate 0.004 and a decay factor of 0.9.

Obtaining Data-set
The sTetro robot [7] is used to capture images of the working environment.A RGB camera is fitted on top of sTetro robot, which is used to take the images, as shown in Figure 6.The final results are obtained from a set of 102 images not present in the data-set.Approximately half of these images contain staircases in them.The other half includes images of corridors, open areas and other common environments for cleaning robots where objects with features similar to staircases may be present.

Results Discussion
In this section, we discuss about the results obtained on the staircase data-set used.We divide this section into two categories: Staircase detection and First step detection.staircases with different orientations.The model is able to detect these as well.In Figure 8h, the staircase is not detected.This can be attributed to bad lighting conditions and larger distance between the robot and the staircase.The model did not detect ladders as staircases, which is a common issue with traditional methods.Also, the model is able to detect curved staircases, which proves to be a challenge for traditional linear line detection based-approaches.

Staircases Detection
However, the model did fail on certain cases where the images were very similar to staircases.Some examples of these scenarios are given below in Figure 9.The model is able to differentiate between staircases and tiled floors, which have parallel lines.One such example is Figure 9a.This is a very common scenario where traditional classifiers struggle, since the floor has parallel lines, which is a common feature used to detect staircases.Figure 9b is detected as a staircase.This can be attributed to the fact that the railings look like a staircase rotated by 90.However, it is easy to differentiate between staircases and railings by considering the angle of lines detected.In Figure 9c, the wall has features similar to a staircase, which causes the model to detect it as a staircase.The model also does not detect ladders as staircases.This is shown in Figure 9d.In Figure 9e, the combination of table and chair is also not recognized as staircases, although they have similar features.However, when combined with lines on the floor, the model detects them as a staircase, as shown in Figure 9f.However, if we put a constraint on the height of steps i.e a constraint on distance between two contour lines, this false detection may be avoided.The overall results are given in Table 1.

First step detection results
This section discusses results pertaining to detection of first step of staircases.We use 60% width of staircase as the thres f ilter to filter out small contours.Also, we filter out all vertical edges by using a thres parameter of ≈ 60 • .This value is obtained from grid searching through possible values for thres and selecting the one with best accuracy.The algorithm is able to detect both the target point p x,y and the angle of approach θ with good accuracy.The results are showcased in Figure 11.Here, the first step of the staircase detected using contour detection is denoted by the line.The cross represents the computed target point p x,y .The detected angle of approach θ is also shown below each Figure.
The algorithm is able to detect these with good accuracy, as shown in Figures 11a, 11b, 11c, 11e, 11f.
Figure 11d is the image of a textured staircase, where the approach is not very consistent due to the  able to detect staircases by sending frames after every interval to a server where the server processed and returned a direction logic (i.e angle of approach) to the robot.Also, the details of the target point was sent so that the robot can move towards this location.In this test, the robot is moving from the left to the right of staircase.Detection of staircase during this test is shown in Figures 12a, 12b respectively.The directions predicted are accurate during our testing scenarios, with minor errors occurring on textured staircases.The target point is predicted accurately.However, for real-time applications, the running time of the algorithm is also crucial.This is given in Table 2.The overall approach takes about 150ms on an average, which is feasible for real-time applications.The high computation times usually occur during the first run, which can be attributed to the loading of model into memory.

Conclusion
In this paper, we described an approach to staircase recognition and localization.We first used a deep learning model to recognize a staircase in the environment.This was done using an object detection network consisting of MobileNet and SSD architectures.We then used canny edge detector followed by our own contour detection algorithm for first step detection of staircase.Through this, we identified the target point, a point close to the center of the first step and the angle of approach, which is used to determine the direction to staircase.This scheme allows to align the robot to the staircase, so that it can start traversing the staircase.We trained and tested our proposed model using our own data-set consisting of images of 11 different staircases captured with sTetro robot from different viewpoints.We also tested our model against images that have features similar to staircases.The model is able to detect staircase and determine the target point and the angle of approach to the first step of the staircase with good accuracy.We also tested this model for real-time scenarios and found that it can be used in slow moving platforms like sTetro.In future, this work can be extended to recognize the type of staircases also, like straight, curved, spiral etc.This would allow the robot to switch between operation modes according to the type of staircase encountered.

Acknowledgement
This research was supported by Grant No. RGAST1702 from National Robotics Program Office, Singapore (NRPO) to the Engineering Product Development (EPD) at Singapore University of Technology and Design (SUTD).
. Deep learning using large neural networks can find the unique features of various objects autonomously, thereby reducing the need of pre-defined kernel based solutions.Since the features of the objects can Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 25 December 2018 Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 25 December 2018 doi:10.20944/preprints201812.0296.v1

Figure 5 .
Figure 5.In this figure, each box represents one pixel in the image.The value in each box represents the iteration in which they would be visited.The box with value −1 represents the current pixel on which getContourWithIJ() is called.The line represents the slope for the given scenario, with the value of slope given below each figure.The algorithm clearly gives preference to points along the slope while also examining points in the area bounded by thres.The numbers in boxes represent

PreprintsFigure 6 .
Figure 6.sTetro robot capturing images from different angles.(a) left of staircase, (b) in front of staircase, (c) right of staircase.

Figure 7 .
Figure 7.Some of the images present in the dataset.
Initially the images are taken from different staircases at different angles ranging from approximately 0 to 180 in front of stairs.Images from different distances and different positions were also taken.All images have a resolution 640 × 480 pixels.The data-set consists of 1025 images.The data-set contains images of 11 different staircases, which include both straight and curved staircases.Staircases with curved steps were also taken.The data-set also has staircases with different materials.A 90 : 10 split was taken for training and testing respectively during the training phase.Figure 7 contains some of the images present in the data-set.Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 25 December 2018 Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 25 December 2018 doi:10.20944/preprints201812.0296.v1

Figure 8 .
Figure 8. Staircase detection.Staircase detected in (a), (b), (c), (d), (e), (f) and (g).Not detected in (h).This section discusses results pertaining to detection of bounding boxes over staircases.Some images of staircases were taken along with images having similar features to staircases.This includes structures with parallel lines, ladders, floors with textures etc.The model detected almost all staircases correctly.This includes different types of staircases and images taken from different angles and distances.This is shown in Figure8.The box represents detected staircase.The confidence of its prediction is shown as a percentage above/ below the box.We filter out boxes with confidence less than 80% for staircase localization.Figures8a, 8b and 8cshow the model is able to detect staircases viewed from different angles.Figure8dis another staircase with different material.Figure8eis a staircase with curved steps, which the model is able to recognize.Figures8f and 8g are curved

Table 1 .
Results of staircase detection.

Table 2 .
Performance of proposed approach.Times do not include network delays.