An Improved Initialization Method for Monocular Visual-Inertial SLAM

: In the aim of improving the positioning accuracy of the monocular visual-inertial simultaneous localization and mapping (VI-SLAM) system, an improved initialization method with faster convergence is proposed. This approach is classiﬁed into three parts: Firstly, in the initial stage, the pure vision measurement model of ORB-SLAM is employed to make all the variables visible. Sec-ondly, the frequency of the IMU and camera was aligned by IMU pre-integration technology. Thirdly, an improved iterative method is put forward for estimating the initial parameters of IMU faster. The estimation of IMU initial parameters is divided into several simpler sub-problems, containing direction reﬁnement gravity estimation, gyroscope deviation estimation, accelerometer bias, and scale estimation. The experimental results on the self-built robot platform show that our method can up-regulate the initialization convergence speed, simultaneously improve the positioning accuracy of the entire VI-SLAM system.


Introduction
Visual simultaneous localization and mapping (VSLAM) techniques allow mobile robots [1,2] and VR/AR devices [3,4] to be aware of their surrounding scene, while carrying on the self-localization in the unknown environments. In recent years, many visual SLAM methods have been studied, such as multi-sensor fusion SLAM (e.g., visual-inertial, visual-LIDAR, and/or visual-GPS), deep learning SLAM, multi-agent SLAM, as well this technology has attracted a lot of interest in the emerging contexts of 5G/6G communications, since directional antenna arrays and higher bandwidths can be fruitfully exploited to achieve high accuracy and 5G/6G SLAM [5][6][7]. The SLAM system based on pure visual sensors has certain problems in robustness and accuracy, which limits its application in the field of terrestrial mobile robots. The monocular camera is not accurate in comparison with a binocular camera, but the computing complexity is lower, the IMU sensor can solve the problem of tracking failure and low precision when the monocular camera moves into the challenging environment (less texture and/or lighting changes) by using the IMU preintegration technology and EKF/nonlinear optimization methods. On the other hand, the visual sensor can make up for the cumulative drift of the IMU [8]. Indeed, such information is crucial when operating in harsh propagation environments (e.g., rich of multipath) where the typical GNSS information is highly inaccurate or completely unavailable [9]. Now, the monocular visual-inertial SLAM system has become a hot topic which contains strap down inertial measurement units (IMU) and monocular vision sensors to provide a low-cost, lightweight, and high-quality solution for most positioning and navigation applications in an indoor and outdoor environment. For simultaneous interpreting of multiple sensor measurements from various sensor frames, a process of initial parameters estimation and calibration is essential. The camera only needs to be calibrated once because it does not change over time, and the IMU sensor must be initialized before each use. This paper focuses on the IMU's initial values estimation. The IMU initialization process is designed to Martinelli, A. [15,16] Campos, C. [17] Disjoint method

•
The initialization estimation accuracy is high.
• Initial estimation is slow and unstable. • Rely on the monocular visual SLAM process.
Murata, R. [11] The joint visual-inertial initialization approach is introduced through Martinelli at first, which is named closed-form solution. However, this research [15] expressed only in theory and then demonstrated by the simulation of general Gaussian motions, the application of MAV is not feasible. So this method is later modified in [16], not only increasing the estimation of gyroscope bias but also is a successful implementation of actual data from quadrotor MAV. The latest work of [17] put forward a robust and fast initialization approach according to [15,16]. The accuracy is improved through several visual-inertial bundle adjustments (BA), and the robustness of the system is enhanced with the addition of consensus and observability tests. As it is tested on the dataset of Euros [18], it is proved to be consistently initialized with scale errors is less than five percent. However, those initialization methods have several limitations:

•
An ideal hypothesis in which all features are tracked in perspective should be contented. However, it can lead to bad solutions under conditions of spurious tracks. • Compared with [19], the disjoint visual-inertial initialization method, the accuracy of the joint method is lower. To improve it, a lot of frames and tracks are usually added, which leads to the computational cost being so high that the real-time performance is unfeasible.

•
The method in [17] works only at 20% of trajectory points. If the system requires to be started immediately, this may be a problem in robot use.
The disjoint visual-inertial initialization approach, i.e., loosely couple method, depends on a very accurate visual measurement model in the initial stage. This method is first applied by Mur-Artal and later adapted in [11,20] with a good performance on the public Electronics 2021, 10, 3063 3 of 15 dataset. In particular, the motion of MAV with metric scale can be recovered with a small error, and the accuracy of positioning is maintained at centimeter-level [11]. However, this approach also exists several limitations:

•
The process of initial estimation is slow and unstable. On account of the inertial parameters being evaluated through solving a set of the linear equations in various steps utilizing the least square method, it requires an excellent iterative strategy that makes fast convergence. However, the convergence speed in [11] is not reliable enough for all variables estimation. it can be a problem for many real applications. • Initialization is fragile. As the method requires running monocular visual SLAM in advance for finding the accurate inertial parameters. If the visual part gets lost, the inertial system will not be launched immediately.
In summary, there are several initialization methods have been studied for the monocular VI-SLAM system. However, few researchers have tried to improve it from the perspective of non-linear optimization. In the current work, an improved initialization approach that is by the disjoint method is proposed. First, in the initial stage, the pure vision measurement model of ORB-SLAM2 is employed to make all the variables visible. Second, the frequency of the IMU camera was aligned by IMU pre-integration technology [21]. Third, the IMU initialization process, which is highlighted in a dotted block diagram with red color. It is divided into several simpler sub-problems, containing direction refinement gravity estimation, gyroscope deviation estimation, scale estimation as well as accelerometer deviation. In this work, an improved iterative method is put forward for estimating the initial parameters of IMU faster. The experimental outcomes on a real mobile robot demonstrate excellent performance while our initialization method is integrated into the VI-SLAM system which is based on the ORB-SLAM2 skeleton [19,22].
The rest of the current paper is organized as below: We introduce the preparatory work in Section 2. Then the core part of this paper, the IMU initialization process, is illustrated in Section 3. Section 4 introduces the real-time experiment for the mobile robot. Section 5 gives the summaries and the future work.

Preliminaries
In the present section, the necessary notation and the monocular visual-inertial coordinate frames and visual measurement model are briefly reviewed, then the IMU preintegration on the manifold is described in the following sections.

Notation
In this paper, we aim to estimate gyroscope bias, gravity, accelerometer bias together with visual scale in the visual-inertial initialization stage. SO(3) represents a special orthogonal group: Lie group, and so(3) is the corresponding Lie algebra. The vectors are uniformly expressed in italics, the reference frame is marked with a right subscript, e.g., A V for the vector A expressed in frame {V}. If a vector describes the relative transformation from one reference frame to another frame, e.g., A CB for the vector that defines the translation from camera frame {C} to IMU body frame {B}. The correlations between camera frame {C} and IMU body frame {B} is defined by scale factor s is considered: in which s represents visual scale, P (.) and R (.) represent the translation and rotation vector between two coordinate frames, respectively.

Coordinate Frames
The transformation between the coordinate frames is shown in Figure 1. Since the measurements of inertial and visual odometry are changed over time. However, the absolute pose is needed in the pre-fixed reference frame. Therefore, it is assumed that the reference frame of our system coincides with the first keyframe which is determined by the pure visual SLAM. In this work, the G E represents the gravity in the inertial frame {E} of earth. The g w represents the gravity in the world coordinate system {W}. The first keyframe is assumed as a reference frame. The external parameter matrix T CB is described as follows 4 × 4 matrix: where R CB and T CB represent the rotation and translation matrix/vector between camera frame {C} and body frame {B} is calibrated in advance. The transformation between the coordinate frames is shown in Figure 1. Since the measurements of inertial and visual odometry are changed over time. However, the absolute pose is needed in the pre-fixed reference frame. Therefore, it is assumed that the reference frame of our system coincides with the first keyframe which is determined by the pure visual SLAM. In this work, the E G represents the gravity in the inertial frame {E} of earth. The w g represents the gravity in the world coordinate system {W}. The first keyframe is assumed as a reference frame. The external parameter matrix CB T is described as follows 4ⅹ4 matrix: where CB R and CB T represent the rotation and translation matrix/vector between camera frame {C} and body frame {B} is calibrated in advance.

Visual Measurement Model
The ORB-SLAM2 [19,22] visual measurement model is adopted for the initial pose estimation. The system consists of the following three parallel threads, i.e., tracing, local mapping along with loop closing. While the tracking together with local mapping threads is applied in the initialization stage. The tracking part is responsible for deciding whether to treat the new frame as a key. After inserting the new key, the associated IMU pre-integration model is calculated between two consecutive keyframes. In this work, we adopt the conventional visual model with the visual projection function (.) π , which converts 3- x y z are the coordinates of 3D points in the camera frame.

Visual Measurement Model
The ORB-SLAM2 [19,22] visual measurement model is adopted for the initial pose estimation. The system consists of the following three parallel threads, i.e., tracing, local mapping along with loop closing. While the tracking together with local mapping threads is applied in the initialization stage. The tracking part is responsible for deciding whether to treat the new frame as a key. After inserting the new key, the associated IMU preintegration model is calculated between two consecutive keyframes. In this work, we adopt the conventional visual model with the visual projection function π(.), which converts nates of 3D points in the camera frame.

IMU Pre-Integration
As the output of IMU and camera are at different rates, the IMU pre-integration technology for aligning the frequency of the IMU camera is introduced. The concept of IMU pre-integrated is pioneered in [23] and extended in [21] on the manifold space. Assumed that there are two consecutive keyframes at time j and i, and the IMU is synchronized with the camera and provides measurements at discrete times k. The associated IMU position P WB , velocity v WB , and orientation R WB can be calculated by summarizing all of the measurements during this period (i.e., Iterating the IMU integration for all ∆t intervals between two consecutive keyframes at times k = i and k = j): in which ∆t is the sampling interval of IMU, with ∆t ij = (j − i)∆t. The Exp(.) represents an exponential mapping operator that maps Lie algebra so(3) to the Lie group SO (3). It is assumed that the deviation remains unchanged in the course of pre-integration, and the effect of measurement noise of IMU is ignored (usually considered as Gaussian noise), w k B , a k B represent the angular rate and acceleration vectors in the IMU body frame, b k represent the bias of IMU (i.e., gyroscope and accelerometer) and the noise of measurement.
A small correction δb i (.) of the formerly estimated b i (.) could be considered to correct preintegrated outcomes. We can rewrite the expressions in Equation (4) as below: among them, the Jacobians J g (.) and J a (.) express how the measured value change owing to the change of deviation estimation. The biases b i g and b i a remain constant in the course of pre-integration and can be pre-calculated at the time i. The specific Jacobians calculation is shown in [21]. Subsequently, the ∆R ij , ∆v ij and ∆P ij pre-integration values can be directly calculated from the outputs of IMU between two keyframes, which are independent of the gravity and the states at the time i: where ∆R ik , ∆ v ik represents the rotation and velocity increment of the i-th keyframe in k-th interval time. Π(.) is cumulative multiplication operation, ∑(.) is accumulation operation.

IMU Initialization
In the present section, the initial IMU parameters are estimated, containing gravity g w , gyroscope bias b g , visual scale s and accelerometer bias b a . To make all the variables visible, the pure monocular visual SLAM system requires to work for a few seconds and then wait for the several keyframes to be formed (Section 2.2). The specific process of the estimation of IMU parameters is revealed below.

Gyroscope Bias Estimation
From the known direction of two consecutive keyframes, we can estimate the gyro bias. It is assumed that the variation of the deviation is negligible, that is, the bias b g is a Electronics 2021, 10, 3063 6 of 15 constant value, this constant value minimizes the difference between the relative direction calculated via ORB-SLAM2 and the gyro integral for all pairs of continuous keyframes: (7) in which N represents the keyframes number. R WC ·R CB is calculated from the calibration R CB and orientation R (.) WC . ∆R i,i+1 denotes the gyro integration between the two consecutive keyframes. Exp(.) and J g ∆R respectively represents the exponential mapping R 3 → SO3 together with the Jacobian matrix. The analytic Jacobian matrices of similar expression are exhibited in [21].

Gravity Direction Estimation
Due to the direction of gravity having a great effect on the acceleration estimation, the direction of gravity must be refined before estimating the accelerometer bias, gravity, and scale parameters. Particularly, a new constraint, gravity magnitude G(G ≈ 9.8), is introduced. As revealed in Figure 2. The inertial reference frame is defined as {I} and the world frame is defined as {W}, the gravity direction is defined as g I = {0, 0, 1}. According to frame {W}, the direction of gravity can be calculated as follows: from the angle θ between two direction vectors, we can calculate rotation R W I : with v = g I × g W g I × g W , θ = a tan 2( g I × g W , g I · g W ), thus the gravity vector can be described as below: in which R W I can be parametrized, only two angles around axis x and y are used in frame {I}, and the rotation around axis z has no influence in g W .

Improved Iterative Strategy
As Equation (7) is a classical problem of nonlinear least square, the generally used solution approach is the Gauss-Newton (G-N) algorithm, which is adopted in [19]. However, this method has several drawbacks. First, large iteration increment may result in slow convergence. Second, this algorithm requires the H (Hessian matrix) be positive definite, and invertible while the actual calculated data may not meet this requirement.
In this paper, an improved iterative method is proposed for improving the stability of convergence. In particular, an appropriate trust region μ is added to the increment x Δ . In the process of each iteration, it is assumed to be effective when the increment x Δ is located in the trust region. Otherwise, it is considered to be invalid, and the iteration

Improved Iterative Strategy
As Equation (7) is a classical problem of nonlinear least square, the generally used solution approach is the Gauss-Newton (G-N) algorithm, which is adopted in [19]. However, this method has several drawbacks. First, large iteration increment may result in slow convergence. Second, this algorithm requires the H (Hessian matrix) be positive definite, and invertible while the actual calculated data may not meet this requirement.
In this paper, an improved iterative method is proposed for improving the stability of convergence. In particular, an appropriate trust region µ is added to the increment ∆x. In the process of each iteration, it is assumed to be effective when the increment ∆x is located in the trust region. Otherwise, it is considered to be invalid, and the iteration may not be converged. The improved iteration method is displayed in Algorithm 1.  According to the formula in step 2: We add the constraint: D∆x 2 ≤ µ, where µ and D respectively is the radius of the trust-region and scaling matrix. When D is unit matrix I or not (for example, D is a diagonal matrix), the trust region is a sphere with radius µ or ellipsoid). To facilitate calculation, Lagrange multiplier is utilized to convert Formula (11) into the unconstrained optimization problem: here λ denotes the Lagrange multiplier, through the expansion of formula, a linear equation can be acquired to count the increment: with H = J T J, g = J T · f , and λ ≥ 0. Where J = J(x) and f = f (x). Formula (13) can be considered as the steepest descent algorithm when λ is small. To effectively adjust the range of trust region, the ratio between the approximate model and actual function after each iteration was calculated in step 3, as below: (14) in which the { f (x + ∆x) − f (x)} and {J(x).∆x} respectively is the actual function together with the approximate model. When ρ is close to 1, it indicates that the approximation performance is good. If ρ < the threshold set to be ρ < 0.25, it represents that in contrast to approximate reduction, the actual reduction is much smaller, so it is necessary to reduce the trust-region radius and set it to µ = 0.5µ. If ρ is greater than the threshold set to ρ > 0.75, it is necessary to expand the trust-region radius set to µ = 2µ.
In step 5, there exist two-stop criteria. At first, the stopping criteria of the algorithm should meet the following criteria: g ∞ ≤ η 1 (15) here η 1 is the small value, set to η 1 = 10 −6 , . ∞ represents an infinite norm. Secondly, when the increment ∆x is too small, we should consider stopping the iteration: in which η 2 represents the relative step size, set to η 2 = 10 −6 . Ultimately, we also set up a protection measure to prevent infinite loops that limit the maximum number of the iterations k MAX = 2000, when k ≥ k MAX , the iteration will be forced to stop.

Accelerometer Bias and Scale Estimation
Following the former sections (Sections 3.1-3.3). Once the accurate gravity vector and gyro bias are acquired, Equation (5) is applied for the pre-integration of positions and velocities, rotate the measurement of acceleration correctly to compensate for the gyro deviation. Subsequently, in consideration of the effect resulting from the accelerometer deviation, the rotation vector R W I is also adjusted, which can be described via a two degree of freedom disturbance δθ, the Equation (10) can be rewritten as below: with δθ= [δθ T xy , 0] T , δθ xy = [δθ x , δθ y T . Therefore, containing the influence of the accelerometer bias, we can get: In consideration of the constraints among the three consecutive keyframes, the velocities can be eliminated, and the linear relationship gets as follows: Here, we writing N keyframes i, I + 1, I + 2, . . . , I + N − 1 as 1, 2, 3, . . . , N for clarity of notation, thus λ (i) , ϕ(i), ζ(i), and ψ(i) are calculated as below: in which [] (:,1:2) denotes the top two columns of the matrix. By superimposing all the correlations between three consecutive keyframes (19), the linear system can generate the following equations A 3(N−2)×6 X 6×1 = B 3(N−2)×1 , which can be solved through the method of singular value decomposition (SVD). In this condition, it is composed of six unknown variables and 3(N-2) equations, and at least four keyframes are required to solve the system.

Experiments
The initialization method is applied in an unknown indoor environment with a selfbuild mobile robot platform. The platform structure is exhibited in Figure 3, the major components include a low-cost VI-camera (MYNT S1030-IR-120), an NVIDIA Jetson TX2, a Xsens MTI-300, and two 12V DC batteries for power supply. , which can be solved through the method of singular value decomposition (SVD). In this condition, it is composed of six unknown variables and 3(N-2) equations, and at least four keyframes are required to solve the system.

Experiments
The initialization method is applied in an unknown indoor environment with a selfbuild mobile robot platform. The platform structure is exhibited in Figure 3, the major components include a low-cost VI-camera (MYNT S1030-IR-120), an NVIDIA Jetson TX2, a Xsens MTI-300, and two 12V DC batteries for power supply.  The key parameters of MYNT S1030-IR-120 are shown in Table 2, it communicates with NVIDIA Jetson TX2 through USB 3.0 interface. In terms of the Xsens MTI-300, it outputs the high-frequency measurements of accelerometers and gyroscopes. In this work, we treat it as a reference system through the post-processing operation. As the low-cost equipment is used to collect datasets, the frequency of the IMU sensor is set to 150 Hz, while the frequency of the camera is set to 10Hz. All of the experiments are implemented by utilizing the computer with i7-9700 CPU (8 cores @3.00 GHz) and 16 GB RAM in the Ubuntu 18.04 + Melodic operating system. The external parameters of the IMU and camera are calibrated via the Kalibr tool [24] in advance which is shown in Table 3.  Notes: T CB is the Camera-IMU frames transformation matrix, k 1 , k 2 is the radial distortion coefficient and p 1 , p 2 is the tangential distortion coefficient.

Evaluation of the Initial Estimation
Our initialization method is first integrated into the VI-SLAM system. To evaluate the algorithms fairly, the original algorithm with the gauss-newton algorithm [19] and the proposed algorithms are detected on the same data set in which the mobile robot is controlled to perform several close-loop movements in the indoor environment. Besides, we only utilized the left camera image to test the performance of the monocular VI-SLAM system. Figure 4a-c shows the example image frame from the laboratory dataset. The comparison results of the initial parameter estimation are shown in Figure 5a-d, it can be known that all estimated variables, containing gravity, gyro deviation, scale factor and accelerometer deviation are converged to the stable values within 2 s to 11 s by using the proposed algorithm (dotted lines), while the gauss newton algorithm (solid lines) is converged within 6 s to 17 s. In particular, as exhibited in the Figure 5a, within 2 s, the gyro bias in x, y, z directions converges to −0.019, 0.023, and 0.081. It is well demonstrated that the iterative method acquired better performance. In Figure 5b,d, the characteristic curves of accelerometer deviation and gravity oscillate seriously within five seconds. This is owing to the mobile robot platform does not show enough excitation to the sensor kit in the slight disturbance and stationary stages, making it difficult to distinguish between gravity vector and accelerometer bias, but the proposed algorithm still has a good performance in convergence speed. In Figure 5c, the visual scale factor is converged 10 s later, and the gauss newton algorithm is converged after 17 s. In general, in the convergence speed, the algorithm is faster than the Gauss-Newton algorithm.
Electronics 2021, 10, x FOR PEER REVIEW 11 of 16 slight disturbance and stationary stages, making it difficult to distinguish between gravity vector and accelerometer bias, but the proposed algorithm still has a good performance in convergence speed. In Figure 5c, the visual scale factor is converged 10 s later, and the gauss newton algorithm is converged after 17 s. In general, in the convergence speed, the algorithm is faster than the Gauss-Newton algorithm.     (c) (d) Figure 5. Time-varying characteristic curves for initial estimations: (a) the gyroscope bias estimation, "gyro" denotes "gyroscope"; (b) the gravity estimation, "gw" denotes "gravity"; (c) the estimation of the visual scale factor; (d) the estimation of accelerometer bias, "acc" denotes "accelerometer". The dotted line denotes the algorithm utilizing the improved iterative method, and the solid lines are the outcomes of the algorithm based on Guass-Newton, which is employed in the VI-ORBSLAM system. The convergence time is represented via red and green vertical lines, respectively.

Evaluation of the Tracking Accuracy and Computational Complexity
In the present section, the property of this algorithm on the VI-SLAM system accuracy was assessed. Similar to the public dataset experiment, when our algorithm is tested on the self-collected dataset, the visual-inertial odometry is utilized as attitude and position feedback. The trajectories are aligned with the reference trajectory, i.e., the measurements of Xsens MTI-300. As exhibited in Figure 6, the dotted line denotes the ground truth

Evaluation of the Tracking Accuracy and Computational Complexity
In the present section, the property of this algorithm on the VI-SLAM system accuracy was assessed. Similar to the public dataset experiment, when our algorithm is tested on the self-collected dataset, the visual-inertial odometry is utilized as attitude and position feedback. The trajectories are aligned with the reference trajectory, i.e., the measurements of Xsens MTI-300. As exhibited in Figure 6, the dotted line denotes the ground truth trajectories, the yellow line represents the trajectory of OKVIS which is a binocular SLAM method, and the green line and red line represent the trajectories of the gauss-newton based algorithm (i.e., VI-ORBSLAM) and our proposed algorithm, respectively. It can be known that the trajectories can be tracked completely by them, but the three algorithms have different degrees of deviation. Due to the improved initialization process, the trajectory of ours is closer to the ground truth compared with VI-ORBSLAM and OKVIS. The quantitative evaluation results are obtained through the calculation of Equations (21) and (22). Which the RMSE errors are calculated as follows: 1) RMSE error of position:  (21) where, 2) RMSE errors of orientation: The quantitative evaluation results are obtained through the calculation of Equations (21) and (22). Which the RMSE errors are calculated as follows: (1) RMSE error of position: where, (x pos(k) , y pos(k) , z pos(k) ) denote the estimation of position with x, y, z-axis, (x pos(k) , y pos(k) , z pos(k) ) denote the true position with x, y, and z-axis, respectively.
As shown in Table 4  Which the orientation accuracy of ours is increased by (23.9%, 2.7%, 6.5%) and (32.9%, 17.47%, 39.31%) along x-axis, y-axis, and z-axis, respectively. Obviously, the improvement of position and orientation accuracy in the three-axis is evident. It also well confirms that the proposed initialization method processes a positive role in the positioning accuracy of the monocular VI-SLAM system. In addition, the CPU/memory utilization statistics and pre-frame process time of the three algorithms are also tested, it can be known from Table 5 and Figure 7 that the proposed algorithm has the lowest CPU and Memory usage with the smallest process times of pre-frame.  Table 5. Average values of CPU/memory usage and process times.   Figure 7. CPU/memory utilization statistics and pre-frame process time summarizing performance. The red color box represents VI-ORBSLAM, the blue color box represents OKVIS, Green color box represents OURS.

Conclusions and Future Work
In the present work, we put forward a new initialization algorithm for the monocular VI-SLAM system from the perspective of non-linear optimization. Firstly, in the initial stage, the pure vision measurement model of ORB-SLAM is employed to make all the variables visible. Secondly, the frequency of the IMU camera was aligned by IMU preintegration technology. Thirdly, an improved iterative method is put forward for estimating the initial parameters of IMU faster. Thanks to an improved iterative strategy, our initialization procedure provides high-quality initial seeds which contain gravity vector, gyroscope bias, visual scale as well as accelerometer biases. Besides, a real-world dataset was collected by self-built mobile robots to validate the proposal. The results demonstrate that this algorithm has excellent properties in system positioning accuracy and initial parameter convergence speed than the gauss-newton based algorithm (VI-ORBSLAM) in an indoor environment. A limitation of this strategy is the camera-IMU external parameters are assumed as constant values. The external parameters have an uncertain influence on

Conclusions and Future Work
In the present work, we put forward a new initialization algorithm for the monocular VI-SLAM system from the perspective of non-linear optimization. Firstly, in the initial stage, the pure vision measurement model of ORB-SLAM is employed to make all the variables visible. Secondly, the frequency of the IMU camera was aligned by IMU pre-integration technology. Thirdly, an improved iterative method is put forward for estimating the initial parameters of IMU faster. Thanks to an improved iterative strategy, our initialization procedure provides high-quality initial seeds which contain gravity vector, gyroscope bias, visual scale as well as accelerometer biases. Besides, a real-world dataset was collected by self-built mobile robots to validate the proposal. The results demonstrate that this algorithm has excellent properties in system positioning accuracy and initial parameter convergence speed than the gauss-newton based algorithm (VI-ORBSLAM) in an indoor environment. A limitation of this strategy is the camera-IMU external parameters are assumed as constant values. The external parameters have an uncertain influence on the initialization results. In future works, we will make additional online estimations of external parameters in the initialization stage to improve the property of the system.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare that there exists no conflict of interest in publishing this paper.