1. Introduction
The field of geomatics has been constantly changing and expanding due to numerous technological advances. The traditional and most consolidated surveying techniques relied primarily on the punctual recording of discrete and precise measurements requiring skilled operators and precise instruments such as: levels, theodolites, tacheometers, classical aerial photogrammetry and GNSS (Global Navigation Satellite System). In recent years the field has flourished with newer instruments and methods aimed at quickly recording the complete 3D, producing a dense point-wise geometric description of the object surfaces, a.k.a. the point cloud, as it is the case with terrestrial laser scanners, portable laser scanners, airborne LiDAR (Light Detection And Ranging) and Structure from Motion (SfM) and image dense matching. These 3D dense geometric recordings of reality have enabled many new applications and are now widely adopted in fields such as: land mapping, construction, cultural heritage, archaeology, and infrastructure. Moreover, software development and advancements of algorithmic processes have opened the door to non-specialized instruments as well to be used for geomatics applications with great success further expanding the field. A noticeable example is the democratization of photogrammetry thanks to modern image-based modelling software and the support for low-cost consumer-grade hardware such as: DSLR (Digital Single Lens Reflex) cameras, smartphone cameras and UAVs (Unmanned Aerial Vehicles). However, despite the many advances achieved so far, such as laser scanning and SfM photogrammetry efficiency, there are applications where these techniques cannot effectively be used due to several limitations in maneuverability, acquisition range, execution time, and error propagation. For example: narrow spaces, tunnels, and caves remain a challenge when accurate dense mapping is required [
1,
2].
Hand-carriable and backpack-mounted Mobile Mapping Systems (MMSs) such as many commercial solutions nowadays available on the market: Geoslam Zeb Horizon [
3],Leica Geosystems BLK2GO [
4],Gexcel s.r.l. Heron [
5], NavVis VLX [
6], etc., are ideal instruments for indoor 3D mapping thanks to their maneuverability and speed-effectiveness of the survey operations. However, when employed in extensive or meandering narrow spaces and tunnel-like environments, the global accuracy attainable from these devices is hampered by drift error propagation [
7,
8] thus leaving the problem of efficiently digitizing narrow spaces unsolved. As an example, performing the geometric 3D survey of narrow tunnels or spiral staircases [
9] are challenging tasks: with a TLS is a burdensome and impractical process, even employing the newest more productive TLS solution that allows data pre-registration on the field, such as the Leica RTC360 [
4]; with a portable MMS, the field acquisition is optimized, nonetheless the unpredictable drift of the sensor’s estimated trajectory force the practitioners to integrate the efficient MMS survey with traditional burdensome ground control measurements.
Among portable range-based MMSs, those that are practically employable in narrow spaces, such as the Geoslam Zeb Horizon [
3] and other commercial instruments [
10,
11,
12] or similar devices from the research community [
13,
14], cannot rely on GNSS modules and can only house compact low-grade IMU (Inertial Measurement Unit). Thus a refined estimate of the device’s position, movement, and trajectory is computed from algorithmic processes i.e. SLAM (Simultaneous Localization and Mapping) algorithms [
15]. SLAM methods compute the device movements in unknown environments by exploiting the 3D geometry acquired by the LiDAR mapping sensors. They are prone to failure when ambiguous or featureless geometry is supplied. Even when suitable 3D geometry is available, SLAM is prone to drift error in long acquisitions, and this error is contained if loop closures are provided during the data acquisition. However, loop/closures are usually inherently denied in tunnel-like environments. Indeed, the very scenarios in which hand-carriable MMSs would be most useful are the same scenarios that tend to hamper the possibility of performing loops (tunnels, corridors). The same is true for Visual SLAM methods, using images data instead of LiDAR acquisition to compute movements [
16]. The visual SLAM approach works for ambiguous and featureless geometries while failing for poor image radiometric texture. The visual approach is more promising for the survey of narrow spaces and tunnel like environment since these tend to be geometrically monotonous but rich in radiometric texture. [
9] hinted that the image-based approach might be the most promising solution for the effective survey of narrow spaces, providing good robustness to drift error and good global accuracy. Indeed, the redundancy and robustness of SfM thanks to the bundle adjustment can constitute a solution to the problem in hand. The literature offers many accounts of off-line SfM and on-line visual SLAM approaches applied to the survey narrow spaces, from the uses of low-cost action cameras rigs [
17,
18] to custom stereo-cameras and multi-camera prototypes [
8,
19,
20,
21,
22,
23,
24,
25,
26].
The flourishing market of portable MMSs and the active research regarding fisheye photogrammetry and multi-camera rigs testify to the presence of a demand for a 3D digitization methodology that is both practical, agile, and fast in the field and accurate in its results.
1.2. Research and Paper Objectives
As mentioned, the most consolidated geomatics techniques are not effective for the survey of narrow spaces: both laser scanning and DSLR close-range photogrammetry are regarded as reliable and accurate techniques, yet, in elongated tunnel-like environments they both require acquiring a large number of data (scans or images) that usually makes the job impractical; portable MMS widely available on the market suits the task but are not regarded as reliable due to the drift error that accumulates in long unconstrained acquisitions. They are accurate locally but fail in general accuracy if they are not supported by control measurements.
Complex confined areas are not uncommon and nowadays, acquiring these kinds of places can be necessary in many fields that would benefit from a complete 3D digitization process and extensive photographic documentation useful for restoration, inspection, and monitoring. In Cultural Heritage, there are narrow passages, stairways, and utility rooms; in archaeology, catacombs, and underground burial chambers; in land surveying, there are natural formations such as caves; and in infrastructure tunnels, aqueducts, sewers; or even there is mining. In all these types of spaces, there is a growing need to record 3D geometry, often quickly and recursively, safely, and cost-effectively.
The study described in this paper aims to provide a trustworthy and effective sur-vey methodology for small, tunnel-like areas. Building on a prior study conducted by [
9], the primary goal of the research is to leverage the robustness of SfM and comprehend drift behavior while streamlining the process of capturing large amounts of images in a repetitive tunnel-like environment. The objective was to develop a multi-camera system equipped with fisheye lenses that can collect data quickly, intuitively, and even in the most complex and challenging spaces, producing results that are accurate and reliable enough to meet the requirements of the scale of architectural representation (2–3 cm error).
The key goals to achieve were:
(1) Cost-effectiveness, to be competitive for low-budget applications, for the survey of secondary spaces for which laser scanning cannot be justified, such as geology and archaeology.
(2) Speed-effectiveness: Like the other MMSs, it must speed up the acquisition process regardless of the complexity of the space to be surveyed (narrow and meandering spaces).
(3) Reliability: The time saved on site must not be spent during data elaboration due to unreliable processes. This is probably the most important flaw of today’s MMSs and it is also a problem encountered in the early test with fisheye photogrammetry.
Therefore, the objective was to develop a multi-camera device that is compact, lightweight, and transportable by hand and houses multiple cameras to cover the entire environment in which the device is immersed except for the operator. The cameras should be equipped with fisheye lenses to maximize the field of view and minimize the number of images to be acquired to complete the survey. The compact structure should accommodate the cameras by ensuring a robust fixed baseline between all cameras in the system. The constrained fix design will then allow for automatic scaling of the resulting three-dimensional reconstructions, introducing the relative orientation constraints between cameras and reducing the degrees of freedom of the photogrammetric network.
The paper describes the research that led to the design of a working prototype of a novel instrument, a fisheye multi-camera called Ant3D resulted in a patent in 2020 and that has already been tested and compared multiple times on the field against other approaches [
8,
24,
27].
1.3. The beginning of the Research—The FINE Benchmark Experience
At the beginning of the research interest in fisheye photogrammetry and fisheye multi-camera applications, in 2019, an access-free benchmark dataset was designed to provide a set of data to evaluate the performances of different image-based processing methods when surveying complex spaces, specifically the performance of low-cost multi-camera rigs. The FINE Benchmark (Fisheye/Indoor/Narrow spaces/Evaluation). Participants from academia and research institutes were invited to use the benchmark data and demonstrate their tools, codes and processing methods in elaborating two image datasets for the 3D reconstruction of narrow spaces (
Figure 1). The Benchmark dataset was first presented during the 3D-ARCH 2019 conferences held in Bergamo, where a special session was held specifically for the presentation dealing with the Benchmark.
The benchmark data were acquired in the internal spaces of the Castagneta Tower of San Vigilio Castle, located at the very top of Città Alta (Bergamo, Italy). The case study has been chosen because of the co-existence of challenging conditions that can be exploited to stress the techniques and processing strategies. All the indoor spaces of the castle are poorly illuminated, and the two main environments of the tower includes some narrow passages in the range of 70-80cm wide. They differ in the surface features: artificial, refined flat surfaces for one area; and rough natural rock surfaces for the other.
The benchmark was composed of two datasets referring to the two connected environments:
(1) Tunnel: a dark underground tunnel (around 80 meters long) excavated in the rock, with a muddy floor, humid walls. In some areas, the ceiling is lower than 1.5 meters.
(2) Tower: an artificial passage composed of two rooms with a circular / semi-spherical shape that are connected by an interior path, starting from the tower’s ground floor and leading to the castle’s upper part, constituted of staircases, planar surfaces, sharp edges, walls with squared rock blocks and relatively uniform texture.
The FINE Benchmark provided several data including the image datasets and a laser scanner ground truth point cloud. For the acquisition of the low-cost multi-camera datasets, an array of action cameras was used to perform a rapid video acquisition of both the tunnel and tower areas. The rig consists of six GoPro cameras mounted rigidly on a rectangular aluminium structure (
Figure 2). Continuous light is provided by two LED illuminators mounted on the back.
The rig was designed to have a sufficient base distance between the six cameras in relation to the width of the narrow passages. The design was thought to reconstruct the object geometry at every single position of the rig. Two cameras are mounted on the top (G6) and the bottom (G5) of the structure, tilted roughly 45° degrees downwards and upwards. Four cameras were mounted on the rig's sides, two of them (G1, G2) in a convergent manner oriented horizontally, and two in a divergent way (G3, G4) oriented vertically.
The FINE Benchmark provided the basis for an in-depth test of the low-cost multi-camera approach. The processing carried out by the author comprises the synchronization of the individual video sequences of the six GoPro cameras using the audio tracks available and the subsequent extraction of timestamped keyframes to form the image datasets to be used for SfM. The obtained images were then processed using a pipeline implemented with the commercial software Agisoft Metashape that accounts for rigid constraints of the known baselines between the cameras. Different keyframe extraction densities were tested during the testing, namely 1 fps, 2 fps, and 4 fps. The evaluation of the resulting 3D reconstruction of the processed datasets was performed in two ways: (i) by checking the error on checkpoints (CPs) available along the narrow environments and extracted from the ground truth laser scanner point cloud and (ii) by checking the cloud-to-cloud deviation of the obtained sparse point cloud from the reference ground truth. For both evaluations, the multi-camera reconstructions were oriented with the reference point cloud using a few ground control points (GCPs) at the tunnel start in order to check the maximum drift at the opposite end.
Table 1 shows the error on the checkpoints resulting from the 3D reconstruction of the tunnel environments for the 1, 2, and 4 fps datasets.
Figure 3 shows their relative cloud-to-cloud deviation from the laser scanner ground truth point cloud. For all results reported, the baselines between the cameras were rigidly constrained in the bundle adjustments exploiting the scalebar function available in Metashape. Overall, the error obtained exceeded the target accuracy, and the processing presented a great degree of unreliability.
The FINE Benchmark's experience revealed several problems with the multi-camera implementation based on a commercial action camera. Nevertheless, the results confirmed the potential of the image-based multi-camera approach, allowing the complete acquisition in a short time despite the complexity of the environment. However, reaching architectural accuracy (2-3cm) was impossible without using the coordinates of known points measured with the total station to optimize the three-dimensional reconstruction.
The main limitations were:
(1) the geometry of the multi-camera. The used configuration, consisting of six GoPro cameras oriented mainly in the frontal direction combined with the surface roughness of the rock walls, has resulted in an insufficient number of tie points to connect the images acquired in the forward direction with those obtained in the backward direction.
(2) The rolling shutter of the sensors used. The introduction of distortions due to the acquisition in motion and the use of rolling shutter sensors has led to not being able to accurately calculate the camera's internal orientation parameters and not being able to impose constraints on the relative orientation of the cameras without a high uncertainty. Nevertheless, the constraints on the distances between the cameras were effective in reducing the drift error compared to non-constrained processing.
The FINE Benchmark experience highlighted how, in order to achieve the aforementioned goals, a custom system was necessary to overcome the low-cost hardware limitation. The following chapters describe the hardware and design choices that led to the definition of the current system.