Serverless GEO labels for the Semantic Sensor Web

With the increasing amount of sensor data available online, it is becoming more difficult for users to identify useful datasets. Semantic Web technologies can improve such discovery via meaningful ontologies, but the decision of whether a dataset is suitable remains with the users. Users can be aided in this process through the GEO label, which provides a visual summary of the standardised metadata. However, the GEO label is not yet available for the Semantic Sensor Web. This work presents novel rules for deriving the information for the GEO label’s multiple facets, such as user feedback or quality information, based on the Semantic Sensor Network Ontology and related ontologies. Thereby, this work enhances an existing implementation of the GEO label API to generate labels for resources of the Semantic Sensor Web. Further, the prototype is deployed to serverless cloud infrastructures. We find that serverless GEO label generation is capable of handling two evaluation scenarios for concurrent users and burst generation. Nonetheless, more real-world semantic sensor descriptions, an analysis of requirements for GEO label facets specific to the Semantic Sensor Web, and an integration into large-scale discovery platforms are needed. 2012 ACM Subject Classification Information systems~Question answering


Introduction
different approaches depending on the platform. Demand is either small, intermittent, and unpredictable, if labels are generated on demand with a discontinuous workload depending on users in interactive sessions, or demand is schedulable in isolated but large bulk events, if labels are generated and stored regularly for all available metadata. To serve both scenarios, we propose to deploy the GEO label API to cloud computing infrastructures. Finally, regarding the third gap, tackling such organisational or strategic issues is out of scope for this work. Nevertheless, closing the former two gaps indirectly helps stakeholders and operators of public or open infrastructures to adopt the label in practice.
The main contributions of this work are (1) creating a mapping between metadata fields of the Semantic Sensor Web and the GEO label facets, (2) implementing a prototype of this mapping which conforms to the GEO label API, and (3) evaluating the prototype in serverless computing infrastructures with respect to intermittent and bulk generation of labels. In the remainder of this work, we first identify suitable sources of information in ontologies of the SSW and related ontologies. Then we evaluate existing GEO label API implementations and different cloud computing providers to identify suitable base software and cloud platforms for a prototypical implementation. Finally, we evaluate the prototype's performance. See the Supplement section for information about the software prototype, the test data used, and online deployments of the prototype.

GEO label for the Semantic Sensor Web
The SSW's main ontology is the Semantic Sensor Network Ontology (SSN, [12]). To create a mapping between the SSN and the GEO label, we first evaluated the modular SSN for suitable fields which can provide meaningful information for the different GEO label facets. This evaluation included SSN's core ontology SOSA (Sensor, Observation, Sample and Actuator) and the SSN's aligned modules, such as the Provenance Interchange Ontology (PROV-O) [17]. Each of the over 50 classes and properties of SSN and SOSA and their aligned modules (100+ classes an properties) was checked one by one against the eight facets of the GEO label. Next, we extended the search to include ontologies often used in conjunction with SSN starting from the SSN specification's examples 4 . From those, we adopted generic properties for names and descriptions, e.g., using the Friend of a Friend ontology (FOAF) [4]. Finally, we looked more broadly for ontologies on topics with a relation to until then not covered facets using the Linked Open Vocabularies catalogue 5 . This search lead eventually led to the usage of the Dataset Usage Vocabulary (DUV) [10] and the Bibliographic Reference Ontology (BiRO) [9]. For example, for the facet Producer Comments is set to available if a document contains an rdfs:comment, because we can assume that such a comment stems from an entity involved in the creation of the metadata record, for the facet Compliance with Standards, the mapping checks if one of the used URIs contains w3.org and thereby denotes usage of a vocabulary that underwent a development under the auspices of the Word Wide Web Consortium (W3C), while for the mapping User Feedback an observation, sosa:Observation, must be connected to a duv:UserFeedback based on the duv:hasUserFeedback property. However, not all mappings are so open respectively direct or simple and allow different options. For example, for the facet Producer Profile an SSNO class such as sosa:Sensor can be connected to a prov:Agent using either prov:wasAttributedTo or prov:wasAssociatedWith and the respective PROV subclasses, and for the facet Lineage Information any of the relations ssn:implements, ssn:implementedBy, or sosa:usedProcedure can connect a sensing system with its procedure documentation. Table 1 summarises the result of the manual process and briefly explains the reasoning behind the chosen mapping. See Section 3.2 for the full details on the mapping and the technical realisation. The table shows the ontologies, classes, and properties we identified as suitable sources for the GEO label's facets. The used ontologies and prefixes are listed in Table 2. Note that we did not add new alignments between the SSN and other ontologies, as that is beyond the scope of this work.

3
Serverless GEO label Generation

Serverless Computing
Serverless computing allows developers to deploy custom code in a shared infrastructure [2], whereby the application is maintained in a scalable way by a platform provider. The automated scaling enables both handling of large spikes of high demand and reducing costs when there is little or no demand. These properties make serverless computing a good fit for the GEO label generation usage scenarios. The GEO label generation can be deployed to a serverless infrastructure quite easily, i.e., without a complex setup including multiple services or a database, because each generation of a label is a relatively small, stateless, atomic operation. The creation of a label externally only relies on the metadata sources for which a label is requested. However, depending on the usage scenario, requests for labels can be erratic and unpredictable. To demonstrate applicability of the prototype the evaluations were conducted within the free tier of the following service providers: Google Cloud Run 6 (GCR) and Amazon Web Services (AWS) Lambda 7 . A comparison of the costs, while relevant for potential operators, is out of scope for this work.

GEO label API Implementation
The GEO label API is implemented in two software projects, one in Java and one in PHP 8 . In this work, the Java-based implementation is used because PHP is not supported by the serverless computing providers and the PHP project is no longer maintained. The rendering of the GEO label is based on an SVG template file. Labels are generated using the template file according to XPath expressions [7], which detect the presence of certain elements in a provided XML document. To use XPath, the RDF graph must be serialised in RDF/XML. Both implementations support a bespoke JSON-based configuration file format, which allows one to update the rules for transformations of metadata documents to labels without changes to the source code and to deploy these updates to GEO label API instances without updating the installation. To realise the conceptual mapping described above, we created a new transformation file 9 . The file is activated when the implementation is provided as an RDF/XML document, i.e., if the XPath boolean(/*[local-name()='RDF']) testing the document's root element evaluates to true. Of note, the implementation of hoverover and drilldown features lies beyond the scope of the proof-of-concept implementation.   Table 1 shows excerpts of the XPaths realising the conceptual mapping. The test data 10 was created based on the example data for the SSN vocabulary 11 , which was converted to RDF/XML using two online converters for two varying serialisations into RDF/XML. MyBluemix RDF Validator and Converter 12 uses rdf:resource attributes to define elements at one level (Listings 1), whereas Easy RDF Converter 13 uses the class names as XML elements and nests the objects (2). These examples illustrate the reason for the complexity of the XPaths, which allow both options to serialise triples from an RDF graph in RDF/XML.
In GCR, the API can be deployed in a container, which allows one to run the whole GEO label API with the existing Java Servlet 14 . In AWS Lambda, however, the Java Servlet application cannot be run, so a subset of the GEO label API was implemented with a bespoke request handling class 15 . This handler exposes the existing internal methods for generating SVGs based on URLs to metadata documents provided by the API caller. Then, the API, i.e., the request parameters and allowed HTTP methods, are configured in the Amazon API Gateway. Figure 1 shows a GEO label rendered by the prototype implementation developed as part of this work.

Performance Evaluation
Two usage scenarios were evaluated with an Apache JMeter 16 scripted test plan 17 . For all API queries, the URL of the example RDF serialisation file MBC_all_factes_available_ ip68smartsensor.rdf hosted on GitHub is passed via the GET request query parameter Listing 1 Observation, converted with MyBluemix RDF Validator and Converter.
For both serverless computing providers, the default configurations were used for the evaluations. GCR allows users to configure the number of containers, the number of parallel requests handled by one container, and the required minimum response time. The GCR deployment used zone europe-west1 with 256 Mebibyte working memory and 1 CPU, at a concurrency setting of 80. AWS Lambda starts more instances of a Lambda function as needed, limited by a configurable concurrency parameter (default value: 1000) for the number of running functions in the used region eu-central-1. The working memory on AWS is set to 1 Gibibyte with the default values for scaling 20 .
Scenario A simulates a geospatial catalogue service with 1000 users whose browsing of the catalogue user interface results in 1 request per second per user. Figures 2 and 3 show the response times during the test execution for GCR and AWS Lambda, respectively 21 . All sent requests have a non-failure status code (HTTP 200). The two different colours in the plots denote the requests that take less than ("Success") or longer than ("Failure") 1 second. Listing 2 Observation, converted with Easy RDF Converter.
< sosa : Observation rdf : about =" http :// example . org / data / iceCore /12# observation " > < rdf : type rdf : resource =" http :// www . w3 . org / ns / prov # Activity "/ > < prov : wasAssociatedWith > < prov : Agent rdf : about =" http :// example . org / data / Org / exampleOrg " > < rdf : type rdf : resource =" http :// www . w3 . org / ns / prov # Organization "/ > < foaf : name > Example Organization </ foaf : name > </ prov : Agent > </ prov : wasAssociatedWith > < sosa : observedProperty > < rdf : Description rdf : about =" http :// example . org / data / iceCore /12# CO2 " > < ssn : isPropertyOf rdf : resource =" http :// example . org / data / iceCore /12"/ > </ rdf : Description > </ sosa : observedProperty > This threshold is used because interactions below one second were found to not interrupt a user's train of thought and are therefore suitable for interactive use [20]. The mean times to complete the request are 414 seconds for GCR and 943 seconds for AWS Lambda. Scenario B tests the batch generation of labels where an operator of a sensor catalogue wants to generate 100 labels at once. Here we measure the overall time for processing all requests, and the operations were repeated 5 times. There is no threshold as in Scenario A. For GCR, this led to failures due to the memory limit; but, the test was completed with a memory of 1 Gibibyte and 2 CPUs per container instance. The resulting data is shown in Figure 4. GCR's need for additional resources can be traced back to an overhead of the full Java Servlet, which the Lambda function handler, which is comparably more minimal, does not suffer from. With the increased resources in GCR, the duration was up to 45 seconds for the first run and decreased though to only about 3 seconds for the fifth repetition. For AWS Lambda, the processing took about 8 seconds on the first run and dropped to around 1 second in the second to fifth repetitions, as shown in Figure 5.
A variant of the batch generation is a test scenario with 1000 parallel requests. This scenario could not be completed by either platform with the maximum available hardware configurations. The error messages (Connection reset and SSL handshake terminated) hint that the services blocked the large number of parallel requests, such that users would need more powerful (and more costly) deployments. Reducing the number of parallel requests eventually led to successful scenario executions at 600 requests in 51 seconds for GCR and 300 requests in 9 seconds for AWS Lambda (see data files GCR_Scenario_4_2_V3 and AWS_Scenario_4_2_V4).

Discussion
The mapping of GEO label facets to properties in the Semantic Sensor Web was an iterative process. While we were able to find data sources for all GEO label facets, the mapping is limited by the availability of realistic SSW datasets. First, the variability of real-world data may not be adequately captured. Second, and the nature of the mapping does not capture cases where concepts between the GEO label's facets do not unambiguously match concepts behind SSW elements. Compared to the centrally managed data sources and industry-driven OGC standards of the original GEO label, we find no need to make distinctions between metadata given by providers and by third-parties, e.g., commenting servers. However, such multi-stakeholder perspectives could mitigate shortcomings in the creation process of the GEO label mapping for SSW. More real-world metadata could improve the scope of the facet data sources, e.g., by deriving from common practices if a comment is actually about a relevant part of a sensor's properties, and not about some less relevant part of the RDF graph. The taken iterative, example-based approach could also be contrasted with the initial creation of an ontology for the GEO label facets and then aligning the GEO label ontology with existing (SSW) ontologies. The alignment-based approach could also improve the scalability of the mapping for a larger variety of uses cases and SSW datasets.
Concerning the mapping's implementation, we found that a document-based approach using RDF/XML could be built quickly on the existing implementation. However, such an approach does not leverage the power of the Semantic Web, e.g., reasoning on dynamically built graphs and aligned ontologies. Furthermore, for some facets, e.g., producer profile, there were clear complications of relying on serialisations into an RDF/XML document. A different approach based on native Semantic Web technologies, such as SPARQL [13], could help address the limited coverage of the presented mapping, and in addition take into account relationships between linked resources by, for example, measuring the distance between connected resources in a graph. The GEO label's option to have "half-filled" facets, which denotes availability of information at a higher level, could expose such more complex scenarios. Most critically, the presented approach is limited by the design process starting only from the current GEO label facets. That is why specific discovery challenges of the SSW may not be adequately addressed. While the label itself may be interactive, the majority of information behind the label is seen as rather static. This may partly be attributed to the GEO label's origin in GEO, with more traditional roles of provider and user. The SSW's potentially very dynamic nature, for examples live data streams, and flexible distributed architecture, in which anybody can create and publish new ontologies and datasets, may require additional facets or a more sophisticated presentation of sources and currentness of the data behind a label.
The evaluation results of Scenario A show show no discernible cold start effect, as one might have expected, where resources need to be activated for the first request or additional resources are added over time. Only few requests take over 1 second to complete and only relatively few outliers exists on the same order of magnitude. For AWS Lambda, both mean and median of elapsed time to complete a request are close to 1 second. For GCR, the elapsed time is well below 0.5 seconds. These results imply that the serverless label generation is suitable for interactive use, with slight advantages of GCR which has overall shorter durations. A limitation for this scenario is that only the generation of the label is tested, whereas for users additional time would be taken up by the client-side rendering of the images. The effects of dropping durations for batch processing in Scenario B were likely achieved by a combination of autoscaling in the underlying platforms and the built-in caching of the GEO label API Java Servlet. Especially on AWS Lambda, the drop after the first iteration is considerable, even tough no internal caching mechanism exists.
Regarding the platforms we used, the data might further point to an advantage for the reduced implementation of the AWS Lambda functions compared to the full Java Servlet running in containers on GCR, though both showed scaling mechanisms of serverless computing to be effective. The results make clear that specific evaluations for each use case and platform are warranted. More test scenarios could include varying allocated resources at the cloud providers to optimise performance versus costs, touching both on the respective configuration parameters and the client-side implementation. For example, the steep drop in Scenario B on AWS Lambda could be used to warm up a service instance, which may have relatively small resources, with a portion of the data for batch processing, and then following up with several chunks afterwards. In contrast, the shorter average response times of GCR may make it more suitable for a scenario with more constant load even if fewer resources are allocated.

Conclusions
In this study, we transferred the goal of the GEO label, which is to improve data discovery by providing a visual overview of available information in machine-readable metadata, to the Semantic Sensor Web. While we were able to find data sources for all GEO label facets using a document-centric approach, the mapping is limited by available datasets and does not leverage the potential of using reasoning in the SSW. Ideally, the creation of a more sustainable mapping and potentially even adaptation of GEO label facets in the future is based on a larger body of public sensor metadata in SSNO format, on a consultation of multiple stakeholders, and on a complementary perspective derived from the SSW's discovery challenges.
We found that the serverless platforms proved suitable for realistic test scenarios, though, naturally, the used free tiers have limits. It became also clear that the different cost models and configurations make serverless solutions difficult to compare. Future evaluations may utilise a strictly cost-based comparison of scenarios with resources tuned to deliver similar performance in the user-facing API.
Finally, the usefulness of the GEO label remains to be demonstrated in broad deployments with many users and extensive user studies. With the practical solutions for label generation introduced in this work, the actual spreading of labels will require leading organisations to add and maintain labels on their widely used geospatial catalogues. In the meantime, a bottom-up approach with client-side label integration [23] could provide the benefits of GEO labels to interested users, and the GEO label can be examined in relation to recent developments on scientific data publication such as the FAIR Guiding Principles [26].