Submitted:
29 May 2026
Posted:
03 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- 1.
- Q1. In which ways can LLMs assist or further increase the automation degree in service publication and discovery?
- 2.
- Q2. Which from these ways leads to better service discovery performance and accuracy?
- Our methodology leads to enriched OpenAPI specifications with high enrichment accuracy
- This enrichment enables increasing the accuracy of existing service discovery algorithms
- Our new algorithms lead to higher service discovery accuracy as they better exploit the semantics incorporated in the OpenAPI specifications
- Our LLM-based service discovery algorithm reaches high accuracy levels. This highlights that LLMs not just assist service discovery but fully realise it in the best possible way.
2. Related Work
3. LLM-Based Service Discovery Methodology
3.1. Assumptions
- We rely on the OpenAPI standard [20] as it is well adopted by the industry. However, in principle, our methodology is independent of the RESTful service description formalism as it can be adapted to support any existing formalism.
- We consider that existing OpenAPI specifications of RESTful services are inappropriate to be used for service discovery purposes for the following reasons: (a) the specifications can drift with respect to the service implementation; (b) in many cases, the specifications are poorly described as they are usually automatically generated by specific tools or libraries that cover the technical service interface. In fact, it has been observed that there is a lack of textual description for service operations and their I/O while also the service operation and parameter names or identifiers do not convey any kind of semantics as they take arbitrary forms. As such, there is a need to consider a single truth source that can enable producing rich OpenAPI specifications. We believe that the service source code can play that role, and LLMs can inspect it to produce a semantically enriched OpenAPI specification.
- While LLMs can produce semantically enriched OpenAPI specifications, such specifications just have more textual descriptions and more meaningful operation and I/O parameter names. However, service discovery practice has unveiled that structural textual description is not enough. Thus, ontological annotations must be incorporated from existing ontologies. Again, LLMs can support this latter task, and our work properly investigated this.
- We focus on the case that service requesters seek specific service operations. We believe that this is more realistic in the context of developing applications or BPs by also considering the very nature of RESTful services, which usually realise management operations over business-related entities.
- Based on the previous assumption, we regard that service requesters do not possess special skills, so they supply a short textual description of the desired service operation. While this description can be ambiguous, we consider it highly realistic (under this highly probable real-world scenario), and our work will attempt to mitigate the incurred ambiguity in various ways.
- Due to our focus on service operation matching, we consider that service matching is a side-effect of the former. In this respect, we match service operations against the service requests, and then we also supply the services that offer these matching operations. This is also related to the very nature of RESTful services, which do not map to specific overall I/O parameters, thus satisfying an overall functional capability. Thus, they are seen as merely collections of functional capabilities implemented by their operations. However, this does not signify that they are neglected in service operation matching. On the contrary, they are accounted for in various algorithms that we propose, as we believe that they can enhance discovery accuracy.
- As can be well understood, our focus is on functional service matchmaking. So, non-functional service matchmaking will be covered in our future work.
3.2. Methodology Core
3.3. Service Representation Models
- COMPLETE: maps to service operations that fully match all aspects of the service request.
- PARTIAL: includes service operations that might require some customisation to fully match the service request. This signifies that there can be some intent or parameter gaps.
- POSSIBLE: maps to service operations that topically match the service request in an incomplete manner. Thus, they are insufficient to completely satisfy the user request.
- PERFECT: the service operation is semantically equivalent to the request.
- PLUGIN: the service operation produces a more specific output (a child concept) than the one demanded by the request.
- SUBSUMES: it is similar to PLUGIN but includes service operations where their output is a sub-concept of the request output, while each request input is the same or a sub-concept of a service operation input.
- PARTIAL: the service operation partially matches the request. Thus, it might need to be combined with other operations to completely fulfil it.
3.4. Automatic LLM-Based OpenAPI Specification Generation
3.4.1. Service Source Code Filtering

- the file name should end with “DTO” (case insensitive)
- should include in its content the “‘data class” declaration – this is Kotlin-specific and utilised to denote immutable domain/data objects
- the file’s content must include one from the next annotations: “@Entity”, “@Document”, “@ApiModel”, “@Schema”, or “@JsonInclude”. These annotations relate to using specific data-oriented frameworks or technologies like the Object-Relational Mapping (ORM) (e.g., JPA). In particular, “@Entity” is a JPA annotation for domain entities, “@Document” represents MongoDB domain documents, “@Schema” is used for documentation purposes of domain models, and “@JsonInclude” is used to enforce Jackson serialisation control for domain/DTO objects.
- the file is included in a directory named as “error” or “exception”
- the file’s name ends with “Error” or “Exception”
- if the file’s class implements or extends “Exception” or “Throwable”, thus checking in this case the file’s content
- the file’s content incorporates one from the following annotations: “@SpringBootApplication”, “@Configuration”, “@Bean”, “Provider” and “@Singleton”. All these annotations are utilised in the configuration and dependency injection layer of an application, so they map to irrelevant kinds of classes in the context of OpenAPI specification generation.
- when the file’s class extends module, converter, validation or (de)serializer classes (e.g., StdDeserializer). Such classes might be named with “DTO” substring or have the “@Component” annotation while they are irrelevant for OpenAPI specification generation purposes.
3.4.2. OpenAPI Specification Incremental Construction
- include in the specification’s Info section detailed text for the “title” and “description” fields, along with the value of 1.0.0 for the “version” field. The main goal is to introduce an overview of the service’s functionality that can facilitate its operations’ discovery.
- incorporate in the specification’s Paths section all the service’s operations as derived from its source code. In each operation, the right HTTP method must be used, the “operationId” field’s value should be equal to the name of the method that implements this operation, and a detailed textual description of the operation’s functionality must be incorporated in the “summary” field. Further, it is instructed that all I/O parameters should have clear names, types and locations along with a detailed analysis of their semantics in the “description” field. In case of complex types, they need to properly reference the correct schemas at the specification’s Components section. In addition, one example must be given for complex types in I/O parameters. Finally, no duplicate keys should be supplied in the operation path. The main goal of the above rules is twofold: (a) guarantee that the service operation signature fully matches that of the respective class method in the service’s source code; (b) detail the semantics of each service operation and its I/O parameters based on the implemented behaviour in the respective class method. As such, a more precise discovery of the service operations can be facilitated.
- include in the Components section the schemas of all composite types mapping to the service operations’ I/O parameters. The schemas must reflect the structure of the information exchanged as reflected in the service’s source code (e.g., in terms of the involved domain models or DTOs). Such structured schema definitions can definitely assist in the semantic enhancement of I/O parameters in the OpenAPI specifications, as they can match ontology concepts.
- ensure that the generated specification conforms to OpenAPI standard version 3.0.0, without causing any validation error. In addition, POST operations should not include plain input parameters but a requestBody part. Finally, no duplicate information should be incorporated in the specification. All these rules enforce the syntactic and structural validity of the generated OpenAPI specification, guaranteeing its proper processing (e.g., by other methods that implement the proposed research methodology).
3.5. Automatic LLM-Based OpenAPI Specification Annotation
3.5.1. I/O Annotation of OpenAPI Specifications
First Task: Internal Ontology Creation
- I/O Parameter Extraction: This operation, named as parseOpenAPISpecification(), extracts the main I/O parameters and their types from a service’s OpenAPI specification.
- Ontology Chunk Construction: This operation, named as computeIntegrateOntologyChunk(), produces an ontology chunk out of an OpenAPI specification’s extracted I/O parameters.
- Ontology Chunks/Parts Consolidation: This operation, named as consolidateOntologyChunks(), takes as input a set of ontology chunks or parts (e.g., N out of M, where M is the total number of OpenAPI specifications) and creates one overall ontology part out of them.
Internal Ontology Construction Algorithm

First Algorithm Operation: I/O Parameter Extraction
- name: the I/O parameter’s name
- description: its textual description
- isComposite: whether it maps to a composite type
- simpleType: its simple type (where applicable)
- fatherParameter: the parameter in which it is contained. This covers the case that the parameter is a field that is included in a composite type represented by another OntologyParameter
- fatherClasses: these are super-classes if the current parameter is composite and thus can be considered as a specific class. Parameter sub-classing relied on the use of the allOf element in the definition of the sub-class type
- a simple-type parameter named equivalently to a field in a composite type is considered as a datatype property included in the composite type’s class. So, for this parameter, the fatherParameters must be completed.
- a parameter that maps to a composite type is structurally represented by this type. Thus, we create an OntologyParameter for it, constructed from its composite type. The latter parameter structure will have the isComposite field as true and might have super-classes if the allOf element is included in its specification. Further, it might have a fatherParameter, if there is another composite type that has a field with the current type as its data type.
Second Algorithm Operation: Ontology Chunk Construction
- an ontology parameter with a simple type that does not have a father parameter is mapped to a global datatype property not encompassed in any concept. The id (i.e., the fragment part of this entity’s URI) of the datatype property and its rdfs:label map to the parameter name, while the property’s type (rdfs:range) is the parameter’s XSD type (e.g., xsd:string). Further, the datatype has its rdfs:comment equal to the parameter’s (textual) description. For example, the identifier parameter will have as id & rdfs:label “identifier”, the xsd:string as rdfs:range and as rdfs:comment “the id of the entity”.
- a parameter with a simple type that has a father parameter will become a datatype property of the father parameter’s class. We construct the same RDFS (triple) statements for this parameter as in the previous case, but we add an extra rdfs:range statement, pointing to the father parameter’s class id. For instance, “streetName“ datatype property will have :Address as its rdfs:domain value (i.e. pointing to a concept representing an address).
- a parameter with a complex type will become an ontology concept (i.e., will map to rdf:type owl:Class statement). This concept will have as id and rdfs:label the parameter name and rdfs:comment the parameter’s description. Further, in case it has super-classes, multiple rdfs:subClassOf statements will be added, each pointing to a different super-class. For example, the EuropeanCountry concept will have rdfd:subClassOf equal to :Country. Moreover, in case it has a father parameter, we need to create an object property (named as “has” plus the parameter name) connecting this parameter to its father one. For instance, if Country has Address as a father parameter, we will create an object property named as “hasCountry” with rdfs:domain equal to :Address and rdfs:range equal to :Country.
Third Algorithm Operation: Ontology Chunk / Part Consolidation
- Equivalent Class/Property Merging: Utilise one canonical URI per class/property and retain only one from classes/properties when they have the same or similar meaning and semantics. During class merging, retain the name that better matches the respective domain. During properties merging: (a) retain the most representative label and comment; (b) select the most general type for them (rdfs:range); (c) apply owl:unionOf directly in the rdfs:domain when the properties belong to different, non-merged classes.
- Structural Integrity Maintenance: The LLM is instructed not to assign the same resource to both ‘owl:Class’ and ‘owl:ObjectProperty’ / ‘owl:DatatypeProperty’. Further, duplicate rdfs:domain and rdfs:range triples for a property must be removed. Finally, references to undefined resources must be eliminated.
- Namespace Consistency Preservation: The LLM must utilise the same namespace in all unified concepts and properties. Further, unused or duplicate prefixes must be removed or consolidated.
- Output Format: The LLM is instructed to generate only the consolidated ontology’s specification in valid Turtle syntax with no explanations / commentary.
- Class Hierarchy Enrichment: generate a common super-class when multiple classes share structure or semantics and move in that class their common data/object properties. Further, if an object property signifies the containment of a specific class with a structure that is fully inherited and extended, model the contained class as a subclass of the containing using rdfs:subClassOf and remove the object property.
- External Ontology Mappings Addition: add an own:equivalentClass or owl:equivalentProperty statement when an element in Schema.org is semantically equivalent to a class or property in the consolidated ontology, respectively. Further, do not hallucinate.
- Structural Integrity Finalisation: guarantee that all rdfs:domain and rdfs:range statements are valid. Further, delete any residual duplicate or conflicting triples. In addition, utilise owl:equivalentClass or owl:equivalentProperty statements only when needed and only between two resources. Finally, break circular rdfs:subClassOf references by retaining only the most semantically valid rdfs:subClassOf links.
- Namespace Consistency Preservation & Output Format: these two sections are actually equivalent to those in the first step’s prompt.
Second Method Activity: I/O Annotation
- Ontology Element Selection: map each I/O parameter to an element from the internal ontology. If no match is found, match the parameter with an element from Schema.org. When no match is found in any of the ontologies, set “UNKNOWN” as the mapped element.
- Output Format: return only a JSON array with no commentary or explanation. Each array member must be unique and include an identified mapping from the current service’s I/O parameters to ontology elements. This mapping should include: (a) the name of the matched I/O parameter – for input parameters, we retain their name while for output parameters, the name of their data type/schema; (b) the full URI of the ontology element mapped to the parameter; (c) the name of the internal ontology or Schema.org as the source ontology including the mapped element.
- Selection Criteria: conduct the mapping by considering the semantics, data type and context (of the I/O parameters) within the OpenAPI specification. Output parameters must be mapped to ontology concepts while input parameters to data/object properties or ontology concepts, depending on the current usage/context. Finally, I/O parameters should be mapped only to valid/existing elements from the two ontologies considered.
- Constraints: do not annotate individual fields of schemas but only complete schemas. Do not utilise concepts/properties belonging to other ontologies than the considered. Further, do not hallucinate by using non-existing elements. Finally, do not provide duplicate array members.
- Output Constraints: produce only a syntactically and semantically valid JSON array. Further, ensure that all URIs in the array are valid and resolvable.
3.5.2. Action Annotation of OpenAPI Specifications
-
Ontology Element Selection: The LLM is instructed to pick up the best possible action from Schema.org that best matches the service operation at hand (based on its description according to the main verbs used). In case no suitable match is found, the LLM must select the most general action, mapping to the URI: https://schema.org/Action. In addition, the section covers corner cases occurring in specific domains. Please see two such cases below:
- –
- In the transaction domain, if the verbs used in the operation description indicate the initiation, confirmation, or finalisation of a transaction or process, the most suitable action to select is https://schema.org/TradeAction or any suitable from its subtypes (like https://schema.org/PayAction).
- –
- In the mathematical/evaluation domain, if the operation verbs indicate computation, transformation, or numeric/text calculation, the best possible action is https://schema.org/SolveMathAction. Otherwise, if the verbs indicate evaluation, validation, or condition checking, the action to be selected is https://schema.org/AssessAction.
Finally, in case no verb is found in the operation description, then:- –
- If the description relates to diagnostics, error handling, health check, or status verification, the https://schema.org/AssessAction must be picked up.
- –
- If it relates to data fetching of retrieval, the https://schema.org/ReadAction must be selected.
- Verb Extraction Rule: This section indicates to the LLM which are the sources for identifying the right verb to consider in the service operations’s specification. The primary source is indicated to be the operation identifier (first verb-like token within the identifier). While the secondary source is the start of the operations’s textual description. In fact, the secondary source verb overrides the primary source one in case that the former is imperative and both conflict with each other. The LLM is also instructed to ignore nouns that include action words. Finally, a corner case is covered where the verb ‘options’ in the operation identifier (e.g., HTTP OPTIONS verb/method) should be mapped to https://schema.org/AssessAction as it indicates the necessity to probe or check the service’s operations.
- Selection Criteria: the LLM is instructed to match based on the semantics of verbs by considering also the context (i.e., the operation’s textual description). Further, only action-related valid terms from Schema.org must be used. Finally, it is repeated that the best possible match should be supplied per each service operation relating to a subclass of https://schema.org/Action. If no match is found, then https://schema.org/Action should be the matched action.
- Constraints: the LLM is instructed not to utilise any element from a different ontology than Schema.org. Further, it is indicated that the LLM should not hallucinate, constructing action-related concepts that do not exist in Schema.org.
- Output Constraints: the LLM is instructed to always provide a full, valid and resolvable URI for the ontology concept mapped.
- Output: the LLM is dictated to only output a JSON array, including as entries the matched operation’s identifier, and the full URI of its matching action from Schema.org.
3.6. Automatic Transformation of OpenAPI Specifications
- service vector: it is produced based on the following information: the service’s name, title, summary and (textual) description in the specification’s Info section, the names and (textual) descriptions of all service’s tags in the specification’s Tags section and the information utilised to construct the vectors of all operations in the specification.
- operation vector: it is produced based on the following information: the operation name (mapping to the operationId field or, when it is absent, to a combination of the HTTP verb and relative path of the operation), the operation’s textual description and summary, the operation’s tags, and information related to the operation’s I/O vectors.
- operation input vector: it is constructed per each operation’s input parameter depending on the parameter’s kind. In case of single (i.e., not entity/class-based) parameters, the parameters’ name and description are only considered. On the other hand, for complex (i.e., class-based) parameters, we also consider the names of their properties/fields.
- operation output vector: only one such vector is constructed per operation. It is treated similarly to the case of input vectors, as an operation output can map either to a simple or a complex type.
- Inclusion of a space between adjacent lowercase and uppercase letters in a word. This enables splitting a CAMEL-case word into multiple sub-words
- Replace underscore characters with a space character to split snake-case words
- Replace non-letter characters with a space
- Lowercasing the remaining characters
- Splitting the overall string into multiple tokens/words based on white space characters
- Removing stop-word tokens
- Lemmatising the remaining tokens based on the Stanford Core NLP pipeline2
- Producing a final array of String-based tokens
3.7. Automatic LLM-Based Service Request Structuring and Enrichment
3.7.1. Automatic LLM-Based Service Request Structuring
- request vector: it is computed similarly to an operation vector.
- request input vector: it is computed similarly to a operation input vector
- request output vector: it is computed similarly to a operation output vector
First Method Activity: LLM-based Request Structuring
- Action-focused: indicates to the LLM that when multiple verbs are involved in the textual request description, the one mapping to the request’s intent must be selected. Further, detailed mappings are supplied from typical3 and non-typical4 verbs to specific “standard” actions from the action hierarchy in Schema.org. Each mapping includes a grouping of semantically-relevant verbs to a “standard” verb. For instance, “list”, “find”, “search” were mapped to “search” action.
- I/O-focused: dictates that single text must be returned per input/output and not structured/nested fields. Further, when multiple entities are involved (in the request), the one fetched or affected by the requested service operation should be the (single) output, while entities playing the role of subject, filter, scope, location or qualifier must be mapped to input parameters. When an action (implicitly) imposes identifying an entity to be processed or returned, an input parameter mapping to the identifier of that entity must be added. Finally, when the input is not specified or is considered irrelevant, the input parameter section should correspond to an empty array.
-
Fallback: The fallback logic affects the handling of two main cases:
- 1.
- when the request’s textual description includes alternatives of entities or parameters, the most general or representative term must be returned
- 2.
- when the request’s textual description is ambiguous in terms of the designated action or I/O, the LLM should make the best reasonable guess for the request’s extracted sections, instead of returning “unknown” or “unspecified”.
- Output format: The output format actually corresponds to the Function entity in the service representation model (without the TFIDF and LLM-based embeddings vectors). Further, the LLM must not add commentary in the response, structured/nested fields in inputs & output sections, and not make up terms.
- Examples: Three examples were supplied covering three main management operations (update, delete and collection retrieval) that can be performed on business entities, respectively. They were termed necessary as some LLMs did not fully follow the previous instructions and, e.g., did not include as input the identifier of an entity to be deleted or did not include as output the entity being updated.
3.7.2. Automatic LLM-Based Service Request Annotation
4. System Architecture & Implementation
4.1. System Architecture
- Controller: it is the most central component, playing the role of the Controller in the CSR pattern. It exposes the service registry interface to the outer world. Further, it is responsible for receiving requests, delegating their processing to the right service at the service layer and returning the respective response. Finally, it validates the request’s and returns appropriate error responses with correct HTTP status code when such a validation fails.
- OpenAPI Service: it handles the first three phases of our research methodology, as it covers the management of RESTful services and especially their OpenAPI specifications. To realise this management, this service communicates with the Code Repository Service to fetch a service’s source code and filter it, and then with the LLM Service to produce the service’s OpenAPI specification. Further, it communicates with the LLM Service to construct the internal ontology, with the Ontology Service to store this ontology and then back with the LLM Service to annotate the services’ OpenAPI specifications. Finally, it communicates with the RESTful Service Repository to store the managed services (according to our service representation model) and their OpenAPI specifications.
- Discovery Service: it handles the last two phases of our research methodology. It communicates with the LLM Service to structure and annotate the incoming service discovery requests. It also interacts with the Ontology Service when executing semantic-based service matchmaking algorithms. This component is a wrapper of all service matchmaking algorithms we have implemented. Thus, it can be configured to execute any of them.
- LLM Service: it interacts with the LLM Proxy to issue LLM prompts or request the calculation of LLM-based embedding vectors.
- LLM Proxy: interfaces with an external LLM API, which might be offered by a LLM provider or a LLM marketplace, so as to allow the LLM Service to flexibly utilise the LLM of its choice.
- Code Repository Service: it fetches via Git a service’s source code based on its Git-based Code Repository URL and places it at an appropriate folder within the local file system.
- Ontology Service: it is responsible for ontology management in cooperation with the underlying Knowledge Base. Further, it supplies reasoning and semantic querying facilities over this Knowledge Base, which can be exploited by semantic functional service matchmaking algorithms. It can be imagined as a wrapper of the underlying Knowledge Base that exposes only the necessary interface, thus hiding low-level ontology management details.
- RESTful Service Repository: it plays the role of a Repository, which stores and updates service objects (according to our service representation model) in cooperation with the underlying Database by exploiting the ORM technology.
- Database: it is a relational DB enabling the transactional storage, updating and querying of information related to RESTful services, including their OpenAPI specification.
- Knowledge Base: it enables storing ontologies (e.g., internal and Schema.org) while also providing reasoning and querying operations over them.
4.2. Implementation Details
- 1.
- the well-known Smile library13, so as to preprocess textual descriptions, produce TFIDF vectors out of them and compute their similarity based on the cosine similarity measure.
- 2.
- Apache’s Jena library14 so as to interact with the underlying Knowledge Base to query or reason over ontology-based data.
5. Service Discovery Algorithms
5.1. Introduction
5.2. Request Paraphrasing
5.2.1. Request Paraphrasing & Results Merging

5.2.2. LLM-Based Request Paraphrasing
5.3. TFIDF-Based Matchers
5.3.1. Service Operation Categorisation
- COMPLETE_THRESHOLD: maps to the threshold for the COMPLETE lexical category. If a similarity score is above this threshold, the respective service operation is classified under this category.
- PARTIAL_THRESHOLD: concerns the threshold for the PARTIAL lexical category. If a similarity score is above this threshold and below COMPLETE_THRESHOLD, the respective service operation is classified under this category.
- POSSIBLE_THRESHOLD: corresponds to the threshold for the POSSIBLE lexical category. If a similarity score is above this threshold and below PARTIAL_THRESHOLD, the respective service operation is classified under this category.
- – we consider the percentile of 10% of all values as a threshold such that we cover most of the best similarity values (90%).
- – we consider the percentile of 80% of all values as the threshold, so we are more conservative as there is a need to have strong evidence that a specific match has a similarity value that signals a partial match.
- – this corresponds to a heuristic baseline that is satisfactory for our purposes. However, in the future, we will opt for a more sophisticated technique, which considers the overlap region between positive and negative distributions.
5.3.2. Core TFIDF Matcher
5.3.3. Structured TFIDF Matcher
- action section: we rely on information that purely covers the actual action realised by a service operation without incorporating any kind of input or output information. To this end, we compute a partial operation vector for the operation at hand, which is similar to the respective operation vector without including information from the respective operation input and output vectors. Thus, this new vector is computed from the service operation’s name, its textual description and summary and its tags. At the service request side, we first structure the request so as to easily obtain its three main actions, and we take only its action section to compute its vector that we call partial request vector. Finally, we compute the cosine similarity of the partial request and operation vector and assign it to .
-
input section: The computation here is more complicated as it takes into account four distinct cases:
- 1.
- Both the request and operation do not have any input parameters, so their input similarity () is 1.0.
- 2.
- Both the request and operation have input parameters. In this case, we construct overall input vectors for them. The overall operation input vector is constructed by concatenating with a space as a separator the information concerning each operation’s input parameter. While the overall request input vector is constructed by concatenating the names of the request input parameters (in the request input section), separating them with a space, too. Finally, we compute the cosine similarity between these two vectors and assign it to . This approach is better as it considers the overall similarity between all service operation and request input parameters. While a naive approach that constructs separate vectors per input parameter, “matching” them and then computing an overall similarity would fail as the probability of having input parameters unmatched due to uncommon terminology would increase.
- 3.
- The operation does have input parameters, while the request does not. This is a rather problematic situation, as the requester will not be able to call the service operation because he/she expects no input parameter for the service. Further, this can be a signal of different intents between the service operation and request. As such, we consider that: where is the penalty value for the similarity score, configured to take the high value of 0.8, and is the number of the operation’s input parameters. Thus, the overall similarity gets even smaller with the increase in the number of input parameters in the service operation.
- 4.
- The operation does not have input parameters, but the request has. This is not a very problematic situation like the previous one. However, as it designates a potential intent mismatch, we supply a smaller penalty value, configured by default to 0.5, while the overall input similarity gets: where is the number of the request’s input parameters .
-
output section: similarly to the case of input section similarity calculation, we discern between three cases:
- 1.
- the service request and operation do not have an output, so is set to 1.0.
- 2.
- the service request and operation do have an output. Then, we compute the cosine similarity between the operation output vector and the request output vector (see Section 3.6 and Section 3.7.1) and we assign it to .
- 3.
- one of them has an output, and the other does not. In this case, it holds that: where is configured to be high (0.8).
5.3.4. Ontology-Based TFIDF Matcher
5.4. LLM-Based Embeddings Matchers
5.4.1. Core LLM-Based Embeddings Matcher
5.4.2. Onto LLM-Based Embeddings Matcher
5.4.3. Structured LLM-Based Embeddings Matcher
Vector Terminology
- partial operation vector: constructed by considering the service operation’s name, textual description and summary.
- semi-semantic request vector: constructed based on a String that concatenates the annotations in the structured request (action + input parameter annotations + output annotation).
- semi-semantic operation input vector: constructed based on a String that concatenates the annotations of all input parameters in the service operation.
- semi-semantic request input vector: constructed based on a String that concatenates the annotations of all input parameters in the request.
- semi-semantic request output vector: maps to an LLM-based embeddings vector, computed based on the request output’s annotation.
- semi-semantic operation action vector: constructed from the action annotation of the service operation after removing the “Action” postfix
- semi-semantic request action vector: constructed from the action annotation of the request after removing the “Action” postfix.
Matcher’s Logic
- action similarity: computed via the cosine similarity between the semi-semantic operation action vector and the semi-semantic request action vector. Thus, we consider the action annotations in the service operation and request to compute it.
- input similarity: computed based on the input matching strategy configured. As this computation is more complicated, we detail it in the next paragraph.
- output similarity: the computation of this component is complicated, so we detail it in the second, next paragraph.
- context similarity: the (service operation) context could be better covered via PE constraints. However, such constraints are not available. As such, we regard that the textual description of the service operation and of the request simulate the intended context. Thus, context similarity is computed by the cosine similarity between the partial operation vector and the request vector.
- domain similarity: as we do not have a description of the domain in the service operation and request, plus no relevant tags or annotations, we make the following assumption. A service covers multiple functionalities within a domain, so it includes sufficient information to cover it. Thus, the domain section for a service’s operation can be the service vector. However, at the request part, we do not have any relevant information apart from the requested operation. Thus, the overall request document is considered the request’s domain. Due to this under-representation, we decided to give a very small weight to this similarity component. Thus, eventually, domain similarity equals the cosine similarity between the service vector and the request vector.
Input Similarity
- 1.
- There are no input parameters in the service operation and request. Then, input similarity is 1.0.
- 2.
- The operation has input parameters, but the request does not. In this case, it holds that where is the operation’s input parameter set and is a specific configuration property to penalise the similarity due to the non-existence of input parameters in the request. It is originally configured to be equal to 0.8. The max_input_sim() is a function that computes the maximum similarity between each operation input parameter (i.e., the respective operation input vector) and the semi-semantic request vector. The main rationale is that we attempt to find whether there is a meaningful (LLM-based embeddings) similarity between any operation input parameter and the request’s ontological description that could signify a respective correlation that might potentially unveil the ability of the request to somehow cover this input parameter. Further, we rely on the request’s ontological description to remove any ambiguity kind.
- 3.
- The request has input parameters, but the operation does not. This case is symmetric with respect to the previous one. More formally, it holds that:where the request’s input parameter set and max_input_sim() computes the maximum similarity between each request input parameter (i.e., its respective request input vector) and the partial operation vector. Again, we try to see in this way whether there is a semantic correlation between any request input parameter and the operation’s partial description that could signify this parameter’s coverage.
- 4.
-
Both service operation and request have input parameters. In this case, the input similarity is computed as follows:where are relative weights given to semantic and lexical-oriented component similarities, respectively. These weights are configurable, and their sum should equal to 1.0. The semantic input similarity is computed via the cosine similarity over the semi-semantic operation input vector and the semi-semantic request input vector. On the other hand, the lexical-oriented similarity’s computation depends on the input matching strategy as follows:
- COMBINED_INPUT (default): In this strategy, we regard each input parameter set (of service operation/request) as a unified (complex) parameter, and we attempt to compute these parameters’ overall similarity. As such, is computed by the cosine similarity between the overall operation input vector and the overall request input vector.
- DIFF_INPUT_AVERAGE: In this strategy, we find the maximum similarity between each operation input parameter and all the request’s input parameters. This means that we compute M similarities per each operation input parameter (where M is the number of request input parameters) and we identify the maximum, which is added to a specific variable. In the end, we compute the average over these maximum similarities by dividing this variable with the number of operation input parameters. Each individual similarity is computed by the cosine similarity between the operation input vector and the request input vector of the respective input parameters being matched. More formally: where is a function that computes the maximum lexical similarity between an operation input vector and the request’s input parameters (actually their request input vectors).
- DIFF_INPUT_COVERAGE: This strategy is similar to the previous one. However, instead of computing the maximum similarity between each operation input parameter and the request’s input parameters, we explore whether one of these similarities is above a specific threshold called , configured by default to be equal to 0.7. If such a similarity is found, we consider the current operation’s input parameter as matched/covered (by the request). Otherwise, if all similarities are below , we consider that the operation’s input parameter as unmatched. By continuing this process, we count how many operation input parameters were covered, and then we divide the result by N (denotes the number of operation’s input parameters) to compute an average value. More formally, where is a function that checks whether the similarity between the current operation input parameter (i.e., its vector ) and any request input parameter is above .
Output Similarity
- Both the service operation and request do not have any output. In this case, output similarity is equal to 1.0.
- One out of the service operation and request does not have output, while the other does have. In this case, output similarity is computed as follows: where was already introduced before and is by default equal to 0.8. Please note that penalisation is independent of the sub-case (whether the service operation or request does not have an output). This is because the lack of an output is a major signal of an (requested or offered) operation’s intent, so it should lead to a major penalty when the other specification being matched does require or produce a specific output. For instance, consider the case of a service operation returning a specific news article and a requested operation deleting a news article. Both specifications will have the same input (the article’s identifier), but the offered operation returns the news article, while the required operation does not return the deleted news article.
-
Both the service operation and request have an output. In this case, output similarity is similarly computed as input similarity (in the respective similar case of input parameter existence) based on the following formula:The weights in the above formula are the same as in the similar case in input similarity computation. corresponds to the semantic output similarity computed by the cosine similarity between the semi-semantic operation vector and the semi-semantic output vector. On the other hand, corresponds to the lexical output similarity that is computed by the cosine similarity between the operation output vector and the request output vector.
Performance and Accuracy Analysis
5.5. LLM-Based Matchers
Introduction
Prompt Design
- 1.
- Inputs: details the two main inputs given to the LLM, i.e., the textual service request and the description of all RESTful services. The description of the latter corresponds to a list of service documents, i.e., the information based on which the service vectors were constructed. This information indeed does not include ontology-based annotations.
- 2.
-
Matchmaking Logic: This is the most detailed prompt section comprising the following six (6) main sub-sections, which explicate the core service matching logic:
- (a)
- Confidence Scoring: explicates that the service operation similarity (or confidence score) is computed as the weighted sum of semantic (based on request and operation name & description), intent (based on request and operation action/intent), parameter (based on request and operation I/O) and domain context (based on topic overlap, tags and keywords). More formally:where are the relative weights given to these similarity-based components, which should have a sum equal to 1.0. By default, these weights are configured as follows: . Thus, the semantic similarity is the highest, followed by the intent and I/O similarities. This is a similar configuration to the one we utilised for the Structured LLM-based Embeddings matcher, giving the highest cumulative weight to IOPE-based similarity components.
- (b)
- Match Types: Here the prompt explicates the two different match categorisation dimensions and their categories’ semantics. Please refer to Section 3.3 for a detailed analysis.
- (c)
-
Service Confidence: While our focus is on supplying service operation matches, we regarded the interesting feature to also match the services themselves via a confidence score and rank them. While this is not reflected in the final output produced for this matcher in our implementation, we intend to properly implement this feature in all matchers so as to present two separate rankings, one corresponding to matched services and one to matched service operations. In this respect, via service ranking, the programmer will have the ability to check first those services that better match his/her request and then focus on which operations in these services can be used to implement his/her intended functionality.The calculation of the service confidence score or similarity is as follows:where are the relative weights given to the two similarity components, respectively, with a sum equal to 1.0. While is the actual similarity of the service with the request, and denotes the maximum from the similarities between the service’s operation and request. Thus, a service’s confidence score depends on its actual similarity with the request as well as the best similarity between the request and its operations. The default values for the aforementioned weights are: , thus giving higher relative importance to the actual service-to-request similarity.
- (d)
- RankingTo support the future service-first ranking, we require the LLM to produce only this ranking. Then, we obviously transform it to the currently supported operation-first classification and ranking. This service-first ranking relies on the ranking of services at the outer level and then the ranking of the service operations at the inner level.
- (e)
- Include Contributions: it is requested to include per matched operation the values of the individual similarity components to have a clear view of their contribution degree towards the overall operation similarity. This is another interesting matcher feature not included in the other matcher families and our service representation model. However, we intend to update the latter model to cover it.
- (f)
- Include Explanations: it is requested to provide explanations of why a specific operation was matched and why it got the respective confidence score and was classified in the respective (structural and semantic) categories.
- 3.
- Output: it is prescribed to produce a JSON object with a specific format, including the original request and the set of matched services. Each matched service is featured by its rank, name and confidence score as well as its matched operations. Each matched operation in turn is featured by its rank, name, similarity score, structural category, semantic category, an explanation of its matching and the contributions of the similarity components to the overall operation similarity score.
- 4.
- Additional Instructions: Six extra instructions are supplied, the most important of which indicate that the LLM should not consider historical usage in the calculation of the confidence scores and all scores (overall and component) should be normalised in the range [0.0, 1.0].
- 5.
- Task: Here, the main task to be performed by the LLM is repeated. Further, the LLM is instructed not to add commentary or explanations and thus output only the prescribed JSON object.
Outlook
6. Experimental Evaluation
6.1. Evaluation Setup & Structure
Action Annotation Evaluation
- annotation precision: examines how precise the suggested annotations are. It is formally computed by dividing the number of correctly annotated operations () by the total number of annotated operations () as follows: .
- relaxed annotation precision: examines how precise the suggested annotations are with a relaxed interpretation of precision, allowing the use of an action, which does not perfectly match but strongly correlates to the operation’s intended semantics. More formally, this metric is computed as follows: where is the number of relevant but not perfect matches. Please note that when the latter matches are considered, precision increases. Further, through this metric, we attempt to have a discrimination criterion in case some LLMs achieve equivalent annotation precision. Thus, we can discern the best among them based on its ability to attain a higher relaxed precision.
- annotation recall: examines the ratio between the number of correctly annotated service operations divided by the total number of service operations. More formally, it is defined as follows: where is the total number of operations in the enhanced EMB dataset. In essence, there is a trade-off between precision and recall. So, an LLM is better when it is able to find the best possible balance between these two metrics.
- relaxed annotation recall: examines the ratio between the number of properly annotated service operations (i.e., with best and relevant matches) divided by the total number of service operations. More formally, it is defined as follows: . We expect that relaxed annotation recall is higher than annotation recall. Combined with relaxed annotation precision can enable exploring which LLM achieves the best possible balance between them.
- annotation F1: it is defined as the harmonic mean between annotation precision and recall. More formally: . This is a better metric than precision and recall as it combines them both into a single accuracy value. Further, by applying the harmonic mean over these two metrics, it punishes their relative imbalance, signifying that it can attain high values only when both have high values, too.
- relaxed annotation F1: it is defined similarly to normal annotation F1 as the harmonic mean between relaxed annotation precision and recall: . With this metric, as it calculates a single (composite) accuracy score, we can explore in a better way which LLM achieves the best possible balanced between relaxed precision and recall.
- operation coverage: examines the proportion of service operations that have indeed been annotated. It is formally defined as :. This metric is similar to a recall metric but instead of considering the number of correct matches, it accounts the number of total matches, either correct or not. The rationale of its usage is to explore how many of the service operations are mapped to ontology-based actions. Obviously, the more operations are covered, the better. However, how much better depends on the precision, i.e., whether a greater number of mapped operations is precisely annotated. Again, this metric can be utilised as a discriminator factor between LLMs to select an LLM that has both the highest F1 and operation coverage.
- hallucination percentage: this metric computes the percentage of hallucinations within all the matches returned by an LLM across all OpenAPI specifications. It is formally defined as: where is the total number of hallucinations returned by the LLM. A hallucination concerns the proposition by an LLM of a non-existing ontology element for annotating a service operation. Obviously, the lower is the value of this metric, the better. Its use is crucial as it signifies whether annotation precision issues relate to the suggestion of hallucinations. Further, it can be a discrimination criterion between LLMs, as it would be much better to select an LLM that does not hallucinate so much, if not at all.
- service coverage: relates to the proportion of services which we were indeed annotated. It is formally computed by dividing the number of outputs produced () with the number of services () in the (enhanced) EMB dataset as follows: . This metric enables examining if an LLM can successfully produce the annotation output for each service without exhibiting any errors. Thus, the higher it is, the better. This is again another discriminator factor for LLMs in case they are equivalent in other important metrics, like annotation precision and recall.
- syntax validity: as the output from the LLM is a JSON-formatted array, we desire that this output is syntactically valid, such that we can successfully process to proceed with the actual annotation of the OpenAPI specification. As such, we need to explore the proportion of syntactically valid outputs () in terms of all outputs () returned by an LLM. More formally, this metric is defined as: . Obviously, the higher its value is, the better. Again, this metric can be used as a discriminator factor between LLMs with similar performance on the most crucial metrics.
Request Structuring and Annotation Evaluation
- Action structuring precision: the precision in correctly determining the right action of the functional service request. It is formally defined as follows: where is the number of correct actions suggested and is the number of all suggested actions by the LLM.
- Input structuring precision: the precision in correctly determining the right input parameters of the functional service request. It is formally defined as follows: where is the number of correct input parameters suggested, and is the number of wrongly suggested input parameters.
- Output structuring precision: the precision in correctly determining the right output of the functional service request. It is formally defined as follows: where is the number of correctly suggested outputs, is the number of times the LLM correctly did not suggest any output, and is the total number of requests in the EMBR dataset. Please note that, as a service request could signify the requirement not to create any output (e.g., when creating, updating or deleting a specific entity), a correct behaviour of an LLM is to not suggest an output in this case. This perfectly justifies the nominator in the above formula.
- Structuring precision: this is the overall metric covering the global precision in structuring service requests. It is formally defined as follows: . As can be seen, we have a division between the sum of all correct section-specific suggestions and the sum of all section-specific suggestions, where the latter sum includes the number of all suggested actions, the number of requests in EMBR for the sake of outputs, as well as the number of correct and wrong input parameters (suggestions).
- Action annotation precision: it corresponds to the precision in correctly annotating the action section of a functional service request. It is formally defined as follows: where is the number of correctly annotated request actions and is the number of action annotations suggested by an LLM. Please note that we do not consider an annotation as correct if it partially matches the intended service request action. Thus, we account only perfect matches.
- relaxed action annotation precision: it is similar to the previous metric, but it also considers as correct annotations those that partially match the request’s intended action (i.e., they are strongly correlated with it but do not fully match it). It is formally defined as follows: where is the number of annotations that partially match the request’s intended action.
- Action annotation recall: indicates the recall in action annotation. It is formally defined as follows: .
- Input annotation precision: it corresponds to the precision in correctly annotating the input section of a functional service request. It is formally defined as follows: where is the number of correctly annotated request input parameters and is the number of input parameter annotations suggested by an LLM. Please note that we do not consider an annotation as correct if it partially matches an intended service input parameter. Thus, we account only perfect matches.
- Relaxed input annotation precision: it is similar to the previous metric, but it also considers as correct annotations those that partially match a request’s intended input parameter. It is formally defined as follows: where is the number of annotations that partially match a request’s intended input parameter.
- Input annotation recall: indicates the recall in input annotation, i.e., the ability to provide correct annotations for all input parameters of all requests (in EMBR). It is formally defined as follows: where is the number of wrong input annotations, is the number of hallucinated annotations for input parameters, and is the number of input parameters with no annotations suggested. Please note that the following holds: . As such, the previous formula could be simplified in terms of its denominator as follows: . However, we supplied its complicated form to stress the kinds of input annotation suggestions that can be delivered by an LLM.
- Output annotation precision: it corresponds to the precision in correctly annotating the output section of a functional service request. It is formally defined as follows: where is the number of correctly annotated request outputs and is the number of output annotations suggested by an LLM. Please note that we do not consider an annotation as correct if it partially matches the intended service request output. Please also note that the suggested output annotations can include partially correct output annotations, wrong output annotations and hallucinated output annotations.
- Relaxed output annotation precision: it is similar to the previous metric, but it also considers as correct annotations those that partially match the request’s intended output. It is formally defined as follows: where is the number of annotations that partially match the request’s intended output.
- Output annotation recall: indicates the recall in output annotation, i.e., the ability to provide correct output annotations for all requests (in EMBR). It is formally defined as follows: where is the number of requests with no outputs. Thus, the formula’s denominator signifies the number of requests that do have an output.
- Annotation precision: indicates the overall precision in annotating all sections of a functional service request. It is formally defined as follows:
-
Annotation recall: indicates the overall recall in annotating all sections of a functional service request. It is formally defined as follows:This metric covers the division between the sum of all section-specific correct annotations and the sum of twice the number of requests minus the requests with no outputs (this covers all actions and outputs to be annotated), plus the number of suggested input annotations and the number of input parameters not annotated.
- Annotation F1: signifies a composite and balanced metric of annotation accuracy, formally computed as the harmonic mean between annotation precision and recall: .
Functional Service Discovery Evaluation
6.2. Action Annotation Evaluation
6.3. Request Structuring and Annotation Evaluation
6.4. Functional Service Discovery Evaluation
6.4.1. TFIDF Matchers
6.4.2. LLM-based Embeddings Matchers
6.4.3. LLM-based Matchers
1st Subset - Single-Service Requests
2nd Subset - Multi-Service Requests
Overall Evaluation Results
6.5. Discussion
- Our research methodology increases the automation degree in service publication and discovery by incorporating specific LLM-based methods and their encompassing techniques.
- Service publication automation is increased as OpenAPI specifications can be automatically generated from the RESTful services’ source code, while they are automatically annotated via the automatic production and use of internal ontologies and the complementary use of external ontologies, like Schema.org.
- Service discovery automation is increased via the automatic structuring and annotation of service requests, which can then be exploited by any service matching algorithm.
- The accuracy of all our methods has been experimentally validated both in our previous published work [21,23] and in the current article (in this section). The evaluation results signify that the automation degree achieved in the context of our research methodology and its implemented LLM-based methods does not sacrifice accuracy. On the contrary, the accuracy in all of the method activities or tasks is very high, making them suitable for use so as to increase (functional) service discovery accuracy.
- The main benefits of our research methodology can be observed by the experimental evaluation of our implemented matchers. As has been derived, there is an increase in service discovery accuracy when utilising the incorporated semantics in service specifications and requests.
- All the above prove the suitability and added-value of our research methodology, which surely increases the automation degree in both service discovery and publication while leads to an increase in service discovery accuracy.
7. Conclusions & Future Work
7.1. Conclusions
7.2. Future Work
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Georgakopoulos, D.; Papazoglou, M.P. Service-Oriented Computing; Cooperative Information Systems; MIT Press, 2008. [Google Scholar]
- OASIS. UDDI Version 3.0.2. Standard, OASIS. 2004. [Google Scholar]
- Dong, X.; Halevy, A.; Madhavan, J.; Nemes, E.; Zhang, J. Similarity search for web services. In Proceedings of the VLDB ’04: Proceedings of the Thirtieth international conference on Very large data bases, Toronto, Canada, 2004; pp. 372–383. [Google Scholar]
- Rodriguez, J.M.; Zunino, A.; Mateos, C.; Segura, F.O.; Rodriguez, E. Improving REST Service Discovery with Unsupervised Learning Techniques. In Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, Santa Catarina, Brazil, July 2015; pp. 97–104. [Google Scholar] [CrossRef]
- Lo, W.; Yin, J.; Wu, Z. Accelerated Sparse Learning on Tag Annotation for Web Service Discovery. In Proceedings of the 2015 IEEE International Conference on Web Services, New York, NY, USA, June 2015; pp. 265–272. [Google Scholar] [CrossRef]
- Zeng, K.; Paik, I. Semantic Service Clustering With Lightweight BERT-Based Service Embedding Using Invocation Sequences. IEEE Access 2021, 9, 54298–54309. [Google Scholar] [CrossRef]
- Yang, Y.; Qamar, N.; Liu, P.; Grolinger, K.; Wang, W.; Li, Z.; Liao, Z. ServeNet: A Deep Neural Network for Web Services Classification. In Proceedings of the 2020 IEEE International Conference on Web Services (ICWS), Beijing, China, October 2020; pp. 168–175. [Google Scholar] [CrossRef]
- Baryannis, G.; Kritikos, K.; Plexousakis, D. A specification-based QoS-aware design framework for service-based applications. Serv. Oriented Comput. Appl. 2017, 11, 301–314. [Google Scholar] [CrossRef]
- Bener, A.B.; Ozadali, V.; Ilhan, E.S. Semantic matchmaker with precondition and effect matching using SWRL. Expert Syst. With Appl. 2009, 36, 9371–9377. [Google Scholar] [CrossRef]
- Plebani, P.; Pernici, B. URBE: Web Service Retrieval Based on Similarity Evaluation. IEEE Trans. Knowl. Data Eng. 2009, 21, 1629–1642. [Google Scholar] [CrossRef]
- Klusch, M.; Fries, B.; Sycara, K. Automated semantic web service discovery with OWLS-MX. In Proceedings of the AAMAS, Hakodate, Japan, 2006; pp. 915–922. [Google Scholar] [CrossRef]
- Hogan, A. The Semantic Web: Two decades on. Semant. Web 2020, 11, 169–185. [Google Scholar] [CrossRef]
- Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, May 2023; pp. 31–53. [Google Scholar] [CrossRef]
- Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–79. [Google Scholar] [CrossRef]
- Smardas, Antonios. Semantic Service Discovery Supported by LLMs. PhD Thesis, University of the Aegean, 2026. [Google Scholar]
- Arcuri, A. RESTful API Automated Test Case Generation with EvoMaster. ACM Trans. Softw. Eng. Methodol. 2019, 28, 1–37. [Google Scholar] [CrossRef]
- J. Obidallah, W.; Raahemi, B.; Rashideh, W. Multi-Layer Web Services Discovery Using Word Embedding and Clustering Techniques. Data 2022, 7, 57. [Google Scholar] [CrossRef]
- Liu, F.; Deng, D.; Jiang, J.; Tang, Q. Event-Driven Semantic Service Discovery Based on Word Embeddings. IEEE Access 2018, 6, 61030–61038. [Google Scholar] [CrossRef]
- Nabli, H.; Ben Djemaa, R.; Ben Amor, I.A. Efficient cloud service discovery approach based on LDA topic modeling. J. Syst. Softw. 2018, 146, 233–248. [Google Scholar] [CrossRef]
- OpenAPI Initiative. OpenAPI Specification v3.1.0. Standard; Linux Foundation, 2021. [Google Scholar]
- Smardas, A.; Kritikos, K. Towards the Automatic Production of OpenAPI Specifications from Source Code. In Proceedings of the 2025 12th International Conference on Future Internet of Things and Cloud (FiCloud), Istanbul, Turkiye, August 2025; pp. 342–349. [Google Scholar] [CrossRef]
- White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT, 2023. 1. [CrossRef]
- Smardas, A.; Kritikos, K. Towards LLM-Assisted Automatic Semantic Annotation for RESTful Services. In 14th International Conference on Emerging Internet, Data & Web Technologies (EIDWT-2026); Series Title: Lecture Notes in Data Engineering and Communication Technologies (LNDECT); Springer Nature Switzerland: Cham, 2026. [Google Scholar]
- Mainas, N.; Bouraimis, F.; Karavisileiou, A.; Petrakis, E.G.M. Annotated OpenAPI Descriptions and Ontology for REST Services. Int. J. Artif. Intell. Tools 2023, 32, 2350017. [Google Scholar] [CrossRef]
- Saati, T. The Analytic Hierarchy Process; McGraw-Hill, 1980. [Google Scholar]
- Liu, M.; Tu, Z.; Zhu, Y.; Xu, X.; Wang, Z.; Sheng, Q.Z. Data correction and evolution analysis of the ProgrammableWeb service ecosystem. J. Syst. Softw. 2021, 182, 111066. [Google Scholar] [CrossRef]
| 1 | |
| 2 | |
| 3 | they convey similar or equivalent semantics to standard REST verbs like create, update, read and delete |
| 4 | correspond to specialised actions like search, download or pay |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | one concerning the API itself based on its image, one for liteLLM and one for PostgreSQL |
| 18 | in Schema.org, all action concepts have this postfix in their names |
| 19 | We actually consider the name, textual description and summary of the service operation and not its whole content as it tends to be overwhelmed with I/O parameter information |
| 20 | in principle, vector computation is faster than request structuring and annotation, but here we assume that they all map to the same cost |
| 21 |
K is the number of operation input parameters and M is the number of request input parameters |





| EMBR Subset | Request Number | Avg./Min/Max Num of Matching Services | Avg./Min/Max Num of Matching Operations |
|---|---|---|---|
| Single-Service | 15 | 1/1/1 | 1.75/1/4 |
| Multi-Service | 11 | 2.81/2/7 | 5.72/3/9 |
| Metric | GPT 4.1 | Claude Sonnet 3.7 | Claude Sonnet 4 | DS V3 | DS R1 | Mistral Large |
|---|---|---|---|---|---|---|
| annotation precision | 95.3% | 81.2% | 91.2% | 95.3% | 97.2% | 89.9% |
| annotation recall | 95.3% | 81.2% | 91.2% | 95.3% | 95.3% | 89.9% |
| annotation F1 | 95.3% | 81.2% | 91.2% | 95.3% | 96.2% | 89.9% |
| relaxed annot. prec. | 99.3% | 93.2% | 95.9% | 97.9% | 98.6% | 97.9% |
| relaxed annot. recall | 99.3% | 93.2% | 95.9% | 97.9% | 96.6% | 97.9% |
| relaxed F1 | 99.3% | 93.2% | 95.9% | 97.9% | 97.6% | 97.9% |
| operation coverage | 100.0% | 100.0% | 100.0% | 100.0% | 97.9% | 100.0% |
| halluc. perc. | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
| service coverage | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| syntax validity | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Metric | GPT 4.1 | Claude Sonnet 3.7 | Claude Sonnet 4 | DS V3 | DS R1 | Mistral Large |
|---|---|---|---|---|---|---|
| Action structuring precision | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Input structuring precision | 89.28% | 96.43% | 88.88% | 85.18% | 92.00% | 89.28% |
| Output structuring precision | 100.0% | 100.0% | 100.0% | 92.59% | 88.88% | 96.29% |
| Structuring precision | 96.34% | 98.78% | 96.29% | 92.59% | 93.67% | 95.12% |
| Metric | GPT 4.1 | Claude Sonnet 3.7 | Claude Sonnet 4 | DS V3 | DS R1 | Mistral Large |
|---|---|---|---|---|---|---|
| Action annotation precision | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Relaxed action annotation precision | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Action annotation recall | 100.0% | 100.0% | 100.0% | 92.59% | 100.00% | 100.00% |
| Input annotation precision | 95.83% | 85.18% | 96.15% | 82.61% | 95.65% | 88.46% |
| Relaxed input annotation precision | 95.83% | 92.59% | 100.00% | 86.95% | 100.00% | 100.00% |
| Input annotation recall | 85.18% | 85.18% | 96.15% | 79.16% | 95.65% | 88.46% |
| Output annotation precision | 93.75% | 79.16% | 83.33% | 80.00% | 85.71% | 82.61% |
| Relaxed output annotation precision | 93.75% | 87.50% | 91.66% | 90.00% | 95.24% | 100.00% |
| Output annotation recall | 62.50% | 79.16% | 83.33% | 76.19% | 85.71% | 79.16% |
| Annotation precision | 97.01% | 88.46% | 93.50% | 88.40% | 94.36% | 90.78% |
| Annotation recall | 80.24% | 85.18% | 90.00% | 78.20% | 87.01% | 86.25% |
| Annotation F1 | 81.76% | 86.79% | 91.72% | 81.33% | 90.54% | 87.89% |
| Metric | Core TFIDF | Enhanced Core TFIDF | Structured TFIDF | Enhanced Structured TFIDF | Ontology-based TFIDF | Enhanced Ontology-based TFIDF |
|---|---|---|---|---|---|---|
| Avg. operation precision | 0.31 | 0.25 | 0.41 | 0.47 | 0.44 | 0.32 |
| Avg. operation recall | 0.09 | 0.28 | 0.25 | 0.17 | 0.21 | 0.44 |
| Avg. operation F1 | 0.1 | 0.19 | 0.18 | 0.13 | 0.19 | 0.32 |
| Metric | Core LLM-based Embeddings | Enhanced Core LLM-based Embeddings | Onto LLM-based Embeddings | Enhanced Onto LLM-based Embeddings | Structured LLM-based Embeddings | Enhanced Structured LLM-based Embeddings |
|---|---|---|---|---|---|---|
| Avg. operation precision | 0.45 | 0.55 | 0.46 | 0.54 | 0.6 | 0.48 |
| Avg. operation recall | 0.13 | 0.50 | 0.14 | 0.43 | 0.35 | 0.45 |
| Avg. operation F1 | 0.13 | 0.40 | 0.13 | 0.34 | 0.25 | 0.33 |
| Metric | GPT 4.1 | Sonnet 3.7 | Sonnet 4 | DS V3 | DS R1 | Mistral Large |
|---|---|---|---|---|---|---|
| 1st EMBR Subset – Single-Service Requests | ||||||
| Avg. operation precision | 0.76 | 0.77 | 0.85 | 0.9 | 0.83 | 0.83 |
| Avg. operation recall | 1 | 1 | 1 | 0.93 | 0.86 | 0.93 |
| Avg. operation F1 | 0.84 | 0.85 | 0.89 | 0.91 | 0.84 | 0.86 |
| 2nd EMBR Subset – Multi-Service Requests | ||||||
| Avg. operation precision | 0.86 | 0.89 | 0.91 | 0.81 | 0.95 | 0.86 |
| Avg. operation recall | 0.68 | 0.83 | 0.91 | 0.68 | 0.60 | 0.70 |
| Avg. operation F1 | 0.67 | 0.83 | 0.89 | 0.71 | 0.69 | 0.72 |
| Overall EMBR Set | ||||||
| Avg. operation precision | 0.77 | 0.79 | 0.84 | 0.83 | 0.85 | 0.81 |
| Avg. operation recall | 0.86 | 0.93 | 0.96 | 0.83 | 0.75 | 0.83 |
| Avg. operation F1 | 0.77 | 0.84 | 0.89 | 0.82 | 0.78 | 0.80 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).