Storage, Indexing, Query Processing, and Benchmarking in Centralized and Distributed RDF Engines: A Survey

The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is significant adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ various mechanisms to implement critical components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is critical for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a crucial optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.


INTRODUCTION
Over recent years, the simple, decentralized, and linked architecture of Resource Description Framework (RDF) data has greatly attracted different data providers who store their data in the RDF format. This increase is evident in nearly every domain. For example, currently, there are approximately 150 billion triples available from 9960 datasets 1 . Some huge RDF datasets such as UniProt 2 , PubChemRDF 3 , Bio2RDF 4 and DBpedia 5 have billions of triples. The massive adoption of the RDF format requires effective solutions for storing and querying this massive amount of data. This motivation has paved the way the development of centralized and distributed RDF engines for storage and query processing.
RDF engines can be divided into two major categories: (1) centralized RDF engines that store the given RDF data as a single node and (2) distributed RDF engines that distribute the given RDF data among multiple cluster nodes. The complex and varying nature of Big RDF datasets has rendered centralized engines inefficient to meet the growing demand of complex SPARQL queries w.r.t. storage, computing capacity and processing [127,82,51,3]. To tackle this issue, various kinds of distributed RDF engines were proposed [41,34,50,86,51,104,105]. These distributed systems run on a set of cluster hardware containing several machines with dedicated memory and storage. Efficient data storage, indexing, language support, and optimized query plan generation are key components of RDF engines: Data Storage. Data storage is an integral component of every RDF engine. Data storage is dependent on factors like the format of storage, size of the storage and inference supported by the storage format [82]. A recent evaluation [4] shows that the storage of RDF data in different RDF graph partitioning techniques has a vital effect on the query runtime.
Indexing. Various indexes are used in RDF engines for fast data lookup and query execution. The more indexes can generally lead to better query runtime performance. However, maintaining these indexes can be costly in terms of space consumption and keeping them updated to reflect the variations in the underlying RDF datasets. An outdated index can lead to incomplete results.
Query Language. Various RDF engines store data in different formats, thus support various querying languages such as SQL [38], PigLatin [80] etc. Since SPARQL is the standard query language for RDF datasets, many of the RDF engines require SPARQL translation (e.g., SPARQL to SQL) for query execution. Such language support can have a significant impact on query runtimes. This is because the optimization techniques used in these querying languages can be different from each other.
Query Execution. For a given input SPARQL query, RDF engines generate the optimized query plan that subsequently guides the query execution. Choosing the best join execution order, and the selection of different join types (e.g., hash join, bind join, the nested loop join, etc.) is vital for fast query execution.
Various studies categorize, compare, and evaluate different RDF engines. For example, the query runtime evaluation of different RDF engines is shown in [82,3,97,22,6]. Studies like [32,71] are focused on the data storage mechanisms in RDF engines. Svoboda et al. [115] classify various indexing approaches used for linked data. The usage of relational data models for RDF data is presented in [92]. A survey of the RDF on the cloud is presented in [59]. A high-level illustration of the different centralized and distributed RDF engines and linked data query techniques are presented in [81]. Finally, empirical performance evaluation and a broader overview of the distributed RDF engines are presented in [3]. According to our analysis, there is no detailed study that provides a comprehensive overview of the techniques used to implement the different components of the centralized and distributed RDF engines.
Motivated by the lack of a comprehensive overview of the components-wise techniques used in existing RDF engines. We present a detailed overview of the techniques used in a total of 78 (the largest to the best of our knowledge) centralized and distributed RDF engines. Specifically, we classify these triples stores into different categories w.r.t storage, indexing, language and query planning. We provide simple running examples to understand the different types. We hope this survey will help readers to get a crisp idea of the different techniques used in RDF engines development.
Furthermore, we hope that this study will help users to choose the appropriate triple store for the given use-case.
The remaining of the paper is divided into different sections. Section 2 provides vital information on RDF, RDF engines and SPARQL. Section 3 is about related work. Section 4, 5, 6 and 7 reviews the storage, indexing, query language and query execution process. Section 8 explains the different graph partitioning techniques. Section 9 explains centralized and distributed RDF engines w.r.t storage, indexing, query language and query execution mechanism. Section 10 discusses the different SPARQL benchmarks. Section 11 illustrates research problems and future directions, and section 12 gives the conclusion.

BASIC CONCEPTS AND DEFINITIONS
This section contains a brief explanation of RDF and SPARQL. The main purpose of explanation is to establish a basic understanding of the terminologies used in the paper. For complete details, readers are encouraged to look at original W3C sources of RDF 6 and SPARQL 7 . This discussion is adapted from [81,99,51,95].

RDF
Before going to explain the RDF model, we first define the elements that constitute an RDF dataset: An example RDF dataset, representing information about a university student is shown in Table 1. RDF models data in the form of a directed labelled graph where resources are represented as nodes and the relationship between resources are represented as a link between two nodes. For a triple < s; p; o >, a node is created for subject s, another node is created for object o, and a directed link with labelled-predicate p is crated from s to o.
An RDF graph can be defined as below [81].
Definition 2 (RDF Graph). An RDF graph is a tuple of six things G = (V, LV, fV, E, LE, fE ), where, 1. V = Vc ∪ Ve ∪ Vl is a collection of vertices that correspond to all subjects and objects in RDF data, where Vc, Ve, and Vl are collections of class vertices, entity vertices, and literal vertices, respectively 2. LV is a collection of vertex labels.

A vertex labelling function fV: V ----> LV is a bijective
function that assigns to each vertex a label. The label of a vertex u ∈ Vl is its literal value, and the label of a vertex u ∈ Vc ∪ Ve is its corresponding URI. 4. E = u1, u2 is a collection of directed edges that connect the corresponding subjects and objects. 5. LE is a collection of edge labels.
6. An edge labelling function fE: E -→ LE is a bijective function that assigns to each edge a label. The label of an edge e ∈ E is its corresponding property. Figure 1 shows the corresponding RDF graph of the sample RDF dataset given in Table 1.  [48] is the standard querying language to query RDF data. The basic building units of the SPARQL queries are triple pattern and Basic Graph Pattern (BGP). A triple pattern is like an RDF triple except that each of the subjects, predicate and object may be a variable. A set of triple patterns constitute a GP 8 and is formally defined as follows. A sequence of triple patterns with optional filters is considered a single BGP. As per the specification of BGPs, any other graph pattern (e.g., UNION, MINUS, etc.) terminates a basic graph pattern.
Any BGP of a given SPARQL query can be represented as directed hypergraph (DH) [99], a general form of a directed graph in which a hyperedge can be used to join any number of vertices. This representation shows that every hyperedge can capture a triple pattern. This representation of hyperedge makes the subject of the triple to becomes the source vertex, and target vertices are shown as the predicate and object of the triple pattern. The hypergraph representation of the SPARQL query is shown in Figure 2. Unlike a common SPARQL representation where the subject and object of the triple pattern are connected by a predicate edge, the hypergraph-based representation contains nodes for all three components (i.e., subject, predicate, object) of the triple patterns. This representation allows to a capture a joins with predicates of triple patterns involved. The SPARQL representation of the hyperedge is defined as follows: Definition 4 (Directed hypergraph of a BGP). SPARQL query represented as DH is represented as a UNION of query BGPs.
Following features of SPARQL queries are defined on the basis of DH representation of SPARQL queries: 8   Definition 7 (Join Vertex Degree). Based on the DH representation of the queries the join vertex degree of a vertex v is JVD(v) = |Ein (v)| + |Eout (v)|, where Ein (v) resp. Eout (v) is the set of incoming resp outgoing edges of v.

LITERATURE REVIEW
The focus of this section is to present studies that discussed the storage, partitioning, indexing, and query processing techniques used in RDF engines.
For example, Sakr et al. [92] presented one of the first surveys on the usage of the relational model for RDF data. This survey is all about the use of the relational models for RDF data storage and query processing. A broader overview of the data storage and query processing techniques in centralized and distributed RDF engines is presented in [81]. A survey of the storage of RDF data in relational and NoSQL database stores is presented in [71]. Pan et al. [82] discussed the storage and query processing techniques in centralized and distributed RDF engines. They also compared various benchmarks datasets. A high-level illustration of both storage and query processing in centralized and distributed RDF engines is presented in Sakr et al. [93]. They also discussed various SPARQL benchmarks. Svoboda et al. [115] discuss different indexing schemes used in centralized and distributed RDF engines. In particular, three types of indexing are discussed, i.e., local, distributed and global. Faye et al. [32] also discussed the storage and indexing techniques in centralized RDF engines. They divided these techniques into nonnative and native storage solutions. The non-native solutions make use of the Database Management Systems (DBMS) or other related systems to store RDF data permanently.
On the other hand, the native storage solutions store the data close to the RDF model. Thus, such storage solutions avoid the use of DBMS. Rather, the data can be directly stored in different RDF syntaxes 9 such as N-Triples, RDFa, JSON-LD, TriG. A classification of RDF engines is presented in [127]. This study focuses on storage, indexing, query processing mechanisms among RDF engines.
Kaoudi et al. [59] present a survey of RDF systems de-signed specifically for a cloud-based environment. The focus of the paper to classify the cloud-based RDF engines accord-ing to capabilities and implementation techniques. Elzein et al. [29] presented another survey on the storage and query processing techniques used in the RDF engines on the cloud. Janke et al. [55,58] presented surveys on RDF graph partitioning, indexing, and query processing techniques used in distributed and cloud-based RDF engines. They also discussed some of the available SPARQL benchmarks for RDF engines. A survey and experimental performance evaluation of distributed RDF engines are presented in [3]. The study reviews 22 distributed RDF engines and presents an experimental comparison of 12 selected RDF engines. Reviewed systems belong to categories like graph-based, MapReduce based, and specialized systems.
Yasin et al. [129] discussed the limitations and discrepancies in distributed RDF engines. In particular, they discussed the SPARQL 1.1 support in these engines. Authors in [5] discussed the different categories of RDF engines, including centralized, memory-based, cloud-based, graph-based and binary stores etc. are discussed with their respective examples. In [101], there is a discussion of different mapping techniques to map the RDF data into the NoSQL databases. The respective paper describes the mapping process in differ-ent types of NoSQL databases, i.e., keyvalue and columnar databases etc. through their various examples.
An overview of the surveys, as mentioned earlier on RDF engines, is shown in Table 2. The summary indicates that most of the existing studies are focused on the specific components (storage, indexing, query processing) of the centralized or distributed RDF engines. According to our analysis, there exists no detailed study which provides a combined and comprehensive overview of the techniques used to implement the different components of both centralized and distributed RDF engines. Furthermore, previous studies only considered limited RDF engines. We fill this gap by presenting a detailed overview of the techniques used, pertaining to the storage, partitioning, indexing, language and query execution, both in centralized and distributed RDF engines. We included a complete list of the existing RDF engines as compared to previous studies. In addition, in Section section 10, we discuss different SPARQL benchmarks designed for the performance evaluation of RDF engines. We show the pro and 9 RDF Syntaxes https://www.w3.org/TR/rdf11-concepts/ #rdfdocuments Table 2: An overview of the existing surveys on RDF engines.

Study
Year Central Focus Janke et al. [58] 2020 Partitioning, indexing, and query evaluation in distributed RDF engines Santana et al. [101] 2020 Mapping strategies for storage of RDF based data into the NoSQL databases Alaoui [5] 2019 A general categorization of different RDF engines Janke et al. [55] 2018 Partitioning, indexing, and query evaluation in distributed and cloud RDF engines Sakr et al. [93] 2018 Storage and query evaluation in centralized and distributed RDF engines Wylot et al. [127] 2018 Storage and indexing in centralized and distributed RDF engines Pan et al. [82] 2018 Storage techniques in centralized and distributed RDF engines Yasin et al. [129] 2018 Suitability and query evaluation of distributed RDF engines Elzein et al. [29] 2018 Storage and query evaluation in RDF engines on cloud Abdelaziz et al. [3] 2017 Performance evaluation of distributed RDF engines Ma et al. [71] 2016 Storage techniques in centralized RDF engines Ozsu [81] 2016 A general overview of centralized and distributed RDF engines Kaoudi et al. [59] 2015 RDF engines on cloud Faye et al. [32] 2012 Storage and Indexing techniques in centralized RDF engines Svoboda et al. [115] 2011 Indexing in centralized, distributed, and global RDF engines Sakr et al. [92] 2010 Storage and query evaluation in SPARQL transnational RDF engines cons of these benchmarks, which will help the user to choose the best representative benchmark for the given use-case.

STORAGE
Data storage is an integral component of any Database Management System (DBMS). Efficient data storage is critical for disk space consumption, security, scalability, maintenance, and performance of the DBMS. This section reviews storage mechanisms commonly used in centralized and distributed RDF engines. We divide these mechanisms into five broader categories (ref. Figure 3) namely Triple Table, Property Table, Vertical Partitioning, Graph-based data storage solutions, and miscellaneous category comprising of Key-Value, Hbase Tables, In-memory, Bit Matrix, storage as index permutations and systems using another system as their the storage component. Table   The Triple Table (TT) is the most general approach to save RDF data in a relational style (ref. discussed in section 9). This style stores all its RDF data in one large table, which contains three columns, for Subject, Predicate, and Object of the RDF triple. Table 3 shows the TT representation of the RDF Pictorial representation of the sample RDF dataset shown in Table 1. To minimize the storage cost and increase query execution performance, the URIs and Strings used in the TT can be encoded as IDs or hash values, and separate dictionaries can be maintained. For example, using a very simple integer dictionary given in Table 4b, the  TT table given in Table 3 can be represented as integer TT shown in Table 4a. The use of a dictionary is particularly useful for RDF datasets having many repeating IRIs or literals. However, in RDF datasets, IRIs are more commonly repeated as compared to literals. As such, encoding each distinct literal and assigning a dictionary id may unnecessarily increase the dictionary size. Consequently, this can leads to performance downgrade due to extra dictionary lookups during the query execution. To tackle this problem, some RDF engines (e.g. Jena 2 [122]) only make use of the dictionary tables to store strings with lengths above a threshold. On the one hand, this design aids in applying filter operations directly on the TT. But this multiple storage of string values results in higher storage consumption.

Triple
In general, an RDF dataset is a collection of RDF graphs. Thus, an RDF dataset comprises exactly one default graph and zero or more named graphs 10 . Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset. The SPARQL query language lets the user specify the exact named graph to be considered for query execution, thus skipping all others named graphs data to be considered for query processing. Since every RDF triple either belongs to the default graph or specifically named graph, the TT storage can exploit this information and stored the corresponding named graph as the fourth element for each input triple. This specific representation of TT is also called Quad, where the table contains four columns; 3 for storing subject, predicate, and object of a triple and the fourth column stores the corresponding named graph of the given triple. According to RDF specification, named graphs are IRIs. For simplicity, let's assume all the triples gave in table Table 3 belongs to named Graph G1, then the Quad representation of the given TT is shown in Table 5. The quad representation has been used in many well-known RDF 10 RDF named graph: https://www.w3.org/TR/ rdf11concepts/#section-dataset   11 such as Virtuoso [31] and 4store [44]. Please note that as per SPARQL specification 12 , a SPARQL query may specify the RDF graph to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. Such queries can be efficiently executed by using Quad tables; as such queries should be executed over the specified named graph and hence skipping triples that belong to other named graphs.

Property Table
The Property Tables approach aims to lessen the number of joins needed evaluation of for SPARQL Basic Graph Pattern. In this approach, all properties (i.e. predicates) that are expected to be used in combination are stored in one table. A typical property table contains exactly one column to store the subject (i.e. a resource) and n number of columns to store the corresponding properties of the given subject. For example, in our sample RDF dataset shown in Table 1, the subject DB has two properties, namely type and fieldOf. Thus a typical property table would have three columns: one to store the subject and two to store the corresponding properties.
There are two ways to determine the set of properties grouped together in a property table: (1) make use of the type definitions (rdf: type as shown in Table 1) in the dataset itself, (2) use some clustering algorithm to determine the properties groups. Our sample RDF dataset shown in Table 1 explicitly mentions two rdf: types, namely Person and Course. Thus, we can group all properties that belong to Person in one table and all properties that belong to the Course in another table. Table 6 shows the corresponding property tables for the RDF dataset shown in Table 1. Table 6 revealed two clear disadvantages of using property tables: Multi-valued properties: Multi-valued predicates are common in databases. For example, a person can have more than one contact numbers. In our example, interestedIn is a multi-valued predicate: Bob is interested both in SemWeb and DB courses. A typical multi-valued predicate will introduce duplicated information in the columns. For example, in Table 6a, the type and the country ( belongsTo predicate) in-formation is duplicated for the subject Bob. One way to handle this problem to use the typical database normalization approach, i.e. create separate property tables for multi-valued predicates. Table 7 shows the corresponding properties tables after creating separate tables for multi-valued predicates.
Null values: It is very common in RDF datasets that certain resources have missing information for some particular predicates. For example, the country of the resource Alice is missing in Table  1. Such missing information is typically represented as null values. The low datasets structuredness values shown in [100] suggest that many real-world RDF datasets contain missing information for different resources.
Summary. PT performs very well for executing star joins (ref. Figure  2) in the query. This is because a star join node is based on subjectsubject joins. Thus a typical property table will act like a subject-based index for executing such joins. However, it suffers for executing other types of joins, e.g., path, hybrid, and sink (ref. ref. Figure 2) used in the SPARQL queries. A path join node refers to subject-object join, sink join node refers to object-object join, and hybrid join node refers combination of all. The real-world user's query statistics of the four datasets, presented in [94], show that 33% of the real-world queries contain star join, 8.79% contain path join, 6.62% contain sink join, 4.51% contains a hybrid join, and 66.51% contains no join at all. Thus, this approach can produce give efficient results for the majority of the queries containing joins between triple patterns. Furthermore, different tools tried to reduce the problems and performance deficiencies associated with PT. For example, in [122], PT was used together with a TT, where the PT aims to store the most commonly used predicates. Finally, this approach is sensitive to the underlying schema or data changes in the RDF datasets.

Vertical Partitioning
Vertical Partitioning (VP) was proposed in [2]. In contrast to PT and TT, a VP stores RDF data in two columns tables form. Subject and object named by the property. The number of tables equals the number of distinct predicates used in the RDF dataset. Since there are four distinct predicates in the sample RDF dataset shown in Table 1, the corresponding vertical tables are shown in Table 8. In contrast to VP, the VT does not suffer from the multi-valued predicates and the missing information that corresponds to null values. The approach is particularly useful for answering SPARQL triple patterns with bound predicates (e.g. ?s p ? o). This is because the predicates tables can be regarded as an index on predicates, hence only a single table needs to considered while answering triple patterns with bound predicates. However, this type of storage is not optimized for answering triple patterns containing unbounded predicates (e.g., s ? p ? o). This is due to the fact since the predicate is shown as a variable in the triple pattern, the triple pattern matching will consider all vertical tables. Consequently, maybe fetched a large portion of intermediate results which will be discarded afterwards. Summary. VP proves to be efficient in practice for large RDF datasets with many predicates, as it also offers indexing by predicates. This approach is particularly useful for columnoriented DBMS and is easy to manage in a distributed setup. However, the table sizes can significantly vary in terms of a number of rows. As such, some partitions can account for a large portion of the entire graph, leading to workload imbalance. Furthermore, it can cause a lot of I/O resources for answering SPARQL queries with unbound predicates.

Graph-Based Storage
A graph is a natural storage form of RDF data. Various (un)directed, (a) cyclic (multi) graph data structures can be used to store RDF graphs. For example, [134] store RDF graphs as directed signature graphs stored as a disk-based adjacency list table, [16] store as a balanced binary tree and store as multigraphs etc. A very simple labelled, the directed signature graph of Figure   1 is shown in Figure 4. In Graph-based approaches, the given SPARQL query is also represented as graph and sub-graph matching is performed to answer the query.  Figure 4. Signature graph representation using dictionary encoding shown in Table 4b Summary.
The main advantage of graph storage of RDF data is that it is the original representation of the RDF data hence and represents the original semantics of SPARQL. As graph homomorphism is NP-complete, the sub-graph matching can be costly.
In particular, graph storage can raise issues pertaining to the scalability of the RDF engines for large graphs, which can be addressed by indexing of database management techniques.

Miscellaneous Storage
This category includes multiple storage schemes which are yet not widely used in state of the art. Main storage schemes are Key-Value based [133], HBase tables based 13 [83], API based like Sesame SAIL Storage And Inference Layer [17], on disk or in memory-based [12], and Bit Matrix-based [130]. We encourage readers to refer to the corresponding papers for the details of these RDF storage schemes.

INDEXING
Indexing is one of the important components of database management systems. The right selection of indexing can significantly improve query runtime performances. How-ever, indexes need to be updated with underlying changes in the data sets. In particular, if the data changes on regular intervals, too many indexes may slow down the performance. In addition, indexes need extra disk space for storage. Major types of indexing that we found during the study are given in Figure 5.

Indexing By Predicate
This type of indexing is found in RDF engines where the data is stored as Vertically Partitioned (VP) tables. Each of the VP tables is naturally indexed by the predicate. Predicate-based indexing is particularly helpful for answering triple patterns with bound predicates. In such cases, only one VP table is consulted to answer the given triple pattern. However, a SPARQL triple pattern can have eight different combinations based on bounded or unbounded s, p, o, as shown in Figure 6. As such, using only a single predicate-based index will be less efficient for triple patterns with unbound predicates.
Summary. If data is stored in VP tables, it is auto-matically indexed by predicate as well. It is a space efficient solution and can be useful for datasets with frequent updates. Triple-Permutation Indexing. As mentioned before, an RDF triple comprises Subject, Predicate, and Object. Figure  7 shows the permutations of an RDF triple s,p. O. The RDF engines in this category create indexes pertaining to some or all permutations of the SPO. The goal is to efficiently execute the different types of SPARQL triple patterns shown in Figure 6. Consequently, the Triple-Permutation indexes can be particularly helpful in the efficient execution of triple types of SPARQL joins (ref. Figure 2). The subject, predicate, object permutations can be extended to the quad to include the fourth element of named graph or model. Figure 8 shows the SPO index for the sample RDF dataset given in Table 1 Summary. The triple-permutation indexes solve the aforementioned issue of unbound predicates, associated with predicate-based indexing. The triple-permutation indexes can directly answer all types of triple patterns shown in Figure 6, and hence avoiding expensive self joins on the large TT. However, they suffers from a severe storage overhead and are expensive to be maintained for RDF data with frequent changes.
Backend Database Indexing. Certain RDF engines, e.g. Jena1, Jena2, and Sesame, etc. make use of the existing DBMS as a backend for storing RDF data. Usually, there are multiple indexes available from that backend DBMS which are utilized as indexes. For example, in Oracle DBMS various index including b-tree, function-based reverse key etc. are available. Post-greSQL 14 provides several indexes, e.g., b-tree, hash, GiST, SP-GiST, GIN and BRIN.
Summary. Using existing DBMS systems avoid the extra work of creating indexes. The available indexes in such DBMS are already mature and optimized for the data these DBMS are designed. However, due to different semantics of RDF data, the available indexes or storage solutions in the existing DBMS may not be optimized for RDF datasets.     Table 1.
Miscellaneous Indexing. This type of indexing contains multiple types, i.e., Local predicate and global predicate indexing [132], Space-filling [133], array indexing [65] and Hbase tables 15 [83] indexing scheme etc. We encourage readers to refer to the corresponding papers for the details of these indexing schemes.

LANGUAGE SUPPORT
Language is an interface through which any data repository can be queried. RDF data can be directly queried via SPARQL. However, a translation of SPARQL to other querying language is required if the data is not stored in RDF format. Summary of query languages in centralized and distributed RDF engines is shown below in Figure 9.
The efficient translation of SPARQL into other querying languages is one of the major optimization steps in such engines. Some RDF engines use different API for querying purposes as well. Different types of the query languages found during the study are given below.

SPARQL: SPARQL is the standard query language for
RDF based data. Directly using SPARQL on top of RDF can be much faster, as the underlying semantics of RDF and SPARQL can be utilized towards better runtime performance. In addition, the aforementioned triplepermutation indexes can be leveraged to further optimize the query execution. The majority (ref. Figure 9) of the RDF engines make use of the direct SPARQL execution on top of RDF data. However, storing Big RDF datasets in a format that can be directly queried via SPARQL is still a challenging task.
2. SPARQL Translation: As mentioned before, certain RDF engines make use of the existing DBMS as a backend for storing RDF data. The relational DBMS systems (PostgreSQL, Oracle, etc.) are the most popular among them. Since SPARQL is not directly supported by these DBMS, a translation of SPARQL to DMBS-supportedlanguage is required for query execution. Most of the RDF engines in this category translate SPARQL to SQL using existing translation tools, e.g., Sparklify [111], Ontop [19] etc.
3. Other Languages: Some RDF engines also use their own query language other than SPARQL, i.e., RDF Data Query Language (RDQL) [73], Interactive Tucana Query Language (iTQL) etc. Some RDF engines are used as a library to be used with different applications and offer full or part of RDF functionality [12].

QUERY EXECUTION
Different RDF engines employ different query execution strategies. The overall goal is to devise an optimized query execution plan that leads to fast query execution. This is generally achieved by using different query optimization strategies, e.g., the selection of query planning trees (e.g. bushy tree vs left-dept tree), the lowest cardinality joins should be executed first, parallel execution of different joins, minimization of intermediate results, use of different joins (hash, bind, merge sort etc.), 15 https://hbase.apache.org/ filter operations are push down in the query plan etc. As such, categorizing the complete surveyed systems according to their query execution strategies is rather hard; a variety of optimization strategies are used in state-of-the-art RDF engines. Broader categories (ref. Figure 10) of the query execution strategies are explained below.
1. Statistics Based. This type of query plans is based on the cardinality (estimated or actual) of triple patterns and join between triple patterns. The cardinality estimations may be performed by using stored statistics. The actual cardinalities can be obtained at runtime by using different queries, executed on the underlying dataset. The goal is to execute the lowest cardinality join first, followed by the next lowest cardinality join and so on. The advantage of executing the lowest cardinality joins first is that the intermediate results for the next join could be significantly reduced, hence the join operations are performed quickly. The cost of the cardinality of a particular join is estimated by using the stored statistics as an index.
2. Translation Based. Those RDF engines that make use of the existing DBMS systems can take advantage of different optimization steps used, already implemented in these DBMS. For such systems, the SPARQL query language is translated into another language supported by the underlying DBMS. Thus, additional optimization can be applied during the translation, e.g., the ordering of triple patterns, the use of filters etc.
3. API Based. Some RDF engines are used as a library (e.g. Jena [73], Sesame [17] to store and query datasets. Query execution is dependent on the Application-specific procedure to generate efficient query plans.

Subgraph Matching
Based. RDF engines in this category exploit the graph nature both for RDF data and the corresponding SPARQL query. Once both data and query are represented as a graph, the query execution problem is reduced to the subgraph matching problem query execution.

Miscellaneous
In this category, different types of query execution models exist. For example, index lookups combined with different joins heuristic-based query processing and join ordering etc.

RDF GRAPH PARTITIONING
In distributed RDF engines, the data is distributed among cluster nodes. This partitioning of big data among multiple data nodes helps in improving systems availability, ease of maintenance, and overall query processing performances. Formally, the problem of RDF graph partitioning is defined as below.
RDF graph partitioning techniques can be divided into two major categories: Horizontal Partitioning. It is the row-wise distribution of data into different partitions. It is also called database sharding. In RDF each row of a TT represents a triple, the triple-wise distribution of complete dataset is regarded as horizontal partitioning.
Vertical Partitioning. It is the column-wise distribution of data into different partitions and involves creating tables with fewer columns and using additional tables to store the remaining columns. In the context of RDF dataset, each triple represents a row with three columns, namely subject, predicate and object.
Hence, distribution by any of these columns is regarded as vertical RDF partitioning. A famous example of vertical partitioning by predicate column is already discussed in 4.
A recent empirical evaluation [4] of the different RDF graph partitioning showed that the type of partitioning used in the RDF engines has a significant impact on the query runtime performance. They conclude that the data that is queried together in SPARQL queries should be kept in the same node, thus minimizing the network traffic among data nodes. Figure 11 shows the categories of the partitioning techniques found in distributed RDF engines.
Please note that all the engines under workload-based, hash-based and graph-based categories are the examples of horizontal partitioning. Now we define each of the category given in Figure 11. We explain commonly used [55,61,105,96] graph partitioning techniques by using a sample RDF graph shown in Figure 12 16 . In this example, we want to partition the 11 triples into three partitions, namely green, red, and blue partitions.
Range Partitioning: It distributes triples based on certain range values of the partitioning key. For example, create a separate partition of all RDF triples with Predicate age and object values between 30 and 40. In our motivating example, let the partition key is the triple number with partitions defined according to the following ranges: the first partition is created for all the triples in the range [1,4], a second partition is created for all the triples in the range [5,8], and the third partition is created for all the triples in the range [9,11].
Workload-Based Partitioning: The partitioning techniques in this category make use of the query workload to partition the given RDF dataset. Ideally, the query work-load contains realworld queries posted by the users of the RDF dataset, which can be collected from the query log of the running system. However, real user queries might not be available. In this case, the query workload can either be estimated from queries in applications accessing the RDF data or synthetically generated with the help of the domain experts of the given RDF dataset that needs to be partitioned.
Hash-Based Partitioning: There are three techniques used in this category: Subject-hashed. In this technique, the hash function is applied to the subject of the triple and based on the output value, and the subject is assigned to a partition [57]. This causes imbalance among partitions. This imbalance is shown in the example given in Figure 12. Using this technique, our example dataset is split such that, triples 3,10 and 11 are assigned into red partition, triple seven is assigned into blue partition, and the remaining triples are assigned into the green partition. It is a clear imbalance of partitioning.
Predicate-hashed. This technique applies function is applied on the predicate of the triple, and based on the output value, the predicate is assigned to a partition. This causes all the triples with the same predicate assigned to one partition. In the example given in Figure 12, there are four distinct predicates while the required number of partitions is 3. Thus by using the first-come for serve strategy, all the triples with predicate p1 are assigned to the first partition (red), p2 triples are assigned to a second partition (green), p3 triples are assigned to third partition (blue), and p4 triples are again assigned to the first partition. This technique can lead to significant performance improvement, provided that the predicates are intelligently grouped into partitions, such that communication load among data nodes is reduced [4]. URI Hierarchy-hashed: This partitioning technique is based on assumptions of that IRIs have path hierarchy, and those with the same hierarchy prefix are often queried together [57]. Same assumptions work in this technique, which extracts the path hierarchy of IRIs and those with same are assigned to the same partition. 16 We used different example to show a clear difference between the discussed RDF partitioning techniques.
For example in the case of "http://www.w3.org/1999/02/22rdf-syntax-ns#type", the path hierarchy is "org/w3/www/1999/02/22 rdf-syntax-ns/type". int the subsequent steps, for each level in the path hierarchy (e. g., "org", "org/w3", "org/w3/www", ...), this technique computes the percentage of triples with the same hierarchy prefix. In case of exceeding a percentage to experimentally de-fined threshold and the number of prefixes is equal to or greater than the number of required partitions at any level of the hierarchy, then these prefixes are used for the hash-based partitioning on prefixes. For the determination of IRI prefixes to apply hash function, this technique is computationally expensive. This technique is exhibited in the example given in Figure 12, shows all the triples having hierarchy1 in subjects are assigned to the green partition, triples having hierarchy2 in subjects are assigned to the red partition, and triples having hierarchy3 in subjects are assigned to the blue partition. This partitioning may not produce the best query runtimes as the underlying assumptions about IRIs might not be true in practice [4].
Graph-Based Partitioning: It makes use of the graph-based clustering techniques to split a given graph into the required pairwise disjoint sub-graphs. There are three techniques used in this category: Recursive-Bisection Partitioning. This technique is used to solve the k-way partitioning problem, as presented in [61]. It works in three steps, i.e. Coarsening, Partitioning and Uncoarsening. In Coarsening: smaller graphs are generated by the input graph. The partitioning phase computes the 2-way partition, which divides the vertices into two parts. In the last phase of Uncoarsening, partitions of a graph are projected back to an original graph. This technique is shown by the example given in Figure 12, triples 1, 2, 4, 7, and 8 are labelled green partition, triples 3, 5, 6, 9 and 10 are labelled with red partition, and triple 11 is labelled with blue partition.

TCV-Min Partitioning.
This technique is also used to solve the problem of k-way graph partitioning with the main objective is to minimize the total communication volume [18] of the partitioning. This technique also comes with phases. But the second phase, i.e. the partitioning, is about the minimization of communication costs. This is shown by the example given in Figure 12, triples 1, 2, 4, 5, 6, 8 and 9 are labelled with the green partition, triples 3, 7 and 10 are labelled with the red partition, and triple 11 is labelled with the blue partition.
Min-Edgecut Partitioning. The Min-Edgecut [61] is also used to solve the problem of k-way graph partitioning. However, unlike TCV-Min, the focus is to partition the vertices by minimizing the number of edges connected to them. This is shown by an example given in Figure 12, triples 1, 2, 4, 7 and 8 are labelled with green the partition, triples 3, 5, 6, 9 and 10 are labelled with the red partition, and only triple 11 is labelled with the blue partition. The graph-based partitioning techniques are computationally complex and may take very strong resources for splitting big RDF datasets. Vertical Partitioning: The vertical partitioning is already discussed in section 4. This technique generally divides the given RDF dataset based on predicates 17 . It creates n number of two columns tables, where n is the number of distinct predicates in the dataset. Please note that it breaks the triples and only store the subject and objects parts. The predicate becomes the caption of the table. Ideally, the number of distinct partitions would be equal to the number of distinct predicates used in the dataset. However, it is possible that the required number of partitions may be smaller than the number of predicates used in the dataset. In this case, the predicate tables are grouped into partitions, i.e., multiple predicate tables are stored in partitions. There can be multiple ways of grouping predicates into partitions: (1) first come, first serve, (2) by looking at the number of triples per predicate and thus group predicates such that maximum load balancing is achieved among partitions, (3) using some intelligence to determine which predicates will be queried together, and hence grouped their corresponding triples in one partition.

STATE-OF-THE-ART RDF ENGINES
Now we present state-of-the-art centralized and distributed RDF Engines. The goal is to provide a broader overview of these engines and classify them according to the previously discussed data storage and partitioning, indexing, language, and query processing techniques. Summary of these characteristics in centralized and distributed RDF engines is shown in table 9 and 10.

Centralized RDF Engines
Redland [12] is a set of RDF libraries for storing and querying RDF data that can be used by other RDF-based applications. It provides a TT like storage based on creating three hashes -SP2O, PO2S, SO2Pper RDF triple, where S, P, O stands for Subject, Predicate and Object respectively. 17 Division by subject and object is also possible but not common in RDF partitioning.
Each hash is a map of a key to a value with duplicates allowed. The first two characters represent the hash key, and the last character represents the value. For example, in SP2O, the key for this hash is subject, and predicate and value is the object. These hashes also serve as three triple-permutation indexes. The hashes can be stored either in memory or on persistent storage. It creates an RDF model which can be queried with SPARQL and RDQL using the Rasqal RDF query library 22 .
Jena1 [73] uses relational databases to store data as TT called statements tables. Jena1 statement table has entries for subject, predicate, objectURI, and objectliteral. URIs and Strings are encoded as IDs, and two separate dictionaries are created for literals and resources/URIs. This scheme is very efficient in terms of storage because of the one-time storage of multiple occurrences of URIs and literals. However, the query execution performance is greatly affected by the multiple self joins as well as dictionary lookups. The indexing, query processing depends upon the relational DBMS (e.g., Postgresql, MySQL, Oracle) it uses for storage. RDQL is used as a query language that is translated into SQL to be run against the underlying relational DBMS.   p4  p1  p2  p4  p1  p2  3   p4  p1  p2  3   p4  p1  p2  3  3  11  2  p2  11  2 p2  11  2 p2  11  2 p2  9  p2  5  9  p2  5  9  p2  5  9  p2  5  p3  p3  p3  p3  p2  p2  p2  p2  10  p1  10  8  p1  p1  8  p1  p1  10  8  p1  p1  10  8  p1   4  p4  13  4  p4  13  4  p4  13  4  p4  13 Basic RDF Graph Range-based Subject-Hash-based Predicate- Hash-based   1  p3  1  p3  1  p3  1  p3   p4  p1  p2  3   p4  p1  p2  3   p4  p1 p2  3   p4  p1  p2  3  11  2  p2  11  2  p2  11  2 p2  11  2 p2  9  p2  5  9  p2  5  9  p2  5  9  p2  5  p3  p3  p3  p3  p2  p2  p2  p2   p1  10  8  p1  p1  10  8  p1  p1  10  8  p1  p1  10  8 (T = Triple Table, P = Property Table,  Sesame [17] is an architecture that allows persistent storage and querying of RDF data. Sesame provides storage-independent solutions, and hence it can be deployed on top of a variety of storage devices such as relational DBMS and Object-oriented databases. The querying is based on RQL language. The storage, indexing, and query processing are based on the underlying DBMS used by the Sesame. Sesame has been renamed as Eclipse RDF4J 27 with improved functionality such as both in-memory and persistence data storage, SPARQL and SeRQL support etc. 3store [43]  The boolean value if a literal is used in the triple, and (3) and the boolean value if this triple is inferred. The models, resource, and literal are two columns tables to map hash IDs to models, resources, and literals, respectively. As 3store uses MySQL as its backend, it depends upon the MySQL built-in query optimizer to effectively use its native indexes. The query processing is based on the translating RDQL to SQL. Jena2 [122] is an improved version of the Jena1. It has the support both for statement and property tables. Unlike Jena1, the schema is denormalized, and URIs, simple literals are directly stored within the statement table. Jena2 also makes use of the dictionary tables only for strings whose lengths exceed a threshold. The advantage of directly storing URIs and literals into TT is that the operation of the filter can be directly performed on TT, thus avoiding extra dictionary lookups. The disadvantage is that it also results in higher storage consumption since string values are stored multiple times. The use of property tables for most frequently used predicates can avoid the multiple self joins in statement tables.  [70] was designed to store ontology information and instance data. The storage is based on multiple relations, stored in relational IBM DB2 DBMS. Five two-column tables were used to store Ontology-related data. Similarly, another five two-columns tables were used to store instance-related data. RStar Query Language (RSQL) was defined for resource retrieval. RStar performs translation of RSQL into the SQL of an underlying database. Query engine of RSQL pushes many tasks to the underlying database to take advantage of built-in query evaluation.
BRAHMS [54] main memory-based RDF engine. The RDF data is store in three hash tables s-op, o-sp, p-so. This RDF engine was designed to find semantic associations in large RDF datasets. This was done by leveraging the depth-first search and breathfirst search algorithms.
Yet Another RDF Store (YARS) [46] stores RDF data in the form of quads having four columns in the disk. In addition to subject, predicate, object, it also stores the context of the triple. YARS implements two kinds of indexes: (1) Lexicons indexes, which operate on the string representation of RDF graph nodes to enable fast retrieval of object identifiers, (2) Quad indexes, which are a set of permutation indexes applied on a quad (subject s, predicate p, object o and context c). They are six -SPOC, POC, OCS, CSP, CP, OSquad indexes used in the YARS. YARS uses Notation3 (N3) to query RDF data. The query processing is based on index lookups and joins operations. The join ordering (lowest cardinality join should be executed first) is based on getting the actual cardinalities using getCountQ query.
Kowari [123] is a metastore built to provide scalable, secure transactions, and storage infrastructure for RDF data. Kowari makes use of the persistent quad-storage by using XA Statement Store. Along with subject, predicate, object, it also stores additional meta node with each triple. The meta node illustrates in which model this statement occurs. Multiple quad-permutation indexes are implemented. Same like Mulgara, Kowari also implements six indexes which are stored as an AVL tree and B-Trees. This combination enables fast, simple searching and modification. The query processing is based on iTQL. Kowari query execution plan is also based on cardinality estimations. Once the initial execution plan is complete, the query engine starts to merge the results using join operations. During this process, the query engine may find a more efficient alternate execution plan from the sizes of intermediate results. If this happens, the query engine will further optimize the execution plan, and complete results are returned to a client after the final join operation.
RDF SW-Store [2] is the example of vertical partitioning: the data is divided into n two columns (subject, object) tables, where n is the number of distinct predicates in the dataset. In addition to natural indexing by a predicate, each of the n tables is indexed by subject so that particular subjects can be retrieved quickly from the tables. The implementation of SW-Store relies on a column-oriented database system C-store [112]. SW-Store uses Jena ARQ to translate SPARQL queries into SQL. For triple patterns with bound predicates, only one table is consulted to get the required results. Fast merge-join operations are exploited to collect information about multiple properties for subsets of subjects.
Hexastore [119] proposes an RDF storage scheme that uses the triple nature of RDF as an asset. In this approach, a single giant TT is indexed in six possible ways -SPO, sop, pso, pos, osp, ops. In this storage technique, two vectors are associated (ref. Figure 8) with each distinct instance of the subject, predicate or object. This format of extensive indexing allows fast query execution at the price of a worst-case five-fold increase in index space as compared to a single TT. Hexastore uses SPARQL as its query language. The query processing is based on using the appreciate index for the different types of triple patterns (ref. Figure 6). Since all vectors store elements in sorted order, a fast merge-joins can be used to integrate the results of the different triple patterns. RDFJoin 32 contains three types of tables, namely the URI conversion tables, TT, and join table. The URI conversion tables are just like dictionary encoded values of the URIs. RDFJoin stores triple in three TTs, namely PSTable, POTable, and SOTable. The PSTable consists of columns for PropertyID, SubjectID and ObjectBitVector. The PropertyID and SubjectID represent the property, respectively. The subject of a triple and ObjectBitVector is a bit vector of all objects that are associated with given property and subject instances. The same explanation goes for POTable and SOTable. The PSTable is naturally ordered by the property and secondary indexed by the subject. Both property and subject make the primary key of the table, which enables to find out Object by the primary key lookups. The same explanation goes for POTable that facilitates the merge joins in the case of Subject-Subject and Object-Object joins. The Join tables store the results of subject-subject joins, object-object joins, and subject-object joins in a bit vectors. Three separate join tables, namely SSJoinTable, SOJoinTable, and OOJoinTable, are used to store the subject-subject, object-object, and subject-object joins, respectively. Each of the three join tables has three columns for Property1, Property2 and a BitVector. The first two columns make the primary key upon which hash indexed is applied to produce a corresponding BitVector. The query execution is based on SPARQL query processing with index lookups, and different join implementations such as merge join.
Parliament [65] contains three types of tables, i.e. re-source DOGMA [16] presents a graph-based RDF data storage solution. In this model, the RDF graph is represented as a balanced binary tree and store it on disk. There is no specific index needed for query processing as subgraph matching; the tree itself can be regarded as an index. However, the author has proposed two additional indexes for fast subgraph matching. DOGMA develops algorithms to answer only graph match-ing queries expressible in SPARQL, and hence it was not supporting all SPARQL queries.
TurboHOM++ [64] is another graph-based solution for running SPARQL queries. However, unlike DOGMA, it is an in-memory solution in which both RDF graph and SPARQL query is represented as specialized graphs by using type-aware transformation, i.e., it makes use of the type information specified by the rdf: type predicate. The sub-graph matching is performed for SPARQL query execution. This approach also makes use of the predicate index where a key is a predicate, and a value is a pair of a list of subject IDs and a list of object IDs. RDF-3X [79] is an example of exhaustive indexing over a single giant TT by using optimized data structures. Triples are stored in a clustered B+ trees in lexicographic order. The values inside the B+ tree are delta encoded [79] to further reduce the space required for storing these indexes. A triple < S; P; O > is indexed in six possible waysspo, sop, pso, pos, osp, ops -, one for each possible ordering of the subject s, predicate p, and object o. These indexes are called compressed indexes in RDF-3X. In addition, RDF-3X makes use of the six aggregated indexes -(sp, ps, so, os, po, opeach of which stores only two out of the three components of a triple along with an aggregated count which is the number of occurrences of this pair in the full set of triples. The aggregated indexes (value1, value2, count) are helpful for answering SPARQL query of type "select?a?cwhere?a?b?c ". Finally, three one value indexes (value1, count) are also maintained for storing the count of the s, p, and o. Thus, altogether, 15 indexes are created in RDF-3X. SPARQL query processing and the selection of an optimized query execution plan is based on a cost model which calculates the cost of the different plans and select the plan with minimum cost.
RIQ [62] is based on the storage of RDF data as a new vector representation. This is achieved by applying different transformations related to a triple in RDF graph and a triple pattern in a BGP. RIQ also maps a graph with context c to pattern vector (PV). This is done by a hash function based on a Rabin fingerprinting technique [89]. New vector representation helps to groups same RDF graph. In spite of the whole RDF graph, these similar groups are indexed in RIQ. A novel filtering index called PV-Index is used for this purpose. This index is used to identify candidate RDF graph early. It is a combination of a Bloom filter and a Counting Bloom Filter for compact storage. RIQ employs a divide and conquers approach for SPARQL query processing. This approach works on the identification of early groups by PV-Index. After the identification, different optimizations are applied, and the SPARQL query is rewritten and executed on the SPARQL processor, which support quads to represent the final output.
Stardog 33 storage is based on the RocksDb 34 , a persistent keyvalue store for fast storage. It supports the triple-permutation indexes where triples are represented as quads; thus, the fourth element of the quad (i.e., the context) is also indexed. It supports SPARQL 1.1, full-text search through Lucene and ACID transactions. The query planning is based on index search with support for different types of joins such as hash join, bind join, merge join etc.
gStore [134] is a graph-based engine that stores RDF triples in the form of a directed, multi-edge graph. The subject and object of a triple are represented by graph nodes, and the corresponding predicate represents a directed link from a subject node to an object node. Multi edges between two nodes can be formed if there exists more than one property relation between the subject and object. The RDF graph is stored in an adjacency list table and encoded into a bitstring, also called vertex signature. The bitstring encoding of the graph is being done by using different hash functions [24]. gStore uses two trees for indexing: (1) a height-balanced S-tree [27], and (2) VS-tree (vertex signature tree). The S-tree index can be used to find attribute values in the adjacency list specified in a query. But it is not able to support multi-way joins over attribute values. To solve this problem, gStore makes use of VS-tree. Same like graph encoding, a SPARQL query is also represented as a signature graph called query signature. Consequently, query execution problem is reduced to the sub-graph matching problem where a query signature graph is matched against the graph signature.
Strabon [67] is used for storing and querying geospatial data, which exists in the form of stRDF, originally presented in [66]. Strabon is built upon Sesame, also known as RDF4J. Sesame is chosen because it is an open-source nature, consists of layered architecture, offers a wide range of functionalities and it has the ability to have PostGIS, which is a database management system with spatial features. Strabon has three components which consist of the storage manager, the query engine and PostGIS. The storage manager of Strabon makes use of a bulk loader to store stRDF triples. These triple are stored in a storage scheme of Sesame, "one table per predicate" and dictionary encoding. For each of predicate and dictionary tables, Strabon creates a two B+ tree two-column indexes and a B+ tree index on the id column 33 https://www.stardog.com/docs/7.0.0/ 34 RocksDB: https://rocksdb.org/ respectively. Strabon has a new extension of SPARQL query language, stSPARQL [66] to query stRDF based datasets. The query engine of Strabon performs the query execution. It consists of a parser, an optimizer, an evaluator and a trans-action manager. The parser and the transaction manager are the same as in a Sesame. An optimizer and an elevator are modified in Strabon. The parser generates an abstract syntax tree, which becomes a query tree by mapping into the Sesame's algebra. Then optimizer takes the query tree and applies different optimizations, and then an evaluator produces the corresponding SQL query. This SQL query will be executed on PostgreSQL.
BitMat [10] as an example of binary data storage which makes use of the three-dimensional (subject, predicate, object) bit matrix which is flattened to two dimensions for representing RDF triples. In this matrix, all the values used are either 0 or 1, representing the absence or presence of that triple. BitMat further compresses the data on each row level. In particular, BitMat creates three auxiliary tables to get mappings of all distinct subjects, predicates, and objects to the sequence-based identifiers. Bitwise AND/OR operators are used to process join queries.
Triplebit [130] stores RDF data in the form of a bit matrix storage structure [10] and applies an encoding mechanism to compress huge RDF graphs. The RDF triples are represented as a two-dimensional bit matrix. The columns of the matrix represent triples, with only bit value entries for subject and object of the triple. Each row is defined by a distinct entity value which represents a set of triples which contain this entity. TripleBit vertically partitions the matrix into multiple disjoint buckets, one bucket per predicate. Two indexes are used in TripleBit, namely ID-Chunk bit matrix and ID-predicate bit matrix. The former supports a fast search for finding the relevant chunks for a given subject or object. The later provides a mapping of a subject or an object to the list of corresponding predicates to which it relates. A dynamic query planning algorithm is used to generate the optimized SPARQL query execution plans, with the aim of minimizing the size of intermediate results as early as possible. Merge joins are used extensively in the generated optimized query execution plans. DB2RDF [15] uses a relational schema consisting of property tables to store RDF data. The storage is based on the encoding scheme which encodes a list of properties for each subject in a single row. DB2RDF does not create any specific index over triple < s; p; o >. However, the property tables can be naturally regarded as a subject-based index to quickly locate all the properties and the corresponding objects for the given subject instance. DB2RDF performs query optimization in two steps. In the first step SPARQL is optimized, and in the second step SPARQL to SQL translation is optimized.
RDFox [77] is an in-memory RDF engine that supports materialisation-based parallel datalog reasoning and SPARQL query processing. The RDF triples are stored in TT. Three different types of indexes are created over the TT, namely ThreeKeysIndex, TwoKeysIndex, and OneKeyIndex. These indexes are used to efficiently answer the different types of triple patterns (ref. Figure 6). For example, the ThreeKeysIndex can be used for answering triple patterns containing bound subjects, predicates, and objects (i.e., variable-free patterns). The TwoKeysIndex can be used for answering triple patterns containing one or two variable. Each TwoKeysIndex contains an OneKeyIndex that is used to locates the first triple in the relevant list with the given resource ID. RDFox answers SPARQL queries by using its querying package, which first parses the SPARQL query int a query object. Then, the SPARQL compiler converts the query object into a TupleIterator that provides iteration over the answers. RD-Fox contains many different TupleIterator variants such TableIterator supports iteration over SPARQL triple pat-terns, DistinctIterator implements the "DISTINCT" con-struct of SPARQL etc. AMBER (Attributed Multigraph Based Engine for RDF querying) [52] stores all of its RDF data in the form of multi-graph. It creates three different dictionaries of key-value pairs, namely a vertex dictionary, an edge-type dictionary and attribute dictionary. Same like three dictionaries, AMBER creates three indexes: first is an inverted list to store the set of data vertex for each attribute, second is a trie index for storing features of the data vertices, and third is also a trie index structure for storing information of neighbours of the data vertices. SPARQL query in AMBER is transformed into a query multigraph and then applied on the data multi-graph. Results are obtained by using the subgraph matching mechanism of the query and data multigraphs.
Tentris [13] is tensor-based, in-memory RDF engine that makes use of the graph data structure dubbed hypertrie for representing RDF data. This Trie is also used as an index; therefore, no additional indexing is maintained. The SPARQL query processing is performed by using tensor by mapping SPARQL operations to Einstein summation.

Distributed RDF Engines
The distributed RDF engines can be divided into four broader categories : (1) No-SQL-based, (2) Hadoop/Spark-based, (3) Distributed memory-based, and (4) others, e.g., MPI-based, Graph-based. Our discussion is focused towards the storage, indexing, query processing, and partitioning in state-of-the-art distributed RDF engines.
AllegroGraph 40 can be used to store and query both documents (e.g. JSON) and graph data (e.g. RDF). The data is horizontally distributed, also called shards. It uses efficient memory management in combination with disk-based storage, enabling it to scale up to billions of quads. For RDF data, it makes use of the triple-permutation indexes along with named graph and triple id. The default indexes are: are: spogi, posgi, ospgi, gspoi, gposi, gospi and i, where s is the subject, p is the predicate, o is the object of a triple pattern, g is the named graph, and i is the unique ID of the triple. Query processing is based on two components 41 . The first is a simple static query analyzer, which indicates the best indexes to be used during query processing. The second one is a dynamic query analyzer, which processes the query and determines which indexes are used. There is a trade-off in the use of the dynamic query analyzer: by providing better information, it takes a much longer time processing the query. Blazegraph 42 , formerly known as BigData, is an open-source graph database system which supports both RDF and SPARQL. Blazegraph supports both row and column data storage models, which can be saved both in memory or on a disk. Indexing is dependent on the triples or quads or triples  (T = Triple Table, P = Property Table,  with provenance. Based on the type of triple Bigdata creates three or six key-range partitioned B+ tree indexes. Indexing in Bigdata is like in YARS. Dictionary encoding with 64bit integers is used for a compressed representation of RDF triples. The query optimizer of Bigdata generates optimized query execution plans according to two different approaches: the default approach uses static analysis and fast cardinality estimation of access paths, the second approach uses runtime sampling of join graphs. In remove the scaling limit, the Blazegraph employs dynamically partitioned key-range shards. YARS2 (Yet Another RDF Store, Version 2) [47] data storage is based on RDF quads (subject, predicate, object, context). The main focus of YARS2 is on the distributed indexing. The index manager in YARS2 uses three indexes, namely keyword index, quad index and join index for evaluating queries.
The keyword index is used for keyword lookups. The quad indexes are the triple permutation indexes, applied on YARS2 RDF quads. The join index helps speeding up query execution, containing certain combinations of values, or paths in the graph. YARS2 also uses SPARQL as its query language. YARS2 performs the distributed query execution by using optimized methods for handling the network traffic and avoiding memory overheads, index loop joins, and dynamic programming for joins ordering. Finally, a hash-based data placement used for quads indexing. SHARD [91] is a Hadoop-based distributed RDF engine, which employs hash-based data distribution. It stores RDF data in flat files on HDFS in a way that each line presents all the triples associated with a subject resource of the RDF triple. The input dataset is hash partitioned so that every partition contains the distinct set of triples. There is no specific indexing employed in SHARD. Query execution is performed through MapReduce iterations: first, it collects the results for the sub-queries, following by the joins, and finally, filtering is performed for bounded variable and to remove the redundant results.
4store [44] stores data in a quad (model, subject, predicate, object), where model represents the graph in which triples are saved. The partitioning of data among n partitions is exactly the same as a subject-based-hashed partition. However, RIDs (Resource IDentifiers) [121] values of the triple sub-jects are used instead of using hashing function over the subject component of the triple. The allocation of triples to a specific partition is based on the formula defined as n = RID(subject) mod n. Three different types of indexes are used in 4Store, namely R, M, and P indexes. RDF Re-sources, i.e. URIs, Literals, and Blank Nodes are represented as triple (rid, attr, lexical value), and stored in a bucketed, power-of-two sized hash table, known as the R Index. The M index is created over named graphs or model. It is basically a hash table which maps named graphs to the corresponding triples in the named graph. The P Indexes (two for each predicate) corresponds to the predicates and consist of a set of radix tries [74], using a 4-bit radix. The query engine is largely based on Relational Algebra. The join order is based on the cardinality estimations.
Virtuoso [31] stores RDF data as a single table with four columns: Subject (s), Predicate (p), Object (o), and Graph (g), where s, p and g are IRIs, and o may be of any data type. The Virtuoso makes use of the dictionary encoding for IRI's and literals used in the triples. It creates a fewer number of indexes. The gspo is the default index used in the Virtuoso. In addition, it creates an auxiliary bitmap index opgs. Query execution is based on the translation of SPARQL queries into SQL to be executed on the underlying database backend. The optimized query plan generation is based on the cost estimations functions. The Virtuoso partition's data using subject-hash-based partitioning.
The Header Dictionary Triples HDT [33] is a compressed representation of RDF data, which consists of three components : (1) a Header which contains metadata about the dataset, (2) a Dictionary which encodes strings into integers, and (3) Triples which are encoded in a binary format. For each distinct subject in the dataset, a binary tree is created, which consists of three levels. The subject is the root of the tree, followed by the list of predicates and objects containing the given subject. Thus, the HDT data structure is naturally indexed by subject. The triple patterns with bound subjects can be directly answered from the binary tree corresponding to the given bound subjects. Additionally, an object-based index is created to facilitate executing triple patterns with bound objects. The join between triple patterns is performed by using merge and index joins. The HDT framework includes distributed query processing tools using Hadoop as well.
H-RDF-3X [51] is a Hadoop-based distributed RDF engine that uses RDF-3X on a cluster of machines. To reduce the communication cost and speed up the query processing, it also leverages n-hop guarantees data replication at workers nodes. A graph-based partitioning (based on METIS) is used to distribute triples among multiple worker nodes. The query execution is based on the RDF-3X and Hadoop, which utilizes the RDF-3X's internal statistics to generate efficient query plans. Please note that the presented architecture is flexible and can be replaced with any other centralized RDF engine.
GraphDB storage 43 is packaged as Storage and Inference Layer (SAIL) component for the RDF4J framework 44 . The Sail API is a set of Java interfaces that support RDF storing in different mechanism, e.g., using relational databases, file systems, in-memory storage etc. GraphDB contains many automatic indexes apply to implicit or explicit triples. Two main triple-permutation indexes, namely Predicate, Object, Subject (POS) and Predicate, Subject, Object (PSO) are created in GraphDB. Furthermore, it also creates an optional PSCO and POCS indexes, where C stands for the context. Finally, it also creates a literal index for efficient data storage and lookup. The SPARQL query execution is based on RDF4J framework. The GraphDB EE is a master-slave-based distributed RDF engine, where each cluster has at least one master node that manages one or more worker nodes, each hosting the full database copy.
CumulusRDF [45] works on top of the Apache Cassandra, an opensource nested key-value store. Two data storage layouts, namely hierarchical layout and flat layout, are implemented in CumulusRDF. The hierarchical layout is based on Cassandra's super columns. The flat layout is based on a standard key-value storage model. Three indexes, namely SPO, POS, OSP, are created to satisfy all six RDF triple patterns.CumulusRDF also supports a new index CSPO, for RDF named graphs by storing quads [1]. There are no dictionaries created in CumulusRDF. The distribution of data among multiple hash tables is based on hashing RDF triples. For query processing, Sesame processor translates the queries into index lookups of Cassandra.
Rapid+ [90] is a Pig-based system that uses a Vertical Partitioning (VP) strategy for storing RDF data. The SPLIT command of Pig is used to store data in VP tables. The main goal of the presented approach is to minimize the communication and I/O costs during the map-reduce operations while processing SPARQL queries on top of RDF in the Map-Reduce frameworks. To achieve this goal, an intermediate algebra called the Nested TripleGroup Algebra (NTGA) was introduced in combination with a predicate-based indexing structure. The NTGA is used for efficient translation of SPARQL queries into PigLatin operations.
Jena-HBase [63] is an HBase backed RDF engine that can be used with the Jena framework. The Jena-HBase supports A variety of custom-built RDF data storage layouts for HBase, namely Simple, Vertical Partitioned, Indexed, Hybrid and Hash over several HBase tables. The indexes used in Jena-HBase depend on the storage layout. Experimental evaluation shows the superiority of Hybrid layout, which leverages the advantages of multiple layouts. For querying, Jena-HBase has query runner block that provides APIs for different storage layouts.
Rya [87] employs Accumulo, a key-value and column-oriented store as its storage backend. However, it can also use any other NoSQL database as its storage component. Rya stores three index permutations, namely SPO, POS, and OSP. The query processing is based on the OpenRDF Sesame SAIL API. In particular, index nested loop joins are used for SPARQL query execution using MapReduce fashion. The count of the distinct subjects, predicates, and objects is maintained and used during joins reordering and query optimization. These counts are only updated in case of data distribution changes significantly.
AMADA [9] is implemented on Amazon Web Services (AWS) cloud infrastructure. This infrastructure supports storage, indexing and querying as software as a service (SAAS). The indexes are built using SimpleDB, and key-value storage solution which supports SQL queries. SimpleDB supports several indexing strategies. One in them is attribute indexing which creates three indexes for three elements of a triple. In AMADA, the query is submitted to a query processor module running on EC2. After that, indexes are lookup in SimpleDB to find the answers related to a query. Results are written in a file stored in S3, whose URI is sent back to the user to retrieve the query answers.
H2RDF [84] stores data as multiple indexes over HBase, in combination with the Hadoop framework. In H2RDF, three permutations of S, P, O (SPO, POS, and OSP) are created over the HBase tables in the form of key-value pairs. The query planning is based on the selectivity information and cost estimation of joins.
The Semantic Hash Partitioning-Enabled distributed RDF data management system (SHAPE) [68] uses RDF-3X to store RDF triples across nodes of a cluster. Distribution of triples is done by using semantic hash partitioning based on the URI hierarchy: triples with the same Subject or Object are identified and are placed in the same partition. Same like RDF-3X, it creates indexes covering all possible permutations of Subject, Predicate, and Object, which are stored as clustered B+ trees. Distributed query execution planner coordinates the intermediate results from different client machines, which are loaded into HDFS and joined using MapReduce Hadoop joins.
Workload-Aware Replication and Partitioning (WARP) [50] Also uses RDF-3X to store triples in partitions among a cluster of machines. The distribution of triples among clusters is based on a given query workload containing a set of SPARQL queries. To reduce the inter-cluster communication cost, WARP supports n-hop guarantee replication. WARP also uses the underlying RDF-3X indexes. Multi-pass queries (MPQ) consisting of many triple patterns are converted into one pass queries (OPQ). This conversion is based on the replication of triples at the border of partition. For each of one pass queries, query optimizer creates an execution plan made up of left deep joins tree. One pass query is executed in parallel by all the slave machines, and results are combined using merge joins.
PigSPARQL [103] uses VP for storing triples. Due to no indexing, PigSPARQL is well suited for "Extract, Transform, Load" like scenarios. The main component of the PigSPARQL is a translation of SPARQL queries into Pig Latin [80] scripts on Apache Pig. The translation is based on the algebraic representation of SPARQL query expressions. In the first step, an abstract syntax tree is generated from the SPARQL query using the Jena ARQ framework. The abstract syntax tree is then translated into the SPARQL algebra tree. At this point, based on selectivity information, optimization for filters and rearrangement of triple patterns are applied. Then this optimized algebra tree is traversed bot-tom up to generate PigLatin expressions for every SPARQL algebra operator. After this, these PigLatin expressions are mapped into MapReduce iterations.
Entity Aware Graph compREssion technique (EAGRE) [133] stores RDF data in HDFS in a (key, value) fashion that preserves both semantic and structural information of the RDF graphs. It employs a graph-based data partitioning using Metis [60]. EAGRE adopts a space-filling curve technique, an inmemory index structure which is used to index high dimensional data to efficiently accelerate the evaluation of range and ordersensitive queries. The SPARQL query processing in EAGRE is completed using worker nodes. The I/O cost is minimized by using efficient distributed scheduling approaches.
H2RDF+ [83] uses HBase 45 tables to store RDF data, dictionaries, and indexes. It makes use of the distributed MapReduce processing and HBase indexes. The cost of the different query joins estimated by using the stored statistics in the indexes. H2RDF+ adaptively chooses for either single-or multimachine execution based on the estimated join costs. It makes use of the distributed MapReduce based implementations of the multiway Merge and SortMerge join algorithms. Trinity.RDF [132] has been implemented on top of Trinity [107], a distributed memory-based key-value storage system. It stores RDF data in the native graph from which is partitioned across multiple clusters by applying hashing on nodes. The distributed SPARQL query processing is managed by using graph exploration, assisted by predicates and triple permutation indexes (which are like PSO or POS). D-SPARQ [76] is a distributed RDF engine implemented on top of MongoDB, a NoSQL document database. D-SPARQ takes RDF triples as an input and makes a graph. The graph is then partitioned among nodes of a cluster such that triples with the same subject are placed in the same partition. In addition, partial data replication is then applied where some of the triples are replicated across different partitions for parallel execution of queries. Grouping the triples based on the same subject enables D-SPARQ to efficiently execute star patterns queries (i.e., subject-subject joins). D-SPARQ indexes involve subject-predicate (sp) and predicateobject (po) permutations. For query execution, D-SPARQ employs a selectivity of each triple pattern to reorder individual triple pattern in star pattern queries.
CliqueSquare [35] is a Hadoop-based RDF engine used to store and process massive RDF graphs. It stores RDF data in VP tables fashion using semantic hash partitioning. The goal is to minimize the number of MapReduce jobs and the amount of data transfer. In CliqueSquare, the objective 45 https://hbase.apache.org/ 22 of the partitioning is to enable co-located or partitioned joins. In these joins, RDF data is placed in such a manner that the maximum number of joins are evaluated in a map phase of the map-reduce system. In addition, CliqueSquare also maintains three replicas for fast query processing and increased data locality. In order to apply SPARQL queries on RDF dataset, the CliqueSquare uses a clique-based algorithm, which divides a query variable graph into clique sub-graphs. This algorithm works in an iterative way to identify cliques and to collapse them by evaluating the joins on the common variables of each clique. The process ends when the variable graph consists of only one node.
TripleRush [114] is based on the Signal/Collect [113], a parallel and distributed graph processing framework. In TripleRush, three types of vertices exit namely the triple vertices, the index vertices, and the query vertices. The triple vertices correspond the RDF triples with each vertex contains subject, predicate, and object of the RDF triple. The index vertices correspond to the SPARQL triple patterns and used for matching triple patterns against the RDF triples. The query vertices coordinate the query execution. In TripleRush, index graph is formed by index and triple vertices. A query execution is initialized when a query vertex is added to a TripleRush graph. Then a query vertex emits a query particle which is routed by the Signal/Collect to index vertex for matching. There is no partitioning performed in this framework. chameleon-db [8] an example of workload-aware RDF graph partitioning and distributed RDF engine. It stores RDF data as graph and SPARQL query is represented as graph as well, thus the query execution is reduced to sub-graph matching problem. The query execution is optimized using workload-aware partitioning indexing technique, which an incremental indexing technique using a decision tree to keep track of the relevant segments of the queries. It also uses a vertex index which is a kind of hash table containing URIs of vertices in a subset of partitions. It also maintains a range index to keep track of the minimum and maximum literal values.
Partout [34] also uses RDF-3X for storing RDF triples which are partitioned by using Workload-aware partitioning technique. The goal of the partitioning to group all RDF triples in a same partition which are likely to be queried together in SPARQL queries. RDF-3X in Partout creates indexes covering all possible permutations of Subject, Pred-icate, and Object, which are stored as clustered B+ trees. The SPARQL query is issued at a server which generates a suitable query plan according to RDF-3X execution model by utilizes a global statistics file, which contains information about partitions definition, size, and mapping to a client. The centralized execution plan of RDF-3X is transformed into a distributed plan, which is then refined by a distributed cost model to resolve the triples locations in a cluster. The refined query plan is executed by client machines in parallel, and the final results are joined by Binary Merge Union (BMU) operation.
Sempala [104] uses Parquet 46 (a columnar storage format for Hadoop) for storing RDF triples. The Parquet is designed for supporting a single wide table, and thus Sempala uses a single Unified Property Table (UPT) for string complete RDF triples. Sempala does not maintain any additional index. The query execution is performed by translating SPARQL queries into SQL, which is executed on top of UPT by using Apache Impala.
Triple Asynchronous and Distributed (TriAD) [40] implements the main memory, master-slave, and shared-nothing architecture based on asynchronous Message Passing protocol. It is based on classical master-slave architecture in which query processing is performed by the slave nodes using autonomous and asynchronous messages exchange. It implements a graph summarization approach for data storage and to facilitate join-ahead pruning in a distributed environment. It creates indexes both at master and slave nodes. The summary graph is indexed in the adjacency list format in large inmemory vectors. A hash partitioning on PSO and POS permutations is applied to distribute data among slave nodes. The slave nodes create six in-memory vectors of triples received from the master node. These are divided into two parts, namely Subject-based and Object-based. Subject-based are SPO, SOP, PSO while object-based is OSP, OPS, POS. The SPARQL query execution is based on the stored indexes, dynamic programming, and distributed cost model. SparkRDF [20] is a Spark-based RDF engine which dis-tributes the graph into multi-layer elastic subgraphs (MESG) by extending vertical partitioning. MESGs are based on the classes and relations to reduce search space and memory overhead. SparkRDF creates five indexes, i.e., C, R, CR, RC and CRC, where C stands for Class subgraphs, and R denotes Relation subgraphs. These indexes are modelled as Resilient Discreted SubGraphs (RDSGs). For SPARQL query processing, SparkRDF employs Spark APIs and an iterative join operation to minimize the intermediate results during subgraph matching.
SemStore [124] follows a master-slave architecture having one master node and many slave nodes. It uses a centralized RDF engine Triplebit as its storage component. The partitioning of data in SemStore is based on creating Rooted Sub-Graphs (RSGs) to effectively localize SPARQL queries in particular shapes, e.g., star, chain, tree, or a cycle that captures the most frequent SPARQL queries. It implements a hash function to assign RSGs to slave nodes. In order to reduce redundancy and localize more query types, a k-mean partitioning algorithm is used to assign RSGs to cluster nodes. After partition, the data partitioner creates a global bitmap index over the vertices and collect the global statistics. The slave nodes build local indexes and statistics to be used during local join processing. The SPARQL query is submitted to a master node which parses the query and decides the query as local or distributed. In case of a distributed query, a query is distributed among slave nodes as subqueries and executed on the bases of local indexes and statistics. DREAM [41] uses RDF-3X single node RDF engine as its storage component. It replicates an entire dataset on every node with RDF-3X installed. This replication causes storage overhead but avoids inter partition communication among cluster nodes. DREAM creates indexes covering all possible permutations of Subject, Predicate, and Object, stored as clustered B+ trees. In the query execution phase, the given SPARQL query is first represented as a directed graph, which is divided into multiple sub-graphs to be distributed among the compute nodes. The results of the query sub-graphs are combined by using a hash-based algorithm.
DiploCloud [126] makes use of three data structures 23 namely the molecule clusters, template lists, and a molecule index. The molecule clusters create RDF subgraphs, co-locating semantically related data to minimize inter-node operations. The template lists are used to store RDF literals in lists. The molecule index is used to index URIs in the molecule clusters. The storage is based on a single node system called dipLODocus [128]. DiploCloud is based on master-slave architecture. The master node contains the key index encoding of URIs into IDs, a partition manager and distributed query executor. The worker node contains a type index, local molecule cluster and a molecule index. In SPARQL query execution, the master node distributes the subqueries among slave nodes, which run subqueries and return intermediate results to a master node. It makes use of the different partitioning strategies like Scope-k Molecules, Manual Partitioning and Adaptive Partitioning to distribute data among the cluster.
SPARQLGX [37] vertically partitioned RDF triples by predicates and stores in Hadoop Distributed File System (HDFS) [110]. A separate file is created for each unique predicate in the RDF; thus, each file contains only subject and object. It translates SPARQL queries into Scala code which is then directly executed by using Spark.
S2RDF [105] translates SPARQL queries into SQL, which is run on Spark [131]. The storage is based on an extended version of the vertical partitioning with special cares for tuples that do not find a joining partner when performing a join between SPARQL triple patterns. It makes use of the selectivity information for query planning and joins ordering.
AdPart [42] is a distributed RDF engine that follows a typical Master-Slave architecture and is deployed on a shared-nothing cluster of machines. The master initially partitioned the data by using by hashing on the subjects of triples. The slave stores the corresponding triples using an in-memory data structure. Within the worker node, AdPart primarily hash triples by a predicate, leading to a natural predicate-based index (P-index). In addition, each worker maintains predicate-subject index (PS-index) and predicate-object index (PO-index). The SPARQL query processing is done by using cost estimations using stored statistics.
S2X [102] processes SPARQL queries on the top of the Spark component called GraphX [36]. The triples stored among different worker machines after applying hash-based encoding on Subject and Objects through GraphX default partitioner. It converts the RDF graph into a property graph, which is the data model used by GraphX. S2X does not maintain any indexing. For SPARQL query processing, it combines matching of graph patterns with relational style operators to produce solution mappings. Relational operators of SPARQL are implemented through the Spark API.
gStoreD [86] is based on the strategy of partitioning the data graph but not decomposing the query. It stores triples by modifying the centralized RDF engine gStore [134], which stores RDF triples in an adjacency list table and encodes them into a bitstring also known as vertex signature. There is no indexing mechanism used in gStoreD. The SPARQL query processing is based on sub-graph matching.
Wukong [109] stores RDF data in a form a directed graph on a distributed key-value store. It maintains two kinds of indexes: a normal vertex index which refers to a subject or object of a triple; and a predicate index which store all subjects and objects labelled with the particular predicate. In order to distribute SPARQL queries among cluster machines, Wukong makes use of the Remote Direct Memory Access (RDMA). Each query is divided into multiple subqueries on the basis of the selectivity and complexity of queries.
Wukong employs a graph exploration mechanism with Full History Pruning to remove intermediate results during query execution.
Semantic Analytics Stack(Sansa) [69] is a Spark-based distributed RDF data management system. It makes use of the extended Vertical partitioning (VP) of RDF data stored in HDFS. Since the data is vertically partitioned by using predicates of the triples, SANSA stack maintains a natural predicate-based index. Sparklify [111] is used as a default query engine for SPARQL-to-SQL translation of SPARQL queries into Spark SQL to be executed on top of Apache Spark SQL engine.
Stylus [49] exploits a strongly-typed key-value storage scheme called Microsoft Trinity Graph Engine [108] 47 . It makes use of a data structure called xUDT for storing a group of predicates identified by a unique id. An index which maps a given predicate to the unique id is maintained within Stylus. Another index is used to retrieve entities within triples. A random hashing is used to distribute RDF dataset over a cluster of servers. In Stylus, the SPARQL query processing is reduced to sub-graph matching problem.
Koral [56] is a distributed system which offers the facility of alternate approaches for each component in distributed query processing. It is also based on classic master-slave architecture. It implements various RDF graph partitioning strategies (hash-based and graph-based) for distributing data among computing nodes. SPARQL is used as a default query language in Koral. It is an extension of asynchronous execution mechanisms such as realised in TriAD. The query execution mechanism is independent of the underlying graph cover and can easily be replaced with another mechanism.
PRoST [25] stores RDF data both as VP and PT and hence take the advantages of both storage techniques with the cost of additional storage overhead. It does not maintain any additional indexes. SPARQL queries are translated into Join Tree format in which every node represents the VP table or PT's subquery's patterns. It makes use of the statistics-based query optimizer. The data is distributed using vertical partitioning.
WORQ [72] uses a query workload driven approach to partition the given RDF data among multiple data nodes which stores data in a Vertical partitioning manner. It uses Bloom Filters as indexes over the Vertically partitioned tables. The online reductions of RDF joins are performed by using bloom filters. In addition, caching of RDF joins reductions is performed to boost the query performance.
Anzograph 48 stores data in a graph form and entirely in memory. It can be installed on single or multiple nodes. It does not require an explicit index created by the user. SPARQL 1.1 queries are sent and received to different nodes over HTTP. Query processing follows a master-slave architecture. The query is issued at a master node which parses the query and assigns to the query planner. The query planner decides the type of join or an aggregation needed for a query. Then the code generator unites the different steps into segments, after that all the segments are packaged for query into a stream. The master sends this stream to corresponding nodes in parallel. The results are then sent back to master and integrated into a final query result set. Neptune 49 works as a graph database service to work with highly connected datasets. Neptune stores RDF graph in the form of a quad in Amazon S3. It implements three index permutations, SPOG, POGS and GPSO. In addition to three index permutations, each index permutation has 16 access patterns. Neptune stores all of its indexes in Hash tables.
Datasets in Neptune can be queried through SPARQL, Apache TinkerPop and Gremlin 50 . Neptune makes use of cardinality estimations and static analysis to rewrite queries, which will be executed through the pipeline of physical operators. In Neptune, there is no partitioning of graphs; instead, it stores 15 replicas at its master node.
HF, VF, MF [85] (Adaptive Distributed RDF Graph Fragmentation and Allocation based on Query Workload) stores RDF data in three types of fragmentation: Horizontal Fragmentation (HF), Vertical Fragmentation (VF), and Mixed Fragmentation (MF). The data fragmentation is performed on the basis of frequently accessed patterns from the query workload. It implements no indexing mechanism. SPARQL query processing is based on the efficient fragmentation across different partitions. It implements a cost-based query planner.
DiStRDF [120] stores RDF data in an encoded form using unique integer identifiers. It employs a special purpose encoding scheme [118] for the encoding of RDF data stored in CSV or Parquet. DiStRDF can handle data stored both in TT or PT. It exploits range partitioning to distribute data. For indexing, DiStRDF makes use of the predicate pushdown mechanism offered by Spark with the combination of Parquet. This mechanism helps to select the required data by exploiting filters in the query. The query processing is based on Spark. Query processing exploits the encoding scheme to choose between different query execution plans. The physical implementation of logical plans is selected on The basis of a static set of rules (rule-based optimization).
Leon [39] stores RDF data in a heuristics-based partitioning based on the characteristics set [78]. After encoding, Leon evenly distributes the triples across different partitions. It implements a bi-directional dictionary between id's in characteristics set and subjects. This dictionary is used as an index. Leon follows a master-slave architecture for query processing. By exploiting the characteristics set based partitioning, Leon minimizes the communication cost during query evaluation. DISE [53] stores RDF data as 3D tensors, a multidimensional array of ordered columns. It performs the translation of SPARQL queries into Spark tensor operation through Spark-Scala compliant code.

SPARQL BENCHMARKS FOR RDF ENGINES
This discussion is adapted from [100], which analyzed different SPARQL benchmarks used to evaluate RDF engines. The discussion is divided into two sections: first, we explain the SPARQL benchmarks key design features, and then we present an analysis of the existing SPARQL benchmarks. The overall goal facilitates the development of benchmarks for high performance in the future, and help RDF engines developers/users to select the best SPARQL benchmark for the given scenario and RDF engine.

10.1
Benchmarks Design Features SPARQL query benchmarks are comprised of three elements, i.e., RDF datasets, SPARQL queries, and performance measures. We first discussed the key features related to these elements that are vital to take into account in the development of RDF engine benchmarks. Then a high-level analysis of the datasets and queries used in the latest RDF engines benchmarks is presented. Datasets. The datasets used in RDF engine benchmarks are of two types, namely synthetic and real-world RDF datasets [98]. Realworld RDF datasets are very useful as they contain real-world information [75]. On the other hand, synthetic datasets are helpful to test the scalability under the varying size of datasets. Synthetic datasets of varying sizes are generated by different generators, which are tuned to reflect the features of real datasets [28]. In order to select a dataset for benchmarking RDF engines, two measures are proposed by [100]. These two measures are (1) Dataset Structuredness, (2) Relationship Speciality. The structuredness or coherence measures how well an instance of different dataset covers classes, i.e., rdf: type. The value of structuredness of given dataset lies between 0, which stands for the lowest possible structure and 1 point to the highest possible structured dataset. In RDF datasets, some resources are more distinguishable from others, and more attributes are more commonly associated with these resources. The count of different predicates related to each and every resource present in the dataset provides valuable insights about the graph structure of an RDF dataset and makes some resources distinguishable from others [88]. This type of relationship speciality is normal in real datasets. The relationship speciality of RDF datasets can be any positive natural number: the higher the number, the higher the relationship speciality. The value of both structuredness and relationship speciality of different datasets directly affect the result size, the number of intermediate results, and the selectivity values of the triple patterns in the SPARQL query. The formal definitions of these datasets measures can be found in [100]. It is essential to mention that besides the dataset structuredness and relationship speciality, observations from the literature [95,28] suggest that different features of RDF datasets should be considered during selection of datasets for SPARQL benchmarking. These feature for consideration are different number of triples, number of resources, number of properties, number of objects etc. SPARQL Queries. Saleem et al. [100] suggest that a SPARQL querying benchmark should change queries in terms of different query characteristics, i.e., the number of projection variables, number of triple patterns, result set sizes, query execution time, number of Basic Graph Patterns, number of join vertices, mean join vertex degree, mean triple pattern selectivities, join vertex types, and highly used SPARQL clauses etc. All of these characteristics impact on the runtime performance of RDF engines. Saleem et al. [100] proposed a composite measure containing these query features called the Diversity Score of the benchmark queries. The higher the diversity score, the more diverse the queries of the benchmark. The Diversity score is defined as follows.
Query Diversity Score. Let ui the mean and σi is the standard deviation of a given distribution w.r.t. the ith feature of the given distribution. Then the overall diversity score DS of the queries is calculated as the average coefficient of variation of all the features k in a query analyzed in the queries of benchmark B: Performance Measures. Saleem et al. [100] divided the performance metrics for RDF engines comparison into four broader categories: • Query Processing Related: These measures are related to query processing of the RDF engines, for which query execution time is the most important. However, since a benchmark usually contains many queries, re-porting of the execution time for each query is not possible. For this problem, the combined results are presented in terms of measures like Query Mix per Hour (QMpH) and Queries per Second (QpS) [100]. Overheads like intermediate results, disk/memory swaps are important to measure during the query executions [106].
• Data Storage Related: This category includes important performance measures like data loading time, acquired storage space, and size of the index. [100].
• Result Set Related: A fair comparison between two RDF engines is only possible in case of production of the same results. In this regard, measures of correctness and completeness are considered [100].
•Parallelism with/without Updates: IGUANA, bench-marks execution framework [23] is used to measure triple stores capability of parallel query processing. We now present a broader analysis of the different existing RDF engines benchmarks containing the abovementioned features.

RDF Engines Benchmarks Analysis
Saleem et al. [100] considered the benchmarks according to the inclusion criteria. The criteria include things like the benchmark is used to evaluate query runtime performance evaluation of triple stores. The criteria also include that both RDF data and SPARQL queries of the different benchmarks are publicly available or can be generated (3) the queries must not require reasoning to retrieve the complete results. RDF engines benchmarks are divided into two main categories of synthetic and real-data benchmarks:

Synthetic Benchmarks.
Synthetic benchmarks make use of the data (and/or query) generators to generate datasets and/or queries for bench-marking. Synthetic benchmarks are useful in testing the scalability of RDF engines with varying dataset sizes and querying workloads. However, such benchmarks can fail to show the features of real datasets or queries. The Train Benchmark (TrainBench) [116] covers different railway networks in terms of increasing sizes, and serialization of them in different formats, including RDF. The Waterloo SPARQL Diversity Test Suite (WatDiv) [7] comes with a synthetic real-world settings which are derived specifically from the Bologna process. Bologna mostly covers user needs in terms of analytic queries. The LDBC Social Network Benchmark (SNB) defines the Interactive workload (SNB-INT) to measure the evaluation of different graph patterns in a localized manner. This happens when the graph is being continuously updated [30]. LDBC also defines the Business Intelligence workload (SNB-BI), this workload covers queries that mix aggregations with complex graph pattern matching with, covering a significant part of the graph [117], without any updates.

Real-Data Benchmarks.
Real-data benchmarks make use of real-world datasets and queries from real user query logs for benchmarking. Real-data benchmarks are useful in testing RDF engines more closely in real-world settings. However, such benchmarks may fail to test the scalability of RDF engines with varying dataset sizes and query workloads. FEASIBLE [98] is used as a cluster-based SPARQL benchmark generator and uses different SPARQL query logs. DBpedia SPARQL Benchmark (DBPSB) [75] is another clusterbased approach based on the generation of queries from logs of DBpedia, but it has a separate technique used for clustering. Another dataset FishMark [11] dataset is obtained from FishBase 51 and provided in a variant of both RDF and SQL. In FishBase, the log if web-based FishBase application provides the SPARQL queries. The performance of biological datasets is evaluated in BioBench [125]. It also evaluates the queries from five different real-world RDF datasets 52 , i.e., Cell, Allie, PDBJ, DDBJ, and UniProt.
Basic Statistics. Table 11 shows high-level statistics of the selected datasets and queries of the benchmarks. In the case of the synthetic benchmarks, data generators, the datasets were chosen were also used in the original paper. In the case of WatDiv, DBPSB, SNB, being template-based query generators, one query per available template was selected. For FEASIBLE (a benchmark generation framework from query logs), 50 queries from DBpedia log was selected, for comparison with a WatDiv benchmark which comes with 20 basic query templates and 30 extensions for testing. 53 Structuredness and Relationship Speciality. Figure 13a contains the structuredness values of the different benchmarks. Duan et al. [28] first proposed this measure and propose that the real-world datasets are low valued than synthetic benchmarks in structuredness. The results show that most of the synthetic benchmarks contain the structureness feature. Even there are data generators, e.g. WatDiv generator, allow the users to generate a benchmark dataset of according to their need of structuredness. Figure 13bshows the relationship speciality values of different benchmarks. According to [88], it is evident that the values of relationship speciality shown by real-world datasets are higher than that of synthetic ones, as shown in figure 13b. This is also presented by average figures of (11098 vs 744). The relationship speciality values of real-world datasets are much higher than those synthetic ones like Bowlogna (8.7), BSBM (2.2), and WatDiv (22.0).

Diversity Score and Coverage.
Figure 14shows the overall diversity counts of the benchmarks. In real-data benchmarks, FEASIBLE generated benchmarks are the most diverse. While in synthetic benchmarks, WatDiv is the most diverse benchmark. Table 12 shows each benchmark with its coverage percentage of SPARQL clauses and join vertices [94]. The table also highlights the benchmarks with missing SPARQL constructs and types of overused join vertices. A lot of vital SPARQL constructs are missing from TrainBench and WatDiv queries. Every query of FishMark contains at least one Star join node. Performance Measures Table 13presents different benchmarks with performance measures to compare triplestores.
Which RDF engine is the fastest?. One of the critical questions is to know which RDF engine is the fastest in terms of query runtimes? According to our analysis, no other study compares both centralized and distributed RDF engines for their query runtime performances. The performance evaluations presented in the two most diverse SPARQL benchmarks, i.e. FEASIBLE [98] and WatDiv [6], showed that Virtuoso version 7.X is the fastest RDF engine. However, recently there are more RDF engines developed, and even existing engines are more improved in their new versions. As such, it would be interesting to perform a combined performance evaluation of both centralized and RDF engines.

RESEARCH PROBLEMS
This section illustrates the different research problems found during the study. Both centralized and distributed RDF engines lack the support for the update operation. Although in some setups it is implemented in a batch-wise manner, overall, it is immature. Indexing in RDF engines also has a lot of research potential. A number of indexes and the storage overhead caused by them require dedicated research efforts. To execute complex queries containing multiple triple patterns results in massive intermediate results. These intermediate results cause network delay and I/O communications. This problem requires efficient storage and partitioning mechanism to distribute datasets in a way to avoid inter partition communication. It is also found during the study that upcoming distributed RDF engines use different prebuilt repositories with SQL interface to store RDF data. These systems perform translation of SPARQL or other domain query language into SQL for applying queries on the repositories. This translation is the main component in their operation, but SPARQL query optimization was not their core aspect. The translation process does not take into account the different features related to the distribution of datasets among a cluster of machines. The translation process also does not take into account the other Semantic Web concepts of Ontologies and Inference etc.

CONCLUSION
This paper reviews centralized and distributed RDF engines. We review both categories in terms of their storage, indexing, language, and query execution. Due to an ever-increasing size of RDF based data, centralized RDF engines are becoming ineffective for interactive query processing. For the problems of effective query processing, distributed RDF engines are in use. Rather than relying on the specialized storage mechanism, some distributed RDF engines rely on existing single-node systems to take advantage of the underlying infrastructure. SQL is a choice for query language in most of the repositories. Rather than making capable SQL for RDF data, RDF engines translate SPARQL into other languages like SQL, PigLatin etc. This translation is an integral step in distributed setups. generator, allow the users to generate a benchmark dataset of according to their need of structuredness. Figure 13b shows the relationship speciality values of different benchmarks. According to [88], it is evident that the values of relationship speciality shown by real world datasets are higher than that of synthetic ones as shown in Figure 13b. This is also presented by average figures of (11098 vs 744). The relationship speciality values of real-world datasets are much higher than that synthetic ones like Bowlogna (8.7), BSBM (2.2), and WatDiv (22.0). Diversity Score and Coverage. Figure 14 shows the overall diversity counts of the benchmarks. In real-data benchmarks, FEASIBLE generated benchmarks is the most diverse. While in synthetic benchmarks, WatDiv is the most diverse benchmark. Table 12 shows each benchmark with its coverage percent-age of SPARQL clauses and join vertices [94]. The table also highlights the benchmarks with missing SPARQL constructs and types of overused join vertices. A lot of vital SPARQL constructs are missing from TrainBench and WatDiv queries. Every query of FishMark contains at least one Star join node. Performance Measures Table 13 presents different bench-marks with performance measures to compare triplestores.
Which RDF engine is the fastest?. One of the critical questions is to know which RDF engine is the fastest in terms of query runtimes? According to our analysis, no other study compares both centralized and distributed RDF engines for their query runtime performances. The performance evalua-tions presented in the two most diverse SPARQL benchmarks, i.e. FEASIBLE [98] and WatDiv [6], showed that Virtuoso version 7.X is the fastest RDF engine. However, recently there are more RDF engines developed, and even existing engines are more improved in their new versions. As such, it would be interesting to perform a combined performance evaluation of both centralized and RDF engines. research efforts. To execute complex queries containing mul-tiple triple patterns results in large intermediate results. These intermediate results cause network delay and I/O communications. This problem requires efficient storage and partitioning mechanism to distribute datasets in a way to avoid inter partition communication. It is also found dur-ing the study that upcoming distributed RDF engines use different prebuilt repositories with SQL interface to store RDF data. These systems perform translation of SPARQL or other domain query language into SQL for applying queries on the repositories. This translation is the main component in their operation, but SPARQL query optimization was not their core aspect. The translation process does not take into account the different features related to the distribution of datasets among a cluster of machines. The translation process also does not take into account the other Semantic Web concepts of Ontologies and Inference etc.

CONCLUSION
This paper reviews centralized and distributed RDF en-gines. We review both categories in terms of their storage, indexing, language, and query execution. Due to an ever increasing size of RDF based data, centralized RDF engines are becoming ineffective for interactive query processing. For the problems of effective query processing, distributed RDF engines are in use. Rather than relying on the specialized storage mechanism, some distributed RDF engines rely on existing single-node systems to take advantage of the un-derlying infrastructure. SQL is a choice for query language in most of the repositories. Rather than making capable SQL for RDF data, RDF engines translate SPARQL into other languages like SQL, PigLatin etc. This translation is an integral step in distributed setups.

RESEARCH PROBLEMS
This section illustrates the different research problems found during the study. Both centralized and distributed RDF engines lack the support for the update operation. Al-though in some setups it is implemented in a batch-wise manner, overall, it is immature. Indexing in RDF engines also has a lot of research potential. A number of indexes and the storage overhead caused by them require dedicated 14. REFERENCES