A Survey of RDF Stores&SPARQL Engines for Querying Knowledge Graphs

RDF has seen increased adoption in recent years, prompting the standardization of the SPARQL query language for RDF, and the development of local and distributed engines for processing SPARQL queries. This survey paper provides a comprehensive review of techniques and systems for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine). To keep the survey self-contained, we also provide a short discussion on graph partitioning techniques used in the distributed setting. We conclude by discussing contemporary research challenges for further improving SPARQL query engines. This extended version also provides a survey of over one hundred SPARQL query engines and the techniques they use, along with twelve benchmarks and their features.


Introduction
The Resource Description Framework (RDF) is a graphbased data model where triples of the form (s, p, o) denote directed labeled edges s p − → o in a graph. RDF has gained significant adoption in the past years, particularly on the Web. As of 2019, over 5 million websites publish RDF data embedded in their webpages [34]. RDF has also become a popular format for publishing knowledge graphs on the Web, the largest of which -including Bio2RDF, DBpedia, PubChemRDF, UniProt, and Wikidata -contain billions of triples. These developments have brought about the need for optimized techniques and engines for querying large RDF graphs. We refer to engines that allow for storing, indexing and processing joins over RDF as RDF stores.
While various query languages have historically been proposed for RDF, the SPARQL Protocol and RDF Query Language (SPARQL) has become the standard [92]. The first version of SPARQL was standardized in 2008, while SPARQL 1.1 was released in 2013 [92]. SPARQL is an expressive language that supports not only joins, but also variants of the broader relational algebra (projection, selection, union, difference, etc.). Various new features were added in SPARQL 1.1, such as property paths for matching arbitrarylength paths in the RDF graph. Hundreds of SPARQL query services, called endpoints, have emerged on the Web [43], with the most popular endpoints receiving millions of queries per day [197,148]. We refer to engines that support storing, indexing and processing SPARQL (1.1) queries over RDF as SPARQL engines. Since SPARQL supports joins, we consider any SPARQL engine to also be an RDF store.
Efficient data storage, indexing and join processing are key to RDF stores (and thus, to SPARQL engines): -Storage. Different engines store RDF data using different structures (tables, graphs, etc.), encodings (integer IDs, string compression, etc.) and media (main memory, disk, etc.). Which storage to use may depend on the scale of the data, the types of query features supported, etc. -Indexing. Indexes are used in RDF stores for fast lookups and query execution. Different index types can support different operations with varying time-space trade-offs. -Join Processing. At the core of evaluating queries lie efficient methods for processing joins. Aside from traditional pairwise joins, recent years have seen the emergence of novel techniques, such as multiway and worstcase optimal joins, as well as GPU-based join processing. Optimizing the order of evaluation of joins can also be important to ensure efficient processing.
Beyond processing joins, SPARQL engines must offer efficient support for more expressive query features: -Query Processing. SPARQL is an expressive language containing a variety of query features beyond joins that need to be supported efficiently, such as filter expressions, optionals, path queries, etc.
RDF stores can further be divided into two categories: (1) local stores (also called single-node stores) that manage RDF data on one machine and (2) distributed stores that partition RDF data over multiple machines. While local stores are more lightweight, the resources of one machine limit scalability [249,175,104]. Various kinds of distributed RDF stores have thus been proposed [88,104,203,204] that typically run on clusters of shared-nothing machines.
In this survey, we describe storage, indexing, join processing and query processing techniques employed by local RDF stores, as well as high-level strategies for partitioning RDF graphs as needed for distributed storage. An appendix in this extended version further compares 135 local and distributed RDF engines in terms of the techniques they use, as well as 12 benchmarks in terms of the types of data and queries they contain. The goal of this survey is to give a succinct introduction of the different techniques used by RDF query engines, and also to help users to choose the appropriate engine or benchmark for a given use case.
The rest of the paper is structured as follows. Section 2 discusses and contrasts this survey with related literature. Section 3 provides preliminaries for RDF and SPARQL. Sections 4, 5, 6 and 7 review techniques for storage, indexing, join processing and query processing, respectively. Section 8 explains different graph partitioning techniques for distributing storage over multiple machines. Section 9 introduces additional content available in the appendix of this extended version, which surveys 135 local and distributed RDF engines, along with 12 SPARQL benchmarks. Section 10 concludes the paper with subsections for current trends and research challenges regarding efficient RDF-based data management and query processing.

Literature Review
We first discuss related studies. More specifically, we summarize peer-reviewed tertiary literature (surveys in journals, short surveys in proceedings, book chapters, surveys with empirical comparisons, etc.) from the last 10 years collating techniques, engines and/or benchmarks for querying RDF. We summarize the topics covered by these works in Table 1. We use ✓, ∼ and blank cells to denote detailed, partial or little/no discussion, respectively, when compared with the current survey (the bottom row). We also present the number of engines and benchmarks included in the extended version of this survey. If the respective publication does not formally list all systems/benchmarks (e.g., as a table), we may write n+ as an estimate for the number discussed in the text.
Sakr et al. [196] present three schemes for storing RDF data in relational databases, surveying works that use the different schemes. Svoboda et al. [221] provide a brief survey on indexing schemes for RDF divided into three categories: local, distributed and global. Faye et al. [70] focus on both storage and indexing schemes for local RDF engines, divided into native and non-native storage schemes. Luo et al. [141] also focus on RDF storage and indexing schemes under the relational-, entity-, and graph-based perspectives in local RDF engines. Compared to these works, we present join processing, query processing and partitioning techniques; furthermore, these works predate the standardization of SPARQL 1.1, and thus our discussion includes more recent storage and indexing techniques, as well as support for new features such as property paths.
Local RDF stores are those most commonly found in practice [43]. To the best of our knowledge, our survey provides the most comprehensive discussion thus far on storage, indexing, join processing and querying processing techniques for SPARQL in a local setting, where, for example, we discuss novel techniques for established features -such as novel indexing techniques based on compact data structures, worst-case optimal and matrix-based join processing techniques, multi-query optimization, etc. -as well as techniques for novel features in SPARQL 1.1 -such as indexing and query processing techniques for evaluating property paths -that are not well-represented in the existing literature. To keep our survey self-contained, we also present partitioning techniques for RDF graphs, and include distributed stores and benchmarks in our survey. Per Table 1, the survey of engines and benchmarks found in the online version is more comprehensive than seen in previous works [10]. Conversely, some of the aforementioned works are more detailed in certain aspects, particularly distributed stores; we refer to this literature for further details as appropriate.

Preliminaries
Before beginning the core of the survey, we first introduce some preliminaries regarding RDF and SPARQL.

RDF
The RDF data model [208] uses RDF terms from three pairwise disjoint sets: the set I of Internationalized Resource Identifiers (IRIs) [66] used to identify resources; the set L of literals used for (language-tagged or plain) strings and datatype values; and the set B of blank nodes, interpreted as existential variables. An RDF triple (s, p, o) ∈ IB×I×IBL contains a subject s, a predicate p and an object o. 1 A set of RDF terms is called an RDF graph G, where each triple (s, p, o) ∈ G represents a directed labeled edge s p − → o. The sets s(G), p(G) and o(G) stand for the set of subjects, predicates and objects in G, respectively. We further denote the set of nodes in G by so(G) := s(G) ∪ o(G).
An example RDF graph, representing information about two university students, is shown in Figure 1. We include both a graphical representation and a triple-based representation. RDF terms such as :DB, foaf:age, etc., denote prefixed IRIs. 2 For example, foaf:age stands for the full IRI http://xmlns.com/foaf/0.1/age if we define the prefix foaf as http://xmlns.com/foaf/0.1/. Terms such as "Motor RDF"@es denote strings with (optional) language tags, and 1 In this paper, we abbreviate the union of sets M 1 ∪ . . . ∪ M n with M 1 . . . M n . Hence, IBL stands for I ∪ B ∪ L. 2 We use the blank prefix (e.g., :DB) as an arbitrary example. Other prefixes used can be retrieved at http://prefix.cc/. terms such as "21"^^xsd:int denote datatype values. Finally we denote blank nodes with the underscore prefix, where _:p refers to the existence of a project shared by Alice and Bob. Terms used in the predicate position (e.g., foaf:age, skos:broader) are known as properties. RDF defines the special property rdf:type, which indicates the class (e.g., foaf:Person, foaf:Project) of a resource.
The semantics of RDF can be defined using RDF Schema (RDFS) [37], covering class and property hierarchies, property domains and ranges, etc. Further semantics can be captured with the Web Ontology Language (OWL) [97], such as class and property equivalence; inverse, transitive, symmetric and reflexive properties; set-and restriction-based class definitions; and more besides. Since our focus is on querying RDF graphs, we do not discuss these standards in detail.

SPARQL
Various query languages for RDF have been proposed down through the years, such as RQL [118], SeRQL [218], etc. We focus our discussion on SPARQL [92], which is now the standard language for querying RDF, and refer to the work by Haase et al. [87] for information on its predecessors.
We define the core of SPARQL in terms of basic graph patterns that express the core pattern matched against an RDF graph; navigational graph patterns that match arbitrarylength paths; complex graph patterns that introduce various language features, such as OPTIONAL, UNION, MINUS, etc. [16]; and query types that specify what result to return.
Basic Graph Patterns (BGPs) At the core of SPARQL lie triple patterns, which are RDF triples that allow variables from the set V (disjoint with IBL) in any position. A basic graph pattern (BGP) is a set of triple patterns. Since blank nodes in BGPs act as variables, we assume they have been replaced with variables. We use vars(B) to denote the set of variables in the BGP B. Given an RDF graph G, the evaluation of a BGP B, denoted B(G), returns a set of solution mappings. A solution mapping µ is a partial mapping from the set V of variables to the set of RDF terms IBL. We write dm(µ) to denote the set of variables for which µ is defined. Given a triple pattern t, we use µ(t) to refer to the image of t under µ, i.e., the result of replacing any variable v ∈ dm(µ) appearing in t with µ(v). µ(B) stands for the image of the BGP B under µ; i.e., µ(B) := {µ(t) | t ∈ B}. The evaluation of a BGP B on an RDF graph G is then given as B(G) := {µ | µ(B) ⊆ G and dm(µ) = vars(B)}. In the case of a singleton BGP {t}, we may write {t}(G) as t(G).
In Figure 2, we provide an example of a BGP along with its evaluation. Each row of the results refers to a solution mapping. Some solutions map different variables to the same term; each such solution is thus a homomorphism from the BGP to the RDF graph.  Navigational Graph Patterns (NGPs) A key feature of graph query languages is the ability to match paths of arbitrary length [16]. In SPARQL (1.1), this ability is captured by property paths [92], which are regular expressions E that paths should match, defined recursively as follows: if p is an IRI, then p is a path expression (property); -if e is a path expression, then^e (inverse), e* (zero-ormore, aka. Kleene star), e+ (one-or-more), and e? (zeroor-one) are path expressions.
-If e 1 , e 2 are path expressions, then e 1 /e 2 (concatenation) and e 1 |e 2 (disjunction) are path expressions. -if P is a set of IRIs, then !P and !^P are path expressions (negated property set); 3 The evaluation of path expressions on an RDF graph G returns pairs of nodes in G connected by paths that match the expression, as defined in Table 2. These path expressions are akin to 2-way regular path queries (2RPQs) extended with negated property sets [128,16].
We call a triple pattern (s, e, o) that further allows a path expression as the predicate (i.e., e ∈ EV) a path pattern. A navigational graph pattern (NGP) is then a set of path patterns. Given a navigational graph pattern N , let paths(N ) := p(N ) ∩ E denote the set of path expressions used in N . Given an RDF graph G and a set of path expressions E ⊆ E, we denote by G E := G ∪ ( e∈E {(s, e, o) | (s, o) ∈ e(G)}) the result of materializing all paths matching E in G. The evaluation of the navigational graph pattern N on G is then N (G) := {µ | µ(N ) ⊆ G paths(N ) and dm(µ) = vars(N )}.
We provide an example of a navigational graph pattern and its evaluation in Figure 3.
Complex Graph Patterns (CGPs) Complex graph patterns (CGPs) introduce additional language features that can combine and transform the results of one or more graph patterns. More specifically, evaluating BGPs and NGPs returns solution mappings that can be viewed as relations, (i.e., tables),  where variables are attributes (i.e., column names) and tuples (i.e., rows) contain the RDF terms bound by each solution mapping (see . CGPs support combining and transforming the results of BGPs/NGPs with language features that include FILTER (selection: σ), SELECT (projection: π), UNION (union: ∪), EXISTS (semi-join: ⋉), MINUS (anti-join: ⊲ 4 ) and OPTIONAL (left-join: ⊲⊳). These language features correspond to the relational algebra defined in Table 3. The default operator is a natural inner join (⊲⊳). Figure 4 provides an example of a CGP combining two BGPs and an NGP using union, join and projection. 4 The definition of MINUS is slightly different from anti-join in that mappings with no overlapping variables on the right are ignored.
Named graphs SPARQL allows for querying multiple RDF graphs through the notion of a SPARQL dataset, defined as D := {G, (n 1 , G 1 ), . . . , (n k , G k ))} where G, G 1 . . . , G n  are RDF graphs; n 1 , . . . , n k are pairwise distinct IRIs; G is known as the default graph; and each pair (n 1 , G 1 ) (for 1 ≤ i ≤ n) is known as a named graph. Letting N ′ , N ′′ denote sets of IRIs, n ′ , n ′′ IRIs and v a variable, SPARQL then provides a number of features for querying different graphs: -FROM N ′ FROM NAMED N ′′ : activates a dataset with a default graph composed of the merge of all graphs G ′ such that (n ′ , G ′ ) ∈ D and n ′ ∈ N ′ , and the set of all named graphs (n ′′ , G ′′ ) ∈ D such that n ′′ ∈ N ′′ ; -GRAPH n ′ : evaluates a graph pattern on the graph G ′ if the named graph (n ′ , G ′ ) is active; -GRAPH v: takes the union of the evaluation of a graph pattern over each G ′ such that (n ′ , G ′ ) is active, binding v to n ′ for each solution generated from G ′ ; Without FROM or FROM NAMED, the active dataset is the indexed dataset D. Without GRAPH, graph patterns are evaluated on the active default graph. Quad stores disallow empty named graphs, such that D := {G, (n 1 , G 1 ), . . . , (n k , G k ))} is viewed as D = G × {⋆} ∪ ( (ni,Gi)∈D G i × {n i }), i.e., a set of quads using ⋆ ∈ IBL as a special symbol for the default graph. In this case, a quad (s, p, o, n) denotes a triple (s, p, o) in the default graph if n = ⋆, or a triple in the named graph G ′ such that (n, G ′ ) ∈ D if n ∈ I. We can define CGPs involving quad patterns analogously.
Other SPARQL features SPARQL supports features beyond CGPs, which include aggregation (group-by with count, sum, etc.), solution modifiers (ordering and slicing solutions), bag semantics (preserving result multiplicity), federation (fetching solutions from remote services), entailment and more besides. SPARQL also supports different query types, such as SELECT, which returns a sequence of solution mappings; CONSTRUCT, which returns an RDF graph based on the solution mappings; DESCRIBE, which returns an RDF graph describing indicated RDF terms; and ASK, which returns true if some solution mapping is found, or false otherwise.

Storage
Data storage refers to how data are represented in memory. Different storage mechanisms store different elements of data contiguously in memory, offering trade-offs in terms of compression and efficient data access. This section reviews various categories of RDF storage.

Triple table
A triple table stores an RDF graph G as a single ternary relation. Figure 1 shows an RDF graph with its triple table on the right-hand side. One complication when storing triple tables in relational databases is that such systems assume a column to have a single type, which may not be true for RDF objects in particular; a workaround is to store a string encoding of the terms, though this may complicate their ordering. Rather than storing full RDF terms in the triple table, stores may apply dictionary encoding, where RDF terms are mapped one-to-one with numeric object identifiers (OIDs), with OIDs being stored in the table and decoded using the dictionary as needed. Since OIDs consume less memory and are faster to process than strings, such an approach works better for queries that involve many intermediate results but generate few final results; on the other hand, such an approach suffers when queries are simple and return many results, or when selective filters are specified that require decoding the term before filtering. To find a better trade-off, some RDF engines (e.g., Jena 2 [241]) only use OIDs for strings with lengths above a threshold.
The most obvious physical storage is to store triples contiguously (row-wise). This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [69]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise storage allows for better compression, and for quickly reading many values from a single column.
Triple tables can be straightforwardly extended to quad tables in order to support SPARQL datasets [69,91].

Vertical partitioning
The vertical partitioning approach [1] uses a binary relation for each property p ∈ p(G) whose tuples encode subjectobject pairs for that property. In Figure 5 we exemplify two such binary relations. Physical storage can again use OIDs, row-based or column-based storage, etc.
When compared with triple tables, vertical partitioning generates relations with fewer rows, and more specific domains for columns (e.g., the object column for foaf:age rdf:type  Figure 1 can be defined as an integer type). However, triple patterns with variable predicates may require applying a union on all relations. Also, RDF graphs may have thousands of properties [233], which may lead to a schema with many relations. Vertical partitioning can be used to store quads by adding a Graph column to each table [69,91].

Extended vertical partitioning
S2RDF [204] uses extended vertical partitioning based on semi-join reductions (we recall from Table 3 that a semi-join M 1 ⋉M 2 , aka. FILTER EXISTS, returns the tuples in M 1 that are "joinable"with M 2 ). Letting x, y, z denote variables and p, q denote RDF terms, then for each property pair (p, q) ∈ p(G) × p(G) such that p = q, extended vertical partitioning stores three semi-join reductions: The semi-join (x, p, y)(G)⋉(z, q, y)(G) (O-O) is not stored as most O-O joins have the same predicate, and thus would occur in the same relation. In Figure 6 we give an example of a semi-join reduction for two predicates from the running example; empty semi-joins are omitted.
In comparison with vertical partitioning, observing that we can apply joins over the corresponding semi-join reductions knowing that each tuple read from each side will contribute to the join, thus reducing I/O. The cost involves storing (and updating) each tuple in up to 3(|p(G)| − 1) additional relations; omitting empty semi-joins can help to mitigate this issue [204]. Extended vertical partitioning also presents complications for variable predicates, graphs with many properties, etc.

Property table
Property tables aim to emulate the n-ary relations typical of relational databases. A property table usually contains one subject column, and n further columns to store objects for the corresponding properties of the given subject. The subject column then forms a primary key for the table. The tables to define can be based on classes, clustering [184], coloring [36], etc., to group subjects with common properties.  Figure 7 for the RDF graph of Figure 1. Property tables can store and retrieve multiple triples with a given subject as one tuple (e.g., to find people with age < 30 and interest = :SW) without needing joins. Property tables often store terms of the same type in the same column, enabling better compression. Complications arise for multi-valued (. . . -to-many) or optional (zero-to-. . . ) properties. In the example of Figure 1, Alice is also interested in SW, which does not fit in the cell. Furthermore, Alice has no past project, and Bob has no current project, leading to nulls. Changes to the graph may also require re-normalization; for example, even though each person currently has only one value for knows, adding that Alice knows another person would require renormalizing the tables. Complications also arise when considering variable predicates, RDF graphs with many properties or classes, quads, etc.

Graph-based storage
While the previous three storage mechanisms rely on relational storage, graph-based storage is adapted specifically for the graph-based model of RDF. Key characteristics of such models that can be exploited for storage include the adjacency of nodes, the fixed arity of graphs, etc.
Graphs have bounded arity (3 for triples, 4 for quads), which can be exploited for specialized storage. Engines like 4store [91] and YARS2 [94] build native triple/quad tables, which differ from relational triple/quad tables in that they have fixed arity, fixed attributes (S,P,O(,G)), and more general domains (e.g., the O column can contain any RDF term).
Graphs often feature local repetitions that are compressible with adjacency lists (e.g., Hexastore [238], gStore [263]  SpiderStore [32], Trinity.RDF [258], GRaSS [142]). These lists are akin to tries, where subject or subject-predicate prefixes are followed by the rest of the triple. Such tries can be stored row-wise in blocks of triples; or column-wise, where blocks elements from one column point to blocks of elements from the next column. Index-free adjacency can enable efficient navigation, where terms in the suffix directly point to the location on disk of their associated prefix. We refer to Figure 8 for an example. Such structures can also include inverse edges (e.g., Trinity.RDF [258], GRaSS [142]).
An alternative is to decompose an RDF graph into its constituent components for storage. AMBER [105] uses a multigraph representation where an RDF graph G is decomposed into a set of (non-literal) nodes V := so(G) ∩ IB, a set of edges  Another type of native graph storage uses tensors, viewing a dictionary-encoded RDF graph G with m = |so(G)| nodes and n = |p(G)| predicates as an m × n × m 3-order tensor T of bits such that T i,j,k = 1 if the i th node links to the k th node with the j th property, or T i,j,k = 0 otherwise. A popular variant uses an adjacency matrix per property (e.g., BitMat [21], BMatrix [38], QDags [163]), akin to vertical partitioning, as seen in Figure 10. A third option (considered, e.g., by MAGiQ [109]) is to encode the full graph as an adjacency matrix where each cell indicates the property id connecting the two nodes; this matrix cannot directly represent pairs of nodes connected by more than one property. While abstract tensor-based representations may lead to highly-sparse matrices or tensors, compact data structures offer compressed representations that support efficient operations [21,109,38,163]. Often such matrices/tensors are stored in memory, or loaded into memory when needed. Such representations may also enable query processing techniques that leverage hardware acceleration, e.g., for processing joins on GPUs (as we will discuss in Section 6.4).

Miscellaneous storage
Aside from relational-based and graph-based storage, other engines have proposed to leverage other forms of storage as implemented by existing systems. A common example is the use of NoSQL key-value, tabular or document stores for distributed storage (see [111,257,249] for more details).

Discussion
Early works on storing RDF tended to rely on relational storage, which had been subject to decades of developments and optimizations before the advent of RDF (e.g., [241,1,69]). Though such an approach still has broad adoption [69], more recent storage techniques aim to exploit the graphbased characteristics of RDF -and SPARQL -in order to develop dedicated storage techniques (e.g., [21,238,263]), including those based on tensors/matrices [21,109,38,163]. A recent trend is to leverage NoSQL storage (e.g., [131,177,25]) in order to distribute the management of RDF data.

Indexing
Indexing enables efficient lookup operations on RDF graphs (i.e., O(1) or O(log |G|) time to return the first result or an empty result). The most common such operation is to find triples that match a given triple pattern. However, indexes can also be used to match non-singleton BGPs (with more than one triple pattern), to match path expressions, etc. We now discuss indexing techniques proposed for RDF graphs.

Triple indexes
The goal of triple indexes is to efficiently find triples matching a triple pattern. . Unlike relational databases, where often only the primary key of a relation will be indexed by default and further indexes must be manually specified, most RDF stores aim to have a complete index by default, covering all eight possible triple patterns. However, depending on the type of storage chosen, this might not always be feasible.
When a storage scheme such as vertical partitioning is used [1], only the five patterns where the predicate is constant can be efficiently supported (by indexing the subject and object columns). If the RDF graph is stored as a (binary) adjacency matrix for each property [21,163], again only constant-predicate patterns can be efficiently supported. Specialized indexes can be used to quickly evaluate such patterns, where QDags [163] uses quadtrees: a hierarchical index structure that recursively divides the matrix into four sub-matrices; we provide an example quadtree in Figure 11. A similar structure, namely a k 2 -tree, is used by BMatrix [38].
Otherwise, in triple tables, or similar forms of graphbased storage, all triple patterns can be efficiently supported with triple permutations. Figure 8 illustrates a single SPO permutation. A total of 3! = 6 permutations are possible and suffice to cover all eight abstract triple patterns if the index structure permits prefix lookups; for example, in an SPO permutation we can efficiently support four abstract triple patterns (s, p, o), (s, p, o), (s, p, o) and (s, p, o) as we require the leftmost terms of the permutation to be filled. In fact, with only 3 ⌊3/2⌋ = 3 permutations -e.g., SPO, POS and OSP -we can cover all eight abstract triple patterns. Such index permutations can be implemented using standard data structures such as ISAM files [94], B(+)Trees [168], AVL trees [243], as well as compact data structures, such as adjacency lists [238] (see Figure 8) and tries [185], etc.
Recent works use compact data structures to reduce redundancy for index permutations, and thus the space required for triple indexing. Perego et al. [185] use tries to  10; the root represents the full matrix, while children denote four sub-matrices of the parent; a node is colored black if it contains only 1's, white if it contains only 0's, and gray if it contains both; only gray nodes require children index multiple permutations, over which they apply crosscompression, whereby the order of the triples given by one permutation is used to compress another permutation. Other approaches remove the need for multiple permutations. RD-FCSA [40] and Ring [19] use a compact suffix-array (CSA) such that one permutation suffices to efficiently support all triple patterns. Intuitively speaking, triples can be indexed cyclically in a CSA, such that in an SPO permutation, one can continue from O back to S, thus covering SPO, POS and OSP permutations in one CSA index [40]. The Ring indexing scheme is also bidirectional, where in an SPO permutation, one can move from O forwards to S or backwards to P.

Entity-based indexes
Entity-based indexes optimize graph patterns that "center on" a particular entity. BGPs can be reduced to joins over their triple patterns; for example, {(x, p, y), (y, q, z)}(G) = {(x, p, y)}(G) ⋊ ⋉ {(y, p, z)}(G). Star joins are frequently found in BGPs, defined to be a join on a common subject, e.g., {(w, p, x), (w, q, y), (w, r, z)}. Star joins may sometimes also include S-O joins on the common variable, e.g., {(w, p, x), (w, q, y), (z, r, w)} [142]. Star joins retrieve data surrounding a particular entity (in this case w). Entity-based indexes permit efficient evaluation of such joins.
Property tables can enable efficient star joins so long as the relevant tables can be found efficiently and there are indexes on the relevant columns (e.g., for p, q and/or r).
The EAGRE system [261] uses an index for property tables where entities with n properties are encoded in ndimensional space. A space-filling curve (e.g., a Z-order or Hilbert curve) is then used for indexing. Figure 12 illustrates the idea, where four entities are indexed (abbreviating :Alice, :Bob, :Carol, :Dave) with respect to two dimensions (say foaf:age for x and integer-encoded values of foaf:knows for y). We show the first-, second-and thirdorder Hilbert curves from left to right. Letting d denote the number of dimensions, the n th -order Hilbert curve assigns an ordinal to 2 dn regions of the space based on the order in which it visits the region; e.g., starting with region 1 on the bottom left and following the curve, :A is in the region of ordinal 2, 7 and 26, respectively. The space-filling curve thus Property tables are complicated by multi-valued properties, missing values, etc. A more flexible approach is to index signatures of entities, which are bit-vectors encoding the property-value pairs of the entity. One such example is the vertex signature tree of gStore [263], which encodes all outgoing (p, o) pairs for a given entity s into a bit vector akin to a Bloom filter, and indexes these bit vectors hierarchically allowing for fast, approximate containment checks that quickly find candidate entities for a subset of such pairs. GRaSS [142] further optimizes for star subgraphs that include both outcoming and incoming edges on entities, where a custom FDD-index allows for efficient retrieval of the subgraphs containing a triple that matches a triple pattern.

Property-based indexes
Returning to the star join {(w, p, x), (w, q, y), (w, r, z)}, another way to quickly return candidate bindings for the variable w is to index nodes according to their adjacent properties; then we can find nodes that have at least the adjacent properties p, q, r. Such an approach is used by RDFBroker [212], which defines the signature of a node s as Σ(s) = {p | ∃o : (s, p, o) ∈ G}; for example, the signature of :SW in Figure 1 is Σ(:SW) = {skos:broader, skos:related} (analogous to characteristic sets proposed later [165]). A property table is then created for each signature. At query time, property tables whose signatures subsume {p, q, r} are found using a lattice of signatures. We provide an example in Figure 13 with respect to the RDF graph of Figure 1, where children subsume the signatures of their parent.
AxonDB [155] uses extended characteristic sets where each triple (s, p, o) in the RDF graph is indexed with the signatures (i.e., characteristic sets) of its subject and object; i.e., (Σ(s), Σ(o)). Thus the triple (:SW, skos:related, :DB) of Figure 1 would be indexed with the extended characteristic set ({skos:broader, skos:related} , {skos:broader}). The index then allows for efficiently identifying two star joins that are connected by a given property p.

Path indexes
A path join involves successive S-O joins between triple patterns; e.g., {(w, p, x), (x, q, y), (y, r, z)}, where the start and end nodes (w, z) may be variables or constants. While path joins have fixed length, navigational graph patterns may further match arbitrary length paths. A number of indexing approaches have been proposed to speed up querying paths.
A path can be seen as a string of arbitrary length; e.g., a path {(w, p, x), (x, q, y), (y, r, z)} can be seen as a string wpxqyrz$, where $ indicates the end of the string; alternatively, if intermediate nodes are not of importance, the path could be represented as the string wpqrz$. The Yaanii system [47] builds an index of paths of the form wpxqyrz$ that are clustered according to their template of the form wpqrz$. Paths are then indexed in B+trees, which are partitioned by template. Fletcher et al. [72] also index paths in B+trees, but rather than partition paths, they apply a maximum length of at most k for the paths included. Text indexing techniques can also be applied for paths (viewed as strings). Maharjan et al. [147] and the HPRD system [139] both leverage suffix arrays -a common indexing technique for text -to index paths. The downside of path indexing approaches is that they may index an exponential number of paths; in the case of HPRD, for example, users are thus expected to specify which paths to index [139].
Other path indexes are inspired by prior works for path queries over trees (e.g., for XPath). Bartoň [26] proposes a tree-based index based on preorder and postorder traversal. A preorder traversal starts at the root and traverses children in a depth-first manner from left to right. A postorder traversal starts at the leftmost leaf and traverses all children, from left to right, before moving to the parent. We provide an example preorder and postorder traversal in Figure 14. Given two nodes m and n in the tree, a key property is that m is a descendant of n if and only if m is greater than n for preorder and less than n for postorder. Bartoň [26] uses this property to generate an index on ascending preorder so as to linearize the tree and quickly find descendants based on postorder. To support graphs, Bartoň uses a decomposition of the graph into a forest of trees that are then indexed [26].
Another type of path index, called PLSD, is used in System Π [245] for indexing the transitivity of a single property, optimizing for path queries of the form (s, p * , o), or (s, p * , o), etc. For a given property p, each incident (subject or object) node x is assigned a triple of numbers (i, j, k) ∈ N 3 , where i is a unique prime number that identifies the node x, j is the least common multiple of the i-values of x's parents (i.e., nodes y such that (y, p, x) ∈ G), and k is the least common multiple of the k-values of x's parents and the ivalue of x. We provide an example in Figure 15. PLSD can further handle cycles by multiplying the k-value of all nodes by the i value of all nodes in its strongly-connected component. Given the i-value of a node, the i-values of its parents and ancestors can be retrieved by factorizing j and k/i respectively. However, multiplication may give rise to large numbers, where no polynomial time algorithm is known for the factorization of binary numbers. Gubichev et al. [80] use a path index of directed graphs, called FERRARI [210], for each property in an RDF graph. First, a condensed graph is computed by merging nodes of strongly connected components into one "supernode"; adding an artificial root node (if one does not exist), the result is a directed acyclic graph (DAG) that preserves reachability. A spanning tree -a subgraph that includes all nodes and is a tree -of the DAG is computed and labeled with its postorder. All subtrees thus have contiguous identifiers, where the maximum identifies the root; e.g., in Figure 14, the subtree at :AI has the interval [1,3], where 3 identifies the root. Then there exists a (directed) path from x to y if and only if y is in the subtree interval for x. Nodes in a DAG may, however, be reachable through paths not in the spanning tree. Hence each node is assigned a set of intervals for nodes that can be reached from it, where overlapping and adjacent intervals are merged; we must now check that y is in one of the intervals of x. To improve time and space at the cost of precision, approximate intervals are proposed that merge non-overlapping intervals; e.g., [4,6], [8,9] is merged to [4,9], which can reject reachability for nodes with id less than 2 or greater than 9, but has a 1 6 chance of a false positive for nodes in [4,9], which must be verified separately.

Join indexes
The results of joins can also be indexed. Groppe et al. [78] proposed to construct 6 × 2 4 = 96 indexes for 6 types of non-symmetric joins between two triple patterns (S-S, S-

Structural indexes
Another family of indexes -known as structural indexes [141] -rely on a high-level summary of the RDF graph. Some structural indexes are based on distance measures. GRIN [228] divides the graph hierarchically into regions based on the distance of its nodes to selected centroids. These regions form a tree, where the non-leaf elements indicate a node x and a distance d referring to all nodes at most d steps from x. The root element chooses a node and distance such that all nodes of the graph are covered. Each non-leaf element has two children that capture all nodes of their parent. Each leaf node contains a set of nodes N , which induces a subgraph of triples between the nodes of N ; the leaves can then be seen as partitioning the RDF graph. We provide an example in Figure 16 for the RDF graph of Figure 1, where all nodes are within distance two of :Alice, which are then divided into two regions: one of distance at most two from _:p, and another of distance at most one from :CS. The index can continue dividing the graph into regions, and can then be used to find subgraphs within a particular distance from a given node (e.g., a node given in a BGP).
Another type of structural index relies on some notion of a quotient graph [48], where the nodes of a graph so(G) are partitioned into {X 1 , . . . , X n } pairwise-disjoint sets such that n i=1 X i = so(G). Then edges of the form (X i , p, X j ) are added if and only if there exists (x i , p, x j ) ∈ G such that x i ∈ X i and x j ∈ X j . Intuitively, a quotient graph merges nodes from the input graph into "supernodes" while maintaining the input (labeled) edges between the supernodes. We provide an example of a quotient graph in Figure 17   a quotient graph, ranging from a single supernode with all nodes so(G) and loops for all properties in p(G), to the graph itself replacing each node x ∈ so(G) with the singleton {x}. If the input graph yields solutions for a BGP, then the quotient graph will also yield solutions (with variables now matching supernodes). For example, taking the BGP of Figure 2, matching foaf:Person to the supernode containing foaf:Person in Figure 17, then the variables ?a and ?b will match the supernode containing :Alice and :Bob, while ?ia and ?ib will match to the supernode containing :CS, :DB, :SW and :Web; while we do not know the exact solutions for the input graph, we know they must correspond to elements of the supernodes matched in the quotient graph. DOGMA [41] partitions an RDF graph into subgraphs, from which a balanced binary tree is computed, where each parent node contains a quotient-like graph of both its children. The (O)SQP approach [225] creates an in-memory index graph, which is a quotient graph whose partition is defined according to various notions of bisimulation.
SAINT-DB [186] adopts a similar approach, where supernodes are defined directly as a partition of the triples of the RDF graph, and edges between supernodes are labeled with the type of join (S-S, P-O, etc.) between them.

Quad indexes
Most quad indexes follow the triple index scheme [243,94,91,69], extending it to add another element. The number of permutations then grows to 2 4 = 16 abstract index patterns, 4! = 24 potential permutations, and 4 ⌊4/2⌋ = 6 flat (ISAM/B+Tree/AVL tree/trie) permutations or 2 circular (CSA) permutations to efficiently support all abstract quad patterns. A practical compromise is to maintain a selection of permutations that cover the most common patterns [69]; for example, a pattern (s, p, o, g) may be uncommon in prac-tice, and could be supported reasonably well by evaluating (e.g.) (s, p, o, g) and filtering on g = g.
The RIQ system [120] proposes a custom index for quads called a PV-index for finding (named) graphs that match a BGP. Each graph is indexed by hashing all seven abstract patterns on triples with some constant, generating seven pattern vectors for each graph. For example, a triple (s, p, o) in a graph named g will be hashed as (s, p ), where ? is an arbitrary fixed token, and each result will be added to one of seven pattern vectors for g for that abstract pattern. Basic graph patterns can be encoded likewise, where locality sensitive hashing is then used to group and retrieve similar pattern vectors for a given basic graph pattern.

Miscellaneous Indexing
RDF stores may use legacy systems, such as NoSQL stores, for indexing. Since such approaches are not tailored to RDF, and often correspond conceptually to one of the indexing schemes already discussed, we refer to more dedicated surveys of such topics for further details [111,257,249]. Other stores provide specialized indexes for particular types of values such as spatial or temporal data [232,130]; we do not discuss such specialized indexes in detail.

Discussion
While indexing triples or quads is conceptually the most straightforward approach, a number of systems have shown positive results with entity-and property-based indexes that optimize the evaluation of star joins, path indexes that optimize the evaluation of path joins, or structural indexes that allow for identifying query-relevant regions of the graph. Different indexing schemes often have different time-space trade-offs: more comprehensive indexes enable faster queries at the cost of space and more costly updates.

Join Processing
RDF stores employ diverse query processing strategies, but all require translating logical operators that represent the query, into "physical operators" that implement algorithms for efficient evaluation of the operation. The most important such operators -as we now discuss -are natural joins.

Pairwise join algorithms
We recall that the evaluation of a BGP {t 1 , . . . t n }(G) can be rewritten as t 1 (G) ⊲⊳ . . . ⊲⊳ t n (G), where the evaluation of each triple pattern t i (1 ≤ i ≤ n) produces a relation of arity |vars(t i )|. Thus the evaluation of a BGP B produces a relation of arity |vars(B)|. The relational algebra -including joins -can then be used to combine or transform the results of one or more BGPs, giving rise to CGPs. The core of evaluating graph patterns is thus analogous to processing relational joins. The simplest and most well-known such algorithms perform pairwise joins; for example, a pairwise strategy for computing {t 1 , . . . t n }(G) may evaluate ((t 1 (G) ⊲⊳ t 2 (G)) ⊲⊳ . . .) ⊲⊳ t n (G).
Without loss of generality, we assume a join of two graph patterns P 1 (G) ⊲⊳ P 2 (G), where the join variables are denoted by V = {v 1 , . . . , v n } = vars(P 1 ) ∩ vars(P 2 ). Wellknown algorithms for performing pairwise joins include (index) nested-loop joins, where P 1 (G) ⊲⊳ P 2 (G) is reduced to evaluating µ∈P1(G) {µ} ⊲⊳ µ(P 2 )(G); hash joins, where each solution µ ∈ P 1 (G) is indexed by hashing on the key (µ(v 1 ), . . . , µ(v n )) and thereafter a key is computed likewise for each solution in P 2 (G) to probe the index with; and (sort-)merge joins, where P 1 (G) and P 2 (G) are (sorted if necessary and) read in the same order with respect to V , allowing the join to be reduced to a merge sort. Index nestedloop joins tend to perform well when |P 1 (G)| ≪ |P 2 (G)| (assuming that µ(P 2 )(G) can use indexes) since it does not require reading all of P 2 (G). Otherwise hash or merge joins can perform well [168]. Pairwise join algorithms are then used in many RDF stores (e.g., [93,69,168]).
Techniques to optimize pairwise join algorithms include sideways information passing [29], which passes data across different parts of the query, often to filter intermediate results. Neumann and Weikum [167] propose ubiquitous sideways information passing (U-SIP) for computing joins over RDF, which shares global ranges of values for a given query variable. U-SIP is implemented differently for different join types. For merge joins, where data are read in order, a maximum value for a variable can be shared across pairwise joins, allowing individual operators to skip ahead to the current maximum. For hash joins, a global domain filter is employed -consisting of a maximum value, a minimum value, and Bloom filters -for filtering the results of each variable.
Some of the previous storage and indexing schemes we have seen lend themselves naturally to processing certain types of multiway joins in an efficient manner. Entity-based indexes allow for processing star joins efficiently, while path indexes allow for processing path joins efficiently (see Section 5). A BGP can be decomposed into sub-BGPs that can be evaluated per the corresponding multiway join, with pairwise joins being applied across the sub-BGPs; for example: {(w, p, x), (w, q, y), (w, r, z), (x, q, y), (x, r, z)} may be divided into the sub-BGPs {(w, p, x),(w, q, y),(w, r, z)} and {(x, q, y), (x, r, z)}, which are evaluated separately as multiway joins before being themselves joined. Even in the case of (sorted) triple/quad tables, multiway joins can be applied taking advantage of the locality of processing, where, for example, in an SPO index permutation, triples with the same subject will be grouped together. Similar locality can be exploited in distributed settings (see, e.g., SMJoin [74]).

Worst case optimal joins
A new family of join algorithms have arisen due to the AGM bound [22], which puts an upper bound on the number of solutions that can be returned from relational join queries. The result can be adapted straightforwardly to the case of BGPs. Let B = {t 1 , . . . , t n } denote a BGP with vars(B) = V . Now define a fractional edge cover as a mapping λ : The AGM bound tells us that if B has the fractional edge cover λ, then for any RDF graph it holds that |B(G)| ≤ n i=1 |t i (G)| λ(ti ) ; this bound is "tight". To illustrate the AGM bound, consider the BGP B = {t 1 , t 2 , t 3 } from Figure 18. There exists a fractional edge cover λ of B such that λ(t 1 ) = λ(t 2 ) = λ(t 3 ) = 1 2 ; taking ?a, we have that B ?a = {t 1 , t 3 }, λ(t 1 )+λ(t 3 ) = 1, and thus ?a is "covered", and we can verify the same for ?b and ?c. Then the AGM bound is given as the inequality |B(G)| ≤ n i=1 |t i (G)| λ(ti ) . For G the graph in Figure 18, |t 1 (G)| = |t 2 (G)| = |t 3 (G)| = 5, and hence |B(G)| ≤ 5 3 2 . In reality, for this graph, |B(G)| = 5, thus satisfying the inequality, but there exists a graph where Recently, join algorithms have been proposed that can enumerate the results for a BGP B over a graph G in time O(agm(B, G)), where agm(B, G) denotes the AGM bound of B over G. Since such an algorithm must at least spend O(agm(B, G)) time writing the results in the worst case, such algorithms are deemed worst-case optimal (wco) [169]. Though such algorithms were initially proposed in a relational setting [169,230], they have recently been adapted for processing joins over RDF graphs [115,99,163,19]. Note that traditional pairwise join algorithms are not wco. If we try to evaluate {t 1 , t 2 }(G) by pairwise join, for example, in order to later join it with t 3 (G), the AGM bound becomes quadratic as λ(t 1 ) = λ(t 2 ) = 1, and thus we have the bound |t 1 (G)|·|t 2 (G)|, which exceeds the AGM bound for B. This ?a ?c s:r To be wco-compliant, the algorithm must always be able to efficiently compute M {v} , i.e., solutions µ with dm(µ) = {v}, such that µ(B v )(G) = ∅. To compute M {?a} in the running example, we need to efficiently intersect all nodes with an outgoing s:b edge and an incoming s:r edge. This is typically addressed by being able to read the results of a triple pattern, in sorted order, for any variable, which enables efficient intersection by allowing to seek ahead to the maximum current value of all triple patterns involving a given variable. Jena-LTJ [99], which implements an LTJ-style join algorithm for SPARQL, enables this by maintaining all six index permutations over triples, while Ring [19] requires only one permutation. Wco algorithms often outperform traditional join algorithms for complex BGPs [115,99].

Translations to linear algebra
Per Section 4.6, dictionary-encoded RDF graphs are sometimes represented as a bit tensor, or as a bit matrix for each property (see Figure 10), etc. Viewed in this light, some query algebra can then be reduced to linear algebra [156]; for example, joins become matrix/tensor multiplication. To illustrate, we can multiply the bit (adjacency) matrix from Figure 10 for skos:broader by itself: The result indicates the analogous bit matrix for an O-S join on skos:broader, with :SW (on row 3) connected to :CS (on column 1), which we would expect per Figure 1.
Translating joins into linear algebra enables hardware acceleration, particularly involving GPUs and HPC architectures, which can process tensors with high levels of parallelism. Such an approach is followed by MAGiQ [109], which represents an RDF graph as a single n × n matrix M, where n is the number of nodes (n = |so(G)|) and M i,j encodes the id of the property connecting the i th node to the j th node (or 0 if no such property exists). One issue with this representation is that it does not support two nodes being connected by multiple edges with different labels, and thus a coordinate list representation can rather be used. Basic graph patterns with projection are translated into matrix multiplication, scalar multiplication, transposition, etc., which can be executed on a variety of hardware, including GPUs.
Other engines that translate SPARQL query features into linear algebra (or other operations within GPUs) include Wukong(+G) [211,235], TripleID-Q [49], and gSmart [52]. Wukong+G [235] proposes a number of caching, pipelining, swapping and prefetching techniques in order to reduce the GPU memory required when processing large graphs while maintaining efficiency, and also proposes a partitioning technique to distribute computation over multiple CPUs and GPUs. TripleID-Q [49] represents an RDF graph as a dictionary-encoded triple table that can be loaded into the GPU in order to search for solutions to individual triple patterns without indexing, but with high degrees of parallelism. On top of this GPU-based search, join and union operators are implemented using GPU libraries. gSmart [52] proposes a variety of optimizations for evaluating basic graph patterns in such settings, including a multi-way join optimization for computing star-like joins more efficiently on GPUs, compact representations for sparse matrices, data partitioning to enable higher degrees of parallelism, and more besides.

Join reordering
The order of join processing can have a dramatic effect on computational costs. For Figure 18, if we apply pairwise joins in the order (t 1 (G) ⋊ ⋉ t 2 (G)) ⋊ ⋉ t 3 (G), the first join (t 1 (G) ⋊ ⋉ t 2 (G)) yields 25 intermediate results, with 5 final results produced with the second join. If we rather evaluate produces only 5 intermediate results, before the second join produces the 5 final results. The second plan should thus be more efficient than the first; if considering a graph at larger scale, the differences may reach orders of magnitude.
A good plan depends not only on the query, but also the graph. Selecting a good plan thus typically requires some assumptions or statistics over the graph. As in relational settings, the most important information relates to cardinalities: how many (distinct) solutions a given pattern returns; and/or selectivity: what percentage of solutions are kept when restricting variables with constants or filters. Statistics can be used not only to select an ordering for joins, but also to decide which join algorithm to apply. For example, given an arbitrary (sub-)BGP {t 1 , t 2 }, if we estimate that |t 2 (G)| ≪ |t 1 (G)|, we may prefer to evaluate t 2 (G) ⋊ ⋉ t 1 (G) as an index nested-loop join, rather than a hash or merge join, to avoid reading t 1 (G) in full.
While cardinality and selectivity estimates can be managed in a similar way to relational database optimizers, a number of approaches have proposed custom statistics for RDF. Stocker et al. [215] collect statistics relating to the number of triples, the number of unique subjects, and for each predicate, the number of triples and a histogram of associated objects. RDF-3X [168] uses a set of aggregated indexes, which store the cardinality of all triple patterns with one or two constants. RDF-3X [168] further stores the exact cardinality of frequently encountered joins, while characteristic sets [165] and extended characteristic sets [155] (discussed in Section 5.3) capture the cardinality of star joins.
Computing and maintaining such statistics incur costs in terms of space and updates. An alternative is to apply sampling while evaluating the query. Vidal et al. [231] estimate the cardinality of star joins by evaluating all solutions for the first pattern of the join, thereafter computing the full solutions of the star pattern for a sample of the initial solutions; the full cardinality of the star pattern is then estimated from the samples. Another alternative is to use syntactic heuristics for reordering. Stocker et al. [215] propose heuristics such as assuming that triple patterns with fewer variables have lower cardinality, that subject constants are more selective than objects and predicates, etc. Tsialiamanis et al. [227] further propose to prioritize rarer joins (such as P-S and P-O joins), and to consider literals as more selective than IRIs.
Taking into account such heuristics and statistics, the simplest strategy to try to find a good join ordering is to ap-ply a greedy metaheuristic [215,155], starting with the triple pattern t 1 estimated to have the lowest cardinality, and joining it with the triple pattern t 2 with the next lowest cardinality; typically a constraint is added such that t n (n > 1) should have a variable in common with some triple pattern in {t 1 , . . . , t n−1 } to avoid costly Cartesian products. Aside from considering the cardinality of triple patterns, Meimaris and Papastefanatos [154] propose a distance-based planning, where pairs of triple patterns with more overlapping nodes and more similar cardinality estimates have lesser distance between them; the query planner then tries to group and join triple patterns with the smallest distances first in a greedy manner. Greedy strategies will not, however, always provide the best ordering corresponding to an optimal plan.
More generally, reordering joins is an optimization problem, where classical methods from the relational literature can be leveraged likewise for BGPs, including dynamic programming [209] (used, e.g., by [94,168,82]) and simulated annealing [106] (used, e.g., by [231]). Other metaheuristics that have been applied for join reordering in BGPs include genetic algorithms [102] and ant colony systems [101,114].

Caching
Another possible route for optimization -based on the observation that queries in practice may feature overlapping or similar patterns -is to reuse work done previously for other queries. Specifically, we can consider caching the results of queries. In order to increase cache hit rates, we can further try to reuse the results of subqueries, possibly generalizing them to increase usability. Ideally the cache should store solutions for subqueries that (a) have a high potential to reduce the cost of future queries; (b) can reduce costs for many future queries; (c) do not have a high space overhead; and (d) will remain valid for a long time. Some of these aims can be antagonistic; for example, caching solutions for triple patterns satisfies (b) and (c) but not (a), while caching solutions for complex BGPs satisfies (a) but not (b), (c) or (d).
Lampo et al. [132] propose caching of solutions for star joins, which may strike a good balance in terms of reducing costs, being reusable, and not having a high space overhead (as they share a common variable). Other caching techniques try to increase cache hit rates by detecting similar (sub)queries. Stuckenschmidt [217] uses a similarity measure for caching -based on the edit distance between BGPs -that estimates the amount of computational effort needed to compute the solutions for one query given the solutions to the other. Lorey and Naumann [140] propose a technique for grouping similar queries, which enables a pre-fetching strategy based on predicting what a user might be interested in based on their initial queries. Another direction is to normalize (sub)queries to increase cache hit rates. Wu et al. [246] propose various algebraic normalizations in order to identify common subqueries [140], while Papailiou et al. [179] generalize subqueries by replacing selective constants with variables and thereafter canonically labeling variables (modulo isomorphism) to increase cache hit rates. Addressing dynamic data, Martin et al. [150] propose a cache where results for queries are stored in a relational database but are invalidated when a triple matching a query pattern changes. Williams and Weaver [242] add last-updated times to their RDF index to help invalidate cached data.
Given that an arbitrary BGP can produce an exponential number of results, Zhang et al. [260] propose to cache frequently accessed "hot triples" from the RDF graph in memory, rather than caching (sub-)query results. This approach limits the space overhead at the cost of recomputing joins.

Discussion
Techniques for processing BGPs are often based on techniques for processing relational joins. Beyond standard pairwise joins, multiway joins can help to emulate some of the benefits of property table storage by evaluating star joins more efficiently. Another recent and promising approach is to apply wco join algorithms whose runtime is bounded theoretically by the number of results that the BGP could generate. More and more attention has also been dedicated to computing joins in GPUs by translating relational algebra (e.g., joins) into linear algebra (e.g., matrix multiplication). Aside from specific algorithms, the order in which joins are processed can have a dramatic effect on runtimes. Statistics about the RDF graph help to find a good ordering at the cost of computing and maintaining those statistics; more lightweight alternatives include runtime sampling, or syntactic heuristics that consider only the query. To decide the ordering, options range from simple greedy strategies to complex metaheuristics; while simpler strategies have lower planning times, more complex strategies may find more efficient plans. Another optimization is to cache results across BGPs, for which a time-space trade-off must be considered.

Query Processing
While we have defined RDF stores as engines capable of storing, indexing and processing joins over RDF graphs, SP-ARQL engines support various features beyond joins. We describe techniques for efficiently evaluating such features, including the relational algebra (beyond joins) and property paths. We further include some general extensions proposed for SPARQL to support recursion and analytics.

Relational algebra (beyond joins)
Complex (navigational) graph patterns CGPs introduce additional relational operators beyond joins.
Like in relational databases, algebraic rewriting rules can be applied over CGPs in SPARQL to derive equivalent but more efficient plans. Schmidt et al. [207] present a set of such rules for SPARQL under set semantics, such as: where for each µ ∈ M * 1 , it holds that vars(R) ⊆ dm(µ). The first two rules split filters, meaning that they can be pushed further down in a query in order to reduce intermediary results. The third rule allows the order in which filters are applied to be swapped. Finally the latter four rules describe how filters can be pushed "down" inside various operators.
Another feature of importance for querying RDF graphs are optionals ( ⊲⊳), as they facilitate returning partial solutions over incomplete data. Given that an optional can be used to emulate a form of negation (in Table 3 it is defined using an anti-join), it can lead to jumps in computational complexity [183]. Works have thus studied a fragment called well-designed patterns, which forbid using a variable on the right of an optional that does not appear on the left but does appear elsewhere in the query; taking an example, is not well designed as the variable z appears on the right of an OPTIONAL and not on the left, but does appear elsewhere in the query. Such variables may or may not be left unbound after the left outer join is evaluated, which leads to complications if they are used outside the optional clause. Most SPARQL queries using optionals in practice are indeed welldesigned, where rewriting rules have been proposed specifically to optimize such queries [183,137].

Property paths
Navigational graph patterns (NGPs) extend BGPs with property paths, which are extensions of (2)RPQs that allow for matching paths of arbitrary length in the graph.
Some approaches evaluate property paths using graph search algorithms. Though not part of SPARQL, Gubichev and Neumann [81] implement single-source shortest paths by applying Dijsktra's search algorithm over B-Trees. Baier et al. [23] propose to use the A* search algorithm, where search is guided by a heuristic that measures the minimum distance from the current node to completing a path.
Extending RDF-3X, Gubichev et al. [80] build a FER-RARI index [210] (see Section 5.4) for each property :p in the graph that forms a directed path of length at least 2. The indexes are used to evaluate paths :p* or :p+ . Paths of the form (:p/:q) * , (:p|:q) * , etc., are not directly supported.
Koschmieder and Leser [127], and Nguyen and Kim [170] optimize property paths by splitting them according to "rare labels": given a property path :p * /:q/:r * , if :q has few triples in the graph, the path can be split into :p * /:q (evaluated right-to-left) and :q/:r * (evaluated left-to-right), subsequently joining the results. Splitting paths can enable parallelism: Miura et al. [158] evaluate such splits on field programmable gate arrays (FPGAs), enabling hardware acceleration. Wadhwa et al. [234] rather use bidirectional random walks from candidate endpoints on both sides of the path, returning solutions when walks from each side coincide.
Another way to support property paths is to use recursive queries. Stuckenschmidt et al. [218] evaluate property paths such as :p+ using recursive nested-loop and hash joins. Dey et al. [63], Yakovets et al. [252] and Jachiet et al. [108] propose translations of more general property paths (or RPQs) to extensions of the relational algebra with recursive or transitive operators. Paths can be evaluated by SQL engines using WITH RECURSIVE; however Yakovets et al. [252] note that highly nested SQL queries may result, and that popular relational database engines cannot (efficiently) detect cycles. Dey et al [63] alternatively explore the evaluation of RPQs via translations to recursive Datalog.
In later work, Yakovets et al. [253] propose Waveguide, which first converts the property path into a parse tree, from which plans can be built based on finite automata (FA), or relational algebra with transitive closure (α-RA, where α denotes transitive closure). Figure 19 gives an example of a parse tree and both types of plans. Although there is overlap, FA can express physical plans that α-RA cannot, and vice versa. For example, in FA we can express non-deterministic transitions (see q 0 in Figure 19), while in α-RA we can materialize (cache) a particular relation in order to apply transitive closure over it. Waveguide then uses hybrid waveplans, where breadth-first search is guided in a similar manner to FA, but where the results of an FA can be memoized (cached) and reused multiple times like in α-RA.
Evaluating complex property paths can be costly, but property paths in practice are often quite simple. Martens and Trautner [149] propose a class of RPQs called simple transitive expressions (STEs) that are found to cover 99.99% of the queries found in Wikidata SPARQL logs, and have desirable theoretical properties. Specifically, they define atomic expressions of the form p 1 | . . . |p n , where p 1 , . . . , p n are IRIs and n ≥ 0; and also bounded expressions of the form a 1 / . . . /a k or a 1 ?/ . . . /a k ? where a 1 , . . . , a k are atomic expressions and k ≥ 0. Then an expression of the form b 1 /a * /b 2 , is a simple transitive expression (STE), where b 1 and b 2 are bounded expressions, and a is an atomic expression. They then show that simple paths for STEs can be enumerated more efficiently than arbitrary RPQs.

Recursion
Property paths offer a limited form of recursion. While extended forms of property paths have been proposed to include (for example) path intersection and difference [71], more general extensions of SPARQL have also been proposed to support graph-based and relation-based recursion.
Reutter et al. [194] propose to extend SPARQL with graph-based recursion, where a temporary RDF graph is built by recursively adding triples produced through CONSTRUCT queries over the base graph and the temporary graph up to a fixpoint; a SELECT query can then be evaluated over both graphs. The authors discuss how key features (including property paths) can then be supported through linear recursion, meaning that each new triple only needs to be joined with the base graph, not the temporary graph, to produce further triples, leading to better performance. Corby et al. [59] propose LD-Script: a SPARQL-based scripting language supporting various features, including for-loops that can iterate over the triples returned by a CONSTRUCT query.
Hogan et al. [98] propose SPARQAL: a lightweight language that supports relation-based (i.e., SELECT-based) recursion over SPARQL. The results of a SELECT query can be stored as a variable, and injected into a future query. Dountil loops can be called until a particular condition is met, thus enabling recursion over SELECT queries.

Analytics
SPARQL engines often focus on transactional (OLTP) workloads involving selective queries that are efficiently solved through lookups on indexes. Recently, however, a number of approaches have looked at addressing analytical (OLAP) workloads for computing slices, aggregations, etc. [45].
Conversely, one can also translate from analytical languages to SPARQL queries, allowing for in-database analytics, where analytical workloads are translated into queries run by the SPARQL engine/database. Papadaki et al. [176] propose the high-level functional query language HIFUN for applying analytics over RDF data. Rules for translating analytical HIFUN queries to SPARQL are then presented. There has also been growing interest in combining graph analytics -such as centrality measures, shortest paths, graph clustering, etc. -with SPARQL. In this way, SPARQL can be used as a declarative language to construct sub-graphs over which analytics are applied, and can further express queries involving the results of analytics. Unlike OLAP-style analytics, graph analytics often require recursion. One approach is to extend SPARQL to include imperative functions for invoking common graph algorithms. Abdelaziz et al. [4] propose Spartex: an extension of SPARQL that allows for invoking common graph algorithms -such as PageRank, shortest paths, etc. -as well as user-defined procedures (UDPs) written in a custom procedural language. An alternative approach is to support graph analytics through a more general recursive language based on SPARQL (as discussed in Section 7.3). Hogan et al. [98] show how the recursive language SPARQAL allows for expressing and evaluating in-database graph analytics, including breadth-first search, PageRank, local clustering coefficient, etc.

Graph query rewriting
We have seen approaches that rewrite SPARQL queries into languages such as SQL [69,252], PigLatin [202,193], etc. Other works rewrite SPARQL into the query languages of (other) graph databases. SPARQL-Gremlin [223] rewrites SPARQL to Gremlin, allowing SPARQL queries to be evaluated on graph database engines that support Gremlin, while Semantic Property Graph [190] describes how reified RDF graphs can be projected into the property graph model supported by many graph database engines.

Multi-query optimization
While the techniques discussed thus far optimize queries individually, multi-query optimization evaluates batches of queries efficiently by exploiting their commonalities. Le et al. [133] propose to first cluster a set of queries into groups with maximal common edge subgraphs; for example, the BGP {(w 1 , p, x 1 ), (w 1 , q, y 1 ), (w 1 , r, z 1 ), (y 1 , s, z 1 )} and the BGP {(w 2 , p, x 2 ), (w 2 , q, y 2 ), (z 2 , r, w 2 )} may form a cluster. A query is then constructed for each cluster by extending its maximal common sub-BGP with optional patterns needed by a proper subset of the queries; for example, ({(w, p, x), (w, q, y)} ⊲⊳{(w, r, z), (y, s, z)}) ⊲⊳{(z, r, w)} would be used for the previous cluster. Individual query results are then computed from the cluster-level results. Optimizing for multiple property paths, Abul-Basher [7] proposes to find a maximum common sub-automaton that can be evaluated and reused across multiple queries. More recent works further address multi-query optimization in specific settings, including federated systems [180], and continuous querying over streaming RDF data [259].

Discussion
SPARQL supports various features beyond joins that ideally should be implemented in an efficient manner. One option is to rewrite SPARQL queries into a target language and evaluate them using an existing engine for that language. However, it is unlikely that an existing language/engine will support all features of SPARQL in an efficient manner. Better performance for a wider range of features can be achieved with custom implementations and optimizations, where property paths have been the focus of many works. Other features that have been targeted for optimization are filters and optionals, noting that optionals are quite frequently used in the context of querying incomplete RDF data. Multi-query optimization can further help to evaluate multiple queries at once. More recent works have addressed recursion and analytics for SPARQL in order to support additional RDF data management scenarios and knowledge graph use-cases.

Partitioning
In distributed RDF stores and SPARQL engines, the data are partitioned over a cluster of machines in order to enable horizontal scale, where additional machines can be allocated to the cluster to handle larger volumes of data. However, horizontal scaling comes at the cost of network communication costs. Thus a key optimization is to choose a partitioning scheme that reduces communication costs by enforcing various forms of locality, principally allowing certain types of (intermediate) joins to be processed on each individual machine [8]. Formally, given an RDF graph G and n machines, an n-partition of G is a tuple of subgraphs (G 1 , . . . , G n ) G i , with the idea that each subgraph G i will be stored on machine i. 5 We now discuss different highlevel alternatives for partitioning.

Triple/Quad-based Partitioning
A first option is to partition based on individual triples or quads without considering the rest of the graph. For simplicity we will speak about triples as the discussion generalizes straightforwardly to quads. The simplest option is to use round robin or random partitioning, which effectively places triples on an arbitrary machine. This ensures even load balancing, but does not support any locality of processing, and does not allow for finding the particular machine storing triples that match a given pattern.
An alternative is to partition according to a deterministic function over a given key; for example, a partition key of S considers only the subject, while a partition key of PO considers both the predicate and object. Later given a triple pattern that covers the partition key (e.g., with a constant subject if the key is S), we can find the machine(s) storing all triples that match that pattern. We show some examples using different functions and partition keys in . This approach allows for range-based queries to be pushed to one machine, but requires maintaining a mapping of ranges to machines, and can be complicated to keep balanced. An alternative is hash-based partitioning where we compute the hash of the partition key modulo the number of machines, where the second example of Figure 20 splits P by hash. This does not require storing any mapping, and techniques such as consistent hashing can be used to rebalance load when a machine enters or leaves; however, if partition keys are skewed (e.g., one predicate is very common), it may lead to an unbalanced partition. A third option is to apply a hierarchical-based partition based on prefixes, where the third example of Figure 20 partitions O by their namespace. This may lead to increased locality of data with the same prefix [113], where different levels of prefix can be chosen to enable balancing, but choosing prefixes that offer balanced partitions is non-trivial. 5 We relax the typical requirement for a set partition that G i ∩G j = ∅ for all 1 ≤ i < j ≤ n to allow for the possibility of replication or other forms of redundancy. Any such partitioning function will send any triple with the same partition key to the same machine, which ensures that (equi-)joins on partition keys can be pushed to individual machines. Hash-based partitioning is perhaps the most popular among distributed RDF stores (e.g., YARS2 [94], SHARD [195], etc.). Often triples will be hashed according to multiple partition keys in order to support different index permutations, triple patterns, and joins (e.g, with S and O as two partition keys, we can push S-S, O-O and S-O joins to each machine). Care must be taken to avoid imbalances caused by frequent terms, such as the rdf:type predicate, or frequent objects such as classes, countries, etc. Omitting partitioning on highly-skewed partition keys may be advantageous for balancing purposes [94].

Graph-based Partitioning
Graph-based partitioning takes into consideration the entire graph when computing a partition. A common strategy is to apply a k-way partition of the RDF graph G [119]. Formally, letting V = so(G) denote the nodes of G, the goal is to and the number of triples (s, p, o) ∈ G such that s and o are in different node partitions is minimized. In Figure 21, we show the optimal 4-way partitioning of the graph seen previously, where each partition has 3 nodes, there are 10 edges between partitions (shown dashed), and no other such partition leads to fewer edges (<10) between partitions. Edges between partitions may be replicated in the partitions they connect. Another alternative is to k-way partition the line graph of the RDF graph: an undirected graph where each triple is a node, and triples sharing a subject or object have an edge between them.

Replication
Rather than partitioning data, data can also be replicated across partitions. This may vary from replicating the full graph on each machine, such that queries can be answered in full by any machine to increase query throughput (used, e.g., by DREAM [88]), to replicating partitions that are in highdemand (e.g., containing schema data, central nodes, etc.) so that more queries can be evaluated on individual machines and/or machines have equal workloads that avoid hot-spots (used, e.g., by Blazegraph [224] and Virtuoso [69]).

Discussion
Triple/quad-based partitioning is the simplest to compute and maintain, being dependent only on the data present in an individual tuple, allowing joins on the same partition key to be pushed to individual machines. Graph-based partitions allow for evaluating more complex graph patterns on individual machines, but are more costly to compute and maintain (considering, e.g., dynamic data). Information about queries, where available, can be used for the purposes of workloadbased partitioning, which partitions or replicates data in or- der to enable locality for common sub-patterns. Replication can further improve load balancing, locality and faulttolerance at the cost of redundant storage.

Systems and Benchmarks
In Appendix A we present a comprehensive survey of 135 individual RDF stores and SPARQL query engines -both distributed and local -in terms of the techniques discussed herein that they use. In Appendix B, we further present the synthetic and real-world benchmarks available for evaluating these systems under a variety of criteria.

Summary
In order to conclude this survey paper, we first summarize some of the current high-level trends that we have observed while preparing this survey, and then summarize the open research challenges that are left to address.

Current trends
While RDF stores and SPARQL engines have traditionally relied on relational databases and relational-style optimizations to ensure scalability and efficiency, we see a growing trend towards (1) native graph-based storage, indexing and query processing techniques, along with (2) exploiting modern hardware and data management/processing. Native storage techniques for graphs move away from relational-style schemata for RDF, and rather focus on optimizing for the compression and navigation of RDF as a graph, with techniques such as index-free adjacency, tensorbased storage, and other graph-based representations. Indexing likewise has evolved to consider entity-based (i.e., nodebased) schemes, path indexes, and structural indexes based on summarizing the graph structure of RDF data. While join processing over RDF is still largely inspired by techniques for relational databases, algorithms based on sideways information passing, multi-way joins, worst-case optimal joins, etc., have been shown to work particularly well on RDF graphs (e.g., given their fixed arity). In terms of query processing, features such as property paths and graph-based recursion go beyond what is considered in typical relational database management, with increased attention being paid to supporting graph analytics in the RDF/SPARQL setting.
Regarding modern hardware, following broader trends, many works now leverage NoSQL systems and distributed processing frameworks in order to scale RDF stores across multiple machines and handle new types of workloads. A similar trend is to better exploit modern hardware, where a variety of compact data structures have been proposed for storing RDF graphs in main memory, possibly across multiple machines, following a general trend of exploiting the growing RAM capacity of modern hardware. Recent techniques for processing graphs -represented as matrices/tensors -further enable hardware acceleration by leveraging GPUs and HPC architectures, per machine learning.
Such trends seem set to continue, where we expect to see further proposals of "native" techniques for RDF/SPARQL, further works that bridge from the RDF/SPARQL setting to related data management and processing settings in order to better support other types of workloads, as well as techniques that better leverage modern hardware, including increased RAM capacity, solid-state disks, GPUs, clusters of machines, and HPC architectures.

Research Challenges and Future Directions
Though major advances have been made in terms of the scale and efficiency of RDF stores in recent years, these will remain central challenges as the scale of RDF graphs and demand for querying them in more complex ways increases.
Other challenges have only been occasionally or partially addressed by the literature, where we highlight: Dynamics: Many of the surveyed works assume static data, and do not handle updates gracefully. Thus, more work is needed on efficiently querying dynamic RDF graphs with SPARQL, including storage that efficiently supports reads and writes, incremental indexing, caching, etc.
Query optimizations (beyond joins): Most works focus on optimizing joins and basic graph patterns. We found relatively few works optimizing features of SPARQL 1.1, such as property paths, negation, etc., where more work is needed. The expressivity of the SPARQL language is sure to grow (e.g., in the context of SPARQL 1.2), where these new features will likewise call for new techniques.
Query volume: Leading SPARQL endpoints process millions of queries per day. This challenge motivates further research on workload-aware or caching strategies that leverage frequent sub-queries. Another research challenge is on how to ensure effective policies for serving many clients while avoiding server overload, where methods such as preemption [157], which allows for pausing and resuming costly query requests, are promising ideas for further development.
Evaluation: Various benchmarks are now available for comparing different RDF stores, but they tend to focus on system-level comparisons, thus conflating techniques. More fine-grained evaluation at the level of individual techniques in the RDF/SPARQL setting would be very useful to understand the different trade-offs that exist. Also many benchmarks were proposed for SPARQL 1.0, where there is a lack of benchmarks including features such as property paths.
Integration: RDF and SPARQL are widely adopted on the Web, and for managing and querying knowledge graphs. However, in such settings, additional types of tasks are often considered, including federated querying, reasoning, enrichment, refinement, learning, analytics, etc. More work is needed on supporting or integrating features for these tasks in SPARQL. Interesting questions relate to efficiently supporting RDFS/OWL/Datalog reasoning, graph algorithms, knowledge graph embeddings, graph neural networks, etc., for RDF graphs within SPARQL engines.

A Survey of RDF Stores
We now present a survey of local and distributed RDF stores, and how they use the aforementioned techniques. At the end of this section, we will discuss some general trends for RDF stores. We include here systems for which we could find technical details regarding (at least) the storage, indexing and processing of joins over RDF graphs. 7 In the case of distributed RDF stores, we expect similar technical details, along with the type of partitioning and/or replication used. We include systems with associated publications, as well as systems that are unpublished but widely known in practice. Both local and distributed systems are presented in approximate chronological order, based on the year of publication, or an approximate year in which the system was released. For unpublished local stores, we include the year where RDF was first supported. For unpublished distributed stores, we include the approximate year when distributed features were added. Some stores that are often deployed in local environments also support distribution; they are included in both sections. Some systems are unnamed; if it is a distributed store that extends an existing local store, we append the suffix "-D" or "-D2" to the local store's name; otherwise we use an abbreviation based on authors and year. Where systems change name, we prefer the more modern name. The papers sometimes use different terminology to refer to similar concepts; we often map the original terminology to that used in the body of the survey in order to increase coherency and improve readability.

A.1 Local RDF Stores
The local RDF stores we include, and the techniques they use, are summarized in Table 4.
Redland [28] (2001) is a set of RDF libraries for native RDF storage that has seen various developments over the years. The original paper describes triple-table like storage based on creating three hash maps -SP→O, PO→S, SO→P -which, given two elements of an RDF triple, allow for finding the third element; for example, using PO→S, we can find the subjects of triples with a given predicate and object. The hash maps can be stored either in-memory or on persistent storage. Support for the RDQL and SPARQL query languages were later added with the Rasqal query library. 7 We thus exclude systems -such as TPF-based systems, and stores such as Fabric, Fluree, and TriplePlace [18] -that do not (yet) describe direct support for joins or basic graph patterns. We also exclude systems -such as Attean, Kineo, KiWi, librdf.sqlite, NitroBase, Oxigraph, Pointrel, Profium Sense, RedStore, RDF::Trine, TerminusDB [229] and TriplyDB -for which we could not find key technical details (e.g., indexes or join algorithms supported) at the time of writing.
Jena [151] (2002) uses relational databases to store RDF graphs as triple tables, with entries for subject, predicate, object IRIs, and object literals. IRIs and literals are encoded as IDs and two separate dictionaries are created for both. Indexing is delegated to an underlying relational DBMS (e.g., Postgresql, MySQL, Oracle, etc.). RDQL is used as a query language and is translated into SQL and run against the underlying relational DBMS. The Jena store would later be extended in various directions, with SDB referring to the use of relational-style storage (per the original system), and TDB referring to the use of native storage. RSSDB [117,118] (2002) stores an RDF graph using a vertical partitioning approach with Postgres as the underlying database. Two variants are considered for class instances: creating a unary table per class (named after the class, with rows indicating instances), or creating one binary table called instances (with rows containing both the instance and the class) in order to reduce the number of tables. Four tables are also added to model RDFS definitions (classes, properties with their domain and range, sub-classes and sub-properties). The system supports queries in RQL (proposed in the same paper [118]), which are translated to SQL by an RQL interpreter and evaluated over Postgres.
3store [90] (2003) uses MySQL as a back-end, sorting RDF graphs in four tables, namely a triple table, a models table,  a resource table, and a literal table. The triple table stores RDF triples (one per row) with additional information: (1) the model the triple belongs to, (2) a boolean value to indicate if the object is a literal, and (3) a boolean value to indicate if this triple is inferred. The models, resource, and literal tables are two-column tables that dictionary encode models, resources, and literals, respectively. Queries expressed in RDQL are rewritten to SQL for execution over MySQL.
Jena2 [241] (2003) is a revised version of the original Jena database schema, with support for both triple and property tables. Unlike the original version, IRIs and literals are stored directly in the tables, unless they exceed a certain length, in which case they are dictionary encoded by two separate tables; this allows filter operations to be directly performed on the triple and property tables, thus reducing dictionary lookups, but increasing storage sizes as string values are stored multiple times. Indexing is handled by an underlying relational database, and graph queries in RDQL are rewritten to SQL queries evaluated over the database.
CORESE [57,58] (2004) began as a search engine with pathfinding functionality and inference [57], but was extended to support SPARQL query features [58]. CORESE models RDF graphs as conceptual graphs; for simplicity we discuss their methods in terms of the RDF model. RDF graphs are indexed according to the terms, enabling the efficient evaluation of triple patterns. Given a basic graph pattern, the triple patterns are reordered based on heuristics -such as the number of constants or filters associated with the triple pattern, or the number of variables bound by previous triple patterns in the order -as well as cardinality estimates. A nested-loop style algorithm is then applied to perform joins. Filters are evaluated as soon as possible to reduce intermediate results.
Jena TDB 9 (2004) is a native RDF store that has seen continuous development in the past decades. A TDB instance consists of three tables: a node table (a dictionary, allowing to encode/decode RDF terms to/from 8-byte identifiers), a triple/quad table (with dictionary-encoded terms), and a prefixes table (used to store common prefixes used for abbreviations). Storage is based on custom B+trees used to build indexes for various triple/quad permutations. Join processing uses pairwise (nested-loop) joins, with a variety of statisticand heuristic-based methods available for join reordering. SPARQL 1.1 query processing is implemented in the custom Jena ARQ query processor. Jena TDB has become the recommended RDF store for Jena, with older relational-based storage (later named Jena SDB) having been deprecated.
RStar [143] (2004) stores (RDFS-style) ontology information and instance data using multiple relations in the IBM DB2 RDBMS. Five two-column tables are used to store ontological data (property dictionary, sub-property relations, class dictionary, sub-class relations, and domain and range relations). Another five two-column tables are used to store 9 https://jena.apache.org/documentation/tdb/ instance-related data (literal dictionary, IRI dictionary, triples, class instances, namespace dictionary). RStar pushes indexing and other tasks to the underlying database. The RStar Query Language (RSQL) is used and translated into SQL.
BRAHMS [110] (2005) is an in-memory RDF store. The RDF graph is indexed in three hash tables -S→PO, O→SP, P→SO -which allow for finding triples that use a particular constant. The motivating use-case of BRAHMS is to find semantic associations -i.e., paths between two subject/object nodes -in large RDF graphs. This path-search functionality was implemented in BRAHMS using depth-first search and breadth-first search algorithms.
In the most recent version, indexes are built for two triple permutations (POS and PSO) as well as a quad permutation (GPSO). Predicate lists (SP and OP) can also be indexed in order to quickly find the predicates associated with a given subject or object. Terms are dictionary encoded. Joins are reordered according to cardinality estimations. SPARQL 1.1 is supported, along with a wide range of other features, including spatial features, full-text indexing, inference, semantic similarity, integration with MongoDB, and more besides.
Mulgara [243,161] (2005), a fork of an earlier RDF store known as Kowari, implements native RDF storage in the form of quads tables using AVL trees. Dictionary encoding based on 64-bit longs is used. Support for transactions is provided using immutable arrays that store quads on disk in compressed form, with skiplists enabling fast search; insertions and deletions lead to a new immutable array being generated on disk. Indexing is based on six permutations of quads (which is sufficient to efficiently evaluate all sixteen possible quad patterns). Joins are evaluated pairwise and reordered (possibly on-the-fly) based on cardinality estimations. Queries are expressed in the iTQL language, where SPARQL support was added later.

RAP [172] (2005) is a general-purpose PHP-based API for
RDF that includes an RDF store. Two forms of storage are provided. An in-memory store collects triples in an array, with three indexes provided on S, P and O to find triples by a given term and position. Alternatively, persistent storage is supported through database backends, where triples are stored in a triple RDFBroker [212] (2006) is an RDF store that follows a property table approach. For each subject in the graph, its signature (equivalent to the notion of characteristic sets that would come later) is extracted, with a property table defined for each signature, including a column for the subject, and a column for each property in the signature. Support for RDFS reasoning is also described. An index over signatures is proposed based on a lattice that models set containment between signatures. Given a signature extracted from the query, the lattice can be used to find tables corresponding to signatures that subsume that of the query. A prototype based on in-memory storage is described, implementing typical relational query optimizations such as join reordering.  2007) is an RDF store based on a structural index (see Section 5.6). This index is a binary tree, where the root refers to all the nodes of the graph, and both children divide the nodes of its parents based on a given distance from a given node in the graph. The leaves can then be seen as forming a partition of the triples in the graph induced by the nodes in its division. The structural index is used to find small subgraphs that may generate results for a query, over which an existing subgraph matching algorithm is applied. 2007) is an RDF store based on vertical partitioning. SW-Store relies on a column-oriented DBMS called C-store [216], which is shown to pair well with vertical partitioning in terms of performance (e.g., the object column of a foaf:age table will have integers in an interval [0, 150], which are highly compressible). Each table is indexed by both subject and object. An "overflow" triple table is used for inserts alongside the compressed, vertically partitioned tables. Jena ARQ is used to translate SPARQL queries into SQL for evaluation over C-Store. Pairwise joins are used, preferring merge joins when data are sorted appropriately, otherwise using index nested-loop joins. Materialization of S-O joins is also discussed.
Blazegraph [224] (2008), formerly known as BigData, is a native RDF store supporting SPARQL 1.1. Blazegraph allows for either indexing triples to store RDF graphs, or quads to store SPARQL datasets. Three index permutations are generated for triples, and six permutations are generated for quads; indexes are based on B+trees. Both row and column data storage models are supported, which can be saved both in-memory or on-disk. Dictionary encoding with 64bit integers is used for compressed representation of RDF triples. Two query optimization strategies are available: the default approach uses static analysis and cardinality estimation; the second approach uses runtime sampling of join graphs. Supported joins include hash joins, index nestedloop joins, merge joins, and multiway star joins.
Hexastore [238] (2008) is an in-memory RDF store based on adjacency lists similar to Figure 8. Six indexes are built for all 3! = 6 permutations of the elements of a triple. For example, in the SPO index, each subject s is associated with an ordered vector of predicates, wherein each p in turn points to an ordered vector of objects. In the PSO index, each p points to a vector of subjects, wherein each s points to the same vector of objects as used for sp in the SPO index. Terms are dictionary encoded. Having all six index orders allows for (pairwise) merge joins to be used extensively. . Though OP and SP BitMats could be indexed for subjects and objects, resp., the authors argue they would be rarely used. BitMats also store the count of 1's (triples) they contain, a row vector indicating which columns contain a 1, and a column vector indicating which rows contain a 1.
In total, 2|P ||S||O| BitMats are generated, gap compressed, stored on disk, and loaded in memory as needed. Bitwise AND/OR/NOT operators are used for multiway joins.
DOGMA [41] (2009) is a graph-based RDF store, where an RDF graph is first decomposed into subgraphs using a graph partitioning algorithm. These subgraphs are indexed as the leaves of a balanced binary tree stored on disk. Each nonleaf node in this tree encodes the k-merge of its two children, which is a graph with k nodes that is isomorphic to a quotient graph (see Section 5.6) of both children. DOGMA proposes a variety of algorithms for evaluating basic graph patterns with constant predicates. The basic algorithm generates a set of candidate results for each individual variable node based on its incoming and outgoing edges; starting with the node with the fewest candidates, the algorithm then proceeds to check the edges between them in a depth-first manner (similar to wco joins). Further algorithms prune the sets of candidates based on their distance from the candidates of other nodes in the query based on the distance between nodes in the subgraphs in the leaves of the binary tree.
LuposDate  RDFJoin [152] (2009) stores RDF graphs using three types of tables. Two dictionary tables are used to encode and decode subjects/objects and predicates. Three triple tables are used, where each has two positions of the triple as the primary key, and the third position is encoded as a bit vector; for example, in the PO table, predicate and object are used as a primary key, and for each predicate-object pair, a bit vector of length |so(G)| is given, with a 1 at index k encoding a triple with the subject identifier k for the given predicate and object. Join tables store the results of S-S, O-O, and S-O joins, encoded with the two predicates of both triples as primary key (joins using the same predicate twice are excluded), and a bit vector to encode the join terms for that predicate pair (the subject/object that matches the join variable). MonetDB and LucidDB are used as underlying databases. SPARQL is supported, where joins are evaluated using the join indexes and pairwise algorithms. Inference would later be added in the extended RDFKB system [153]. dipLODocus [250] (2011) is an RDF store based on the notion of a "molecule", which is a subgraph surrounding a particular "root" node. The root nodes are defined based on matching triple patterns provided by the administrator. The molecule of a root node is then the subgraph formed by expanding outward in the graph until another root node is encountered. Dictionary encoding is used. Indexes are further built that map nodes and the values of properties indicated by the administrator to individual molecules. SPARQL is supported through the Rasqal query library, with joins pushed within individual molecules where possible; otherwise hash joins are used. Aggregate queries are pushed to the indexes on values of individual properties (which offers benefits similar to column-wise storage).
gStore [263] (2011) is a graph-based RDF store. The RDF graph is stored using adjacency lists (see Section 4.5) where each node is associated with a bit vector -which serves as a vertex signature (see Section 5.2) -that encodes the triples where the given node is the subject. gStore then indexes these signatures in a vertex signature tree (VS-tree) that enables multi-way joins. The leaves of the VS-tree encode signatures of nodes, and non-leaf nodes encode the bitwise OR of their children; the leaves are further connected with labeled edges corresponding to edges between their corresponding nodes in the graph. Basic graph patterns can then be encoded in a similar manner to the graph, where gStore then evaluates the pattern by matching its signature with that of the indexed graph.
SpiderStore [164] (2011) is an in-memory graph store based on adjacency lists. Specifically, for each node in the RDF graph, an adjacency list for incoming and outgoing edges is stored. Likewise, for each predicate, a list of subject nodes is stored. Rather than storing the constants directly in these lists, pointers are stored to the location of the term (with the adjacency lists for the node or the subjects of the predicate). Alongside these pointers, cardinality metadata are stored.
(Though SPARQL queries with basic graph patterns and filters are evaluated in the experiments, the types of join algorithms used are not described.) SAINT-DB [186] (2012) is an RDF store with a structural index that organizes triples in the graph according to the type of join that exists between them (S-S, P-O, etc.). The index itself is then a directed edge-labeled graph whose nodes represent a set of triples from the graph, edges indicate that some pair of triples in both nodes are joinable, and edge labels indicate the type of join that exists (which makes the graph directed as S-O differs from O-S). The nodes of the index then form a partition of the graph: no triple appears in more than one node, and their union yields the graph. This index can range from a single node with all triples in the graph (with loops for each type of join present), to singleton nodes each with one triple of the graph. A condition based on semi-joins is used to strike a balance, minimizing the intermediate results generated for individual triple patterns. Given a basic graph pattern, each triple pattern is then mapped to nodes in the structural index, where the triple patterns it joins with must match some triple in a neighbor on an edge whose label corresponds to the type of join.
Strabon [130] (2012) is an RDF store that supports custom features for indexing and querying geospatial data (specifically in the form of stRDF [129] data). Strabon is built upon Sesame/RDF4J, which is chosen as an open-source solution that can easily integrate with PostGIS: a DBMS with spatial features. Strabon then stores RDF using a vertical partitioning scheme with dictionary encoding; an identifier for each triple is also included. B+tree indexes are built for the three columns of each table (subject, predicate, identifier). Strabon supports an extension of SPARQL, called stSPARQL [129], for querying stRDF based datasets, with spatial features supported through PostGIS.
BrightstarDB 12 (2013) is a persistent RDF store that indexes dictionary-encoded RDF datasets using B-trees and/or B+trees. Two types of persistence are supported: in appendonly mode, writes are made to pages at the end of the index files, while in rewritable mode, writes are made to copies of index pages that are made active upon a commit. The system further supports querying over multiple named graphs. SPARQL 1.1 queries are processed over BrightstarDB's storage using dotNetRDF's Leviathan library, which supports hash joins and uses a heuristic-based join reordering based on which elements of the triple patterns are constant.
DB2RDF [36] (2013) uses a relational schema similar to property tables to store RDF data. However, rather than having a column for each property/predicate associated with a given subject, DB2RDF uses a "primary hash" table with columns S, P 1 , O 1 , . . . , P k , O k , where each P i , O i pair of columns indicates the i th predicate-object pair associated with the subject listed in the S column. A binary "spill" column is added, with a 1 indicating that a subject has more than k triples, in which case it will occupy more than one row of the OntoQuad [187] (2013) is an RDF store that extends the triple-based representation of Hexastore to additionally support quads. A structure similar to a trie is used, where the top layer is a vector of values for S, P, O, G; the second level encodes SP, . . . , GO, etc., with three children for each parent in the top layer (e.g., SP, SO, SG for S); the third layer has two children for each parent in the second layer encoding SPO, . . . , GOP; the fourth layer has one child for each parent in the third layer, completing the quad permutation. B-trees are then used for indexing. Both pairwise and multiway joins are supported using zig-zag joins that seek forward to the maximum compatible join value across the triple patterns. Cardinality estimates and query rewriting rules are used to optimize SPARQL query plans.
OSQP [225] (2013) is an RDF store based on a structural index using various notions of bisimulation, where two nodes in the graph are bisimilar if they cannot be distinguished from their paths. The nodes of the graph are then partitioned into sets that are pairwise bisimilar. The index is then based on a quotient graph, where supernodes correspond to a set of bisimilar nodes in the input graph. In order to reduce index sizes, a parameter corresponding to path lengths is added, such that bisimulation only considers paths within a bounded region of the graph rather than the entire graph. A basic graph pattern is then matched over the quotient graph (kept in-memory), where the triples corresponding to each matched node are retrieved (from the disk) and used to compute the final results. Custom optimizations are considered for triples with unprojected variables, whose triple patterns can be definitively "satisfied" and thus pruned based on the index; and selective triple patterns, which are evaluated directly over the RDF graph.
Triplebit [256] (2013) represents a dictionary-encoded RDF graph as a compressed 2-dimensional bit matrix. Each column of the matrix represents a triple, and each row represents a subject/object node. The subject and object rows are assigned 1 for the corresponding column of the triple. Columns are sorted by predicate, where a range of columns corresponds to the triples for that predicate. The columns for triples are sparse (at most two 1's) and thus the two identifiers for subjects and objects are used, rather than storing 1's; two orders are maintained for SO and OS (thus effectively covering PSO and POS orders). Two auxiliary indexes are used in TripleBit. Given a subject or object node and a predicate node, the first index (called ID-Chunk) supports lookups for finding the range for the unspecified object or subject. Given a subject or object node alone, the second index (called ID-predicate) finds predicates associated with that subject or object. Basic graph patterns are evaluated using multiway merge-joins for star joins, with semi-joins used to reduce the number of intermediate results across star joins. Join ordering uses a greedy strategy on selectivity.
R3F [123,124] (2014) is an extension of RDF-3X with pathbased indexes and novel join processing techniques. The first addition is the "RP-index", which indexes all nodes with a given incoming path expression up to a certain length; for example, the incoming path expression pqr (of length 3) indexes all nodes z such that there exists w, x, y, such that (w, p, x), (x, q, y), (y, r, z) are all triples of the graph. The RP-index is structured as a trie indexing the prefixes of the incoming path expressions, whose leaves are the list of nodes (which are dictionary encoded, sorted and delta encoded). Virtual inverse predicates are added to the RDF graph to support paths in both directions. The second extension is a modification to the sideways information passing strategy of RDF-3X to incorporate information about paths for filtering additional intermediate results.
RQ-RDF-3X [135] (2014) is an extension of RDF-3X towards support for quads. The extension follows the same principles and techniques for RDF-3X, but the extension to quads requires covering additional permutations. Indexes are built for all 4! = 24 quad permutations, similar to how RDF-3X indexes all 3! = 6 triple permutations; having all permutations enables reading the results for any variable of any triple pattern in sorted order, which in turn enables merge joins. The delta encoding used by RDF-3X is extended to the fourth element. Like in RDF-3X, counts are indexed for all quad patterns with 1, 2, or 3 constants, requiring 4, 12 and 24 indexes, respectively (40 in total). Join and query processing use RDF-3X's techniques. RQ-RDF-3X then offers optimized support for reification using named graphs/triple identifiers.
SQBC [262] (2014) is a graph store -with support for RDF graphs -inspired by existing subgraph matching techniques for efficiently finding subgraph isomorphisms. 13 In order to index the graph, codes are extracted for each node that capture structural information about it, including its label, the largest clique containing it, the degrees of its neighbours, etc. Given a basic graph pattern, candidates are identified and filtered for variable nodes. If the basic graph pattern has no cliques, degree information is used; otherwise clique sizes can be used to filter candidate matches.
WaterFowl [61] (2014) is a compact RDF store based on succinct data structures. The RDF graph is dictionary encoded and sorted in SPO order, and represented as a trie: the first layer denotes subjects, connected to their predicates in a second layer, connected to their objects in the third layer. This trie structure is encoded in a compact representation using a combination of bit strings that indicate the number of children for a parent (e.g., for predicates, 100101 . . . tells us that the first subject has three children (unique predicates) and the second has two); and wavelet trees that encode the sequence of terms themselves (e.g., the sequence of predicates). Pairwise joins are evaluated in terms of leftdeep plans, with further support for SPARQL (1.0) features. RDFS inference is also supported.
GraSS [142] (2015) is an RDF store that is based on decomposing basic graph patterns into subgraph patterns forming star joins (considering S-S, S-O, or O-O joins). An "FFDindex" for star joins is proposed, where for each node, a bitstring signature is computed that encodes its incoming and outgoing edges, i.e., the triples in which it appears as subject or object. A neighbourhood table is constructed: each row denotes a node, which is associated with its signature and edges. Five triple permutations are further indexed (covering SP*, OP*, S*, P*, O*), where in the SP* permutation, for example, (s, p) pairs are mapped to a list of objects and their degrees. A basic graph pattern is then decomposed into sub-patterns forming star joins, which are evaluated using the available indexes.
k 2 -triples [13] (2015) is a compact in-memory RDF store based on k 2 trees. The RDF graph is first dictionary encoded. For each predicate, a k 2 tree is used to index its subjects and objects. In order to support variable predicates in triple patterns, SP and OP indexes are used to map subjects and objects, respectively, to their associated predicates; these indexes are encoded using compressed predicate lists. RDFCSA [40,39] (2015) is a compact in-memory RDF store based on text indexes. Specifically, triples of the RDF graph are dictionary encoded and considered to be strings of length 3. The graph is thus sorted and encoded as a string of length 3n, where n is the number of triples. This string is indexed in a compressed suffix array (CSA): a compact data structure commonly used for indexing text. The CSA is modified by shifting elements so that instead of indexing a string of 3n elements, triples cycle back on themselves, giving n circular strings of length 3. Thus in an SPO permutation, after reading the object of a triple, the next integer will refer to the subject of the same triple rather than the next one in the order. With cyclical strings, one triple permutation is sufficient to support all triple patterns; SPO is in fact equivalent to POS and OSP. Merge joins, sort-merge joins and a variant of index nested-loop joins (called "chain joins") are supported.
RDFox [164] (2015) is an in-memory RDF engine that supports Datalog reasoning. The RDF graph is stored as a triple table implemented as a linked list, which stores identifiers for subject, predicate and object, as well as three pointers in the list to the next triple with the same subject, predicate and object (similar to Parliament [126]). Four indexes are built: a hash table for three constants, and three for individual constants; the indexes for individual constants offer pointers to the first triple in the list with that constant, where patterns with two constants can be implemented by filtering over this list, or (optionally) by using various orderings of the triple list to avoid filtering (e.g., a triple list ordered by SPO can be used to evaluate patterns with constant subject and predicate without filtering). These in-memory indexes support efficient parallel updates, which are key for fast materialization. According to the implementation, (index) nested-loop joins are supported; optionally join plans can be generated based on tree decompositions. SPARQL 1.1 is further supported over the engine.
Turbo HOM++ [122] (2015) is an in-memory, graph-based RDF store. The RDF graph is stored as the combination of adjacency lists for incoming and outgoing triples (see Section 4.5), and an index that allows for finding nodes of a particular type (based on rdf:type). Evaluation of basic graph patterns is then conducted by generating candidates for an initial node of the query graph based on local information (intersecting adjacency lists and type information in order to match all triple patterns that the node appears in), where the neighbors of the candidates are explored recursively in the graph guided by the graph pattern, generating candidates for further query nodes (in a manner akin to DOGMA [41] RIQ [120] (2016) provides a layer on top of an existing RDF store that indexes similar named graphs in a SPARQL dataset. A bit vector -called a "pattern vector" -is computed for each named graph in the dataset. The pattern vector consists of seven vectors for S, P, O, SP, SO, PO and SPO, where, e.g., the SP vector hashes all subject-predicate pairs in the named graph. An index over the pattern vectors (PV-index) is constructed by connecting similar pattern vectors (based on locality-sensitive hashing) into a graph; each connected component of the graph forms a group of similar graphs. The union of the graphs in each group is further encoded into Bloom filters. In order to evaluate a basic graph pattern, a pattern vector is computed combining the triple patterns (e.g., a triple pattern (s, p, o) will generate a single SP sub-vector). The PV-index is then used to optimize an input query by narrowing down the candidate (named) graphs that match particular basic graph patterns before evaluating the optimized query over the underlying SPARQL store.
axonDB [154] (2017) uses two dictionary-encoded triple tables to store RDF graphs. In the first table, each triple is additionally associated with the characteristic set (CS) of its subject (see Section 5.3). The CS is assigned a unique identifier and one-hot encoded, i.e., represented by a bit vector with an index for each property that carries a 1 if the property is part of the CS, or a 0 otherwise. Triples are then sorted by their CS, grouping subjects with the same CS together. A second triple table stores each triple, along with the corresponding extended characteristic set (ECS; again see Section 5.3). The ECS is encoded with a unique identifier, and the identifiers for the subject and object CSs. The triple table is sorted by ECS. When evaluating a basic graph pattern, its analogous CSs and ECSs are extracted, along with the paths that connect them. The CSs and ECSs are matched with those of the graph, enabling multiway joins; binary hash joins are used to join the results of multiple CSs/ECSs.
HTStore [138] (2017) uses hash-based indexes to build an RDF store. The RDF graph is indexed in a hash tree whose top layer forms a hash table over the nodes of the graph. The hash tree is based on a sequence of prime numbers. When hashing a node, the first prime number is used, and if no collision is detected, the node is inserted in the first layer. Otherwise the second prime number is used, and if no collision is detected, it is inserted in that layer as a child of the bucket of the first layer that caused the collision. Otherwise the third prime number is used, and so forth. Nodes in the hash tree then point to their adjacency lists in the graph. To evaluate queries, constant nodes in the query are hashed in the same manner in order to retrieve the data for the node. SPARQL queries are supported, though details about join and query processing are omitted.
Ontop [46] (2017) is a open source Ontology-Based Data Access (OBDA) system based on relational (and potentially decentralised) storage. The underlying data are mapped to RDF graphs and/or ontologies using languages such as the R2RML standard. SPARQL queries are rewritten to SQL queries following such mappings, which are evaluated over the underlying database; a more recent version rather translates SPARQL into an intermediate algebraic query that is subsequently optimised and translated into SQL [251]. Entailment for RDFS and OWL 2 QL are additionally supported through query rewriting techniques that expand the given query to capture solutions over entailments.
Quadstore 14 (2017) is a client-side RDF store that can be used with Node.js for in-browser management of RDF quads.
The system also supports a variety of underlying storage options through the Level-down interface, such as LevelDB and RocksDB for persistent storage, and MemDown for inmemory storage. By default, indexes are generated for six quad permutations, namely SPOG, OGSP, GSPO, OSPG, POGS and GPOS, though these indexes are configurable by the user. SPARQL 1.1 queries and updates are supported.
AMBER [105] (2018) stores RDF graphs in a "multigraph" representation, where IRIs form nodes, whereas predicateliteral pairs form "attributes" on nodes. All nodes, predicates and attributes are dictionary encoded. AMBER then generates three indexes: the first stores the set of nodes for each attribute, the second stores vertex signatures that encode metadata about the triples where a given node is subject or object, and the third stores adjacency lists. Basic graph patterns are evaluated by classifying query nodes with degree greater than one as core nodes, and other nodes as satellite nodes. Core nodes are processed first, where candidates are produced for each query node based on the available indexes, recursively producing candidates for neighbors; the algorithm starts with the core query node with the most satellite nodes attached, or the highest degree. For each solution over the code nodes, each satellite node is then evaluated separately as they become disconnected once the core nodes are bound to constants.
TripleID-Q [49] (2018) is an RDF store that uses a compact representation called TripleID for RDF graphs such that query processing can be conducted on GPUs. The TripleID representation is based on a dictionary-encoded triple table.  Rather than indexing the triple table, chunks of the table can 14 https://github.com/beautifulinteractions/nodequadstore be loaded into GPUs, which, given a particular triple pattern, will scan the triple table in parallel looking for matching triples in the RDF graph. Other operators such as union, join, filter, distinct, etc., are then implemented on top of this GPU search; specifically, these operators are translated into functions that are executed in the GPU over the results of the search. RDFS entailment is further supported.
Jena-LTJ [99] (2019) extends the Jena TDB RDF store with the ability to perform worst-case optimal (wco) joins (see Section 6.3). Specifically, Jena TDB is extended with an algorithm similar to Leapfrog TrieJoin (LTJ), which is adapted from a relational setting for the RDF/SPARQL settings. The algorithm evaluates basic graph patterns variable-by-variable in a manner that ensures that the overall cost of enumerating all of the results is proportional to the number of results that it can return in the worst case. In order to reach wco guarantees, the three-order index of Jena TDB -based on B+trees -is extended to include all six orders. This ensures that for any triple pattern, the results for any individual variable can be read in sorted order directly from the index, which in turn enables efficient intersection of the results for individual variables across triple patterns. Thus Jena-LTJ uses twice the space of Jena TDB, but offers better query performance, particularly for basic graph patterns with cycles.
MAGiQ [109] (2019) is an RDF store that can use a variety of compressed sparse matrix/tensor representations for RDF graphs in order to translate basic graph patterns into linear algebra operations. These representations include compressed sparse column, doubly compressed sparse column, and coordinate list encodings of the graph as a matrix/tensor. Basic graph patterns are then translated into operations such as matrix multiplication, scalar multiplication, transposition, etc., over the associated matrices/tensor, which can be expressed in the languages provided by libraries such as GraphBLAS, Matlab, CombBLAS, and ultimately evaluated on CPUs and GPUs for hardware acceleration.
BMatrix [38] (2020) is a compact in-memory RDF store, where the RDF graph is first dictionary encoded and sorted by predicate. Two binary matrices are created: an s × n matrix called ST and an o × n matrix called OT, where s, o and n are the number of unique subjects, objects and triples respectively. The ST/OT matrix contains a 1 in index i, j if the subject/object of the j th triple corresponds to the i th term (or a 0 otherwise). Both matrices are indexed with k 2trees, while a bit string of length n encodes predicate boundaries with a 1, i.e., in which columns of the matrix (denoting triples sorted or grouped by predicate) the predicate changes. These indexes are sufficient to cover all eight possible triple patterns. Further compression can be applied to the leaf matrices of the k 2 .tree in order to trade space for time. The authors mention that joins can be supported in a similar fashion as used for RDFCSA and k 2 -triples.
Tentris [31] (2020) is an in-memory RDF store wherein an RDF graph is viewed as a one-hot encoded 3-order tensor (equivalent to the 3-dimensional array used in BitMat [20]), which in turn is viewed as a trie of three levels for S, P and O. However, rather than storing tries for all permutations, a hypertrie is used with three levels. The leaves in the third level correspond to all possible combinations of two con- Finally, the top level -the root, representing zero constants -maps to all the second level elements. Basic graph patterns (with projection) are translated into tensor operations that can be evaluated on the hypertrie using a worst-case optimal join algorithm.
Ring [19] (2021) is an in-memory RDF store that uses FMindexes (a text-indexing technique) in order to represent and index RDF graphs in a structure called a "ring". Specifically, a dictionary-encoded RDF graph is sorted lexicographically by subject-predicate-object; then the triples are concatenated to form a string s 1 p 1 o 1 . . . s n p n o n , where (s i , p i , o 1 ) indicates the i th (dictionary-encoded) triple in the order and n = |G|. A variant of a Burrows-Wheeler Transform is applied over this string, which allows for finding triples given any constant and position (or sequence of constants and positions), and for traversing to other elements of a triple in any direction. The result is a bidirectional circular index that covers all triple permutations with one index that encodes the graph and requires sub-linear space additional to the graph. For basic graph pattern queries, a variant of Leapfrog-Trie Join is implemented, offering worst-case optimal joins.

A.2 Distributed RDF Engines
We now survey distributed RDF stores. Table 5 summarizes the surveyed systems and the techniques they use. We further indicate the type of underlying storage used, where italicized entries refer to local stores. Some systems that appear in the following may have appeared before in the local discussion if they are commonly deployed in both settings.
YARS2 [94] (2007) is an RDF store based on similar principles to YARS (see local stores) but for a distributed environment. The index manager in YARS2 uses three indexes namely a quad index, keyword index, and a join index for evaluating queries. The quad indexes cover six permutations of quads. The keyword index is used for keyword lookups.
The join indexes help speed up query execution for common joins. The core index on quads is based on hashing the first element of the permutation, except in the case of predicates (e.g., for a POGS permutation), where hashing creates skew and leads to imbalance, and where random distribution is thus used. Indexed nested loop joins are used, with triple patterns being evaluated on one machine where possible (based on hashing), or otherwise on all machines in parallel (e.g., for constant predicates or keyword searches). Dynamic programming is used for join reordering in order to optimize the query.
Clustered TDB [173] (2008) is a distributed RDF store based on Jena TDB storage (a local system). The system is based on a master-slave architecture where the master receives and processes queries, and slaves index parts of the graph and can perform joins. Hash-based partitioning is used to allocate dictionary-encoded triples to individual slaves based on each position of the triple; more specifically, distributed SPO, POS and OSP index permutations are partitioned based on S, P and O, respectively. An exception list is used for very frequent predicates, which are partitioned by PO instead of P. Index-nested loop joins are supported and used to evaluate SPARQL basic graph patterns.
Virtuoso EE [69] (2008) is a local RDF store whose enterprise edition also offers support for indexing over a cluster of machines. Recalling that Virtuoso stores RDF graphs as a quads table in a custom relational database, the most recent version of Virtuoso offers three options for each table: partitioned, replicated or local. Partitioning is based on partition columns specified by the administrator, which are used for hash-based partitioning; partitions can also be replicated, if specified. Replication copies the full table to each machine, which can be used for query-based partitioning, or to store a global schema that is frequently accessed by queries. Local tables are only accessible to the individual machine, and are typically used for local configuration.
4store [91] (2009) stores quads over a cluster of machines, where subject-based hash partitioning is used. Three types of indexes are used in 4Store namely R, M, and P indexes. The R index is a hash table that dictionary encodes and stores meta-data about individual RDF terms (called "resources"). The M index is a hash table that maps graph names (called "models") to the corresponding triples in the named graph. The P Indexes consist of radix tries, with two for each predicate (similar to vertical partitioning): one for SOG order and another for OSG order. Joins are pushed, where possible, to individual machines. Join reordering uses cardinality estimations. SPARQL queries are supported. Blazegraph [224] (2009), discussed previously as a local store, also features partitioning in the form of key-range shards that allow for partitioning B+tree indexes, potentially across multiple machines. An alternative replication cluster is supported that indexes the full RDF graph or SPARQL dataset on each machine, allowing queries to be evaluated entirely on each machine without network communication.
SHARD [195] (2009) is a distributed, Hadoop-based RDF store. It stores an RDF graph in flat files on HDFS such that each line presents all the triples associated with a given subject resource of the RDF triple, which can be seen as an adjacency list. The graph is hash partitioned, so that every partition contains a distinct set of triples. As the focus is on batch processing of joins, rather than evaluating queries in real-time, there is no specific indexing employed in SHARD. Query execution is performed through MapReduce iterations: first, it collects the results for the subqueries, which are joined and finally filtered according to bound variables and to remove redundant (duplicate) results.
AllegroGraph (2010), discussed previously as a local store, offers a distributed version where data are horizontally partitioned into shards, which are indexed locally on each machine per the local version. Alongside these shards, "knowledge bases" can be stored, consisting of triples that are often accessed by all shards (e.g., schema or other high level data), such that queries can be evaluated (in a federated manner) over one shard, and potentially several knowledge bases.
GraphDB [125,33] (2010), also a local store, offers an enterprise edition that can store RDF graphs on a cluster of machines using a master-slave architecture, where each cluster has at least one master node that manages one or more worker nodes, each replicating the full database copy, thus allowing for queries to be evaluated in full on any machine. Updates are coordinated through the master.
AnzoGraph 15 (2011) is an in-memory, massively parallel processing (MPP) RDF store based on a master-slave architecture. The system indexes named graphs, where partitioning and replication are also organized by named graphs. By default, all triples involving a particular term are added into a named graph for that term. A dictionary is provided to map terms to named graphs. The query is issued at a master node, which features a query planner that decides the type of join (hash or merge joins are supported) or aggregation needed. Individual operations are then processed over the slaves in parallel, generating a stream of intermediate results that are combined on the master. 15 https://docs.cambridgesemantics.com/anzograph/ userdoc/features.htm CumulusRDF [131] (2011) works on top of Apache Cassandra: a distributed key-value store with support for tabular data. Three triple permutations -SPO, POS, OSP -and one quad permutation -GSPO -are considered. A natural idea would be to index the first element as the row key (e.g., S for SPO), the second (e.g., P) as the column key, and the third (e.g., O) as the cell value, but this would not work in multivalued cases as columns are unique per row. Two other data storage layouts are thus proposed. Taking SPO, the "hierarchical layout" stores S as the row key (hashed and used for partitioning), P as the supercolumn key (sorted), O as the column key (sorted), with the cell left blank. An alternative that outperforms the hierarchical layout is a "flat layout", where for SPO, S remains the row key, but PO is concatenated as the column key, and the cell is left blank. In the POS permutation, the P row key may create a massive row; hence PO is rather used as the row key, with P being indexed separately. Join and query processing is enabled though Sesame.
H-RDF-3X [104] (2011) is a Hadoop-based RDF store that uses RDF-3X on a cluster of machines. A graph-based partitioning (using the METIS software package) is used to distribute triples among multiple worker nodes. It also employs a k-hop guarantee, which involves replicating nodes and edges that are k hops away from a given partition, thus increasing the locality of processing possible, and reducing communication costs. Local joins are optimized and evaluated on individual machines by RDF-3X, while joins across machines are evaluated using Hadoop. The use of Hadoop -which involves expensive coordination across machines, and heavy use of the disk -is minimized by leveraging the k-hop guarantee and other heuristics.
PigSPARQL [202] (2011) is a Hadoop-based RDF store that uses a vertical partitioning strategy. Data are stored on HDFS without indexes, and thus the focus is on batch processing. SPARQL queries are translated into PigLatin: an SQLinspired scripting language that can be compiled into Hadoop tasks by the Pig framework. The Jena ARQ library is used to parse SPARQL queries into an algebra tree, where optimizations for filters and selectivity-based join reordering are applied. The tree is traversed in a bottom-up manner to generate PigLatin expressions for every SPARQL algebra operator. The resulting PigLatin script is then translated to -and run as -MapReduce jobs on Hadoop.
Rapid+ [193] (2011) is a Hadoop-based system that uses a vertical partitioning strategy for storing RDF data. Without indexing, the system targets batch processing. Specifically, Pig is used to generate and access tables under a vertical partitioning strategy. In order to translate SPARQL queries into PigLatin scripts, user-defined-functions are implemented that allow for optimizing common operations, such as loading and filtering in one step. Other optimizations include support for star joins using grouping, and a look-ahead heuristic that reduces and prepares intermediate results for operations that follow; both aim to reduce the number of Hadoop tasks needed to evaluate a query.
AMADA [17] (2012) is an RDF store based on the Amazon Web Services (AWS) cloud infrastructure. Indexes for the RDF graph are built using Amazon SimpleDB: a key-value storage solution that supports a subset of SQL. SimpleDB offers several indexing strategies, where "attribute indexing" can be used to create three indexes for the three elements of a triple. In AMADA, a query is submitted to a query processing module running on EC2, which in turn evaluates triple patterns using the SimpleDB-based indexes.
H2RDF(+) [178,177] (2012) stores RDF graphs using the HBase distributed tabular NoSQL store. Three triple permutations (SPO, POS, and OSP) are created over HBase tables in the form of key-value pairs. A join executor module creates the query plan, which decides between the execution of joins in a centralized (local) and distributed (Hadoop-based) manner. It further reorders joins according to selectivity statistics. Multiway (sort-)merge joins are run in Hadoop.
Jena-HBase [121] (2012) (also known as HBase-RDF 16 ) is a distributed RDF store using HBase as its back-end. Jena-HBase supports three basic storage layouts for RDF graphs in HBase namely "simple": three triple tables, the first indexed and partitioned by S, the second by P, the third by O; "vertical partitioning": two tables for each predicate, one indexed by S, the other by O; "indexed": six triple tables covering all permutations of a triple. Hybrid layouts are also proposed that combine the basic layouts, and are shown to offer better query times at the cost of additional space. Jena is used to process joins and queries.
Rya [189] (2012) is a distributed RDF store that employs Accumulo -a key-value and tabular store -as its back-end. However, it can also use other NoSQL stores as its storage component. Rya stores three index permutations namely SPO, POS, and OSP. Query processing is based on RDF4J, with index-nested loop joins being evaluated in a MapReduce fashion. The count of the distinct subjects, predicates, and objects is maintained and used during join reordering and query optimization.
Sedge [254] (2012) is an RDF store based on Pregel: a distributed (vertex-centric) graph processing framework. Pregel typically assumes a strict partition of the nodes in a graph, 16 https://github.com/castagna/hbase-rdf where Sedge relaxes this assumption to permit nodes to coexist in multiple partitions. A complementary graph partitioning approach is proposed involving two graph partitionings, where the cross-partition edges of one are contained within a partition of the other, reducing cross-partition joins. Workload-aware query-based partitioning is also proposed, where commonly accessed partitions and frequently-queried cross-partition "hotspots" are replicated. The store is implemented over Pregel, where indexes are built to map partitions to their workloads and to their replicas, and to map nodes to their primary partitions.
chameleon-db [11] (2013) is a distributed RDF store using custom graph-based storage. Partitioning is graph-based and is informed by the queries processed, which may lead to dynamic repartitioning to optimize for the workload being observed. An incremental indexing technique -using a decision tree -is used to keep track of partitions relevant to queries. It also uses a hash-table to index the nodes in each partition, and a range-index to keep track of the minimum and maximum values for literals of each distinct predicate in each partition. The evaluation of basic graph patterns is delegated to a subgraph matching algorithm over individual partitions, whose results are then combined in a query processor per the standard relational algebra. Optimizations involve rewriting rules that preserve the equivalence of the query but reduce intermediate results. Trinity.RDF [258] (2013) is an RDF store implemented on top of Trinity: a distributed memory-based key-value storage system. A graph-based storage scheme is used, where an inward and outward adjacency list is indexed for each node. Hash-based partitioning is then applied on each node such that the adjacency lists for a given node can be retrieved from a single machine; however, nodes with a number of triples/edges exceeding a threshold may have their adjacency lists further partitioned. Aside from sorting adjacency lists, a global predicate index is also generated, covering the POS and PSO triple permutations. Queries are processed through graph exploration, with dynamic programming over cardinality estimates used to choose a query plan.
TripleRush [220] (2013) is based on the Signal/Collect distributed graph processing framework [219]. In this framework, TripleRush considers an in-memory graph with three types of nodes. Triple nodes embed an RDF triple with its subject, predicate, and object. Index nodes embed a triple pattern. Query nodes coordinate the query execution. The index graph is formed by index and triple nodes, which are linked based on matches. A query execution is initialized when a query node is added to a TripleRush graph. The query vertex emits a query particle (a message) which is routed by the Signal/Collect framework to index nodes for matching. Partitioning of triples and triple patterns is based on the order S, O, P, where the first constant in this order is used for hash-based partitioning. Later extensions explored workload-aware query-based partitioning methods [226].
WARP [103] (2013) uses RDF-3X to store triples in partitions among a cluster of machines. Like H-RDF-3X, graphbased partitioning is applied along with a replication strategy for k-hop guarantees. Unlike H-RDF-3X, WARP proposes a query-based, workload-aware partitioning, whereby the value of k is kept low, and selective replication is used to provide guarantees specifically with respect to the queries of the workload, reducing storage overheads. Sub-queries that can be evaluated on one node are identified and evaluated locally, with custom merge joins (rather than Hadoop, as in the case of H-RDF-3X) used across nodes. Joins are reordered to minimize the number of single-node subqueries. Partout [73] (2014) is a distributed store that uses RDF-3X for underlying storage on each machine. The RDF graph is partitioned using a workload-aware query-based partitioning technique, aiming to group together triples that are likely to be queried together. Each partition is indexed using standard RDF-3X indexing. The SPARQL query is issued to a query processing master, which uses RDF-3X to generate a suitable query plan according to a global statistics file. The local execution plan of RDF-3X is transformed into a distributed plan, which is then refined by a distributed cost model that assigns subqueries to partitions. This query plan is executed by slave machines in parallel, whose results are combined in the master. [79] (2014) is a distributed RDF store built on HBase. Triples are distributed according to six triple permutations -partitioning on S, P, O, SP, SO, PO -enabling lookups for any triple pattern. In order to reduce network communication, Bloom filters are pre-computed for each individual variable of each triple pattern with at least one constant and one variable that produces some result; e.g., for SP, a Bloom filter is generated encoding the objects of each subject-predicate pair; for S, a Bloom filter is generated for each subject encoding its predicates, and optionally, another Bloom filter is generated for its objects. These Bloom filters are sent over the network in order to compute approximate semi-join reductions, i.e., to filter incompatible results before they are sent over the network. SPARQL (1.0) queries are evaluated by translating them to PigLatin, which are compiled into Hadoop jobs by Pig.

RDF-3X-MPI [54] (2014) is a distributed RDF store build on top of RDF-3X and a Message Passing Interface (MPI).
After dictionary-encoding the triples, they are initially partitioned based on hashes on graph nodes, where the partitions are extended to ensure an n-hop guarantee: i.e., that any node reachable in n-hops from a node assigned to that partition will also be available on the same partition. The partitions are stored in RDF-3X on each machine, and basic graph patterns are evaluated independently on each partition (it is assumed that the value of n is sufficient to enable this, with other queries left for future work).
Sempala [203] (2014) stores RDF triples in a distributed setting, using the columnar Parquet format for HDFS that supports queries for specific columns of a given row (without having to read the full row). In this sense, Parquet is designed for supporting a single, wide (potentially sparse) table and thus Sempala uses a single "unified property table" for storing RDF triples with their original string values; multi-valued properties are stored using additional rows that correspond to a Cartesian product of all values for the properties of the entity. SPARQL queries are translated into SQL, which is executed over the unified property table using Apache Impala: a massively parallel processing (MPP) SQL engine that runs over data stored in HDFS.
SemStore [244] (2014) is a distributed RDF store with a master-slave architecture. A custom form of graph partitioning is used to localize the evaluation of subqueries of particular patterns -star, chain, tree, or cycle -that form the most frequent elements of basic graph patterns. A k-means partitioning algorithm is used to assign related instances of patterns to a particular machine, further increasing locality. The master creates a global bitmap index over the partitions and collects global cardinality-based statistics. Slave nodes use the TripleBit local RDF engine for storage, indexing and query processing. The master node then generates the query plan using dynamic programming and global cardinality statistics, pushing joins (subqueries) to individual slave nodes where possible.
SparkRDF [51] (2014) is a Spark-based RDF engine that distributes the graph into subgraphs using vertical partitioning, adding tables for classes as well as properties. SparkRDF then creates indexes over the class and property tables, and further indexes class-property, property-class, and classproperty-class joins. These indexes are loaded into an inmemory data structure in Spark (a specialized RDD) that implements query processing functionalities such as joins, filters, etc. Class information is used to filter possible results for individual variables, where a greedy selectivity-based strategy is used for reordering joins. Joins themselves are evaluated in a MapReduce fashion.
TrIAD [86] (2014) is a in-memory distributed RDF store based on a master-slave architecture. The master maintains a dictionary of terms, a graph summary that allows for pruning intermediate results, as well as global cardinality-based statistics that allow for query planning. The graph summary is a quotient graph using METIS' graph partitioning: each partition forms a supernode, and labeled edges between supernodes denote triples that connect nodes in different partitions; the graph summary is indexed in two permutations: PSO and POS. The triples for each partition are stored on a slave; triples connecting two partitions are stored on both slaves. Each slave indexes their subgraph in all six triple permutations. Given a basic graph pattern, the graph summary is used to identify relevant partitions, which are shared with the slaves and used to prune results; dynamic programming uses the global statistics to optimize the query plan. Alongside distributed hash and merge joins, an asynchronous join algorithm using message passing is implemented.
CK15 [53] (2015) is a distributed in-memory RDF store that combines two types of partitioning: triple-based partitioning and query-based partitioning. The graph is initially divided over the machines into equal-size chunks and dictionaryencoded in a distributed manner (using hash-based partitioning of terms). The encoded triples on each machine are then stored using a vertical partitioning scheme, where each table is indexed by hashing on subject, and on object, providing P → SO, PS → O and PO → S lookups. Parallel hash joins are proposed. Secondary indexes are then used to cache intermediate results received from other machines while processing queries, such that they can be reused for future queries. These secondary indexes can also be used for computing semi-join reductions on individual machines, thus reducing network traffic.
CliqueSquare [75] (2015) is a Hadoop-based RDF engine used to store and process massive RDF graphs. It stores RDF data in a vertical partitioning scheme using semantic hash partitioning, with the objective of enabling co-located or partitioned joins that can be evaluated in the map phase of the MapReduce paradigm. CliqueSquare also maintains three replicas for fast query processing and increased data locality. In order to evaluate SPARQL queries, CliqueSquare uses a clique-based algorithm, which works in an iterative way to identify cliques in a query-variable graphs and to collapse them by evaluating joins on the common variables of each clique. The process will then terminate when the query-variable graph consists of only one node.
DREAM [88] (2015) is a distributed store using RDF-3X for its underlying storage and indexing. The entire RDF graph is replicated on every machine, with standard RDF-3X indexing and query processing being applied locally. To reduce communication, dictionary-encoded terms are shared within the cluster. In the query execution phase, the SPARQL query is initially represented as a directed graph, which is divided into multiple subqueries to be evaluated by different machines. The results of subqueries are combined using hash joins and eventually dictionary-decoded.
AdPart [89] (2016) is a distributed in-memory RDF store following a master-slave architecture. The master initially performs a hash-based partitioning based on the subjects of triples. The slave stores the corresponding triples using an in-memory data structure. Within each slave, AdPart indexes triples by predicate, predicate-subject, and predicateobject. Each slave machine also maintains a replica index that incrementally replicates data accessed by many queries; details of this replication are further indexed by the master machine. Query planning then tries to push joins locally to slaves (hash joins are used locally), falling back to distributed semi-joins when not possible. Join reordering then takes communication costs and cardinalities into account.
DiploCloud [248] (2016) is a distributed version of the local RDF store dipLODocus. The store follows a masterslave architecture, where slaves store "molecules" (see the previous discussion on dipLODocus). The master provides indexes for a dictionary, for the class hierarchy (used for inference), as well as an index that maps the individual values of properties selected by the administrator to their molecule. Each slave stores the molecule subgraphs, along with an index mapping nodes to molecules, and classes to nodes. Query processing pushes joins where possible to individual slaves; if intermediate results are few, the master combines results, or otherwise a distributed hash join is employed. Molecules can be defined as a k-hop subgraph around the root node, based on input from an administrator, or based on a given workload of queries.
Dydra [15,14] (2016) is an RDF store that can leverage both local and remote storage, and provides support for versioned RDF graphs. In terms of local storage, RDF data are dictionary encoded and indexed in six permutations of quad tables -namely GSPO, GPOS, GOSP, SPOG, POSG, OPSGusing on-disk B+trees. These B+trees offer support for both static and streaming data, and further capture information about revisions, enabling versioned queries and other RDF archival features. Support for replication through convergent replicated data types (CvRDTs) is also proposed [14]. A SPARQL query processor is layered on top of storage, providing support for SPARQL 1.1 queries and updates. Dydra further offers a multi-tenant cloud-based storage service. JARS [192] (2016) is a distributed RDF store that combines triple-based and query-based partitioning. The graph is partitioned by hashing on subject, and hashing on object, constructing two distributed triple tables. The subject-hashed table is indexed on the POS, PSO, OSP and SPO permutations, while the object-hashed table is indexed on POS, PSO, SOP and OPS. Specifically, by hashing each triple on subject and object, the data for S-S, O-O and S-O are on one machine; the permutations then allow for such joins to be supported as merge joins on each machine. Basic graph patterns are then decomposed into subqueries answerable on a single machine, with a distributed hash join applied over the results. Jena ARQ is used to support SPARQL.
S2RDF [204] (2016) is a distributed RDF store based on HDFS (with Parquet). The storage scheme is based on an extended version of vertical partitioning with semi-join reductions (see Section 4.3). This scheme has a high space overhead, but ensures that only data useful for a particular (pairwise) join will be communicated over the network. In order to reduce the overhead, semi-join tables are not stored in cases where the selectivity of the join is high; in other words, semi-join tables are stored only when many triples are filtered by the semi-join (the authors propose a threshold of 0.25, meaning that at least 75% of the triples must be filtered by the semi-join for the table to be included). SPARQL queries are optimized with cardinality-based join reordering, and then translated into SQL and evaluated using Spark.
S2X [201] (2016) runs SPARQL queries over RDF graphs using GraphX: a distributed graph processing framework built on the top of Spark. The triples are stored in-memory on different slave machines with Spark (RDDs), applying a hash-based partitioning on subject and objects (per GraphX's default partitioner). S2X does not maintain any custom indexing. For SPARQL query processing, graph pattern matching is combined with relational operators (implemented in the Spark API) to produce solution mappings.
SPARQLGX [76] (2016) stores RDF data on HDFS per a vertical partitioning scheme. A separate file is created for each unique predicate in the RDF graph, with each file containing the subjects and objects of that triple. No indexes are provided, and thus the system is intended for running joins in batch-mode. SPARQL queries are first optimized by applying a greedy join reordering based on cardinality and selectivity statistics; the query plan is then translated into Scala code, which is then directly executed by Spark.
Wukong [211] (2016) stores RDF graphs in DrTM-KV: a distributed key-value store using "remote direct memory access" (RDMA), which enables machines to access the main memory of another machine in the same cluster while bypassing the remote CPU and OS kernel. Within this store, Wukong maintains three kinds of indexes: a node index that maps subjects or (non-class) objects to their corresponding triples; a predicate index, which returns all subjects and objects of triples with a given predicate; and a type index, which returns the class(es) to which a node belongs. Hashbased partitioning is used for the node index, while predicate and type indexes are split and replicated to improve balancing. A graph-traversal mechanism is used to evaluate basic graph patterns, where solutions are incrementally extended or pruned. For queries involving fewer data, the data are fetched from each machine on the cluster and joined centrally; for queries involving more data, subqueries are pushed in parallel to individual machines. A work-stealing mechanism is employed to provide better load balancing while processing queries. Koral [112] (2017) is a distributed RDF store based on a modular master-slave architecture that supports various options for each component of the system. Among these alternatives, various triple-based and graph-based partitioning schemes are supported. In order to evaluate basic graph patterns, joins are processed in an analogous way to TrIAD, using asynchronous execution, which makes the join processing strategy independent of the partitioning chosen. The overall focus of the system is to be able to quickly evaluate different alternatives for individual components -particularly partitioning strategies -in a distributed RDF store. Stylus [96] (2017) is a distributed RDF store using Trinity: a graph engine based on an in-memory key-value store. Terms of the RDF graph are dictionary encoded. Each subject and object node is associated with a dictionary identifier and its characteristic set. A sorted adjacency list (for inward and outward edges) is then stored for each node that also encodes an identifier for the characteristic set of the node. Schemalevel indexes for characteristic sets are replicated on each machine. Hash-based partitioning is employed on the data level. Indexes are used to efficiently find characteristic sets that contain a given set of properties, as well as to evaluate common triple patterns. Wukong+G [235] (2018) extends the distributed RDF store Wukong [211] in order to additionally exploit GPUs (as well as CPUs) for processing queries in a distributed environment. One of the main design emphases of the system is to ensure that large RDF graphs can be processed efficiently on GPUs by ensuring effective use of the memory available, noting in particular that the local memory of GPUs has a much higher bandwidth for reading data into the GPU's cores, but a much lower capacity than typical for CPU RAM. Wukong+G thus employs a range of memory-oriented optimizations involving prefetching, pipelining, swapping, etc., to ensure efficient memory access when processing queries on the GPU. A graph partitioning algorithm is further employed to distribute storage, where lower-cost queries are processed on CPU (as per Wukong), but heavier loads are delegated to GPUs, where (like Wukong) efficient communication is implemented using RDMA primitives, allowing more direct access to remote CPU and GPU memory.

CM-Well
Akutan 18 (2019) (formerly known as Beam) is a distributed RDF store developed by eBay. Triple storage is implemented on top of RocksDB, with indexes provided on SP → O and OP → S. Triples are additionally associated with triple identifiers. Transactional logging is implemented using Apache Kafka, which coordinates read and write requests across machines. A SPARQL(-like) query processor is then layered on top of the underlying storage layer, which includes an optimizer that leverages statistics about the data to reorder joins. Hash joins and nested-loop joins are supported, and selected, as appropriate, by the query planner. Queries are then processed in streams and/or batches. A limited form of inference based on transitive closure is also supported.
DiStRDF [239] (2019) is a massively parallel processing (MPP) RDF store based on Spark with support for spatiotemporal queries. A special dictionary-encoding mechanism is used where the identifier concatenates a bit-string for spatial information, a bit-string for temporal information, and a final bit-string to ensure that the overall identifier is unique. Thus spatial and temporal processing can be applied directly over the identifiers. Storage based on both a triple table and property tables is supported, where range-based partitioning is applied to the triples (based on the spatio-temporal information). Data is stored on HDFS in CSV or Parquet formats. Query processing is implemented in Spark. Distributed hash joins and sort-merge joins are supported; selections and projections are also supported. Three types of query plans are proposed that apply RDF-based selections, spatio-temporal selections and joins in different orders.
gStore-D2 [181] (2019) is a distributed RDF store using workload-aware graph partitioning methods. Frequently accessed (subgraph) patterns are mined from the workload, where all subjects and objects are mapped to variables. Subgraphs that instantiate these patterns are assigned DFS codes that are indexed as a tree, and associated with various metadata, including identifiers for queries that use the pattern, cardinality estimations, partition identifiers, etc. Three partitioning methods are based on these patterns, with partitions stored locally in gStore. "Vertical partitioning" indexes all instances of a given pattern on the same machine. "Horizontal partitioning" distributes instances of the same pattern across various machines based on its constants. "Mixed partitioning" combines the two. Basic graph patterns are decomposed into frequent sub-patterns, where the join order and algorithms are selected to reduce communication costs.
Leon [83] (2019) is an in-memory distributed RDF store based on a master-slave architecture. Triples are partitioned based on the characteristic set of their subject; the characteristic sets are ordered in terms of the number of triples they induce, and assigned to machines with the goal of keeping a good balance. Indexes (similar to those of Stylus [96]) are built, including a bidirectional index between subjects and their characteristic sets, an index to find characteristic sets that contain a given set of properties, and indexes to evaluate certain triple patterns. A multi-query optimization technique is implemented where, given a workload (a set) of queries, the method searches for an effective way to evaluate and share the results for common subqueries -in this case, based on characteristic sets -across queries. StarMR [236] (2019) is a distributed RDF store that centers around optimizations for star joins. A graph-based storage scheme is employed, where for each node in the graph, its outward edges are represented in an adjacency list; this then supports efficient evaluation for S-S star joins. No indexing is provided, where the system targets batch-based (e.g., analytical) processing. A basic graph pattern is then decomposed into (star-shaped) sub-patterns, which are evaluated and joined. Hadoop is then used to join the results of these individual sub-patterns. Optimizations include the use of characteristic sets to help filter results, and the postponement of Cartesian products, which are used to produce the partial solutions for star joins including the non-join variables; these partial solutions are not needed if the corresponding join value is filtered elsewhere.

SPT+VP
DISE [107] (2020) is an in-memory, distributed RDF store that conceptualizes an RDF graph as a 3-dimensional binary tensor, similar to local approaches such as BitMat; however, physical representation and storage is based on dictionary encoded triples. Partitioning is based on slicing the tensor, which is equivalent to a triple-based partitioning. Joins are evaluated starting with the triple pattern with the fewest variables. SPARQL queries are supported through the Jena (ARQ) query library and evaluated using Spark.
DP2RPQ [237] (2020) is an RDF store built on a distributed graph processing framework with support for regular path queries (RPQs), which form the core of SPARQL's property paths. Unlike the standard RPQ semantics, the evaluation returns the "provenance" of the path, defined to be the subgraph induced by matching paths. Automata are used to represent the states and the potential transitions of paths while evaluating the RPQ, and are thus used to guide a navigationbased evaluation of the RPQ implemented by passing messages between nodes in the framework. Optimizations include methods to filter nodes and edges that cannot participate in the solutions to the RPQ, compression techniques on messages, as well as techniques to combine multiple messages into one. DP2RPQ is implemented on Spark's GraphX.
Triag [162] (2020) is a distributed RDF store that optimizes for triangle-based (sub)-patterns in queries. Two types of triangular RDF subgraphs are extracted using Spark: cyclic ones (e.g., (a, p, b), (b, q, c), (c, r, a)) and (directed) acyclic ones (e.g., (a, p, b), (b, q, c), (a, r, c)). The predicates of such subgraphs are extracted, ordered, hashed, and indexed in a distributed hash table using the predicate-based hash as key and the three nodes (e.g., a, b, c) as value. An encoding is used to ensure that the ordering of predicates is canonical for the pattern (assuming that nodes are variables) and that the subgraph can be reconstructed from the node ordering. Parallel versions of hash joins and nested loop joins are supported, where triangular subqueries can be pushed to the custom index. Queries are executed over Spark. Support for inferencing is also described.
WISE [84] (2020) is a distributed RDF store using workloadaware query-based partitioning. The system follows a masterslave architecture. Queries processed by the master are also analyzed in terms of workload: common sub-patterns are extracted from a generalized version of the queries where constant subject and object nodes are first converted to variables. Query-based partitioning is applied so that common sub-patterns can be pushed to individual machines. Partitioning is dynamic, and may change as queries are received. A cost model is thus defined for the dynamic partitioning, taking into account the benefits of the change in partitioning, the cost of migrating data, and potential load imbalances caused by partition sizes; a greedy algorithm is then used to decide on which migrations to apply. The system uses Leon -an in-memory distributed RDF store discussed previously -for underlying storage and indexing.
gSmart [52] (2021) is a distributed RDF store that is capable of leveraging both GPUs and CPUs in a distributed setting. In order to take advantage of faster access for GPU memory despite its limited capacity, the LSpM storage system is used, which allows for loading compressed matrices for particular predicates and edge directions, as relevant for the query; matrices are encoded row-wise and column-wise, representing edge direction, in a compressed format, and can be partitioned for parallel computation. "Heavy queries" involving triple patterns with variable subjects and objects are then delegated to GPU computation, while "light queries" are run on CPU, where intermediate results are then combined to produce the final results on the CPU. Basic graph patterns are compiled into linear algebra operations that are efficiently computable on GPUs, with additional optimizations applied to process multi-way star joins.

A.3 Trends
We remark on some general trends based on the previous survey of local and distributed systems. In terms of local systems, earlier approaches were based on underlying relational stores given that their implementations were already mature when interest began to coalesce around developing RDF stores. Thus, many of these earlier stores could be differentiated in terms of the relational schema (triple table, vertical partitioning, property tables, etc.) used to represent and encode RDF graphs. Systems that came later tended to rather build custom native storage solutions, optimizing for specific characteristics of RDF in terms of its graph structure, its fixed arity, etc.; relating to the fixed arity, for example, native stores began to develop complete indexes, by default, that would allow efficient lookups for any triple pattern possible. Also, many engines began to optimize for star-joins, which are often used to reconstruct n-ary relations from RDF graphs. Engines would soon start to explore graph-inspired storage and indexing techniques, including structural indexes, compressed adjacency lists, etc. A more recent trend -likely following developments in terms of hardware -has been an increased focus on in-memory stores using compact representations and compressed tensor-based representations of graphs that enable GPU-based hardware acceleration. Another recent development has been the application of worst-case optimal join algorithms for evaluating basic graph patterns, as well as techniques for translating queries into operations from linear algebra that can be efficiently evaluated on GPUs.
With respect to distributed RDF stores, in line with an increased demand for managing RDF graphs at very large scale, proposals began to emerge around 2007 regarding effective ways to store, index and query RDF over a cluster of machines. 19 Initial proposals were based on existing native stores, which were extended with triple/quad-based partitioning and distributed join processing techniques to exploit a cluster of machines. A second trend began to leverage 19 We highlight that decentralized proposals for managing RDF graphs existed before this, including federated systems, P2P systems, etc., but are not considered in-scope here. the maturation and popularity of "Big Data" platforms, including distributed processing frameworks like Hadoop and later Spark, and distributed NoSQL stores like Cassandra, HBase, MongoDB, etc., in order to build distributed RDF stores. During this time, graph-based and later query-based partitioning methods began to emerge. Like in the local case, more and more in-memory distributed RDF stores began to emerge. Another trend was to explore the use of distributed graph processing frameworks -that offer a vertex-based computation and messaging paradigm -for evaluating queries over RDF. A very recent trend is towards using both CPUs and GPUs in a distributed environment in order to enable hardware acceleration on multiple machines.
While proposed solutions have clearly been maturing down through the years, and much attention has been given to evaluating basic graph patterns over RDF, some aspects of SPARQL query processing have not gained much attention. Most stores surveyed manage triples rather than quads, meaning that named graphs are often overlooked. A key feature of SPARQL -and of graph query languages in general -is the ability to query paths of arbitrary length, where optimizing property paths in SPARQL has not received much attention, particularly in the distributed setting. Many works also focus on a WORM (write once, read many) scenario, with relatively little attention paid (with some exceptions) to managing dynamic RDF graphs.
A final aspect that is perhaps not well-understood is the trade-off that exists between different proposals, what precisely are their differences on a technical level (e.g., between relational-and graph-based conceptualizations), and which techniques perform better or worse in which types of settings. In this regard, a number of benchmarks have emerged to try to compare RDF stores in terms of performance; we will discuss these in the following section.

B SPARQL Benchmarks for RDF Stores
We now discuss a variety of SPARQL benchmarks for RDF stores. We speak specifically of SPARQL benchmarks since benchmarks for querying RDF either came after the standardization of SPARQL (and thus were formulated in terms of SPARQL), or they were later converted to SPARQL for modern use. The discussion herein follows that of Saleem et al. [200], who analyze different benchmarks from different perspectives. We first discuss the general design principles for benchmarks, and then survey specific benchmarks.

B.1 SPARQL Benchmark Design
SPARQL query benchmarks consist of three elements: RDF graphs (or datasets), SPARQL queries, and performance measures. We first discuss some design considerations regarding each of these elements.

Datasets
The RDF graphs and datasets proposed for use in SPARQL benchmarks are of two types: real-world and synthetic. Both have strengths and weaknesses.
Real-world graphs reflect the types of graphs that one wishes to query in practice. Graphs such as DBpedia, Wikidata, YAGO, etc., tend to be highly complex and diverse; for example, they can contain hundreds, thousands or tens of thousands of properties and classes. Presenting query performance over real-world graphs is thus a relevant test of how a store will perform over RDF graphs found in practice. Certain benchmarks may also include a number of real-world graphs for the purposes of distributed, federated or even decentralized (web-based) querying [205].
Synthetic graphs are produced using specific generators that are typically parameterized, such that graphs can be produced at different scales, or with different graph-theoretic properties. Thus synthetic graphs can be used to test performance at scales exceeding real-world graphs, or to understand how particular graph-theoretic properties (e.g., number of properties, distributions of degrees, cyclicity, etc.) affect performance. Synthetic graphs can also be constructed to emulate certain properties of real-world graphs [65].
A number of measures have been proposed in order to understand different properties of benchmark graphs. Obvious ones include basic statistics, such as number of nodes, number of triples, number of properties and classes, node degrees, etc. [198,65]. Other (less obvious) proposals of measures include structuredness [65], which measures the degree to which entities of the same class tend to have similar characteristic sets; relationship specialty [191], which indicates the degree to which the multiplicity of individual properties varies for different nodes, etc. Observations indicate that the real-world and synthetic graphs that have been used in benchmarks tend to vary on such measures, with more uniformity seen in synthetic graphs [65,191,200]. This may affect performance in different ways; e.g., property tables will work better over graphs with higher structuredness and (arguably) lower relationship specialty.

SPARQL Queries
The second key element of the benchmark is the queries proposed. There are three ways in which the queries for a benchmark may be defined: -Manually-generated: The benchmark designer may manually craft queries against the RDF graph, trying to balance certain criteria such as query features, complexity, diversity, number of results, etc. -Induced from the graph: The queries may be induced from the RDF graph by extracting sub-graphs (e.g., using some variation on random walks), with constants in the sub-graphs replaced by variables to generate basic graph patterns.
-Extracted from logs: The queries to be used may be extracted from real-world SPARQL logs reflecting realistic workloads; since logs may contain millions of queries, a selection process is often needed to identify an interesting subset of queries in the log.
Aside from concrete queries, benchmarks may also define query templates, which are queries where a subset of variables are marked as placeholders. These placeholders are replaced by constants in the data, typically so that the resulting partially-evaluated query still returns results over the RDF graph. In this way, each template may yield multiple concrete queries for use in the benchmark, thus smoothing variance for performance that may occur for individual queries. Queries can vary in terms of the language considered (SPARQL 1.0 vs. SPARQL 1.1) and the algebraic features used (e.g., projection, filters, paths, distinct, etc.), but also in terms of various measures of the complexity and diversity of the queries -and in particular, the basic graph patternsconsidered. Some basic measures to characterize the complexity and diversity of queries in a benchmark include the number of queries using different features, measures for the complexity of the graph patterns considered (e.g., number of triple patterns, number of variables, number of joins variables, number of cyclic queries, mean degree of variables, etc.), etc. Calculating such measures across the queries of the benchmark, a high-level diversity score can be computed for a set of queries [200], based on the average coefficient of variation (dividing the mean by the standard deviation) across the measures.

Performance Measures
The third key element of a benchmark is the performance measures used. Some benchmarks may be provided without a recommended set of measures, but at the moment in which a benchmark is run, the measures to be used must be selected. Such measures can be divided into four categories [200]: -Query Processing Related: The most important dimension relating to query processing relates to runtimes. A benchmark usually contains many queries, and thus reporting the runtime for each and every query is often too fine-grained. Combined results can rather be presented with measures like Query Mix per Hour (QMpH), Queries per Second (QpS), or measures over the distributions of runtimes (max, mean, percentile values, standard deviation, etc.). Other statistics like the number of intermediate results generated, disk/memory reads, resource usage, etc., can be used to understand lower-level performance issues during query processing load [206]. Often there is a space-time trade-off inherent in different approaches, where more aggressive indexing can help to improve query runtimes but at the cost of space and more expensive updates. Hence these measures help to contextualize query-processing related measures. -Result Related: Some systems may produce partial results for a query based on fixed thresholds or timeouts. An important consideration for a fair comparison between two RDF engines relates to the results produced in terms of correctness and completeness. This can often be approximately captured in terms of the number of results returned, the number of queries returning empty results (due to timeouts), the recall of queries, etc. -Update Related: In real-world scenarios, queries are often executed while the underlying data are being updated in parallel. While the previous categories consider a read-only scenario, benchmarks may also record measures relating to updates [68,56]. Measures may include the number of insertions or deletions per second, the number of read/write transactions processed, etc.
Often a mix of complementary measures will be presented in order to summarize different aspects of the performance of the tested systems.

B.2 Synthetic Benchmarks
We now briefly survey the SPARQL benchmarks that have been proposed and used in the literature, and that are available for download and use. We start with benchmarks based on synthetic data.
LUBM (Lehigh) [85] (2005) creates synthetic RDF graphs that describe universities, including students, courses, professors, etc. The number of universities described by the graph is a parameter that can be changed to increase scale. The benchmark includes 14 hand-crafted queries. LUBM further includes an OWL ontology to benchmark reasoning, though often the benchmark is run without reasoning.
BSBM (Berlin) [35] (2009) is based on an e-commerce use-case describing entities in eight classes relating to products. The number of products can be varied to produce RDF graphs of different scales. A total of 12 query templates are defined with a mix of SPARQL features. The benchmark is also given in SQL format, allowing to compare RDF stores with RDBMS engines. SP 2 Bench [206] (2009) creates synthetic RDF graphs that emulate an RDF version of the DBLP bibliographic database. Various distributions and parameters from the DBLP data are extracted and defined in the generator. A total of 17 queries are then defined for the benchmark in both SPARQL and SQL formats.
BowlognaBench [62] (2012) creates synthetic RDF graphs inspired by the Bologna process of reform for European universities. The dataset describes entities such as students, professors, theses, degrees, etc. A total of 13 queries are defined that are useful to derive analytics for the reform process.
WatDiv [12] (2014) provides a data generator that produces synthetic RDF graphs with an adjustable value of structuredness, and a query template generator that generates a specified number of query templates according to specified constraints. The overall goal is to be able to generate diverse graphs and queries.
LDBC-SNB [68] (2015) is a benchmark based on synthetically generated social networking graphs. Three workloads are defined: interactive considers both queries and updates in parallel; business intelligence considers analytics that may touch a large percentage of the graph; algorithms considers the application of graph algorithms.
TrainBench [222] (2018) is a synthetic benchmark inspired by the use-case of validating a railway network model. The graph describes entities such as trains, switches, routes, sensors, and their relations. Six queries are defined that reflect validation constraints. TrainBench is expressed in a number of data models and query languages, including RDF/S-PARQL and SQL.

B.3 Real-World Benchmarks
Next we survey benchmarks that are based on real-world datasets and/or queries from real-world logs.
DBPSB (DBpedia) [159] (2011) clusters queries from the DBpedia logs, generating 25 query templates representative of common queries found. These queries can then be evaluated over DBpedia, where a dataset of 153 million triples is used for testing, though smaller samples are also provided.
FishMark [24] (2012) is based on the FishBase dataset and is provided in RDF and SQL formats. The full RDF graph uses 1.38 billion triples, but a smaller graph of 20 million triples is used for testing. In total, 22 queries from a log of real-world (SQL) queries are converted to SPARQL.
BioBenchmark [247] (2014) is based on queries over five real-world RDF graphs relating to bioinformatics -Allie, Cell, DDBJ, PDBJ and UniProt -with the largest dataset (DDBJ) containing 8 billion triples. A total of 48 queries are defined for the five datasets based on queries generated by real-world applications.
FEASIBLE [199] (2015) generates SPARQL benchmarks from real-world query logs based on clustering and feature selection techniques. The framework is applied to DBpedia and Semantic Web Dog Food (SWDF) query logs and used to extract 15-175 benchmark queries from each log. The DBpedia and SWDF datasets used contain 232 million and 295 thousand triples, respectively.
WGPB [100] (2019) is a benchmark of basic graph patterns over Wikidata. The queries are based on 17 abstract patterns, corresponding to binary joins, paths, stars, triangles, squares, etc. The benchmark contains 850 queries, with 50 instances of each abstract pattern mined from Wikidata using guided random walks. Two Wikidata graphs are given: a smaller one with 81 million triples, and a larger one with 958 million triples.

B.4 Benchmark Comparison and Results
For a quantitative comparison of (most of) the benchmarks mentioned here, we refer to the work by Saleem et al. [200], which provides a detailed comparison of various measures for SPARQL benchmarks. For benchmarks with results comparing different RDF stores, we refer to the discussion for (italicizing non-RDF/SPARQL engines): For a performance comparison of eleven distributed RDF stores (SHARD, H2RDF+, CliqueSquare, S2X, S2RDF, Ad-Part, TriAD, H-RDF-3x, SHAPE, gStore-D and DREAM) and two local RDF stores (gStore and RDF-3X) over various benchmarks (including LUBM and WatDiv), we refer to the experimental comparison by Abdelaziz et al. [3].