1 Introduction
Nowadays, we can use the Resource Description Framework (RDF) (Klyne, Carroll, & McBride, 2004), which is recommended by the World Wide Web Consortium (W3C) as the foundation of the Semantic Web, to restore the knowledge. An RDF statement is a triple presented as 〈subject, predicate, object〉, which describes a property value of a subject or the relation between the two entities – the subject and the object. In practice, a huge amount of entities and statements contains spatial and temporal information, e.g., a city is always located in a specific location, and a transient event happens at a specific time point. Therefore, the schema of the RDF data needs to be extended to express the spatiotemporal semantics. For example, 〈Ulm Coordinates 48.39841/9.99155〉 describes the longitude and latitude of a spatial entity “Ulm”. 〈Albert_Einstein WonPrize Nobel_Prize (59.35,18.0667) (1921####,1921####)〉 denotes the event that Einstein won Nobel Prize in 1921 in a location with the coordinates 59.35°N,18.0667°E.
Based on the spatiotemporal RDF data, users can ask more meaningful queries. For example, it is useful to count the fast food restaurants nearby someone’s workplace, or to find the spouses in Hollywood whose age difference is more than 10 years. More practically, for meetings, incentives, conventions, and events (MICE) tourists, it is important to find those newly renovated motels that are near the places of interest. In order to answer these spatiotemporal queries more efficiently and effectively, it is important to build an RDF query engine for the spatiotemporal RDF data.
Although the spatial and temporal data can be managed using traditional spatiotemporal databases, the “payasyougo” nature of RDF proposes new challenges for existing solutions. Firstly, the RDF data have diverse graph structures for different entities, which property does not fit the traditional entity–relationship (ER) model. Though the columnbased relational database partially solves the problem, it also suffers due to the multiple values and null values in RDF data. Secondly, the incomplete spatiotemporal information in RDF data makes it inefficient to retrieve the spatiotemporal RDF data using a “join” operator, i.e., too many intermediate results might be generated. The entities and the statements without necessary spatiotemporal information should be more efficiently filtered early. In summary, the traditional spatiotemporal databases are not suitable for spatiotemporal RDF data management.
A spatiotemporal RDF data set can be linked to other RDF repositories to provide structural queries with both semantic and spatiotemporal features. In this case, the spatiotemporal informationintegrated RDF data are more suitable for providing locationbased and timebased semantic search for users. Though queries are often related to spatiotemporal information, it is hard to find a short query that includes all the four spatiotemporal queries . As a result, we artificially build an example: a user wants to find a physicist who was born in a circular area, with the center located at coordinates (49° N,10°E) and having a radius of 300 km (this area is the southern area of Germany), and who won some academic award in some place where the distance between the place and his birth place is <1500 km. Additionally, he was born before the year 1900, and he won the prize before his 50th birthday. The query can be represented as a Simple Protocol and RDF Query Language (SPARQL)like query as follows. Section 3.2 gives the formalized definition of the query.
Example 1. SELECT ?x WHERE{ 
?x BornIn ?y ?l1 ?t1. 
?x WonPrize ?z ?l2 ?t2. 
?y type Populated_place. 
?z type Academic_Awards.} 
FILTER {distance(place(?y), place(49,10))<300 
AND distance(place(?l2), place(?y))<1500 
AND time(?t1)<date(1900.01.01) 
AND time(?t2)<time(?t1) 
+ (50year.01 month.01 day)} 
In this paper, we extend the semantics of the SPARQL language by integrating the spatial and temporal feature assertions (the extended SPARQL query is called the ST query, short for spatialtemporal SPARQL query). The spatial and the temporal constraints assert the location of an entity/event and the event’s valid time, for instance, distance(place(?y),place(49, 10)) < 300 and time(?t1) < date(1900.01.01) in Example 1.
In order to answer ST queries in a uniform manner, we propose a treestyle index structure (called the STtree). The STtree index is a height balanced nary tree. The semantic features and the spatiotemporal features are integrated within the STtree, and the STtree combines the advantages of the advantages of the Rtree (Guttman, 1984) and the VStree (Zou, Mo, Chen, Özsu, & Zhao, 2011) in two steps. First, we encode the entities and RDF triples into bit strings (called “signatures”) to form a signature graph. The STtree is constructed over the signature graph, and a list of pruning rules that consider both spatiotemporal and semantic constraints in the query is proposed to reduce the search space during ST query processing. Second, we introduce a cost model to guide the STtree construction.
To summarize, we make the following contributions in this paper.
 We formalize the spatiotemporal queries by extending the semantics of SPARQL queries, and the spatiotemporal queries are used to retrieve information over the RDF data integrating the spatial and temporal information. Flexible spatiotemporal and semantic constraints are used in the spatiotemporal queries.
 We build a novel treestyle index integrating the spatiotemporal features and the semantic features, and we design a cost modelbased approach to build STtree.
 Based on the STtree, we design a novel ST queryprocessing algorithm that includes both semantic and spatiotemporal pruning rules to reduce the search space.
 We evaluate our approach on a large realworld data set, and the result shows that our approach outperforms the baselines.
The remainder of this paper is organized as follows. Section 2 reviews the existing spatiotemporal RDF data management systems and some related works. Section 3 describes the basic idea of our work and gives a list of formal definitions. Section 4 gives the whole framework of our g^{st}store. Then, Section 5 and 6 show the technical details of our index and the query algorithm. The experimental results are shown in Section 7. Finally, we conclude this paper in Section 8.
2 Related Work
Recently, researchers have begun to pay attention to the spatiotemporal RDF data. There are some available realworld RDF data sets that integrate spatial and temporal information, such as YAGO2^{1} (Hoffart, Suchanek, Berberich, & Weikum, 2013), OpenStreetMap^{2} (Haklay & Weber, 2008), GovTrack^{3} and so on, are available. YAGO2 (Hoffart, Suchanek, Berberich, & Weikum, 2013) is an RDF data set based on Wikipidea and WordNet. Additionally, YAGO2 integrates GeoNames^{4}, which is a geographical database that contains more than 10 million geographical names, to express the spatial information of the entities. At the same time, some statements have temporal information, e.g., the objects of the predicates “bornOnDate”, “wasCreatedOnDate” and so on denote the time that the were born in or created. Based on the spatiotemporal information and some simple inference rules, YAGO2 generates a list of spatial entities and a list of spatiotemporal statements (Hoffart, Suchanek, Berberich, & Weikum, 2013).
Many RDF management systems (Abadi, Marcus, Madden, & Hollenbach, 2009, 2007; Broekstra, Kampman, & Van Harmelen, 2002; Neumann & Weikum, 2009; Weiss, Karras, & Bernstein, 2008; Broekstra, Kampman, & Van Harmelen, 2002; Wilkinson, 2009; Wilkinson, 2009; Wilkinson, Sayers, Kuno, & Reynolds, 2003) have been proposed in the past years. RDF3x (Neumann & Weikum, 2010), Hexastore (Weiss, Karras, & Bernstein, 2008) and gStore (Zou, Mo, Chen, Özsu, &Zhao, 2011) are the stateoftheart RDF management systems. In these management systems, the RDF data are well organized and indexes are used to efficiently and effectively answer the RDF queries. Unfortunately, since the indexes are welldesigned and none of the systems takes spatial or temporal features into consideration, all the systems are unsuitable for spatiotemporal RDF data management without great modification.
To the best of our knowledge, few SPARQL query engines consider spatial and temporal queries over RDF data in a uniform manner except for YAGO2 Demo (Hoffart et al., 2011) and SPARQLST (Perry, Jain, & Sheth, 2011). However, YAGO2 Demo (Hoffart et al., 2011) uses hardcoded spatial/temporal predicates to define the spatiotemporal queries. Six (hardcoded) spatial predicates (“northOf”, “eastOf”, “southOf”, “westOf”, “nearby”, and “locatedIn”) and four (hardcoded) temporal predicates (“before”, “after”, “during”, and “overlaps”) over statements are employed used in the YAGO2 Demo. Users can construct queries as a list of triple patterns with the spatial and temporal predicates. Other spatiotemporal queries are not supported. Since all spatiotemporal predicates are determinative, YAGO2 Demo does not allow flexible spatiotemporal range queries or join queries. The spatiotemporal semantics for the statements is limited, and the spatial semantics of the entities is missing.
Perry et al. (Perry, Jain, & Sheth, 2011) propose SPARQLST based on the work of Gutierrez et al. (Gutierrez, Hurtado, & Vaisman, 2007) for integrating the spatial information of entities and the temporal information of statements. In their study, Perry et al. (Perry, Jain, & Sheth, 2011) formalize the storage schema for the spatial entities and the temporal statements, in addition to formalizing the spatiotemporal graph pattern to construct SPARQLST. Their work implements a query engine by extending a commercial relational database that supports spatial objects, i.e., by dividing the spatiotemporal RDF data into three main tables (namely, triple table, spatial table, and temporal table) to restore the data and by utilizing the literal and the spatiotemporal indexes of the relational database to evaluate the SPARQLST queries. In contrast to our framework, the spatial semantics of statements is missing, and the storage schema is not suitable for real, big RDF data, e.g., >100 million statements; moreover, only parts of them have spatiotemporal information. The time cost on selfjoins is unacceptable.
Furthermore, Batsakis et al. (Batsakis & Petrakis, 2010) and Lyell et al. (Lyell, Voyadgis, Song, Ketha, & Dibner, 2011) try to build spatiotemporal ontology to organize the spatiotemporal RDF data. The corresponding ontologybased query languages are introduced to retrieve the spatiotemporal RDF data. These works introduce welldesigned ontology, and the query capability has been widely extended. However, these two reports have little discussion on how to answer the spatiotemporal RDF queries efficiently, and the query performance is not evaluated. Additionally, since the statements can not be seen as vertices, the ontologybased model is not suitable for organizing the spatiotemporal information of the statements.
Besides, several other proposals take either spatial features or temporal features of RDF data into consideration. Brodt et al. (Brodt, Nicklas, & Mitschang, 2010) and Erling and Mikhailov (Erling & Mikhailov, 2009) utilize RDF query engines and spatial index to manage spatial RDF data. Brodt et al. (Brodt, Nicklas, & Mitschang, 2010) uses RDF3x as the base RDF query engine, and adds a spatial index for filtering entities before or after RDF3x join operations. These two approaches only support range query (and spatial join (Erling & Mikhailov, 2009)) on entities, and the spatial entities follow the GeoRSS GML (Singh, Turner, Maron, & Doyle, 2008) model. Our early work on Sstore (Wang et al., 2013) integrates spatial information into the RDF data. In Sstore, a tree index SStree is used. First, an Rtree based on the spatial entities and a VStree based on the nonspatial entities are built separately, and then the two trees are combined to form the SStree. The Rtree and the VStree pruning rules are used to generate the candidates for the queries. The brute force combining method disregards that an entity integrates the spatial features and the semantic features at the same time. In contrast, we propose a cost modelbased method to take both spatiotemporal features and semantic features into consideration while constructing the tree index.
Gutierrez et al. (Gutierrez, Hurtado, & Vaisman, 2007; Gutiérrez, Hurtado, & Vaisman, 2005) give formally definitions of the temporal RDF graph, and prefer to use time interval labeling on an RDF graph to integrate temporal information into RDF data. Furthermore, their first work (Gutierrez, Hurtado, & Vaisman, 2007) introduces a simple query language for temporal RDF data. Based on the work of Gutierrez et al., several query languages have been proposed, such as TSPARQL (Grandi, 2010), SPARQLST (Perry, Jain, & Sheth, 2011) and τSPARQL (Tappolet & Bernstein, 2009). Tappolet et al. (Tappolet & Bernstein, 2009) propose a temporal RDF data management framework. The named graph is used to manage the statements with different time intervals, and a treestyle index keyTree is introduced to efficiently retrieve the valid time interval and the involved triples at a certain time point. In contrast, Pugliese et al. (Pugliese, Udrea, & Subrahmanian, 2008) extend the work of (Gutierrez, Hurtado, & Vaisman, 2007), and introduce a novel treestyle index to efficiently and effectively answer the temporal RDF queries. Firstly, they combine the graph distance metric and the temporal distance metric to build a metric called tGRIN distance metric. Then, based on the tGRIN distance metric, the entries are clustered. The clusters with different granularity constitute the tGRIN treestyle index. Based on the tGRIN index, two pruning rules are introduced to efficiently answer the temporal queries. However, most of the statements in real data sets (e.g., YAGO2), most of the statements lack temporal information. Therefore, the tGRIN metric fails since it is hard to compute the temporal distance between temporal statements and nontemporal statements^{5}. Thus, the pruning rules are inefficient. Besides, the pruning rules are based on mapping the constant in the query to the data set. If the constants in the query are highdegree nodes (e.g., the type “city”) or if there is no constant in the query, the pruning rules are also inefficient.
3 Problem Definition
3.1 SPARQL vs. Subgraph Match
An RDF data set is a list of RDF triples. Here, we have a sample RDF data set (shown in Figure 1(a)), which that consists of 25 triples. We call each triple a statement. The answer of a SPARQL query is a list of statements that satisfy the SPARQL constraints. We also regard an RDF data set as a graph (called RDF graph G). Figure 1(b) shows the corresponding RDF graph of the sample data set. Furthermore, a SPARQL query can be also modeled as a graph structure Q. Therefore, answering a SPARQL query is equivalent to finding subgraph matches of query graph Q over RDF graph G. The formal definitions are given as follows.
A statement is a triple 〈s, p, o〉, where s, p, and o represent subject, predicate, and object, respectively.
The RDF data graph is denoted as G = 〈V, E, L_{V}, L_{E}), where
 V = V_{l} ∪ V_{e} ∪ V_{c} ∪ V_{b} denotes all RDF vertices where V_{l}, V_{e}, V_{c} and V_{b} are the sets of literal vertices, entity vertices, class vertices and blank nodes respectively.
 E is a collection of the edges between vertices.
 L_{V} = {URI} ∪ {Literal Value} ∪ {null} is the collection of all vertex labels (i.e., label(v)), where v ∈ {V_{e} ∪ V_{c}} ⇔ label(v) ∈ {URI} and v ∈ V_{l} ⇔ label(v) ∈ {Literal Value}. For v ∈ V_{b}, label(v) is null.
 L_{E} is the collection of edge labels, i.e., all possible predicates.
The SPARQL query graph is denoted as Q = 〈V, E, L_{V}, L_{E}), where
 V = V_{l} ∪ V_{e} ∪ V_{c} ∪ V_{b} ∪ V_{p}, where V_{p} denotes the parameter vertices, and V_{l}, V_{e}, V_{c}, and V_{b} are the same as in Definition 2.
 E and L_{E} are the same as in Definition 2.
 L_{V} is the same as in Definition 2, expect for v ∈ V_{p} ⇔ label(v) is null.
Consider an RDF graph G and a query graph Q with n vertices {v_{1},…, v_{n}}. A list of n corresponding vertices {u_{1},…, u_{n}} in G is said to be a match of Q if and only if the following conditions hold:
 If v_{i} ∈ {V_{l} ∪ V_{c} ∪ V_{e}}, u_{i} ∈ {V_{l} ∪ V_{c} ∪ V_{e}} and label(v_{i}) = label(u_{i});
 If v_{i} ∈ {V_{b} ∪ V_{p}}, lable(u_{i}) is unrestricted;
 If there is an edge
v_{i}v_{j} from v_{i} to V_{j} in Q, there is also an edgeu_{i}u_{j} from u_{i} to u_{j} in G. Ifv_{i}v_{j} has predicate p,u_{i}u_{j} must have the same predicate p.
Here, an RDF data set is seen as a list of statements. A statement is also regarded as an edge in the RDF graph connecting the subject vertex and the object vertex with the edge label (the predicate). The subjects and the objects contain the vertex set of the RDF graph. If a vertex is an entity or a class, the vertex label is a uniform resource identifier (URI). If the subject or the object is a string, the vertex label is the corresponding literal value. Note that the label of a vertex can be null, i.e., the vertex is a blank node.
A SPARQL query is a small graph similar to the RDF graph. In contrast to the RDF graph, the SPARQL query graph contains a special case of vertices, i.e., the parameter vertices. The identifier of a parameter vertex is started with a “?”, and the label of the parameter vertex is seen as null. In the RDF graph, the graph matches of the SPARQL query graph are the result of the SPARQL query.
3.2 Spatiotemporal RDF
In this section, we formally define the spatiotemporal RDF data and the spatiotemporal SPARQL query as follows.
An entity e is called a spatial entity if it has an explicit location labeled with the coordinates x and y (for the twodimensional situation). The other entities are called nonspatial entities.
An ST statement is a fivetuple 〈s, p, o, L, T〉, where s, p, o, L and T represent subject, predicate, object, location, and time interval, respectively. The ST statement is an extension of the original RDF statement, where s, p, and o are the original elements. L denotes the spatial feature (the coordinates) of a statement, and T has the start time T_{s}, and the end time T_{e} to denote the valid time interval of a statement, i.e., the statement is considered to be credible in this time interval: specifically, T_{s} = T_{e} if and only if the statement happens at a time point.
If the L in a statement is not null, the statement is called a spatial statement. Otherwise, it is called a nonspatial statement.
If the T in a statement is not null, the statement is called a temporal statement. Otherwise, it is called a nontemporal statement.
An ST triple pattern is a fivetuple 〈s, p, o, L, T〉, where s, p, o, L, and T represent subject, predicate, object, location, and time interval respectively. In contrast to the ST statement, each item of an ST triple pattern can be replaced by a variable. An ST statement S is called a match of an ST triple pattern P if the nonvariable items are the same in S and P. The variable items in P are mapped to the corresponding items in S.
An ST query is a list of ST triple patterns with some spatial and temporal filtering conditions. If there is neither spatial nor temporal filtering condition, the ST query is degraded to a traditional SPARQL query.
The spatiotemporal filtering conditions are represented as spatiotemporal assertions in this paper.
Given an ST query Q, the spatial assertions are expressed as an expression that distance(place(a), place(b)) < d, where “distance” and “place” are reserved words in g^{st}store, a and b are variables in Q or a specific geometry point, and d is a constant given by the user.
The temporal assertions are expressed as time(a) < time(b) ± XyearYmonthZday or time(c) < ymd. Here, X, Y, Z and y, m, d are parameters given by the user to denote the values of year, month and day; a, b, c are variables in Q, and “time” and “yearmonthday” are reserved words in g^{st}store. Note that the compare symbol “<” can be replaced by “=” or “>”.
Figure 2 shows a subset of a spatiotemporal RDF data set. Ulm, BadenWürttemberg, and Gdańsk are spatial entities. Some statements are spatial statements, such as #1, #2, and #6, and some statements are temporal statements, such as #10, #17, and #20. Besides, there are a lot of nonspatial entities, as well as nonspatial and nontemporal statements. For example, people have no spatial information since we cannot locate a person on the map. Similarly, statements such as 〈People hasName Name〉 are nonspatial and nontemporal statements. In g^{st}store, we use “ST assertion” to represent the spatiotemporal constraints in ST queries. For example, in Example 1, the filtering conditions list four kinds of spatiotemporal constraints, where place(?y), place(49, 10)) < 300 is a spatial range constraint, distance(place(?l2), place(?y)) < 1500 is a spatial join constraint, time(?t1) < date(1900.01.01) is a temporal range constraint, and time(?t2) < time(?t1) ± (50year01month–01day) is a temporal join constraint.
In this stage, we support (i) the spatial range query and the spatial join semantics for spatial entities and statements, and (ii) the temporal range query and the temporal join semantics for temporal statements.
In practice, we use place(?x) to denote the spatial label of variable ?x. Also, distance(a, b)^{6}< r denotes that the distance between a and b should be below the threshold r, where a and b should be a specific location or a variable. If either a or b is a constant, the constraint is called a spatial range assertion. If both a and b are variables, the constraint is called a spatial join assertion. Note that a spatial query can have range assertions and spatial join assertions at the same time.
Similarly, we use time_{start}(?x) and time_{end}(?x) to denote the T_{s} and T_{e} features of variable ?x respectively. Note that time(?x) denotes that both T_{s} and T_{e} should satisfy the constraints. In the temporal assertions, we use “a < b”, “a = b” and “a > b” to denote the time order of a and b, where a, and b are either a temporal feature or a time point. If either a or b is a time point, the constraint is called a temporal range assertion. If both of a and b are expressions that include variables, the constraint is called a temporal join assertion.
For instance, the Example 1 is an ST query including the spatial range assertion, the spatial join assertion, the temporal range assertion and the temporal join assertion at the same time.
The ST RDF data set and the ST query can be also modeled as graphs (Definitions 12 and 13). The query processing is to find the matches (Definition 14) of an ST query graph Q in an ST RDF data graph G. Figure 3 shows the graph corresponding to the ST RDF data set in Figure 2, where the spatial entities and the spatial statements are all surrounded by red rectangles, and the temporal statements are surrounded by blue rectangles. Note that if a temporal statement is already surrounded by a red rectangle, we only surround the temporal feature of the statement with a blue rectangle.
The ST RDF data graph is denoted as G = 〈V, E, L_{V}, L_{E}, S_{V}, S_{E}, T_{V}, T_{E}〉, where
 V, E, L_{v}, L_{E} is the same as in Definition 2.
 S_{V} and S_{E} represent the spatial labels of V and E respectively, where the spatial labels denote the position of the entity (the event), i.e., the latitude and longitude (only valid for spatial entities and spatial statements).
 T_{V} and T_{E} represent the temporal labels of V and E respectively, where the temporal labels denote the time interval when the entity (the event) occurs, i.e., the start time and the end time.
The ST SPARQL query graph is denoted as Q = 〈V, E, L_{V}, L_{E}, SC_{V}, SC_{E}, TC_{V}, TC_{E}〉, where
 V, E, L_{v}, L_{E} is the same as in Definition 3.
 SC_{V} and SC_{E} represent the spatial assertions of V and E respectively, where the spatial assertions can be an absolute area or the relative position for some parameter.
 TC_{V} and TC_{E} represent the temporal assertions of V and E respectively, where the temporal assertions can be an absolute time interval or the relative relation for some parameter expressions, such as “>” “=” and “<”.
Consider an ST RDF graph G and an ST query graph Q with n vertices {v_{1},…,v_{n}}. A list of n vertices {u_{i},…, u_{n}} in G is said to be a match of Q if and only if the conditions in Definition 4 and the following conditions hold:
 If v_{i} ∈ V_{p}, the spatial label S(u_{i}) must satisfy the spatial assertion SC(v_{i}), and the temporal label T(u_{i}) must satisfy the temporal assertion TC(v_{i});
 If there is an edge
v_{i}v_{j} from v_{i} to v_{j} in Q, there is also an edgeu_{i}u_{j} from u_{i} to u_{j} in G. If v_{i}v_{j} has spatial(temporal) assertions,u_{i}u_{j} must have the corresponding spatial(temporal) label that satisfies the spatial(temporal) assertions.
We show the graph view of Q in Figure 4. We can find that there is a match of Q in the ST RDF data graph satisfying all the constraints of Q, where the result of ?x, ?y and ?z is “Albert_Einstein”, “Ulm”, and “Nobel_Prize”, respectively.
3.3 ST Signature Graph
In g^{st}store, we use a bit string^{7}, a minimum bounding rectangle (MBR), of a spatial feature (the coordinates) and a segment^{8} of a temporal feature (the time interval) to denote an entity. The bit string is called a signature. The original ST RDF graph is converted to an ST signature graph in g^{st}store.
The signature sig of each subject s depends on all the edges {e_{1}, e_{2},…, e_{n}} adjacent to s. For each e_{i}, a list of hash functions are used to generate a signature sig.e_{i}, where the front N bits denote the predicate, and the following M bits denote the object. The valid bits (i.e., the bits with value “1”) depend on the hash codes of the corresponding textual information. For instance, suppose that we use two hash functions for the predicates and two hash functions for the URI/literals, and N and M are both set to be 5. Here, for the edge (statement) Ulm isCalled “Ulm”, the hash codes of the predicate isCalled are 1 and 5 and the hash codes of the literal value “Ulm” are 2 and 4 based on the hash functions. Therefore, the edge is represented as 10001 01010 in Figure 5, where the first 5 bits represent the predicate isCalled, and the last 5 bits represent the literal value “Ulm”. The signature sig of s is sig = sig.e_{1}sig.e_{2}…sig.e_{n}, where sig.e_{1}, sig.e_{2},…, sig.e_{n} are the outedges of s.
For example, in Figure 2, there are four edges starting from Ulm (#8, #9, #10, and #11). Suppose that we set the first five bits for the predicate and the following five bits for the object, we can get four signatures 0001101000, 1000101010, 1001000010 and 0001100011 corresponding to the four edges. Thus, Ulm can be represented as 1001101011. Figure 5 shows the encoding processing for “Ulm”. Note that only the entity and class vertices in the RDF graph are encoded.
Then, for each vertex (v_{i}) and each edge e_{j}, we set the MBR(v_{i}) and the MBR(e_{j}) of the entities and the statements, where MBR(x)denotes the MBR of the spatial feature of x. Next, we set the segments seg(v_{i}) and seg(e_{i}) in the time axis to denote the time features. Note that the seg(v) of all the entities is null in this stage. Subsequently, for each node v, all the segments seg(e_{i}) of edge e_{i} starting from node v are combined as a union segment to denote the temporal feature of the node v, i.e., seg(v) = ∪seg(e_{i}).
Given an ST query Q, Q can also be easily transformed into an ST signature query Q^{*} based on the upper conversion method. We define the match of Q^{*} in the ST signature RDF graph as follows. It can be easily derived that each match (Definition 14) of Q in G corresponds to a match (Definition 15) of Q^{*} in G^{*}.
Given an ST signature graph G^{*} and an ST signature query graph Q^{*} with n signature vertices {q_{1},…, q_{n}}, a set of distinct signature vertices {sig_{1},…, sig_{n}} in G^{*} is a match of Q^{*} if and only if the following conditions hold:
 ∀q_{i}, sig_{i}.signature&q_{i}.signature = q_{i}.signature;
 ∀q_{i}, the spatiotemporal labels of sig_{i} must satisfy the spatiotemporal assertions;
 If there is an edge
q_{i}q_{j} from q_{i} to q_{j} in Q^{*}, there is also an edgesig_{i}sig_{j} from sig_{i} to sig_{j} in G^{*}, andq_{j}q_{j} .signature&sig_{i}sig_{j} .signature =q_{i}q_{j} .signature. Ifq_{i}q_{j} has spatial(temporal) assertion,sig_{i}sig_{j} must have the spatial(temporal) label that satisfies the assertion.
4 Overview of g^{st}store
g^{st}store uses a hybrid index that integrates both Rtree (Guttman, 1984) and VStree (Zou, Mo, Chen, Özsu, & Zhao, 2011). Therefore, the pruning strategies of Rtree and VStree are also integrated as the searching strategy of g^{st}store. Our framework consists of the preprocessing, the index construction and the query processing stages.
In the preprocessing stage, we first encode each vertex and edge as a bit string (we call it a signature). Subsequently, we build the ST signature graph G^{*}. Figure 6 shows a running example. In Figure 6, the entities or the statements surrounded by the dotted rectangles have the spatial feature or the temporal feature. The spatial features are represented as red “MBR(·,·)”, and the temporal feature are represented as blue “########”. Since the nodes that have no outedges are not encoded, they are not taken into consideration in the ST signature graph.
In the index construction stage, we construct a treestyle index based on the ST signature graph to effectively reduce the search space. The index is called STtree. Figure 7 shows an running example. The nodes on the same level of the STtree form an ST signature graph. If there’s a match of a query Q in a lower ST signature graph, there must be a corresponding match in each higher ST signature graph. Therefore, we need to guarantee that STtree is a heightbalanced tree.
In the query processing stage, given a query graph Q, we first convert Q into the ST signature query graph Q^{*}. Figure 8 shows the ST signature query graphs of the example Q in Section 3.2. In Figure 8, the edges and the nodes are encoded, and the spatiotemporal constraints are added to the edges and nodes. Note that if there is a set of vertices in G that matches a query graph Q, there must be a corresponding match in G^{*} of Q^{*}. Subsequently, we implement a top down searching algorithm over the STtree to find the matches of Q^{*} in G^{*}. Finally, we retrieve the corresponding textual result and return it to the user.
5 Index Construction
In this section, we introduce our ST RDF index STtree. The index is presented in a tree style. Generally speaking, we build the STtree based on the VStree and the Rtree. The STtree is used to generate the candidates for the variables.
5.1 The STtree Structure
The STtree is a hybrid tree style index combining the VStree (Zou, Mo, Chen, Özsu, & Zhao, 2011) and the Rtree (Guttman, 1984). The VStree is an extension of the Stree (Deppisch, 1986). As shown in Figure 7, the STtree is a height balanced nary tree, and each level of the STtree comprises an ST signature graph. The leaves of the STtree and the corresponding edges between the leaves comprise the ST signature RDF data graph, and the inner nodes of the STtree obey the STtree rule.
 STtree Rule: Consider two ST signature nodes v_{1} and v_{2} and their father nodes n_{1} and n_{2}. The following conditions hold:
 n_{1}.sig&v_{1}.sig = v_{1}.sig, n_{2}.sig&v_{2}.sig = v_{2}.sig;
 v_{1}.MBR ⊆ n_{1}.MBR, v_{2}.MBR ⊆ n_{2}.MBR;
 v_{1}.seg ⊆ n_{1}.seg, v_{2}.seg ⊆ n_{2}.seg;
 If there is an edge
v_{1}v_{2} between v_{1} and v_{2}, there must be an edgen_{1}n_{2} between n_{1} and n_{2}, wheren_{1}n_{2} .sig&v_{1}v_{2} .sig =v_{1}v_{2} .sig,v_{1}v_{2} .MBR ⊆n_{1}n_{2} .MBR andv_{1}v_{2} .seg ⊆n_{1}n_{2} .seg, even if n_{1} = n_{2}.
 The STtree rule ensures that the upperlevel ST signature graph is a summary graph of the lowerlevel ST signature graph, i.e., each node/edge in the upper level is the union of its descendants. For example, the node
is the father of the nodes Ulm and BadenWürttemberg. Thus, the signature of$\begin{array}{}{d}_{3}^{3}\end{array}$ is 1101101011, which is the union of the signatures of Ulm and BadenWürttemberg, 1001101011 and 1101000010. The spatial MBR and the temporal interval of$\begin{array}{}{d}_{3}^{3}\end{array}$ are also the union of the corresponding features of the nodes Ulm and BadenWürttemberg. In the STtree, given levels i and i ± 1^{9}, we call the ST signature graph$\begin{array}{}{d}_{3}^{3}\end{array}$ in level i as the summary graph of the ST signature graph$\begin{array}{}{G}_{i}^{\star}\end{array}$ in level i ± 1, and$\begin{array}{}{G}_{i+1}^{\star}\end{array}$ is the expanded graph of$\begin{array}{}{G}_{i+1}^{\star}\end{array}$ .$\begin{array}{}{G}_{i}^{\star}\end{array}$
Given an ST signature query Q^{*} and level i, if there is a match of Q^{*} in the ST signature graph
Suppose the match of Q^{*} {q_{1},…, q_{n}} in
□
Theorem 1 gives the correctness guarantee. If the STtree can be separated into several layers, i.e., the STtree is a heightbalanced tree, the tree nodes in the upper layer can be safely pruned if the signature, the MBR, or the segment is unsatisfied.
5.2 STtree Construction
The STtree is constructed over the ST signature graph. In Section 3.3, we have described the generation of the ST signature graph. Each node in the ST signature graph has three features: the signature, the spatial MBR and the temporal segment. Based on these three features, we can build an Stree, an Rtree of spatial information and an Rtree of temporal information respectively. In the STtree, we integrate the three trees with different features.
We use the “insert” operation to build the STtree. Given a list of ST signature nodes, we insert the nodes one by one into the STtree. Since the STtree is a heightbalanced nary tree, we implement a similar “insert and split” strategy as for other heightbalanced nary trees, such as B^{+}tree, Rtree, Stree, and so on. When a node n comes, the strategy works as follows.
Algorithm 1 STtree Construction 
Require: STtree T, max node size N, root r of T, entity set E = {e_{1}, e_{2},…, e_{n}}. 
Ensure: STtree T. 
1: T.clear() 
2: fore_{i} ∈ Edo 
3: Set entry e = null 
4: e.id = e_{i}.name 
5: e.sig = e_{i} .signature 
6: e.MBR = MBR(Point(e_{i}.x, e_{i}.y), Point(e_{i}.x, e_{i}.y)) 
7: e.seg = segment(T_{s}, T_{e}) 
8: insert(T, N, r, e) 
9: end for 
 Iteratively choose the node from topdown manner with the lowest cost when inserting n into it. If the chosen node v is a leaf, insert n into v.
 If v is full, split v into two separate nodes v_{1} and v_{2}, where the costs of v_{1} and v_{2} are minimized. If a splitting operation makes the father of the split node become full, split the father node iteratively.
 If the root is split to r_{1} and r_{2}, set a new root r and make r to be the father of r_{1} and r_{2}.
Algorithm 2insert(T, N, r, e) 
Require: STtree T, max node size N, root r of T, entry e. 
Ensure: STtree T. 
1: Set node n = r 
2: whilen.isLeaf() = falsedo 
3: n.sig = n.sige.sig 
4: n.MBR = n.MBR ⊙ e.MBR 
5: n.seg = n.seg ⊗ e.seg 
6: Set node next = null 
7: Set minCost = ∞, 
8: for each child n_{i} of ndo 
9: ifcost(n_{i}, e)  minCostthen 
10: minCost = cost(n_{i}, e) 
11: next = n_{i} 
12: end if 
13: end for 
14: n = next 
15: end while 
16: n.sig = n.sige.sig 
17: n.MBR = n.MBR ⊙ e.MBR 
18: n.seg = n.seg ⊙ e.seg 
19: ifn.isFull() then 
20: split(T, N, r, n) 
21: end if 
Since the Rtree and the VStree have similar cost model, we can adopt a cost model while constructing the STtree by integrating the independent cost models of the Rtree and the VStree.
The first cost in our model is the signature (bit string) cost. The signature cost represents the dissimilarity of two signatures. Equation 1 shows how to compute the signature cost when given signatures of tree nodes sig_{1} and sig_{2}, where Cost_{sig} denotes the signature cost, bit_{count}(sig) counts the number of the valid bits, and ⊕ means the xor operation.
The second cost in our model is the spatial cost. While two entries are combined, the spatial cost is the increasing area of the MBR. In order to avoid ineffective insertion or splitting, we use the area of the rectangle’s circum circle^{10} instead of the area of the original rectangle. Equation 2 shows how to compute the spatial cost while combining the tree nodes n_{1} and n_{2}, where Cost_{spa} denotes the spatial cost, Area(S_{i}) means the area of the the rectangle’s circum circle, and d(MBR_{i}) denotes the diameter of the tree node n_{i}’s MBR_{i}. Note that R_{i} ⊙ R_{j} denotes a binary operator to generate a rectangle surrounding the rectangles R_{i} and Rj.
The third cost in our model is the temporal cost. The temporal cost of two entities e_{1} and e_{2} is the increased length while combining e_{1} and e_{2}. Equation 3 shows how to compute the temporal cost while combining the tree nodes n_{1} and n_{2}, where seg_{1} ⊗ seg_{2} denotes the time interval surrounding seg_{1} and seg_{2}.
Algorithm 3split(T, N, r, n) 
Require: STtree T, max node size N, root r of T, full node n. 
Ensure: STtree T. 
1: Set S = n’s children set. 
2: Find n_{1}, n_{2} ∈ S s.t. ∀n_{i}, n_{j} ∈ S, cost(n_{i}, n_{j}) < cost(n_{1}, n_{2}). 
3: Set seed_{1} = n_{1}, seed_{2} = n_{2} and S_{1} = ∅, S_{2} = ∅. 
4: whileS_{1}.size < N/2 and S_{2}.size < N/2 do 
5: Let diff(n_{i}) denotes cost(n_{i}, n_{1}) – cost(n_{i}, n_{2}). 
6: Find n_{i} ∈ S s.t. ∀n_{j} ∈ S, diff(n_{j}) < diff(n_{i}). 
7: Move n_{i} to S_{1} if diff (n_{i}) < 0, and vice versa. 
8: end while 
9: ifS.size > 0 then 
10: Put all n_{i} ∈ S into the smaller S_{i}, S.clear(). 
11: end if 
12: Allocate new node

13: Set

14: ifn == rthen 
15: Allocate new node n_{r} as root. 
16: Set

17: else 
18: Set n_{p} = n.parent and delete n. 
19: n_{p}.setAsChild(

20: ifn_{p} .isFull() then 
21: split(T, N, r, n_{p}) 
22: end if 
23: end if 
Since each tree node owns both spatiotemporal and signature features, we take both the spatiotemporal cost and the signature cost into account when inserting an entity or splitting a full node. The cost of combining two nodes is shown in Equation 4, where 0 < a < 1, 0 < α < 1, and 0 < α + β < 1. Note that Z_{spa} = ∑ Cost_{spa}, Z_{tem} = ∑ Cost_{tem}, and Z_{sig} = ∑ Cost_{sig} are the normalized parameters to balance the scale of the spatial cost, the temporal cost and the signature cost respectively. In the section of the experiments, we design a specific experiment to determine the values of the parameters α and β.
Based on the cost model, we propose a method to construct an STtree. Algorithm 1 shows the procedure to build an STtree given a set of entities. In the very beginning, the STtree only has a empty root, and we set the maximum node size of the STtree. Given a set of entities, we iteratively insert the entities one by one into the STtree. While inserting an entity, we generate an entry representing the entity and insert the entry into the node with the lowest cost in a topdown manner. If the insertion produces a full node, we split the full node into two half full nodes. Note that the splitting operation may produce a new full node. If the root needs to be split, we generate a new node, and then we set the new node as the root of the STtree and set the two split nodes as the new node’s children.
5.3 Analysis of the STtree
5.3.0.1 Guarantee of Balance
Based on Theorem 1, the STtree should be a heightbalanced tree. Since the STtree is built based on the “insert” and “split” operations, it can be proven that the STtree is a heightbalanced tree.
Given a heightbalanced tree T, T is also balanced after splitting a node n to n_{1} and n_{2} using the “split” operation.
(Sketch) Since T is a heightbalanced tree, the subtree T_{n} rooted at n is also a balanced tree, and the subtrees rooted at n’s children are balanced tree too. Based on the definition of the heightbalanced tree, the new trees T_{1} and T_{2} rooted at n_{1} and n_{2}, respectively are both height balanced trees since the children of n_{1}(n_{2}) are the subsets of n’s children.
Clearly, the height of T is the bigger one of T_{1}’s height and T_{2}’s height, and the difference between T_{1}’s height and T_{2}’s height is at the most 1. Therefore, the new tree
Since the remaining part of T is unchanged, if n is not the root of T, T remains height balanced after the splitting.
If n is the root of T, a new root r is set to be the father of n_{1} and n_{2}. Since T_{1} and T_{2} are height balanced and the height difference between T_{1} and T_{2} is no more than 1, the new tree is also height balanced. □
Given a heightbalanced tree T and a new node n, T is also a heightbalanced tree after inserting n into T.
(Sketch) n is added to the lowest layer of T via the “insert” operation. Then,
 the father node of n is not full, i.e., the insertion does not cause splitting procedure. Clearly, T is also a heightbalanced tree since the height of T is unchanged and the depth of each node in T is unchanged except for n.
 the father node of n is not full, i.e., a splitting procedure is triggered. Based on Lemma 1, T is also a heightbalanced tree after the necessary splitting steps.In summary, T remains height balanced after inserting n into T. □
An STtree T is a heightbalanced tree.
 Clearly, an empty STtree is a heightbalanced tree.
 (2) Based on Lemma 2, if an STtree is height balanced, it remains height balanced after an insertion.
In summary, T is a height balanced tree since the construction of T is a series of insertions. □ □
In this stage, we ignore the “update” and the “delete” operations of the STtree because (1) in contrast to occasionally removing/editing statements, the real RDF data sets prefer to increase their scales, i.e., insertion is more important, and (2) we can use time stamps to manage the changed statements. In fact, it is easy to design the “update” and the “delete” operations of the STtree by referring to the same operations of the Rtree (or B^{+}tree, Stree, and so on).
5.3.0.2 Time Complexity
The STtree construction is a series of “insert” and “split” operations. Therefore, the time complexity of the tree construction depends on the time complexity of the two operations and the number of times that the operations are triggered. Suppose that the node capacity is set to be k, i.e., the tree nodes in the STtree have no more than k children without splitting. In other words, if a tree node has k + 1 children after an “insert” or a “split” operation, the node should be split.
Based on Algorithm 2, given an STtree T with height h, the time cost of an “insert” operation is O(h × k), where an insertion needs h times of comparison to find the lowest cost path, and O(k) times of cost computation with all the children of the chosen node in each comparison. Suppose that an RDF data set has n entities, the time cost of the insertion is O(n) × O(h × k) = O(n) × O(lg n × k) = O(nk lg n). Note that an insertion introduces a new node to the STtree.
Based on Algorithm 3, it takes
As a result, the time complexity of the tree construction is O(nk lg n) + O(nk) = O(nk), where n is the entity number in the data set, and k is the node capacity of the STtree.
6 Query Processing
Given an ST query Q, we first convert the Q to an ST signature graph Q*. The conversion process consists of three steps.
 Encode the triple patterns as described in Section 3.3.
 For each spatiotemporal range assertion, we add the corresponding absolute MBR or segment on the specific variables.
 For each spatiotemporal join assertion, we add the relevant MBRs or segments on the variables.
The Q* corresponding to Q is shown in Figure 8. The signatures are generated as G to G*, where the variables contribute no valid bit. The range assertions of Q are converted to the absolute MBRs binding ?y in Q*, and the join assertions of Q are converted to the relevant MBRs in Q*. Specifically, if an outedge from node n has temporal range assertions, we add the assertions on n, which is called “infection”.
Algorithm 4 Query Processing 
Require: Q* = 〈v_{1}, …, v_{n}〉, STtree T, root r of T, signature data graph G*. 
Ensure: The node candidate sets {NodeSet} of nodes of Q* in G*. 
1: Set each NodeSet_{i} = r //initialize the node candidate set. 
2: while true do 
3: if ∀NodeSet_{i} ∈ G* then 
4: return {NodeSet} //the sets contains real data points. 
5: end if 
6: for all NodeSet_{i}do 
7: NodeSet_{i} =the children of each node n_{i} ∈ NodeSet_{i} 
8: Set MBR_{i} = ∪{nn ∈ NodeSet_{i}} 
9: end for 
10: for all node n_{i} ∈ NodeSet_{i}do 
11: ifv_{i}.sig&n_{i}.sig ≠ n.sigthen 
12: remove n from tempNodeSet //pruning rule 4. 
13: end if 
14: ifv_{i} is bound by range assertions then 
15: ifintersection(v_{i}.mbr, n_{i}.mbr) = ϕthen 
16: remove n from tempNodeSet //pruning rule 1. 
17: end if 
18: end if 
19: ifT_{v} is not null then 
20: ifintersection(v_{i}.seg, n_{i}.seg) = ϕthen 
21: remove n from tempNodeSet //pruning rule 3. 
22: end if 
23: end if 
24: if ∃e = 
25: ifn_{i}.neighbour ∩ NodeSet_{j} = ϕthen 
26: remove n from tempNodeSet //pruning rule 5. 
27: end if 
28: end if 
29: ifdist(v_{i}, v_{j}) <= lthen 
30: ifdist(n_{i}, MBR_{j}) > lthen 
31: remove n from tempNodeSet //pruning rule 2. 
32: end if 
33: end if 
34: end for 
35: end while 
After the corresponding Q* is generated, we next search the matches of Q* in G* exploiting the STtree. Considering an ST signature query graph Q* = {q_{1},…, q_{n}}, we first generate the node candidate set NodeSet_{i} for each variable q_{i}, and then verify each candidate in the query candidate set QSet = {NodeSet_{1} × … × NodeSet_{n}} to generate the matches of Q in G.
6.1 Pruning Rules
For efficiently generating the node candidate sets, we have the following five pruning rules. Pruning rules 1 and 2 are based on the spatial range and spatial join assertions respectively. Pruning rule 3 is based on the temporal range assertions. Pruning rule 4 is based on the signature, and Pruning rule 5 considers the edge features. Based on Theorem 1 (in Section 5), when node n is unsatisfied, the subtree rooted at n can be safely pruned.
6.1.0.1 Pruning Rule 1
Consider a variable v bound with a range assertion. If there is a tree node n where v.mbr has no intersection with n.mbr, the subtree rooted on n can be pruned safely.
For example, ?y in Q* has a range assertion. Thus, the subtrees rooted at
6.1.0.2 Pruning Rule 2
Consider two variables v_{i} and v_{j} bound by a spatial join assertion, and NodeSet_{i} is the candidate set of v_{i} and NodeSet_{j} is the candidate set of v_{j}. Suppose the max distance is set to be MaxDist. Let n_{i} ∈ NodeSet_{i}; if the distance from MBR of n_{i} to any node n_{j} ∈ NodeSet_{j} is larger than MaxDist, n_{i} can be safely pruned.
In practice, we combine all the MBRs of the candidates of one variable into one MBR, and the minimal distance between two combined MBRs is considered as the lower bound of each candidate pair. Thus, the time complexity is reduced from O(m × n) to O(m + n), where m and n are the sizes of two candidate sets respectively.
For example, if the distance between ?x and ?y is set to be less than 50km, when only node
6.1.0.3 Pruning Rule 3
Consider a variable v, if the temporal assertion is not null and there is a tree node n where v.seg ∩ n.seg = ϕ, the subtree rooted on n can be pruned safely.
For example, ?x in Q* has a temporal assertion. Thus, the subtree rooted at
6.1.0.4 Pruning Rule 4
Consider a variable v, if there is a tree node n where v.sig&n.sig! = n.sig, the subtree rooted on n can be pruned safely.
In Q*,
6.1.0.5 Pruning Rule 5
Consider two linked variables v_{i} and v_{j} with an edge e =
The pruning rule is based on the fact that if there is no satisfied edge from n_{i} to any node n_{j} ∈ NodeSet_{j}, there is no satisfied edge from the descendants of n_{i} to any descendants of the n_{j} ∈ NodeSet_{j}. In practice, given a node n, all the features of the edges starting from n are integrated into one signature, one MBR, and one segment to reduce the time complexity.
Algorithm 4 describes the generation process for the topdown node candidate sets generating process. The use of the pruning rules is shown in Lines 921.
6.2 Verification
Algorithm 5 Verification 
Require: node candidates {NodeSet}, Q* = 〈v_{1}, …, v_{n}〉, Q, G. 
Ensure: the matches {M} of Q. 
1: Set the match candidate list of Q* L = 0. 
2: for each connected region Q_{i} ⊆ Q* do 
3: Select the NodeSet_{j} with the smallest size in Q_{i}. 
4: Set the Q_{i}’s match candidate set

5: for each node n_{k} ∈ NodeSet_{j}do 
6: Run the BFS process from n_{k}. 
7: if ∃match candidate

8:

9: end if 
10: end for 
11: end for 
12: Set

13: Set M* = ϕ. //The matches of Q*. 
14: for each

15: if all ST assertions are valid on

16: M*.add(

17: end if 
18: end for 
19: Set M = ϕ. //The matches of Q. 
20: for each m* ∈ M* do 
21: Get the subgraph m ⊆ G corresponding to m*. 
22: if all literal constraints are valid on mthen 
23: M.add(m). 
24: end if 
25: end for 
26: return M. 
For the node candidate set {NodeSet}, we generate a list of nodes 〈v_{1},…, v_{n}〉 from each item of {NodeSet}, respectively, and verify if 〈v_{1},…, v_{n}〉 forms the connected regions that correspond to the connected regions in Q*. If 〈v_{1},…, v_{n}〉 can form, we consider it a match candidate of Q*, or we discard it otherwise. The generating process can be accomplished by using a breadth first search (BFS) algorithm starting from the smallest node candidate sets in each connected region. If there is an edge e =
Given a match candidate
7 Experiments
To the best of our knowledge, only YAGO2 Demo (Hoffart et al., 2011) and SPARQLST (Perry, Jain, & Sheth, 2011) are available spatiotemporal RDF data management systems. Since the technical details of YAGO2 Demo are not reported, SPARQLST is chosen to make a comparison with g^{st}store. In addition, we also make a comparison between g^{st}store, a postprocessing method, Sstore (Wang et al., 2013) and an enterprise system Virtuoso.
Our demo is available at http://59.108.48.17:8080/GStoreWangDong/query.jsp.
7.1 Data Set & Setup
7.1.0.1 Data Set
YAGO2 is a real data set based on Wikipedia, WordNet, and GeoNames. The latest version of YAGO2 has >10 million entities and 440 million statements. We obtain a spatiotemporal RDF data set from YAGO2 by removing some statements that describe the date when another statement is extracted or the uniform resource locator (URL) where another statement is extracted from. The condensed data set has >10 million entities/classes and >180 million statements. More than 7 million entities are spatial entities, >90 million statements are spatial statements, and >28 million statements are temporal statements. Based on YAGO2, we generate 10,557,223 ST signature nodes, wherein 7,394,075 of them have not null spatial features, and 1,266,865 of them have not null temporal features.
7.1.0.2 Queries and Setup
In order to evaluate our approach, we manually generate 20 sample ST SPARQL queries that have different features. The sample queries are divided into 10 classes, i.e., Ssimple, SRE, SJE, SS, SC, Tsimple, TR, TJ, TC and ST. We run all queries on a personal computer (PC) server with an Intel Xeon CPU E5645 running at 2.40 GHz and 16 GB main memory. The node capacity is set to be 100, i.e., a node in the STtree should have no more than 100 children. Our previous work (Wang et al., 2013) shows that different node capabilities affect the performance little.
 Ssimple: Simple queries with Spatial range assertions of entities.
 SRE: Queries with Spatial Range assertions of Entities.
 SJE: Queries with Spatial Join assertions of Entities.
 SS: Queries with Spatial assertions of Statements.
 SC: Complex queries with all kinds of Spatial assertions.
 Tsimple: Simple queries with Temporal range assertions.
 TR: Queries with Temporal Range assertions.
 TJ: Queries with Temporal Join assertions.
 TC: Complex queries with all kinds of Temporal assertions.
 ST: Queries with all kinds of SpatioTemporal assertions.
Table 1 shows the result set size of each query. In order to illustrate the reason why the postprocessing method (i.e., finding SPARQL query results by ignoring the spatiotemporal assertions and then verifying the candidates by the spatiotemporal assertions) is not efficient, we report the result sizes of all queries discarding the spatiotemporal assertions and the final ST query result sizes. From Table 1, we observe that the result sizes discarding the spatiotemporal assertions are very large even though the final ST queries have <10 results, such as Ssimple1. It means that the postprocessing method needs a lot of effort during the verification process.
The Result Set Size of Queries
Ssimpie1  Ssimple2  SRE1  SRE2  SJE1  
ST Queries  3  1,177  1  10  18 
SPARQL Queries ignoring ST assertions  10,137,491  8,567  36  50  36 
SJE2  SS1  SS2  SC1  SC2  
ST Queries  25  2  23  7  12 
SPARQL Queries ignoring ST assertions  50  36  50  40  50 
Tsimple1  Tsimple2  TR1  TR2  TJ1  
ST Queries  5,513  37  13  4  25 
SPARQL Queries ignoring ST assertions  90,647  8,567  36  50  36 
TJ2  TC1  TC2  ST1  ST2  
ST Queries  7  25  23  8  3 
SPARQL Queries ignoring ST assertions  50  36  50  36  50 
7.2 Evaluating the Parameters of Cost Model
In this section, we evaluate how does the variations of α and β affects the query performance. Since the combination of α and β can be huge, we just adjust α and β separately, and combine the respective optimal ratios to build the STtree. For convenience, we use y to denote 1 − α − β.
First of all, we set β to be zero, i.e., we only focus on spatial information. In order to obtain optimal α, we vary α from 0 to 1 with step size 0.1. The query sets Ssimple, SRE, SJE, SS, and SC are used for adjusting α. We report the average time cost of different α in Figure 9. Based on the performance curve, we set the ratio of y and α to be 5:5. Second, we set α values to be zero and vary β to choose the best ratio of y and β. The performance of different β values is shown in Figure 10. Note that in this experiment, we use query sets Tsimple, TR, TJ, TC, and ST. Based on the result, the ratio of y and β is set to be 7 : 3. Therefore, we use α = 0.41, β = 0.18 and y = 0.41 as the optimal cost ratio to build the STtree.
7.3 Evaluating Entity Organization
In this section, we evaluate whether different entity organization styles affect the offline and online performances. There are four different tree construction methods, which are STtree, VStree (Zou, Mo, Chen, Özsu, & Zhao, 2011), Rtree based on the spatial MBR, and Rtree based on the temporal segment. After adding the necessary features and building the ST signature graphs, all the four kinds of trees can answer ST queries. In the following, we use VStree^{+} to denote the tree based on VStree, RtreeS^{+} to denote the tree based on spatial Rtree, and RtreeT^{+} to denote the tree based on temporal Rtree.
Table 2 shows the offline cost. The STtree demands lower storage space than the other three tree styles. Since all the four kinds of tree construction methods can be modeled as cost modelbased methods, the result shows that the cost model of the STtree is more effective than the others. The last row of Table 2 shows the time cost of the tree construction. Clearly, the more complex the cost model is, the more time cost is incurred to build the tree structure. The STtree requires the most time cost. However, only less than half an hour is needed to build the STtree. Note that we only consider the tree construction and ignore the ST signature graph construction.^{12}
Offline Cost of Tree Construction
Index Style  STtree  VStree^{+}  RtreeS^{+}  RtreeT^{+} 
Index Size(MB)  1,971  2,066  1,975  2,263 
Node Count  177,882  186,388  178,178  204,107 
Time Cost(s)  1,726  855  773  947 
Table 3 shows the online time cost of the queries based on different tree style indexes. Obviously, the STtree based on our cost model outperforms the other tree styles. In other words, the cost model of the STtree takes both semantic feature and spatiotemporal feature into consideration, which improves the performance.
Online Cost of Different Tree styles
Index Style  STtree  VStree^{+}  RtreeS^{+}  RtreeT^{+} 
Average Time(ms)  1,439  2,158  3,503  3,039 
7.4 Evaluating Performance
To evaluate the efficiency of our approach, we choose four baseline approaches, which are denoted as gStore^{+}, Virtuoso, Sstore^{+} and SPARQLST respectively.
The gStore^{+} method adopts the postprocessing solution, which runs the SPARQL queries on an RDF query engine by ignoring the spatiotemporal assertions and then refining the candidates by considering the spatiotemporal assertions. In practice, this approach exploits gStore (Zou, Mo, Chen, Özsu, & Zhao, 2011) as the RDF management system, and the node capacity is set to be 100. Besides, MySQL is used to retrieve the spatiotemporal information of the entities and the statements.
The Virtuoso approach is an enterprise system Virtuoso, which declares that the spatial RDF data can be organized using Virtuoso.
The Sstore is our early work based on spatial RDF data. The index SStree of Sstore can be separated into an Rtree based on the spatial entities and a VStree based on the nonspatial entities. After simply adding the temporal features in the nodes and edges of the SStree, the Sstore can be extended to answer ST queries. Here, we use Sstore^{+} to denote this baseline method.
We implement the method of SPARQLST (Perry, Jain, & Sheth, 2011) as the fourth baseline. MySQL is used as the data management system. The B^{+} tree index and the spatial index of MySQL are used. In this section, we make a comparison between g^{st}store and the four baselines. The query response times are shown in Table 4.
The Performance Comparison
Time Cost(s  
g^{st}store  gStore^{+}  Virtuoso  Sstore^{+}  SPARQLST  
Ssimple1  1.15  >30min  2.79  0.19  0.53 
Ssimple2  0.11  112.40  19.84  0.16  >30min 
SRE1  0.41  5.89  128.03  0.45  109.33 
SRE2  0.69  9.56  >30min  0.78  113.38 
SJE1  2.09  4.48  178.31  2.70  157.44 
SJE2  2.14  4.13  1727.65  2.74  152.47 
SS1  1.98  3.62    2.60   
SS2  2.11  6.75    2.70   
SC1  0.84  5.84    1.16   
SC2  1.39  3,78    1.73   
Tsimple1  1.56  >30min    1.92  204.11 
Tsimple2  0.07  130.12    0.10  156.77 
TR1  1.96  4.08    2.61  65.29 
TR2  2.08  5.14    2.70  55.18 
TJ1  1.96  4.07    2.61  150.44 
TJ2  2.11  3.16    2.73  160.47 
TC1  1.96  4.08    2.60  185.97 
TC2  2.10  3.46    2.69  107.88 
ST1  1.16  4.57    1.43   
ST2  1.41  3.70    1.75   
Since query sets Ssimple and Tsimple have many candidate results (Table 1), the time cost of gStore^{+} is unacceptable. gStore^{+} cannot get the results of Ssimple1 or Tsimple1 in reasonable time (half an hour), and the time costs for Ssimple2 and Tsimple2 are >100 seconds. However, our approach (g^{st}store) can answer these queries efficiently. Although the other queries have just a few candidate results without spatiotemporal assertions, g^{st}store still outperforms the BASE1 approach.
Actually, only several of the queries(the query sets Ssimple, SRE, and SJE) can be answered using Virtuoso. g^{st}store outperforms Virtuoso with several orders of magnitude. Here, the mark “” denote that the query can not be answered.
Compared to the Sstore^{+}, g^{st}store outperforms in all queries except for Ssimple1. In Table 1, we can find that Ssimple1 has low selectivity on the semantic constraint (>90% entities are selected) and high selectivity on the spatial constraint (only three entities are considered as the result). As a result, the Sstore^{+} performs like an Rtree based on the spatial entities (because of the composing construction method), where g^{st}store is confused due to the cost model. In other cases, the cost model performs well, and g^{st}store defeats the Sstore^{+}.
The SPARQLST approach can answer most of the queries, except query sets SS, SC and ST. This is because these three query sets involve the spatial features of the statements, which are out of the SPARQLST data model. For the same reason as the competition with Sstore^{+}, the SPARQLST approach performs better than g^{st}store for Ssimple1. In other cases, SPARQLST costs several minutes to answer the queries, much slower than g^{st}store. In summary, g^{st}store outperforms its competitors in most scenarios.
8 Conclusions
In this paper, we introduce ST queries, a variant of SPARQL language, to query RDF data with spatiotemporal features. In order to answer ST queries efficiently, we build a hybrid index, called STtree, in our g^{st}store system, an engine for large RDF graphs integrating spatial and temporal information. Several pruning rules are introduced in the query algorithm to reduce the search space. The experiment results on a real large RDF graph show the effectiveness and the efficiency of our approach.
A The Sample Queries
In this appendix, the 20 queries used in the experiments are shown in Figure 11.
References
Abadi D. J. Marcus A. Madden S. R. & Hollenbach K. (2007 September). Scalable semantic web data management using vertical partitioning. Paper presented at the Proceedings of the 33rd international conference on Very large data bases (pp. 411422). VLDB Endowment.
Abadi D. J. Marcus A. Madden S. R. & Hollenbach K. (2009). SWStore: a vertically partitioned DBMS for Semantic Web data management. The VLDB Journal 18(2) 385406. http://dx.doi.org/10.1007/s007780080125y
Batsakis S. & Petrakis E. G. (2010 September). SOWL: spatiotemporal representation reasoning and querying over the semantic web. In Proceedings of the 6th International Conference on Semantic Systems (p. 15). ACM. http://dx.doi.org/10.1145/1839707.1839726
Brodt A. Nicklas D. & Mitschang B. (2010 Novenmber). Deep integration of spatial query processing into native RDF triple stores. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 3342). ACM. http://dx.doi.org/10.1145/1869790.1869799
Broekstra J. Kampman A. & Van Harmelen F. (2002 June). Sesame: A generic architecture for storing and querying rdf and rdf schema. Paper presented at the International semantic web conference (pp. 5468). Springer Berlin Heidelberg.
Deppisch U. (1986 September). Stree: a dynamic balanced signature index for oflce retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 7787). ACM. http://dx.doi.org/10.1145/253168.253189
Erling O. & Mikhailov I. (2009). RDF Support in the Virtuoso DBMS Networked KnowledgeNetworked Media (pp. 724): Springer Berlin Heidelberg. http://dx.doi.org/10.1007/9783642021848_2
Grandi F. (2010 Septermber). TSPARQL: A TSQL2like Temporal Query Language for RDF. Paper presented at the ADBIS (Local Proceedings).
Gutierrez C. Hurtado C. A. & Vaisman A. (2007). Introducing Time into RDF. IEEE Transactions on Knowledge and Data Engineering 19(2) 207218. http://dx.doi.org/10.1109/TKDE.2007.34
Gutiérrez C. Hurtado C. A. & Vaisman A. A. (2005). Temporal RDF. Paper presented at the The Semantic Web: Research and Applications Second European Semantic Web Conference ESWC 2005 Heraklion Crete Greece May 29  June 1 2005 Proceedings.
Guttman A. (1984). Rtrees: a dynamic index structure for spatial searching. In ACM SIGMOD International Conference on Management of Data (Vol.14 pp.4757). ACM. http://dx.doi.org/10.1145/602259.602266
Haklay M. & Weber P. (2008). Openstreetmap: Usergenerated street maps. IEEE Pervasive Computing 7(4) 1218. http://dx.doi.org/10.1109/MPRV.2008.80
Hoffart J. Suchanek F. M. Berberich K. LewisKelham E. De Melo G. & Weikum G. (2011 March). YAGO2: exploring and querying world knowledge in time space context and many languages. In Proceedings of the 20th international conference companion on World wide web (pp. 229232). ACM. http://dx.doi.org/10.1145/1963192.1963296
Hoffart J. Suchanek F. M. Berberich K. & Weikum G. (2013). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence 194 2861. http://dx.doi.org/10.1016/j.artint.2012.06.001
Klyne G. Carroll J. J. & McBride B. (2004 February). Resource description framework (RDF): Concepts and abstract syntax. World Wide Web Consortium Recommendation.
Lyell M. Voyadgis D. Song M. Ketha P. & Dibner P. (2011 May). An ontologybased spatiotemporal data model and query language for use in gistype applications. In Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications (p. 15). ACM. http://dx.doi.org/10.1145/1999320.1999335
Neumann T. & Weikum G. (2009). RDF3X: a RISCstyle engine for RDF. Proceedings of the VLDB Endowment 1(1) 647659. http://dx.doi.org/10.14778/1453856.1453927
Perry M. Jain P. & Sheth A. P. (2011). Sparqlst: Extending sparql to support spatiotemporal queries. Semantic Web & Beyond 12 6186. http://dx.doi.org/10.1007/9781441994462_3
Neumann T. & Weikum G. (2010). xRDF3X: fast querying high update rates and consistency for RDF databases. Proceedings of the VLDB Endowment 3(12) 256263. http://dx.doi.org/10.14778/1920841.1920877
Pugliese A. Udrea O. & Subrahmanian V. (2008 April). Scaling RDF with time. In Proceedings of the 17th international conference on World Wide Web (pp. 605614). ACM. http://dx.doi.org/10.1145/1367497.1367579
Singh R. Turner A. Maron M. & Doyle A. (2008). GeoRSS: Geographically encoded objects for RSS feeds: http://georss.org/gml.
Tappolet J. & Bernstein A. (2009 May). Applied temporal RDF: Eflcient temporal querying of RDF data with SPARQL. In European Semantic Web Conference (pp. 308322). Springer Berlin Heidelberg. http://dx.doi.org/10.1007/9783642021213_25
Wang D. Zou L. Feng Y. Shen X. Tian J. & Zhao D. (2013 April). Sstore: An engine for large rdf graph integrating spatial information. In International Conference on Database Systems for Advanced Applications (pp. 3147). Springer Berlin Heidelberg. http://dx.doi.org/10.1007/9783642374500_3
Weiss C. Karras P. & Bernstein A. (2008). Hexastore: sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment 1(1) 10081019. http://dx.doi.org/10.14778/1453856.1453965
Wilkinson K. (2009). Jena property table implementation. Ssws.
Wilkinson K. Sayers C. Kuno H. & Reynolds D. (2003 September). Eflcient RDF storage and retrieval in Jena2. Paper presented at the International Conference on Semantic Web and Databases (pp. 120139). CEURWS. org.
Zou L. Mo J. Chen L. Özsu M. T. & Zhao D. (2011). gStore: answering SPARQL queries via subgraph matching. Proceedings of the VLDB Endowment 4(8) 482493. http://dx.doi.org/10.14778/2002974.2002976