Introduction
In the task of predicting a user’s next location, one of the key challenges lies in effectively integrating the spatial characteristics of locations with the temporal patterns of user behavior to improve prediction accuracy. To address this challenge, recent research has emphasized training models on the trajectory data of all users, thereby capturing global behavioral patterns and spatiotemporal preferences. Leveraging these learned patterns, the model can infer the most probable next location for a user based on their recent trajectory, regardless of whether the user is new or existing.
The related definitions for predicting a user’s next location are as follows:
Let the user set be denoted as , where each user u has an associated trajectory record. The location set is defined as , where each location represents an encoded place. For example, A12 may indicate a specific convenience store (e.g., FamilyMart Guishan Branch), B07 may indicate a specific library (e.g., Taoyuan Library or Guishan Library), and C10 may indicate a specific restaurant (e.g., McDonald’s or KFC). Furthermore, let the location category set be defined as . Each location has a corresponding category , which denotes its categorical attribute. For instance, McDonald’s and KFC correspond to the “restaurant” category, FamilyMart and 7-Eleven correspond to the “convenience store” category, while Taoyuan Library and Guishan Library correspond to the “library” category.
Subsequently, a user’s trajectory data S is defined as a location sequence segmented by day, denoted as , in which represents the sequence of locations visited on the i-th day in chronological order. Each trajectory can also be mapped to a corresponding location category sequence , Furthermore, this study divides trajectories into two parts: the historical trajectory sequence and the current trajectory sequence .The prediction target is defined as the next potential location to be visited. In other words, given the historical trajectory and the current trajectory, the objective is to predict the location that the user is most likely to visit next.
In other words, the research problem can be described as follows: after collecting all users’ historical and current trajectory data (including location coordinates, location categories, and temporal information), how can we construct a predictive model that effectively learns group behavior patterns, and then, for any given user—whether a new user or one with existing records—infer the most likely next location based on their recent behavior. For example, if during the training phase the model observes that most users tend to visit a particular shopping district for lunch around noon, then when a user’s current trajectory indicates movement toward that area, the model should be able to predict, based on spatiotemporal behavioral features, that the user’s next step is likely to be visiting a restaurant in that district. This not only reflects current behavioral trends but also enables personalized recommendations aligned with individual preferences.
The aforementioned research problem highlights that modeling and integrating spatial and temporal information is one of the key challenges in predicting users’ next locations. In particular, locations not only contain geographic coordinates but also semantic categorical features, while users’ behavioral patterns are often influenced by temporal cycles and individual preferences. These complex factors must be effectively incorporated into predictive models. To design a system that truly reflects real-world user behavior, it is essential to explore how to process and represent these high-dimensional and dynamic spatiotemporal features. Based on this perspective, this study further elaborates its research motivation and objectives, examines the limitations of existing methods, and outlines the proposed model’s improvements and expected benefits.
In recent years, features such as geographic location, timestamps, and user behavior patterns have been widely utilized as crucial information for predicting users’ future movements in industries including tourism, smart cities, transportation management, and any domain where users’ dynamics can be observed through spatial and temporal data. Among these approaches, representing locations as embedding vectors has emerged as a particularly popular and innovative data processing method. Embedding vectors refer to techniques that transform high-dimensional sparse data (e.g., text, locations, images) into low-dimensional dense representations [
4]. In machine learning, embeddings are commonly applied to map discrete data that may appear unrelated on the surface into continuous vector spaces, thereby enabling the discovery of latent semantic information and relationships among these features.
Sassi and Brahimi et al. [
5] proposed LOC2vec, which models locations as contextual entities with sequential dependencies, employing Skip-gram [
6] to learn inter-location associations and a CNN-based framework [
2] for next-location prediction. However, relying solely on location embeddings introduces several limitations: (1) locations with similar environments may cause confusion in distinguishing spatial differences; (2) data imbalance arises when popular regions dominate the training process while sparse locations are underrepresented, leading to biased predictions; and (3) structural information, such as temporal and sequential patterns in user behavior, is overlooked when only spatial features are considered.
To overcome the limitations of purely sequential models in capturing spatial structures, this study proposes a hybrid architecture that integrates Graph Convolutional Networks (GCN) [
7] and Long Short-Term Memory (LSTM) networks [
1]. GCN, as a specific method within Graph Neural Networks (GNN) [
3], is combined with LSTM to simultaneously capture spatial dependencies among locations and temporal dynamics of user behavioral patterns. The overall framework consists of two major phases: Feature Preprocessing and Model Construction and Prediction.
In Feature Preprocessing phase, raw mobility data are retrieved from the database and subjected to two categories of preprocessing. First, spatial feature preprocessing extracts and encodes location attributes such as geographic coordinates and location categories to generate enriched spatial representations. Second, temporal feature preprocessing processes time-related attributes, including visit sequences, timestamps, and periodic patterns, to capture the dynamic characteristics of user mobility.
In Model Construction and Prediction phase, the architecture incorporates two primary modules: LSTM and GCN. The LSTM module models users’ historical and current trajectories to extract sequential behavioral features. To improve temporal modeling precision, a time-decay mechanism is embedded into the LSTM. By comparing temporal slots (e.g., morning vs. afternoon, weekday vs. weekend), the similarity between historical behaviors and the current time period is measured, allowing the model to adjust the influence of historical data and enhance its ability to perceive users’ periodic preferences. Meanwhile, the GCN module treats locations as graph nodes, with edges constructed based on Euclidean distances between locations. Through graph convolution, spatially informed node embeddings are obtained, which capture both topological structures and functional similarities among locations, serving as robust representations of users’ spatial preferences.
Finally, the outputs from both the LSTM and GCN branches are fused in the Prediction module to generate the next-location prediction. This integration compensates for the limitations of single-model approaches that only consider temporal or spatial aspects, thereby enhancing the model’s overall capability to understand user behavior and improving prediction accuracy. By jointly incorporating spatial dependencies and temporal dynamics, the proposed framework achieves more effective and reliable forecasting of users’ future mobility.
Related Work
In numerous scientific and engineering disciplines, data is inherently characterized by graph structures. Such representations not only delineate the explicit relationships among entities but also encapsulate rich semantic and topological information. For example, in the domain of chemistry, molecules can be naturally modeled as graphs, where atoms correspond to nodes and chemical bonds are represented as edges. In natural language processing, syntactic structures are frequently formalized as directed or undirected graphs to capture linguistic dependencies. Similarly, in domains such as social networks, recommender systems, and information retrieval, graphs serve as an intuitive and powerful formalism for representing complex interactions and relational patterns.
2.1. Graph Neural Network
Traditional neural network architectures, such as multilayer perceptrons (MLPs), convolutional neural networks (CNNs) [
2], and recurrent neural networks (RNNs) [
1] are inherently designed to process data structured as vectors, matrices, or sequences. However, these models lack the ability to directly accommodate irregular, unstructured, or topologically dynamic data. To address this limitation, early research often relied on preprocessing strategies that transformed graph data into fixed-dimensional vectors. Such transformations were typically achieved through manual feature engineering or graph embedding methods, which encode structural characteristics into vector representations for subsequent processing by conventional models.
Although early approaches succeeded in utilizing existing techniques, they exhibited several intrinsic limitations when applied in practice. First, the process of vectorization often resulted in the loss of crucial structural information, as topological dependencies among nodes could not be adequately preserved. Second, these methods relied heavily on labor-intensive manual feature engineering, requiring the design of task-specific graph descriptors for each domain, and thus lacked true end-to-end learning capability. Third, their capacity for generalization remained limited, as conventional graph embedding methods frequently failed to capture both local and global semantic characteristics, leading to unstable performance across diverse datasets.
Franco et al. [
3] introduced the Graph Neural Network (GNN) architecture to address the inherent limitations of earlier methods for learning from graph-structured data. Unlike conventional approaches that convert graphs into fixed-dimensional vectors, GNNs are explicitly designed to operate directly on arbitrary graph structures, thereby preserving their intrinsic topology. The architecture leverages localized message passing and global convergence mechanisms, enabling each node’s representation to iteratively incorporate information from its neighbors while maintaining the broader structural context. This mechanism allows GNNs to capture both local interactions and global dependencies, ensuring that the learned representations remain faithful to the original graph.
Typically, a GNN comprises nodes, edges, node features, edge features, and graph-level attributes, all of which interact to enhance representational capacity. The objective is to jointly model node relationships and overall topological structure, thereby facilitating a broad range of tasks such as node classification, edge prediction, and graph classification. By combining structural fidelity with learning efficiency, GNNs provide a flexible and robust framework for analyzing unstructured, graph-structured data. Their versatility has enabled widespread adoption across domains including social network analysis, natural language processing, recommendation systems, and bioinformatics, demonstrating both theoretical significance and practical applicability.
2.2. Graph Convolutional Network (GCN)
Jiang et al. [
7] proposed the Graph Convolutional Network (GCN), an extension of the Graph Neural Network (GNN) framework. The key distinction between GNN and GCN lies in their operational focus: while GNN primarily provides a general framework for modeling graph structures, GCN explicitly exploits this structural information through iterative aggregation and update mechanisms. After numerically encoding node and edge features, GCN employs a message-passing scheme in which each node embedding is iteratively updated by aggregating information from its neighboring nodes, followed by a transformation or update function to refine its representation. This message-passing mechanism forms the core of GCN, allowing the model to effectively capture both local topological patterns and semantic dependencies within graph data.
In the standard GCN framework [
7], a single message-passing step typically consists of the following three stages: The first stage is Message Generation: Each node
receives information from its neighboring nodes
. This information is usually determined jointly by the feature vector of the neighbor node
, the edge features
, and the feature of the current node
. The message generation equation is given in Equation (1), where the function
is a learnable message-generation function. The second stage is Message Aggregation: For each node
, all incoming messages from its neighbors are aggregated, as shown in Equation (2). The aggregation can be performed in three common ways: summation, mean, or maximum. None of these approaches is universally superior; the choice depends on the data characteristics and the specific requirements of the task. The third stage is Node Update: The node
updates its representation vector by combining the aggregated neighbor messages
with its current state
. The update equation is defined in Equation (3), where the
denotes the update function. This can also be implemented using a learnable neural network module.
In addition to message passing between nodes, the message-passing mechanism of GCN can also be extended to interactions between edges and nodes. Specifically, not only can the feature information of a node be propagated to its adjacent edges, but the edge features can also be fed back to the nodes. Furthermore, all node and edge information can be aggregated and transmitted to a global node (Super Node), which serves as the holistic semantic representation of the entire graph. The introduction of a global node aims to compensate for the insufficiency of local information, particularly for nodes and edges that are located at the periphery of the graph, have a limited number of neighbors, or appear infrequently. Through the aggregation and feedback mechanism of the global node, the model can effectively integrate structural and semantic information from across the entire graph. This enables boundary nodes to also receive supplementary information from other parts of the graph, thereby enhancing the overall learning performance and generalization capability.
Graph Convolutional Networks (GCNs) [
7] have emerged as powerful tools for processing non-Euclidean structured data, and in recent years they have been extensively applied across multiple domains, demonstrating strong capabilities in feature representation and relational modeling. In the domain of recommender systems, Ying et al. [
12] introduced the PinSage model, which integrates graph convolution with efficient neighbor sampling strategies. This approach was successfully deployed in Pinterest’s large-scale image recommendation system, illustrating that GCN can support accurate and efficient recommendation tasks at Web scale. For human action recognition, Yan et al. proposed ST-GCN [
13], which constructs skeleton-based human joint graphs and leverages spatio-temporal convolutions to model human motion patterns. This method substantially enhances recognition accuracy compared with prior approaches. In traffic forecasting, Huang et al. developed the Diffusion Convolutional Recurrent Neural Network (DCRNN) [
14], which models traffic flow as a diffusion process on road networks while incorporating time-series modeling for multi-step prediction. Their results demonstrate significant improvements over traditional methods in complex traffic environments.
2.2. Long- and Short-Term Preference Modeling (LSTPM)
The LSTPM framework [
9] was proposed to address the challenge of jointly modeling temporal dependencies and spatial relationships in user trajectories, which earlier approaches based solely on LSTM or spatial embeddings struggled to capture effectively. In this framework, user trajectories are divided into historical and current sequences, each processed independently through an LSTM [
1] to preserve temporal order and behavioral context. Points of interest (POIs) are represented by their geographic coordinates (latitude and longitude), following the formulation outlined in Section 1. The overall architecture consists of two hierarchical layers: the non-local operation and the geo-nonlocal operation. The non-local operation is designed to capture long-range dependencies within temporal and behavioral contexts, whereas the geo-nonlocal operation further incorporates spatial distance as a weighting factor. This integration enables the model to refine and enhance information by accounting for the spatial proximity between locations.
Specifically, the historical trajectory is first fed into the LSTM, transforming the raw coordinate-based into continuous spatial vectors of dimension d, denoted as . Once the hidden representations of all locations are obtained, temporal decay weights are computed for each location. The temporal dimension is discretized into 48 intervals (24 for weekdays and 24 for weekends/holidays). The similarity between the current trajectory and the trajectories observed in the corresponding time intervals is then evaluated.
After computing the similarity of each temporal interval, the model aggregates the hidden representations of all locations within each historical trajectory by weighting them with the corresponding temporal interval weights. Specifically, the hidden state of each location is multiplied by the normalized weight of its temporal interval, summed across the trajectory, and divided by the trajectory length, as formulated in Equation (4). Given the temporal intervals
, the weight
is derived from the similarity between the current temporal trajectory
and the historical temporal interval trajectory
. Multiplying
with the hidden state
and averaging across the trajectory yields the long-term trajectory representation
.
For the current trajectory
, the model takes the location coordinates as input and processes them through an LSTM to obtain hidden vectors at each time step. The hidden outputs are then averaged to generate a short-term preference vector
, as described in Equation (5), where
denotes the hidden state of the current location.
The non-local operation aims to measure the similarity between the current trajectory output
and each historical trajectory
, while applying an exponential transformation to the similarity scores. As shown in Equation (6),
, where the similarity scores are normalized to ensure that the total weight sums to one. The similarity function
captures the spatial correlation between the current trajectory
and a historical trajectory
, and
represents a learnable projection of the historical trajectory using the projection matrix
. This operation effectively computes the spatial similarity between
and
and redistributes the weights to produce the updated representation
.
In the geo non-local operation, the centroid of each historical trajectory is first computed by averaging the latitude and longitude of its locations. The Euclidean distance
between this centroid and the last coordinate of the current trajectory is then calculated. Equation (7) extends the non-local operation by incorporating spatial proximity:
, where
is combined with the final hidden state of the current trajectory from the LSTM. The inner product with the historical trajectory
is then weighted by
, such that trajectories with centroids closer to the current location contribute larger weights, while more distant trajectories contribute smaller weights. Summing across all adjusted historical trajectories yields
, which encodes spatial similarity, geographic distance, and temporal interval information.
In addition to trajectory-based training, LSTPM introduces the Geo-dilated LSTM, which aims to simulate realistic user mobility patterns. Unlike the conventional LSTM that strictly follows the sequential order of trajectories, the Geo-dilated LSTM reconstructs the input sequence by starting from the initial location and selecting the geographically nearest next location, thereby capturing distance-aware movement dynamics. For example, a standard LSTM processes the input sequence , while the Geo-dilated LSTM reconstructs the sequence based on nearest neighbors, generating samples such as . This mechanism reflects the natural tendency of users to visit geographically proximate locations and enables the model to capture more realistic spatial movement patterns. The final hidden representation is computed as in Equation (16), where denotes the input coordinate of the final location and denotes the hidden state of the previous layer.
The geo-dilated hidden state is then averaged with to form . Finally, the long-term spatiotemporal preference and the short term preference are concatenated and used for next-location prediction. This modeling strategy integrates both long- and short-term spatiotemporal preferences while incorporating a geo-dilated mechanism to enhance predictive performance. Although improvements are observed, several limitations remain. First, the absence of explicit location features reduces the ability of raw coordinates to capture semantic characteristics of places. Second, the Geo-dilated LSTM may distort temporal order when simulating mobility patterns. Third, the non-local operation compares trajectory similarity in a relatively simplistic manner without leveraging learnable mechanisms or multimodal semantics, thereby limiting the model’s representational power. To address these issues, this study proposes further enhancements aimed at improving the model’s adaptability to real-world contexts and its predictive accuracy.
Nevertheless, several limitations remain. First, the geo-nonlocal operation relies exclusively on geographic distance, which may not fully capture semantic or contextual distinctions between locations. Second, the framework does not explicitly incorporate location categories, user intent, or temporal semantic factors that could provide additional behavioral insights. Finally, the reliance on pairwise distance weighting may overemphasize nearby POIs while underestimating long-term patterns or cross-regional mobility. These limitations suggest that future extensions of LSTPM could benefit from integrating semantic embeddings, context-aware attention mechanisms, and heterogeneous graph structures to more comprehensively model spatiotemporal user behavior.
2.3. Graph Long-Term and Short-Term Preference (GLSP)
Jinbo and Yunliang et al. [
10] proposed the Graph Long- and Short-Term Preference (GLSP) model to address the limitations of existing approaches in jointly capturing spatial dependencies and temporal dynamics in user trajectory data. GLSP is a hybrid framework that integrates Graph Convolutional Networks (GCNs) [
7] with Long Short-Term Memory networks (LSTMs) [
1]. The GCN module is responsible for learning spatial and structural relationships among locations, thereby enhancing the model’s ability to capture graph-based contextual information. In parallel, the LSTM module focuses on modeling sequential dependencies, effectively representing both long- and short-term user preferences.
For LSTM module, user trajectories comprising both historical and current sequences are first encoded using discrete location identifiers. These identifiers are then transformed into continuous spatial vectors through an embedding layer, enabling the model to capture semantic information in a dense representation. The embedded sequences are input into the LSTM, where hidden states are modulated by a temporal decay mechanism designed to emphasize recent behaviors. Specifically, an exponential decay function assigns lower weights to distant historical events and higher weights to temporally proximate ones, thereby forming the long-term preference representation. For current trajectories, embeddings are processed in a similar manner, with the final hidden state representing the user’s short-term preferences. In parallel, the GCN module constructs a user-specific graph based on spatial relationships among locations. Node updates are performed using the adjacency matrix, degree matrix, and node features, where adjacency encodes spatial connectivity and a learnable projection matrix refines feature representations. Finally, flattening and pooling operations are applied to generate the graph-based preference representation.
The outputs from the two modules are subsequently integrated. Specifically, the graph-based representation is first combined with the long-term preference vector, after which both long- and short-term preferences are fused using a weighting factor to obtain the final user preference representation. This representation is then passed through a fully connected layer and a Softmax classifier to generate the prediction results. GLSP [
10] marks a significant advancement by jointly modeling spatial dependencies via GCN and temporal dynamics through LSTM, further enhanced by the incorporation of location embeddings and a temporal decay mechanism. By combining these two complementary components, GLSP leverages the strengths of graph-based representation learning and temporal sequence modeling, leading to improved performance in trajectory-based recommendation and prediction tasks compared with methods that rely solely on either spatial or temporal modeling.
Despite these strengths, certain limitations persist. The exponential decay function, while effective in emphasizing recent interactions, does not account for semantic distinctions across temporal contexts (e.g., “morning vs. evening” or “weekday vs. weekend”), may overly suppress the influence of long-term behaviors, and cannot fully capture habitual patterns solely on the basis of temporal proximity. Furthermore, the current framework omits location category features, which could enrich the graph representation if both node and edge attributes were incorporated into the GCN. These considerations highlight the need for further refinement to fully harness the potential of spatiotemporal user modeling.
3. Our Approach
Building on previous approaches to location prediction and geographic information modeling [
5,
8,
9,
10,
11], recent studies have demonstrated the effectiveness of embedding static geographic data and modeling sequential patterns, thereby confirming the feasibility of location prediction within sequential frameworks. Despite these advances, several challenges persist. One major limitation is the insufficient ability of existing methods to address data sparsity. In practical scenarios, users’ historical trajectories are often unevenly distributed and highly imbalanced, making it difficult for models to capture the distinctive characteristics of long-tail locations. Furthermore, current methods show limited capacity for integrating heterogeneous data sources, as they rarely incorporate location categories, spatial coordinates, and temporal dynamics in a unified manner. This lack of joint modeling reduces their representational power and ultimately constrains both prediction generalization and accuracy.
To address the aforementioned limitations, we propose a dual-branch architecture, termed STLGNet, which jointly leverages Graph Convolutional Networks (GCN) [
7] and Long Short-Term Memory networks (LSTM) [
1]. The overall framework, depicted in
Figure 1, is organized into four principal stages: (1) data preprocessing, (2) graph construction and feature design, (3) the GCN–LSTM integration module, and (4) the final prediction layer. With respect to data inputs, three primary sources are incorporated: users’ historical trajectories (History Trajectory), category encodings of historical locations (Category ID), and current trajectories (Current Trajectory). To capture both the uniqueness and semantic characteristics of locations, location and category information are transformed via dedicated embedding layers. Specifically, venue identifiers (Venue IDs) are projected into low-dimensional embedding vectors to preserve identification features, while location categories are converted into semantic vectors to reinforce categorical semantics.
Subsequently, users’ historical trajectories are formalized as personalized trajectory graphs. In these graphs, nodes represent the locations visited within individual historical trajectories, whereas edges capture the co-occurrence of locations within the same trajectory. For node feature representation, location embeddings are combined with category embeddings, enabling each node to simultaneously preserve identification characteristics and semantic attributes. This design is particularly advantageous for addressing cold-start scenarios and mitigating issues arising from sparse samples. Edge features are defined as the Euclidean distance between adjacent locations, thereby encoding spatial proximity. Such modeling not only enhances semantic associations but also reinforces behavioral dependencies between frequently co-occurring and geographically proximate locations.
To further capture temporal variability in user behavior, we incorporate a trajectory similarity based temporal decay mechanism. This mechanism computes both the temporal distance and the semantic similarity between historical locations and the current context, and subsequently applies dynamic weights to historical records. In doing so, it emphasizes the influence of recent and semantically relevant trajectories, thereby enabling the model to more effectively characterize evolving user preferences. For graph convolutional processing, we adopt a GCN framework that employs a message-passing strategy to aggregate feature information from nodes and their neighboring nodes. Furthermore, we extend the conventional update function by integrating edge features—specifically, geographic distances—as weighting factors. This enhancement ensures that geographically proximate locations exert stronger influence during the aggregation process, thereby improving the model’s capacity to capture spatial dependencies and user mobility patterns.
In parallel, the LSTM module is employed to capture temporal sequence information. Historical and current trajectories are separately input into the LSTM to model long-term behavioral dependencies and short-term activity patterns, thereby preserving sequential structures and reflecting latent dynamics of user interests. The graph embeddings generated by the GCN are subsequently fused with the sequential preference vectors derived from the LSTM. This joint representation is then passed through a fully connected layer to generate the final prediction of the user’s next location.
Overall, the proposed STLGNet integrates the spatial–semantic representational power of graph structures with the temporal modeling capacity of recurrent networks. By incorporating multiple sources of information—including location categories, geographic distances, and temporal decay—alongside carefully designed node and edge features and an enhanced graph update function, STLGNet effectively mitigates the challenges of data sparsity and semantic degradation inherent in existing methods. As a result, the model achieves improved predictive accuracy and enhanced generalizability in location prediction tasks.
3.1. Geographic Data Processing
To ensure that the model can effectively process and learn the semantic information embedded in locations and their associated categories, this article introduces an embedding transformation mechanism during the data preprocessing stage. The purpose of this mechanism is to convert the original discrete encodings into continuous representations suitable for neural network learning. In the raw trajectory data, each visited location is represented by a unique Venue ID and is linked to one or more predefined Category IDs. These attributes are typical discrete features. However, directly inputting such discrete values (e.g., a Venue ID of 1347 or a Category ID of 56) into a deep learning model prevents the model from capturing semantic relationships. This is because discrete encodings lack inherent numerical continuity and semantic order; the distances and angles between such encodings are mathematically meaningless and cannot be optimized effectively through gradient descent.
To overcome this limitation, the proposed model incorporates two embedding layers: a Venue Embedding Layer and a Category Embedding Layer. The Venue Embedding Layer maps each Venue ID into a fixed-dimensional continuous vector, thereby capturing the latent semantic features of the location. Similarly, the Category Embedding Layer maps each category into a continuous vector, which is iteratively updated during training to reflect both semantic distinctions and functional similarities among categories. For example, if “restaurant” and “fast-food outlet” frequently co-occur in user trajectories, their embeddings will likely be located closer to each other in the latent vector space, whereas their distance from categories such as “airport” or “hospital” will be relatively large.
This embedding transformation offers several advantages. First, it enhances the model’s ability to capture latent semantic structures, allowing it to identify clustering patterns and functional similarities among locations. Second, the trainability of the model is improved, since embeddings are continuous parameters that can be directly optimized through backpropagation, thereby facilitating the integration of location and category information into the overall learning process. Third, the approach strengthens the model’s capability to handle cold-start problems; when dealing with rarely visited locations or trajectories of new users, category embeddings can provide semantic compensation and mitigate data sparsity issues.
For embedding layer, the final feature vector of each node is defined as the sum of its Venue Embedding and Category Embedding. This design not only enhances the semantic representational capacity of individual nodes but also ensures that message passing between nodes is semantically informed. Consequently, the performance of GCN can be improved when processing sparse graph structures. The embedded feature vectors obtained through this transformation is subsequently used as inputs for GCN and LSTM, enabling more effective graph-structured learning and behavioral preference modeling.
3.2. User Graph Embedding Vector Generation
In the above section, we obtained both the location embedding vectors and the corresponding category embeddings. To fully integrate the semantic information of each location with that of its category, we directly sum the two vectors to produce a composite embedding that simultaneously represents location-specific and category-related characteristics. This design enhances the representational power of each node, enabling the model to better capture the relationship between locations and their associated categories.
Next, a graph structure is constructed based on the connections among locations within user trajectory data. If two locations appear as adjacent points in a user’s historical trajectory, an edge is established between the corresponding nodes in the graph. To more precisely capture spatial relationships, the Euclidean distance between connected nodes is calculated and used as an edge feature, reflecting their geographic proximity. This calculation is defined in Equation (8), where
indicates whether nodes
i and
j are connected.
The updated process of the graph is shown in Equation (9). Specifically, the feature representation of node
i at layer
, denoted
, is updated as the weighted average of its neighboring nodes’ features. Here,
represents the feature of node
i at layer
l,
is the distance between nodes,
is a small constant introduced to prevent division by zero and ensure computational stability, and
denotes the set of neighbors of node
i. The graph convolution operation assigns higher weights to closer neighbors by using the reciprocal of the distance as the weighting coefficient, thereby aligning with the intuition of spatial proximity in geographic contexts. Through multiple layers of graph convolution updates, node features gradually integrate information from their neighbors, allowing the model to capture semantic characteristics inherent in the overall graph structure.
For illustrative purposes,
Table 1 presents an example. Assume that the initial node features and adjacency relationships are given, with
set to 0.1. Taking node 0 as an example, its initial feature vector is
= [−1.1, 0.3, −0.1]. Node 0 has two neighbors: node 1 and node 2, and the distances with node 0 are 3 and 2, respectively. First, we compute the weight coefficients for the neighbors using
, yielding 0.3226 for node 1 and 0.4762 for node 2. These coefficients are then multiplied by node 0’s feature vector, producing vectors v1 = [−0.355,0.097,−0.032] and v2 = [−0.524,0.143,−0.048]. Summing and averaging these results yields node 0’s updated feature representation at the next layer:
= [−0.439,0.119,−0.040]. Using this updated formula, the next-layer representation of every node can be systematically calculated. These node features, processed through message passing and weighted integration, encompass rich spatial relationships and categorical semantics, providing a solid foundation for subsequent user representation construction.
After updating the graph structures for all users, we generate the overall user graph embedding vector . To achieve this, node features are further updated using both the conventional GCN operations and the novel graph update equation (9) proposed in this article. Following node-level updates, convolutional and pooling layers are applied, after which a fully connected layer integrates the entire graph information into a fixed-dimensional user graph embedding vector . This vector comprehensively reflects the spatial behavioral patterns and location relationships formed by the user and serves as a critical component for integrating multi-source information within the model.
3.3. User Preference Generation
In the model design, the user’s historical trajectory data is processed using a basic LSTM [
1] architecture, augmented with a time-decay mechanism adapted from LSTPM [
9]. The goal of this design is to enhance the model’s ability to identify and adjust the importance of different temporal segments within the historical trajectories. By emphasizing recent behaviors while attenuating the influence of older trajectories, the model can more effectively capture the evolution of user preferences. Specifically, the input to the LSTM consists of user trajectory sequences represented by location encodings. Since these encodings are inherently discrete variables, an embedding layer is first applied to map the discrete location IDs into a continuous vector space. This mapping ensures that semantically similar or related locations are positioned closer to each other in the embedding space, which facilitates the LSTM’s ability to learn temporal dependencies.
To capture temporal variations in user behavior, one week is divided into 48 equally sized time slots: 24 slots for weekdays (Monday–Friday) and 24 slots for weekends (Saturday–Sunday). Each historical trajectory entry is then aligned with its corresponding time slot based on the timestamp. This segmentation enables the model to capture the temporal distribution of user activities. To integrate temporal relevance between past and current behaviors, the model computes the similarity between each historical time slot and the current trajectory, yielding a weight for each time interval. These time-decay weights dynamically adjust the influence of different segments of historical trajectories on the user’s overall preference representation.
For each historical trajectory, the hidden states of all locations generated by the LSTM are multiplied by their corresponding temporal weights, summed together, and then normalized by the trajectory length to obtain an aggregated semantic representation
, as shown in Equation (4). This representation method simultaneously considers both the sequential information of locations and the temporal weights, thereby facilitating the extraction of the most representative semantic features from historical trajectories. Building on this, the semantic representations
of all historical trajectories belonging to the same user are averaged to generate the user’s long-term preference vector
, as defined in Equation (10). This long-term preference vector effectively reflects the user’s stable and representative location preferences and spatial behavior patterns from past activities, offering greater stability and interpretability.
For the current trajectory
, the model applies a similar LSTM processing procedure. The sequence of locations in the current trajectory is first mapped into the embedding space, then fed into the LSTM. The hidden outputs
of all locations in the sequence are summed and normalized by the trajectory length, resulting in the user’s short-term preference representation
, as shown in Equation (11). This short-term preference effectively reflects the user’s immediate interests and behavioral dynamics, thereby providing highly time-sensitive and influential cues for the prediction of the next location.
3.4. STLGNet Model Prediction
We employed a Graph Convolutional Network (GCN) to obtain the user’s graph-structured representation vector, denoted as
. This vector incorporates structural information regarding locations and their categories, thereby reflecting both the spatial features and the relationships between users and their visited locations. In
Section 3.3, we utilized LSTM to separately generate the user’s long-term preference vector
and short-term preference vector
. The long-term preference captures the user’s stable behavioral patterns from the past, whereas the short-term preference emphasizes the user’s current dynamic interests.
To fully integrate these multi-source signals, we first combine the graph structure vector
with the long-term preference vector
, using a simple averaging strategy to derive a comprehensive long-term preference vector
, as shown in Equation (12). This vector encapsulates both the long-term behavioral preferences and the spatial correlations revealed by the graph structure, thus offering a more complete and stable representation of user preferences.
Next, in consideration of the immediate impact of short-term behaviors on next-location prediction, we combine the comprehensive long-term preference vector
with the short-term preference vector
using a weighted scheme parameterized by
, as formulated in Equation (13), in which
is a hyperparameter controlling the relative importance of short-term and long-term preferences in the final representation. A larger value of
indicates that the model places greater emphasis on the user’s current behavior, while a smaller value prioritizes the user’s stable long-term preferences.
The resulting user representation vector thus integrates graph-structural information, long-term stable preferences, and short-term dynamic preferences, making it the most comprehensive and expressive semantic vector of the user in the given context. Finally, this vector is fed into a fully connected layer, where a Softmax activation function is applied to transform the output into a probability distribution over all candidate locations, thereby completing the prediction task of identifying the most likely next visited location.
For training the model, we adopt the Categorical Crossentropy loss function, which is commonly used in multi-class classification tasks. Its primary role is to measure the discrepancy between the predicted probability distribution and the actual label distribution. A smaller loss value indicates that the model’s prediction is closer to the true label, thus yielding more accurate results. The mathematical representation of categorical cross-entropy is given in Equation (14). Let
denote the number of classes (i.e., the total number of candidate locations),
represent the true encoding of the
i-th location in one-hot form, and
denote the probability predicted by the model for that class. For each sample, the formula calculates the loss by taking the negative logarithm of the predicted probability
corresponding to the actual class. This implies that the higher the model’s confidence in the correct location, the smaller the loss. Conversely, if the model assigns excessive confidence to an incorrect class, the loss increases significantly.
Table 2 provides an illustrative example. Suppose there are three locations: A, B, and C, with one-hot encodings A = [1, 0, 0], B = [0, 1, 0], and C = [0, 0, 1]. For Sample 1, the predicted probabilities are A = 0.8, B = 0.1, and C = 0.1, resulting in a loss of approximately 0.223. For Sample 2, the loss is about 0.3577, and for Sample 3, the loss is about 0.5108.
To calculate the overall loss, the losses of all N samples are summed and then averaged, as shown in Equation (15). This yields the mean model loss after computing the loss for each individual sample.