LaneLM has neither complex trick nor sophisticated structure. First, we tokenize the float keypoints and convert them into word embeddings. Then, we use a visual encoder to transform the visual information into visual embeddings. Next, a language decoder (shown in
Figure 4) is designed for prompting. We simply use cross-attention layers to condition the language model [
35] on visual inputs.
3.1. Lane Representation
A lane is a sequence of xy coordinates represented as tuples of float numbers. First, we quantize the float numbers in these tuples into integer tokens. Then, we integrate the x and y coordinates information into a single embedding, transforming it into a contextual representation that can be processed by a transformer decoder.
Lane tokenization. In order to represent lanes, following [
6,
8], we use equally-spaced 2D-points as lane representation. Each lane
L is expressed as a sequence of keypoints, i.e.
in which
y coordinate is equally and vertically sampled at regression time step
t, i.e.
, where
H is image height. Inspired by row-wise methods mentioned above, we transfer continuous keypoints of lane into discrete counterparts. Then we can rasterize H pixel rows into integer space
T, which is also the vocabulary size for
. Specifically, the keypoint at time step
t is composed of two tokens, i.e.
. For each column,
is quantized to
, in which
is the vocabulary size for
.
Figure 2 shows the row-wise location for a lane. We process the
-classes classification in each row and
T-classes classification at time step
t in each column.
It should be noted that 0 and T are padding tokens for coordinates respectively (i.e. if there is no lane at time step t, let and ). Here, T also refers to the number of rollouts.Then we use the quantized term to index a learnable vocabulary to get the token counterpart, which allows the model to depict the location of keypoints.
Keypoint token encoding. If
in the i-th lane at time step
t is given, we use two simple embedding layers
and
to get the token embeddings
, where
D is the size of each vector. Therefore, each integer tuple
is mapped to a high-dimensional vector.
where
is positional encoding in language modeling [
35] to provide transformer model with information about the position of each token in the sequence. Eq.
2 encodes the information of the x-coordinate and y-coordinate of tuple
into embeddings, which are then fused through addition, followed by the incorporation of positional information. The sequence composed of these embeddings conditioning on visual features is subsequently processed by a language decoder. In the output layer, each embedding will go through the linear head and predicts the next token by classification:
Therefore, we embed a integer tuple lane sequence
L (see Eq.
1) to its contextual representation
3.2. Probabilistic Rollouts
Probability factorization. Following [
25], let
represent the target keypoints for the
n-th lane at time
t. We cast LD as a next-token prediction task, formulated as a conditional probability
, where
represents the set of target keypoints for all lanes at time
t and
is the visual context, RGB image inputs from camera. Thus, we factorize the distribution over future sequences as a product of conditionals:
Similar to [
25], Eq.
6 represents the fact that we treat each lane as conditionally independent for its keypoint at time
t, given the previous keypoints and visual context.
n-lanes parallelism at time step t. In this formulation, we duplicate the visual context
n times for each lane in an image, where duplication is repeated in batch dimension in our implementation to calculate
in parallel because they are assumed to be conditionally independent as described in Eq.
6.
3.3. Network Architecture
In
Section 3.1, we obtained the embedding sequence for the keypoint modality. Similarly, we first encode the image into an embedding sequence using a visual encoder. Then, the information from both the visual modality and the keypoint modality is fused within the AR language decoder. Ultimately, each output embedding can be utilized by the classification head to predict the next token id. We use an encoder-decoder structure for learning and inference. Such a network architecture is widely used in vision-language model [
28,
29,
36] and language model [
37].
Visual encoder. Our visual encoder has to extract visual features from images and then transform them into embedding sequences. The pyramid feature extractor
f shown in
Figure 3 adopts the standard ResNet[
38] and DLA[
39] as visual feature extractor. And we add a FPN [
40] neck to provide integrated multi-scale features:
where
,
and
is the channel, the height and the width of the i-th level feature
extracted from CNN and FPN [
40].This structure leverages ConvNet’s pyramidal feature hierarchy and demonstrates its efficiency in [
6,
13,
16,
41]. Then we split
into fixed-size patches, linearly embed each of them, add position embeddings:
where
is level embedding that embeds level information into vectors,
is patch embedding in ViT[
42],
is its standard positional encoding layer that retains the positional information of each patches and the result value
represents the token sequence extracted from level feature
, in which
is the number of patches and we linearly embed them into
D-dimensional visual embeddings to aligned with keypoint embeddings
.
Since ViTAdapter [
43] proposes a pretraining-free adapter and achieves SOTA result in dense prediction and more and more visual-language model [
28,
29] use a frozen pretrained visual encoder and with how well the spatial feature in LD has been exploited, there is good reason to believe that we can directly adopt a frozen backbone from pretrained to reach comparable performance.
Language decoder. Our language decoder causally models keypoint sequence and conditions each embeddings
e on visual sequences
. As shown in
Figure 4, we use a transformer decoder for target sequence generation. This decoder consists of 3 layers of LaneLM blocks. For every LaneLM block, we insert new cross-attention layers between OPT[
35] causal attention blocks, trained from scratch. The visual feature sequence
, a sequence of flatten 2D-patch embeddings will serve as keys and values in cross-attention layer of block
i while queries are derived from keypoint prompts:
in which we denote
as hidden state output from block
i and specially,
is original keypoint embedding sequence input.
should go through word embedding layer and add position embedding as OPT[
35] does before feeding into block
i.
Our LaneLM layer can avoid information leak. cross-attention ensures that the outputs for a certain position in a sequence is based only on the known query tokens at previous positions and not on future positions. Besides, each query embedding is conditioned on the whole visual information. Theorem A1 demonstrates the feasibility of the network architecture.
Decoupled head. In training phase, every embedding (except for the last embedding of each sequence) output from language decoder will go through
x head and
y head to predict the next token id of x and y as illustrated in Eq.
3. Considering that predicting ordered
y sequence is a simple task, we use two decoupled classification heads,
x head,
y head, to predict the next token of
x and
y, respectively, instead of predicting
tokens like [
25]. Such decoupled prediction strategy greatly reduces complexity of classification from
to
. To further reduce computational overhead, these heads share the same hidden state input because the embedding
contains the information of both x-coordinate and y-coordinate (see Eq.
2).
Figure 4.
To condition the LM on visual inputs, we insert cross-attention layers between causal attention layers. The keys and values are obtained from the visual features while queries are derived from keypoint prompts
Figure 4.
To condition the LM on visual inputs, we insert cross-attention layers between causal attention layers. The keys and values are obtained from the visual features while queries are derived from keypoint prompts
3.4. Training and Inference
Visual question answering. If the detector knows where lanes are, we just need to teach it to read them out so that we can refine the location of keypoints in this context. For instance, during training, each image is pre-annotated by other lane detection models. The output labels are converted into pseudo labels using the method described in
Section 3.1. These pseudo labels are then used as lane priors and fed into the language decoder as queries to extract spatial information from the keys and values in its cross-attention modules. In this way, the model gains an approximate understanding of the lane locations. Our goal is to refine these locations or just read them out if pseudo labels are accurate enough. Moreover, These pseudo-labels can be regarded as questions in VQA, while their corresponding ground truth serves as the answers to these questions. In this way, the LD task is transformed into a VQA task.
Self-supervised labels. Following [
28,
29], we design a muti-turn conversation between LaneLM and a pretrained teacher. For an input image
, we consider the pretrained detecor [
6], which provides pseudo keypoint labels
as init queries for laneLM. Then we generate multi-turn conversation data
, where
N, the max number of turns is also the max number of lanes in each image and
is the first lane of ground truth in the image. We organize them as a sequence by treating ground truth as the answer of pseudo label. Therefore, the multi-modal self-supervised label
S for image
can be expressed as follows:
where ∘ means concatenating two sequences. We adopt the bipartite matching to find the matching that minimizes the distance of the start points between the query sequence
and the answer
, and then take the matched pair
as the self-supervised label for each lane.
Loss. We only adopt standard loss in the decoder-only language models. We train our model by minimizing negative log-likelihoods of keypoint tokens conditioned on the visual inputs
with cross entropy loss at each time stamp
t:
where
T is the length of the sequence.
Inference. We sample tokens from the model likelihood
and
using the argmax sampling as illustrated in Eq.
3.The same as language models, we apply the standard greedy search with fixed length and EOS (the End of Sequence token) stop criteria to generate
tokens at the same time (i.e. we stop prediction when EOS token is predicted or the current sequence reaches the max length). After obtaining the discrete tokens, we de-quantize them to get continuous coordinates.
To speed up the inference, in addition to adopting the parallel strategy outlined in
Section 3.2, each level of visual sequence
is cached into the decoder’s corresponding cross-attention layer. Further more, we adopt the KV-cache strategy in each causal self-attention layer only at inference time.
Prompting strategy. We have the following three prompting strategies. (1) A regression network is employed to provide the two initial keypoints, for each lane. LaneLM is responsible for completing the remaining keypoints. The regression network (we use CLRNet [
6]) only gives start points for each lane rather than the holistic lane, which is easier than the LD task. Keypoint-based methods [
13,
14,
15] use the similar start point regression. However, they are struggling to design keypoint decoding strategy while we leverage contextual representation of lane and just rollout the keypoints. (2) Pseudo labels in
Section 3.4 are given as questions, our language decoder leverage the few-shot capability to generate the answers and refine the read input or just read them out. (3) We simulate the annotation process performed by annotators by introducing a certain level of noise to the ground truth (randomly shifting the x-coordinates by -5 to 5 pixels) to demonstrate LaneLM’s capability of interaction. (4) We give some simple instruction tokens to LaneLM to accomplish some post-processing tasks.