3.3. Task 3: Assigning a Level of Certainty to the Prediction
The third task, which is the most important and the least addressed in the literature, is to assign a level of certainty to each prediction. For each input and phase, the 12 best models provide a prediction. Due to model errors from bias and variance, models are unlikely to converge on the same incorrect prediction unless there is an inherent data error.
A certainty metric is defined based on model consensus. Given a set of predictions
P for the same input with a median prediction
m, an adjustable tolerance level
t (typically within 20 percent) is used. The number of predictions within this tolerance indicates the certainty level. Certainty is computed by the number or percentage of agreeing models. The number of models in consensus is preferred over the percentage, as the latter could be misinterpreted as the probability of a correct prediction.
Testing this formulation, a tolerance level of 5% is chosen and the count of the models within this tolerance level is evaluated. This results in 12 different levels of confidence from 1 to 12, where 1 indicates the least confidence (a minimum of 1 model agrees with this median within tolerance) and 12 indicates the highest confidence (all 12 models agree on this median within tolerance). The MAPE levels for each level of consensus are shown in
Figure 8. It should be noted that for each level of consensus, the error shown is for this level of consensus or higher, so a level of consensus of 1 shows the overall MAPE as the entire dataset has a minimum level of consensus of 1 model.
Figure 8 highlights the effectiveness of the consensus metric in distinguishing high-confidence from low-confidence predictions. Intersection 650058 has a 16.04% MAPE overall, which drops to 3.59% when all 12 models reach consensus. At intersection 650060, the MAPE decreases by 80%, from 16.83% to 3.2%, for the highest consensus score. For intersection 650063, the MAPE falls from 16.89% to 5.33% with a consensus score of 12, confirming hypothesis 2 from the methodology section. The consensus score quantifies prediction uncertainty before the true value is known. This approach is adaptable to all model types, including statistical and machine learning models, and can be combined with other certainty measures like probabilistic predictions.
Figure 8 shows the error probability distribution for each consensus level from 1 to 12. The error is computed using the equation:
Where
is the median of the model predicted values and
is the ground truth. This means that a positive error refers to an overestimation and a negative error refers to an underestimation of the traffic signal time to change. The error distribution shown in
Figure 8 shows errors between -10 seconds and +10 seconds for the 20-second prediction window. Errors greater than 10 seconds are distant outliers and their percentages for each level of consensus are shown in
Table 8 for completeness. Errors are aggregated across all six intersections. The predictions are all rounded to the nearest second and the ground truth is in seconds. This means that the shown values for each bar in the histograms in
Figure 8 are the exact value of the error. For example, the middle bar represents the percentage of the time where the error is zero, so the rounded prediction is exactly equal to the ground truth.
Table 4.
CNNLSTM best model configurations
Table 4.
CNNLSTM best model configurations
| Intersection |
Model Rank |
lr |
batch size |
neurons per layer |
n layers lstm |
n layers dense |
| 650058 |
1 |
0.00040 |
16 |
300 |
2 |
1 |
| |
2 |
0.00101 |
32 |
240 |
2 |
1 |
| |
3 |
0.00109 |
64 |
240 |
2 |
1 |
| 650060 |
1 |
0.00080 |
32 |
300 |
4 |
1 |
| |
2 |
0.00021 |
16 |
300 |
4 |
1 |
| |
3 |
0.00040 |
16 |
300 |
2 |
1 |
| 650063 |
1 |
0.00109 |
64 |
240 |
2 |
1 |
| |
2 |
0.00061 |
16 |
120 |
4 |
1 |
| |
3 |
0.00048 |
32 |
300 |
2 |
1 |
| 650064 |
1 |
0.00253 |
128 |
240 |
3 |
1 |
| |
2 |
0.00050 |
16 |
120 |
3 |
1 |
| |
3 |
0.00230 |
128 |
210 |
4 |
1 |
| 650065 |
1 |
0.00041 |
16 |
240 |
4 |
1 |
| |
2 |
0.00121 |
128 |
150 |
4 |
1 |
| |
3 |
0.00050 |
16 |
120 |
3 |
1 |
| 650075 |
1 |
0.00041 |
16 |
240 |
4 |
1 |
| |
2 |
0.00080 |
32 |
300 |
4 |
1 |
| |
3 |
0.00168 |
64 |
120 |
2 |
1 |
Table 5.
Transformer best model configurations
Table 5.
Transformer best model configurations
| Intersection |
Model Rank |
lr |
batch size |
embed dim (N) |
n heads |
n encoder layers |
n layers dense |
| 650058 |
1 |
0.00020 |
32 |
300 |
1 |
2 |
2 |
| |
2 |
0.00016 |
32 |
240 |
3 |
1 |
4 |
| |
3 |
0.00014 |
32 |
210 |
2 |
2 |
2 |
| 650060 |
1 |
0.00018 |
16 |
60 |
3 |
2 |
2 |
| |
2 |
0.00046 |
64 |
90 |
3 |
2 |
4 |
| |
3 |
0.00042 |
32 |
60 |
5 |
2 |
2 |
| 650063 |
1 |
0.00067 |
64 |
30 |
5 |
4 |
2 |
| |
2 |
0.00080 |
128 |
60 |
5 |
1 |
2 |
| |
3 |
0.00018 |
16 |
60 |
3 |
2 |
2 |
| 650064 |
1 |
0.00014 |
32 |
210 |
2 |
2 |
2 |
| |
2 |
0.00020 |
32 |
300 |
1 |
2 |
2 |
| |
3 |
0.00046 |
64 |
120 |
1 |
1 |
2 |
| 650065 |
1 |
0.00075 |
128 |
150 |
1 |
1 |
2 |
| |
2 |
0.00041 |
128 |
120 |
3 |
1 |
2 |
| |
3 |
0.00024 |
32 |
210 |
1 |
1 |
4 |
| 650075 |
1 |
0.00029 |
32 |
120 |
1 |
1 |
4 |
| |
2 |
0.00020 |
32 |
300 |
1 |
2 |
2 |
| |
3 |
0.00018 |
16 |
60 |
3 |
2 |
2 |
Figure 8.
MAPE as a function of the level of consensus with a 5% tolerance level for all 6 intersections.
Figure 8.
MAPE as a function of the level of consensus with a 5% tolerance level for all 6 intersections.
Figure 9 and
Figure 10 illustrate the error distributions for different consensus levels at a 5 percent tolerance. Lower consensus levels have flatter error curves. At a consensus level of 1, the median prediction has zero error 36% of the time, while at level 12, it has zero error 61% of the time. The error is within 1 second 90.2% of the time at level 12 but only 68.4% at level 1. The ensemble’s error exceeds 10 seconds 1.782% of the time at level 1, compared to just 0.026% at level 12. Thus, at level 1, the ensemble is 68.5 times more likely to be off by more than 10 seconds than at level 12. This strongly supports hypothesis 2, showing that the consensus level is a reliable indicator of prediction accuracy.
Table 6.
Classification accuracy of different models distinguishing whether a signal phase will change 20 seconds in the future
Table 6.
Classification accuracy of different models distinguishing whether a signal phase will change 20 seconds in the future
| Intersection |
Model |
LSTM |
MLP |
CNN- |
Trans- |
Mean |
Median |
Vote |
| |
Rank |
|
|
LSTM |
former |
|
|
|
| 650058 |
1 |
94.84% |
94.97% |
94.37% |
95.31% |
95.24% |
95.12% |
95.08% |
| |
2 |
94.59% |
94.71% |
94.63% |
95.25% |
|
|
|
| |
3 |
94.68% |
94.82% |
94.64% |
95.14% |
|
|
|
| 650060 |
1 |
95.20% |
95.23% |
93.47% |
95.04% |
95.53% |
95.03% |
95.17% |
| |
2 |
94.40% |
95.11% |
93.22% |
95.09% |
|
|
|
| |
3 |
94.20% |
95.05% |
93.58% |
95.32% |
|
|
|
| 650063 |
1 |
95.95% |
95.90% |
95.88% |
96.07% |
96.17% |
96.10% |
96.08% |
| |
2 |
95.70% |
96.01% |
95.54% |
95.96% |
|
|
|
| |
3 |
95.87% |
95.84% |
95.91% |
96.04% |
|
|
|
| 650064 |
1 |
94.98% |
95.84% |
94.20% |
96.59% |
96.51% |
96.04% |
96.17% |
| |
2 |
95.06% |
95.81% |
93.77% |
96.24% |
|
|
|
| |
3 |
95.67% |
95.97% |
94.36% |
96.45% |
|
|
|
| 650065 |
1 |
95.63% |
96.36% |
94.33% |
96.39% |
96.43% |
96.08% |
96.07% |
| |
2 |
95.88% |
95.75% |
94.18% |
96.35% |
|
|
|
| |
3 |
95.76% |
95.92% |
94.39% |
96.08% |
|
|
|
| 650075 |
1 |
96.01% |
96.63% |
96.01% |
96.07% |
96.73% |
96.07% |
95.86% |
| |
2 |
96.08% |
96.19% |
95.54% |
96.18% |
|
|
|
| |
3 |
95.77% |
96.05% |
95.59% |
96.04% |
|
|
|
Figure 9.
The error distributions a) consensus values 1 to 3. b) consensus values 4 to 6. c) consensus values 7 to 9. d) consensus values 10 to 12.
Figure 9.
The error distributions a) consensus values 1 to 3. b) consensus values 4 to 6. c) consensus values 7 to 9. d) consensus values 10 to 12.
Table 7.
Mean Absolute Percentage Error of Each Model for Values under 20 seconds
Table 7.
Mean Absolute Percentage Error of Each Model for Values under 20 seconds
| Inter- section |
Model Rank |
LSTM |
MLP |
CNN- LSTM |
Trans- former |
Mean |
Median |
| 650058 |
1 |
20.45 |
16.65 |
19.40 |
15.39 |
15.58 |
15.03 |
| |
2 |
19.70 |
16.33 |
20.43 |
15.95 |
|
|
| |
3 |
21.15 |
16.68 |
20.90 |
15.29 |
|
|
| 650060 |
1 |
22.64 |
19.49 |
17.80 |
18.07 |
16.72 |
15.89 |
| |
2 |
20.90 |
19.12 |
17.84 |
19.01 |
|
|
| |
3 |
21.79 |
19.29 |
20.27 |
19.03 |
|
|
| 650063 |
1 |
22.47 |
19.65 |
20.62 |
16.29 |
16.45 |
15.61 |
| |
2 |
23.14 |
19.20 |
18.67 |
16.90 |
|
|
| |
3 |
22.47 |
19.19 |
21.23 |
17.56 |
|
|
| 650064 |
1 |
22.29 |
21.28 |
21.92 |
19.88 |
18.31 |
17.60 |
| |
2 |
23.08 |
21.18 |
22.43 |
18.75 |
|
|
| |
3 |
25.07 |
22.58 |
23.09 |
19.38 |
|
|
| 650065 |
1 |
22.74 |
19.97 |
19.65 |
17.56 |
16.70 |
16.32 |
| |
2 |
22.21 |
18.97 |
19.82 |
17.90 |
|
|
| |
3 |
22.54 |
19.19 |
20.33 |
17.34 |
|
|
| 650075 |
1 |
16.38 |
15.01 |
12.95 |
11.37 |
11.94 |
10.99 |
| |
2 |
17.64 |
13.68 |
14.96 |
13.25 |
|
|
| |
3 |
17.62 |
13.64 |
16.25 |
12.94 |
|
|
Figure 10.
Percentage of outlier predictions with an error greater than 10 seconds
Figure 10.
Percentage of outlier predictions with an error greater than 10 seconds