State Estimation for Autonomous Navigation
Outcome:
Developed a hybrid trajectory prediction framework that integrates supervised learning with uncertainty-aware filtering. Our approach employs a neural network to predict both the next state and process noise covariance of agents, using 10 previous trajectory points as input. These learned components are incorporated into an Unscented Kalman Filter (UKF) for real-time state estimation without requiring explicit dynamics models of the agents. We evaluate our method on the Stanford Drone Dataset and demonstrate strong predictive performance for moving agents.

Motivation:
Multi-object tracking (MOT) in dynamic, uncertain environments is essential for autonomous navigation tasks such as guiding a cyclist through pedestrian-dense areas. The ability to reliably estimate and forecast the trajectories of multiple agents is essential for ensuring safety, enabling collision avoidance, and supporting real-time decision-making. Understanding and forecasting the movement of crowds has long been a focus of research, with applications ranging from evacuation planning to autonomous navigation.
Neural networks, particularly recurrent architectures, have proven highly effective for this task. Long short-term memory (LSTM) networks have been shown to outperform traditional methods in predicting human motion in crowds, thanks to their ability to capture temporal dependencies and nonlinear dynamics. Recent work has also explored hybrid models that combine LSTM-based neural networks with UKF filtering. In these approaches, the LSTM learns motion patterns or uncertainty representations, which the UKF performs state estimation. This integration enables robust trajectory prediction even in the absence of continuous observations and the explicitly dynamical model, given enough data, effectively combining the strengths of data-driven learning and probabilistic filtering.
Results:
Our trajectory prediction model shows strong performance for moving agents, with low RMSE for both walking pedestrians (4.94 px) and moving bikers (18.19 px). The model’s low inference time (< 0.006s per agent) also makes it suitable for real-time applications. These results suggest the model can generalize across agent types with different dynamics. Despite these strong results, we still found a few edge cases (such as modelling objects at rest) where the model fails, and outline potential research directions to improve the model.
This experience highlighted the importance of traditional state estimation and filtering techniques. Despite training our LSTM on a large dataset to accurately predict agent behaviour, the UKF ended up correcting the output of the LSTM significantly to get our strong results. I gained a deep appreciation for incorporating these traditional methods, even in today’s more foundation model-driven world.
Project Report: