Source: Deep Learning on Medium
A Prediction System for Pedestrian Behaviors and Trajectories
Abstract: This article proposes to consider pedestrians’ body pose and head orientation/gaze direction for movement trajectory prediction, besides of speed and location.
To increase safety of autonomous driving, the prediction of surrounding vehicles and pedestrians is one of the most important issues. In planning and control, detection, tracking and collision avoidance of the obstacles are critical. Usually in behavior modeling of pedestrians on the walkway, their interaction with other traffic participants can provide clues to increase the accuracy of prediction systems. People think this kind of interaction as pedestrians’ sociality, reported in the early work about exclusion, i.e. social force model. Recently related work applied deep learning, such as social LSTM, Social GAN, SoPhie, Social Attention and Social Ways etc. It is worth to mention, categorization of pedestrians is an important label for their behavior, such as children, teenager, youth, adult and elders. Also inference of human intentions and activities are helpful for understanding their sociality.
The interaction modeling work need perceptual clues of interaction, i.e. human signals captured by sensors (LiDAR, camera and V2X etc.). Unfortunately, most of work in this area is limited by pedestrians’ speed, direction and trajectory. Though this kind of clues can reflect the interaction behavior, the more direct clues are human activity styles, eye gaze direction, body posture, hand gestures, facial expression and speech signals, seldom discussed yet.
Ma et al. modeled interaction for prediction using the game theory, assuming the walking as a Markov decision process (MDP) and learning the behavior model using inverse optimal control (IOC) theory. Camera video data is captured for estimating pedestrian trajectories through computer vision methods, and classifying them as young or old people, women or men. Shortcomings lies in lack of details for body pose, hand gestures and head orientation/eye gaze direction. Besides, we don’t see analyzing the effect of static obstacles or roads on human behavior.
Liang et al.  applied computer vision to extract more visual features of pedestrians and surrounding environments and provided an end-to-end learning system for pedestrians activity prediction, including four modules, i.e. pedestrian behavior module, pedestrian interaction model, pedestrian trajectory generation module and activity prediction module. The former two modules provides feature extraction, and the latter two modules predict trajectory and activity respectively. The pedestrian behavior module extracted body features, such as key point detection. The pedestrian interaction module considered relationship feature with other targets (vehicles and humans) and neighboring environments (road, walkway or grass). The limitation of this method is, no information about face movement, such as expression, eye gaze and mouth opening/close. The activity classification relies on body parts’ movement and trajectory, without facial expression as the clue.
In this article, we explicitly append pedestrian’s pose, head orientation and eye gaze direction as interaction clues, based on the state-of-art behavior and trajectory prediction systems.
2. Pedestrian Behavior and Trajectory Prediction
First, let’s introduce the input signals. Shown in Figure 1, a traffic environment is a cross section with traffic light control. Vehicles (solid rectangles in blue) drive along horizontal and vertical bi-directions, lanes of different directions are separated by curbs (yellow line). Now the traffic signal is controlled to green for vertical directions, pedestrians (solid small rectangles in blue) are on the walkway, some of them passing the cross (empty rectangles in purple). One car on the vertical road will arrive at the cross for right turn (signal as red point at the right rear), it has to yield to pedestrians, but possibly squeeze its way (area in yellow). If the traffic light turns to red from green, the right turn looks like lane change (yellow area would change to pink). The car on the right is braking when it drives close to the cross (red line at the rear), warning the car behind to avoid tail hit (pink region). However it perhaps run the red light (light yellow region). Another car on the left is waiting for left turn (red point at the left rear), we still estimate the possibility of running the red light (light yellow area too). One thing should be explained additionally: when the traffic light changes to green, the car on this lane could go straight or turn left (Note: for this cross street, there is no left turn only lane), so the left turn is also regarded as lane change (light yellow area), but its priority is still less than pedestrians at this situation. Pedestrians are safe on the walkway, some of them on the grass (green area). The black regions are depicted for buildings. Note: if there are no traffic light, instead 4-way stop signs(red line at the cross, similar to vehicle braking), then the traffic rule follows “First come, first pass”.
Based on this scene, the system input signals, shown in Figure 2, include road map (grass and buildings are rendered), traffic light map, speed limit map, pedestrian way (illumination is inversely proportional to the accessibility), vehicle signal map (braking and lane change signal emitted from the vehicle, either by front/rear light or driver/passenger hand gesture), obstacle location map, history trajectory map  and head/gaze direction map (If eyes are not detected, head only; If face is not detected, pose only).
The output is the future trajectories of all obstacles, shown in Figure 3.
Based on those input and output signals, the prediction model‘s system diagram is illustrated in Figure 4. “Encoder” is a CNN model for feature extraction, including pedestrians’ pose (arms, legs) and gaze direction as the new interaction clues. “Vehicle LSTM” predicts the vehicles’ directions, speed, way points and location heatmaps, where LSTM  is one kind of RNN , taking the temporal feature. “Pedestrian LSTM” predicts humans’ direction, speed, way points and location heat maps instead, temporally too. “Road Decoder” is a CNN model which outputs the drivable area , and the final “Fully Connected Layers” outputs the rendered future trajectory map of pedestrians and vehicles.
In model training, the loss function includes imitation losses (the rendered bird eye view images shown in Figure 2), vehicle collision term, vehicle drivable region term, vehicle on-road term, vehicle geometric loss (way points), pedestrian-vehicle collision term and pedestrian on-walkway term.
The imitation losses are the same to ChaufferNet . Let’s define the other terms as below.
Assume the vehicle traffic signal is S_vehicle (region caused by the vehicle signal), the vehicle predicted location heatmap and the true location is Obj_vehicle and Obj_vehicle_GT，then the vehicle collision term is defined as:
where namda is the vehicle traffic signal weight, 0<namda<1（suggest namda=0.3），and H（）function is cross entropy.
Assume the predicted drivable region and its true region are respectively R_vehicle and R_vehicle_GT，then the drivable region term is defined as
The vehicle on-road term is:
Besides, the vehicle geometric loss comes from the predicted vehicle trajectory region, assume the true region (binary map) is G_vehicle_GT，then the vehicle geometric loss is:
Assume the pedestrians’ predicted location heatmap is Obj_pedestrian，the predicted pedestrian accessible map and its true map are T_pedestrian and T_pedestrian_GT respectively, then the vehicle-pedestrian collision term is:
At last, the pedestrian on-walkway term is defined as:
To improve the system learning power in behavior prediction, we propose another GAN-based system, shown in Figure 5.
The GAN needs a generator (G) to obtain the data distribution and a discriminator (D) to estimate the sample is from the training data or the generator. The original GAN’s generator needs a noise input (Gaussian distribution for example) for data generation, but conditional GAN allows to input another signal simultaneously to both G and D. In Figure 5, the eight rendered maps input the encoder which output feature maps are fed into G. The three modules in G are the same to that in Figure 4, and the G’s output is given to D to decide it is true or fake. In D, “Classifier LSTM” is a LSTM-based temporal sequence classification model, and the “Fully connected layer” shows the decision result.
Next, we explain how to estimate the pedestrian’s pose, head orientation and gaze direction, shown in Figure 6. The body pose estimation is suggested using OpenPose, gaze direction estimation is done by a deep learning model , and the head orientation is estimated by a non-key-point deep learning model. If eyes are not detected, we try face detection for head orientation as replacement. If face detection fails, head status in the body pose estimation result is taken.
About vehicle signal detection/recognition (either from vehicle light or hand gesture), refers to [16–18] for more details.
This article put vehicles and pedestrians’ trajectory prediction together in a deep learning framework, while the focus is on pedestrians. In interaction feature extraction, we add the head/gaze direction for explicitly taking into account the facial movement in the behavior model.
1. D. Helbing, P. Molnar. “Social force model for pedestrian dynamics”. Physical review E, 51(5):4282, 1995
2. A Alahi et al.,“Social LSTM: Human Trajectory Prediction in Crowded Spaces”, IEEE CVPR 2016
3. A Gupta et al. “Social GAN: Socially acceptable trajectories with generative adversarial networks”, IEEE CVPR, 2018
4. A Sadeghian et al. “Sophie: An attentive GAN for predicting paths compliant to social and physical constraints”, arXiv 1806.01482, 2018.
5. A Vemula, K. Muelling, J. Oh. “Social attention: Modeling attention in human crowds”. IEEE ICRA 2018.
6. J Amirian et al, “Social ways: Learning multi-modal distributions of pedestrian trajectories with GANs”, IEEE CVPR Workshop, 2019
7. I Goodfellow et al., “Deep Learning”, MIT Press, 2016.
8. I Goodfellow et al.,“Generative Adversarial Nets”, NIPS, 2014
9. M Mirza, S Osindero,“Conditional Generative Adversarial Nets”, arXiv:1411.1784, 2014
10. Z Cao et al., “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields”, arXiv:1812.08008，2018
11. S Park et al，“Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings”，arXiv 1805.04771，2018
12. N. Ruiz, E. Chong, J. M. Rehg. “Fine-grained head pose estimation without keypoints”. arXiv, 1710.00925, 2017.
13. W Ma et al.,“Forecasting Interactive Dynamics of Pedestrians with Fictitious Play”, IEEE CVPR 2017
14. J Liang et al., “Peeking into the Future: Predicting Future Person Activities and Locations in Videos”, IEEE CVPR 2019
15. M Bansal et al., “ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst”, arXiv 1812.03079, 2018
16. D Frossard，E Kee，R Urtasun，“DeepSignals: Predicting Intent of Drivers Through Visual Signals”，ICRA，2019
17. W Luo, B Yang, R Urtasun, “Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net”, IEEE CVPR 2018
18. H Kretzschmar, J Zhu，“Cyclist hand signal detection by an autonomous vehicle”， Google patent，US 9,014.905 B1，April，2015