Kalman Filter vs Deep Learning for Position Estimation

Source: Deep Learning on Medium

Kalman Filter vs Deep Learning for Position Estimation

An inertial measurement unit (IMU) is an electronic device that measures and reports a body’s specific force, angular rate, and sometimes the orientation of the body, using a combination of accelerometers, gyroscopes, and sometimes magnetometers


Kalman Filter is a well-known algorithm for position estimation and sensor fusion. However, with the recent advancement in Deep Learning, could we also use Deep Learning to learn a function approximator of the real trajectory given the sensor readings? In this post, we will attempt to estimate the trajectory of an object given a 6-DOF IMU (gyroscope and accelerometer) using Kalman Filter, as well as training it end-to-end with Deep Learning. There’s a bit of Fourier Transform involved as well.

The real application of this task is actually used to estimate the position of a toothbrush. According to the National Institute of Dental and Craniofacial Research, 42% of children who are ages 2 to 11 develop a cavity in their baby (primary) teeth. This project, with its accompanying app aims to bring in the responsibility of the parents to monitor the children’s brushing habits to prevent cavity.

http://www.cirbd.cn/product.html Yes, this is a Chinese company based in China’s Big Data Valley.

See that green dinosaur right there? That is one of our products called Xrush (a toothbrush IoT Device!). It is an IoT device that contains an IMU and a transmitter.

Kalman Filter

Kalman Filter was first used and introduced for the application of trajectory estimation for the Apollo Program. Remember sending this guy to the moon? 🙂

Kalman Filter is actually pretty neat and tricky. This is because although there’s plenty of tutorials online about Kalman Filter, they are mostly task-specific. For example, sensor smoothing and so forth. In this case, we would like to model the position of a toothbrush given an accelerometer and gyroscope value. How should we go about integrating it into our Kalman Filter? Note that this post assumes an understanding of the Kalman Filter as it is not a tutorial. It merely is a documentation and a presentation of the task. I only talk about the main tweaks of the filter to model it as a position estimation problem.


Kalman Filter consists of two main steps, i.e. prediction and the update step. The primary tweak required is of course in the prediction step.


The process noise is modeled as a Gaussian distributed random variable. We could use the variations provided in the datasheet of the sensor as the standard deviation. Since we are dealing with 6-DOF IMU, which consists of accelerometer and gyroscope, an algorithm known as Runge-Kutta is first used to improve the accuracy of accelerometer readings by fusing the readings of both the sensors. The updated accelerometer reading is then fed into the prediction step of the Kalman Filter.

Using the kinematic equation, the position (displacement) is calculated by

Hence, the state is predicted using the following equation by substituting acceleration values in the x, y, and z-axis.

The process noise can be sampled using Monte-Carlo Markov Chain, and more specifically the Metropolis Algorithm.

Of course, quaternions are used for rotation into the world frame before integration. The result of the filter will be presented towards the end of the post, compared side by side with the Deep Learning model.

Deep Learning for Position Estimation

This is based on the paper End-to-End Learning Framework for IMU-Based 6-DOF Odometry. The following model is proposed, which takes a sequence of gyroscope and accelerometer readings as input, and outputs a relative position between two sequential moments. It is just a simple 1D-CNN as a dimensionality reduction and feature extraction layer, which is then fed into layers of LSTM for prediction. The input has a window size of 200, and a frequency of 100 Hz.

The position and the orientation are calculated with the following equation, as outlined in the paper.

There are several approaches to represent a 6-DOF relative pose. One approach is to use a 3D translation vector ∆p and a unit quaternion ∆q. This representation correctly handles the orientation when dealing with motions in any direction. From a previous position pt−1 and orientation qt−1 , the current position pt and orientation qt after applying a pose change (∆p, ∆q) is given by

where R(q) is the rotation matrix for q, and ⊗ is the Hamilton product. The quaternions predicted by the neural network need to be normalized in order to ensure that they have unit length. In our experiments, we noted that the predicted quaternions before normalization have an average norm of 4.91, justifying their explicit correction.

The OxIOD dataset from Oxford is collected using smartphones and aerial vehicles, with sequences such as walking, swinging and so forth.

Model Comparison

The Deep Learning model works as the paper claimed.

Output Generated by the Deep Learning model

Whereas the position as estimated by the Kalman Filter is error-prone and is highly subjected to variability due to drift.

The output of the Kalman Filter

Toothbrush Trajectory

Recall that we are trying to outline the trajectory of a toothbrush. We face the following problem.

  1. Brushing teeth has a much higher frequency than regular tasks such as walking. The movements are much more vigorous. Since the model is trained on data that is collected while walking, this may pose a problem.
  2. The sensors in the handphone are much better in terms of quality. We aim to cut down on costs and attempt to use the cheapest sensor available in the market.

While a reduction in cost is desirable, we would like to ensure the quality of the product by using sophisticated algorithms and models to make up for the reduction in sensor quality.

To resolve the aforementioned problem (1), the signals are upsampled by 1, 2, 5, 10, 100 times and tested on different upsampled frequency. For problem (2), cheaper sensors are subjected to noise. Hence, before feeding into the Deep Learning model, we resort to Fourier Transform to filter out the high-frequency component in the signal.

Without going too much into the detail, Fourier Transform breaks down an aperiodic signal into a bunch of sine and cosine and can be visualized in the frequency domain. Unwanted high-frequency components (noise) can then be filtered out.

All of these fusses is due to the fact that we do not have ground truth labels as it is simply not possible (at the moment) to collect samples in the mouth knowing the exact brushing location without more sophisticated equipment such as anchors in the lab. We could only rely on the labeled dataset, OxIOD provide by Oxford University.

So in the end, how does the device behave? Recall that this is the device. It’s called the Xrush.

Xrush, an IoT device for a toothbrush.

If I were to brush my teeth at 4 different corners in my mouth, this is the trajectory after feeding it into the Deep Learning model, having done the pre-processing step as outlined by 1 and 2.

Brushing Trajectory at Four Different Corners In The Mouth

It is evident that the model is able to recognize the change in position as I brush a different location. The sharp turn gives a good indication of the model performance.

Having a real trajectory of the entire brushing session is, of course, desirable but it is very difficult. Take, for example, all of us have different brushing habits and may start at any random position. Initialization of orientation and position remains a huge challenge. For the time being, position estimation is still R & D at the moment. The product actually can do an amazing job by modeling the task as a classification problem, predicting the brushing location, and trained with a Deep Learning model. In this case, generalization remains a huge challenge. However, as the product matures, more data will come in and this will create a positive feedback loop that will increase the classification accuracy.


Is Deep Learning signifying the end of Signal Processing? 🙂