Training taking too long? Use Exponentially Weighted Averages!

Original article was published by Danyal Jamil on Artificial Intelligence on Medium


Stuck behind the paywall? Click here to read the full story with my friend link!

We all, ML Engineers or Data Scientists, love Data. Whenever we hear that we are getting more Data to use, it sounds like Heaven but

Not everything is as it seems.

“What is the drawback here?” ~ you might ask. So, we have our little CPU Processors, but some of the lucky ones among us have GPUs, but then too, the Computing Power is not skyrocketing and has a limit. The main drawback that I can think of is Too long Training time obviously.

Exponentially Weighted Averages

Let me explain this the way NG explained. Suppose you have the weather data of London. It might look something like:

╔═══════════╦════════════════════╗
║ house ║ Weather ║
╠═══════════╬════════════════════╣
║ X1 ║ 40 ˚F ║
║ X2 ║ 40 ˚F ║
║ X3 ║ 49 ˚F ║
║ .. ║ ... ║
║ .. ║ ... ║
║ .. ║ ... ║
║ X180 ║ 63 ˚F ║
║ X180 ║ 61 ˚F ║
╚═══════════╩════════════════════╝

Where ‘X1’ is the whether on Day 1 and ‘X180’ is for the 180th Day. Now, if you plot the data, you’ll see something like this:

Picture by Author

Now, what the algorithm is?

The formula is: Vt = (ß) * Vt-1 + (1 — ß) * øt

Where:

  • ‘øt’ is the current day’s temperature.
  • ‘Vt-1’ is the day before’s temperature.
  • And ‘ß’ is a variable. It changes how the graph looks like.

The more the value of ß, the more the smooth curve would be. If ß is lesser, the curve will be noisy.

Picture by Author

Here, the lines are:

  • Green: 0.98
  • Red: 0.9
  • Yellow: 0.5

Or If :

  • ß ≈≈ 1: Green Line.
  • ß != 1 or 0: Red Line.
  • ß ≈≈ 0: Yellow Line.

Where ‘≈≈’ means ‘Approaches to’.

What we have is an exponentially decreasing value. All of Vt add unto 1, roughly. Hence, we can say that V100 will be ß times the sum of all the values of V before the 100th day.

Implementing the algorithm

# Pseudo CodeVø = 0For loop {
Get next øt
Vø = ß * Vø + (1 - ß) * øt
}

That’s it! Yes, the biggest pro of using this algorithm is that it uses very little memory. We just initialize and then we keep on updating it.

If you want to compute averages of many values, this is useful due to its so less space acquired.

What is Bias Correction?

Picture by Author.

Greenline is what we want but the purple line is what we obtain using the equation.

What happens is that because we initialize to zero, in a couple of first terms and hence, the graph starts pretty low and is not what we expect.

So, in order to cope with this, instead of using , we use Vt/(1 — ßt), and this pretty much solves the problem. Also, when t is large enough, there is almost no correction needed and hence is what the equation shows.