Fighting Fraud with Anomaly Detection

Original article was published on Artificial Intelligence on Medium

What could you do with 800 million dollars? What could the government do with 800 million dollars? 800 million dollars is financial loss caused by credit card fraud in Canada every year.

When a fraud occurs, information about the victim can be used against them. Passwords, personal identification numbers, and sometimes even the physical credit card can be stolen. However, the one thing that cannot be stolen is behavior.

A thief will often use the stolen credit card for their own purposes, and make distinct purchasing patterns. To demonstrate this, imagine that we have collected data on a credit card for the past few years. Suddenly, a new example, denoted with a red circle, appears.

We can tell that the new example is probably a fraud, based on the difference in behavior.

Gaussian Anomaly Detection

How do computers detect when a data point is different from the rest? The answer is anomaly detection. Falling in the category of semi-supervised learning, Gaussian anomaly detection finds the probability of a new (possibly fraudulent, or anomalous) data point distributed through Gaussian distribution, given a previous data set of non-fraudulent, or non-anomalous data.

The probability of a variable distributed through Gaussian distribution with mean μ and standard deviation 1. Source: Wikipedia

To perform Gaussian anomaly detection with a single variable:

  • Find the mean of all the data points. This is denoted with the Greek letter μ.
  • Calculate the variance of the data set, denoted with σ²:
Where m is the number of data points, and x is the data set for a single variable. The standard deviation, denoted with σ, is the square root of the variance.
  • To calculate the probability of a new variable distributed with Gaussian distribution, given the parameters μ and σ²:
Where the probability of the new variable is denoted P(x)
  • If P is less than some constant ε that we choose, we classify it as an anomaly.

This gives a Gaussian curve centered at μ and stretched by σ. We do not want to be limited to only one variable, so there are a few ways to get around this.

For an approach where we assume variables are independent to each other, to change P(x) to include more variables, fit μ and σ² for every variable, and take the product of the probabilities of each variable.

  • Compute μ and σ² with the formulas above for each individual variable.
  • To calculate the probability of a new variable, given a vector μ and a vector σ²:
Where μj and σ²j represent the mean and variance of the jth variable, and n is the dimension of a data point.
  • If P is less than some constant ε that we choose, we classify it as an anomaly.

When we fit μ and σ² to our previous example, we get a contour graph that looks like this:

Fitting a Gaussian distribution model through our data set based on μ and σ², most of the probability is heavily concentrated in an ellipse around our data set. The warmer colors in this example would represent a higher probability. The new example has a probability very close to zero, so we would classify it as an anomaly.