# [Paper review] On Choosing and Bounding Probability Metrics (part 1)

Source: Deep Learning on Medium

This post is a paper review for educational purpose.

In the past post, while talking about GAN’s techniques, I mentioned a little about f-divergence. In this post, I will look at how divergence and metric are distinguished. This time it will be divided into two parts because it is large. In part 1, we will explain the various metrics. In part 2, we will look at how the various metrics are related to each other and how convergence trends differ.

Therefore, the article to be introduced this time is a paper written quite a long time ago (2002 is about 15 years ago, but it is a very long time considering the development speed of technology these days). It is a paper published in 2002 that does not cover the latest technology such as CNN.

You can see the above table. Table 1 lists the various metrics covered in this paper. I am curious here.

`"How can we distinguish between the words DIVERGENCE and METRIC?"`

We can find the answer to this question in the definition of metric.

We can quote the Wiki’s information to get a definition of the above metric. The metric satisfies four properties.

1) Non-negativity

2) Identity of indiscernibles

3) Symmetry

4) Triangle inequality

Based on these basic properties, it is related to Hilbert space or Banach space, and the concept used for various probability model analysis is metric.

“So what’s different from divergence?”

The most important feature of the line I know is that divergence is asymmetric. Therefore, this violates the metric’s third property. Perhaps asymmetry is a key feature that distinguishes metric from divergence.

Therefore, it would not be awkward for table 1 to include the word divergence in the item metric.

Now, let’s analyze various metrics in figure 2. Figure 2 is an illustration to help you understand the various metrics. From the bottom are D (discrepancy), P (prokhorov), and W (wasserstein), which are defined in the metric space.
In the center, there is TV(total variation). As you can see from the picture, TV has directionality with all other metrics. (Having directionality implies that TV can be appropriately modified to be represented by other metrics.)

Also, as we move upwards, the f-divergence family mentioned in the last post appears. Typically, the relative entropy (Kullback-Leiblur divergence) and Chi2 can be checked. This generally represents a non-metric distance.
Intuitively, figure 2 can be interpreted as giving a strong penalty to the convergence of the distance as it goes to the upper right. (Let’s discuss it in part 2).

Now let’s look at the concept of each metric.

1. Discrepancy metric

Defined space is any metric space, definition is as follows.

This represents the difference between the two probability measures for a value in a closed ball. Of course this difference is limited by supremum. Also it assumes values ​​in [0,1].

This concept is used to analyze the WGAN and feature space with the concept of Maximum Mean Discrepancy (MMD).

2. Wasserstein (or Kantorovich) metric

Define space is real space or any metric space, definition is as follows.

It is also equivalent to the following equation under the condition of any separable metric space.

This can be calculated through supremum based on all functions h that satisfy the Lipschitz condition of the RHS. The biggest difference from the previous discrepancy is the area of ​​the value where the metric is defined. For Wasserstein, [0, diam(Ω)], where diam(Ω) is the diameter of the metric space (Ω, d). Another difference is that you should look for yourself :)

3. Relative entropy (or Kullback-Leibler divergence)

The next metric is relative entropy. This is defined in any measurable space, and the definition is as follows.

Also, if we assume countable space, it can be represented below.

Relative entropy is also called information distance because it was first introduced in information theory. For example, let’s say there is an answer p distribution that you do not know. To predict this, we define the q distribution and calculate the difference between the p distribution and the q distribution. This is called relative entropy because it uses the entropy concept of shannon to calculate the difference in distribution. Shannon’s entropy equation is like this.

Therefore, the KL divergence is minimized when the ratio of the values ​​in the log is equal in the KL divergence. Also, because KLD has a fractional part of the log, it can have an infinite value. Therefore, the relative entropy assumes values in [0,∞] (One of the reasons for the mode collapse in the vanilla GAN learning process is that the variance of the loss function is very large. GAN uses JSD-based loss function.)

But KL divergence is not a metric. Because it can not satisfy symmetry property, one of the properties of the metric. However, KL divergence has important properties including additivity over marginals of product measures. (Cover and Thomas 1991, Reiss 1989, p. 100)

4. Separation distance

The next metric is the most widely used relative entropy. This is defined in the countable space, and the definition is as follows.

This is also not a metric because this distance also does not satisfy symmetry. And it has a range of [0, 1]. This is typically used for analysis of Markov chains in uniform time series.

I will finish part 1 so far. In Part 2, we will look at the relationship between the other probability distance.

#### Reference

A. L. Gibbs and F. Edward SU. (2002). On choosing and bounding probability metrics. https://www.math.hmc.edu/~su/papers.dir/metrics.pdf