Source: Deep Learning on Medium

## Machine Learning & Data Science — ambrosia for noobs.

The article was supposed to be short. But there is no way to talk about “machine learning” without knowledge of the slogans used in the discussion. We will start in traditional way — we will go through a list of the most commonly used techniques used in “data mining” and then come to real life examples. The subject of “data mining” itself is so capacious that one could write at least a few pages about it. For our needs let us assume that it is simply working with data. And in this work, we usually use the techniques listed below in the “Data Mining Techniques” paragraph.

Having a general idea of what’s behind, we will move on to apply this knowledge in practice. In part, “machine learning on examples”, there will be specific cases of using algorithms to solve real life problems and to answer frequently asked questions in business.

### Data Mining Techniques

### 1. Prediction

Classification and Regression are data mining techniques used to solve similar problems. Both are used in predictive analysis, but regression is used to predict a numeric or continuous value, while the classification uses labels to assign data to separate categories (classes).

**Classification**

Classification is the assignment of an object to a particular class based on its similarity to previous examples of other objects. Usually classes are mutually exclusive. An example of a classification question would be “*Which of our clients will respond to our offer?*” and creation of two classes: “will react to an offer” and “reject an offer”.

Another exemplary classification model — credit risk. It could be developed on the basis of observed data for loan applicants over a period of time. We can track employment history, house ownership or renting, length of residence, type of investment and so on. The target classes would be a credit rating; e.g. “low” and “high”.

We will wisely call attributes (e.g. employment history) the “predictors” (or “independent variables”) and target variables the “dependent variables” or simply “classes”. Classification belongs to the “supervised” methods. What are supervised methods read in “*Supervised methods and unattended methods* “below.

**Regression**

Regression is a “value estimation”. It is the determination of the relation between different entities and on this basis attempts to estimate (“predict”) unknown values. For example: if we know the company’s turnover from the previous year, month after month, and we know the advertising expenses in each month of the previous year, we are able to estimate the amount of revenue by assuming spending some amount on advertising in the next year.

Another question that we can find the answer using regression can be “*How often will a given customer use the service?*”

### 2. Co-occurrence and associations

Exploring groups or discovering associations tries to find the links between objects based on the transactions of these objects. Why do we want to find such occurrences? Association rules will prompt us whether “*Customers who bought a new eWatch, they also bought a Bluetooth speaker*”. Sequence pattern search algorithms can suggest how a service action or customer service should be organized.

**Association**

Association is the method of discovering interesting dependencies or correlations, commonly referred to as associations, between data in data sets. Relations between elements are expressed as association rules. Association rules are often used to analyze sales transactions. For example, you can see that “*Customers who buy cereal in a grocery store often buy milk at the same time*” (eureka!).

For example, an association rule can be used in e-commerce to personalize websites. The associative model can discover that “person who visited pages A and B can also visit the page C in the same session with probability of 70%”. Based on this rule we can create a dynamic link for users who might be interested in the C page.

**Searching for patterns**

Searching patterns (patterns), more precisely sequential patterns, is searching for ordered sequences. The order of sequences between elements is important. Founded patterns are presented in the order of ‘support’, i.e. the frequency of the occurrence of a given pattern in the set of elements in relation to the number of considered transactions.

### 3. Clustering

Clustering is the grouping of objects with similar properties. As a result of this operation a cluster or class is created. Clustering can give us an answer to the question “*do our clients create groups or segments?*” And consequently, “*how should our customer service teams (or sales teams) look like to adapt to them?*”.

Clustering, like the classification, is used to segment the data. In contrast to classification, grouping models into segments divides data into groups that were not previously defined.

Clustering belongs to ‘unattended’ methods. What are unattended methods read in “Supervised methods and unsupervised methods” below.

### The examples of Machine Learning

Having the theory and examples of algorithms in mind, let’s move and apply them in practice. We have specific problems to solve and we will use the techniques described in the previous paragraph.

### Classification

*“Which customers can leave us in the near future?”*

The algorithm that we can use to try to find the answer to this question is J48. Successor of C4.5, whose predecessor is ID3. The algorithm uses the concept of entropy. In short: “how many questions is necessary to get to the information?”. This algorithm belongs to the group of algorithms of the “decision tree”. There are at least 835463248965 articles on the Internet about these algorithms and decision trees …. of which 2/4 are written by those who do not know how they work and 1/4 by scientists who have focused on to crypt this information.

To try to answer the question of which clients can leave, we will need historical data as a behavioral pattern (where we know which clients have left us). Analyzing this information, attributes, we will create a model. This model will be used to analyze the data of customers who are still our clients (current data). The last column in the table below is the class (the answer we will look in the data).

Historical data:

Current data:

Which software will do this task for us? Unfortunately, we will not do it in Excel (although there are some experimental solutions for this program). If we could do it there — the whole spell of “machine learning” would evaporate in one moment. Among the few others there are two programs that are great for this. The first is Weka. Created by the academic community it offers adequate interface. However, it is particularly suitable for testing the selection of attributes, which in effect will assign our clients properly to the appropriate class — by experimenting we can work out a model that we apply to the current data. We can use built model in the second software; Pentaho Data Integration. The advantage of Pentaho (PDI) is the logical interface and the ability to process large amounts of data. PDI use Weka libraries and is also able to prepare a model to test production data.

### Regression

*“What turnover will we achieve by investing amount X in advertising?”*

We need two variables to calculate regression; an predictor (independent) variable and an target (dependent) variable.

Historical data:

After calculation we have ready to use formula:

*y = 10,788 * x + 368,848*

Where X is the forecasted amount of advertising and Y is the expected revenue. The green line is the regression line. The linear model assumes that relations between variables can be summarized with a straight line. The line indicates the relation between X and Y.

Regression can be calculated in Excel or using a simple to use online calculator. Thanks to Pentaho, we can automate the regression calculation — it can be part of the data analysis process.

The example shows a simple linear regression. Real cases often use multiple regression with multiple predictor variables or non-linear regression.

### Association

*“What is the most common product basket?”*

We can also ask a similar question: “*What products are usually bought with hamburgers?*” We want to learn the correlations that occur in customers’ purchases — get to know their shopping patterns. Let’s assume that we have a list of customer transactions:

*Best rules found:*

*1. burgers=y 4 ==> potatos=y 4 <conf:(1)> lift:(1.17) lev:(0.08) [0] conv:(0.57)*

*2. onion=y 3 ==> potatos=y 3 <conf:(1)> lift:(1.17) lev:(0.06) [0] conv:(0.43)*

*3. onion=y burgers=y 2 ==> potatos=y 2 <conf:(1)> lift:(1.17) lev:(0.04) [0] conv:(0.29)*

*4. burgers=y milk=y 2 ==> potatos=y 2 <conf:(1)> lift:(1.17) lev:(0.04) [0] conv:(0.29)*

*5. burgers=y beer=y 2 ==> potatos=y 2 <conf:(1)> lift:(1.17) lev:(0.04) [0] conv:(0.29)*

*6. potatos=y beer=y 2 ==> burgers=y 2 <conf:(1)> lift:(1.75) lev:(0.12) [0] conv:(0.86)*

*7. onion=y milk=y 1 ==> potatos=y 1 <conf:(1)> lift:(1.17) lev:(0.02) [0] conv:(0.14)*

*8. onion=y beer=y 1 ==> potatos=y 1 <conf:(1)> lift:(1.17) lev:(0.02) [0] conv:(0.14)*

*9. onion=y beer=y 1 ==> burgers=y 1 <conf:(1)> lift:(1.75) lev:(0.06) [0] conv:(0.43)*

*10. onion=y burgers=y beer=y 1 ==> potatos=y 1 <conf:(1)> lift:(1.17) lev:(0.02) [0] conv:(0.14)*

How to interpret the results? 10 rules were found in the data (default program settings). Let’s explain the first line. A pair of “burgers ==> potatos”. Burgers were found in 4 customer transactions (in 4 customer baskets). This number is called ‘support’. Number 4 next to potatos means that were also 4 connections “burgers ==> potatos”. This variable is called “support” or “coverage” (coverage of the whole rule).

The number in brackets after “conf:” is “confidence” or “trust”. Trust is the percentage determination of the probability of a rule occurring; 100% sure that if the customer bought burgers, will also buy potatos.

Confidence arises from the formula:

*confidence = support rule (second digit .. number of rule occurrences) / support (number of instances of the formula … first digit)*

The apriori algorithm provides us with unordered sets of elements (without a specific sequence).

### Searching for patterns

*“What next product will buy the customer who bought product X?”*

The discovery of sequence patterns involves the analysis of a database containing information about events that occurred over a given period of time in order to find a relationship between the occurrence of specific events over time. An example of a sequence pattern are customer purchases. The purchases included in the sequence pattern do not have to occur directly one by one — they can be separated by other purchases. This means that the customer usually purchases another product between the purchase of the product X and the purchase of the product Y, but the given sequence describes the typical behavior of most customers.

Let’s assume that the purchases of our customers look as follows:

Using the Weka program and applying the GeneralizedSequentialPatterns algorithm, we get the result:

*Frequent Sequences Details (filtered):*

*– 1-sequences*

*[1] <{coffee}> (3)*

*[2] <{milk}> (2)*

*[3] <{sugar}> (2)*

*[4] <{pasta}> (2)*

*– 2-sequences*

*[1] <{coffe}{coffee}> (2)*

*[2] <{coffee}{pasta}> (2)*

*[3] <{milk}{coffee}> (2)*

*[4] <{milk}{sugar}> (2)*

*[5] <{sugar}{coffee}> (2)*

*– 3-sequences*

*[1] <{milk}{sugar}{coffee}> (2)*

How to interpret the result of the program calculations? “X-sequences” are groups of instances that meet the calculation criteria (minimum support settings, “minSupport” that must meet the results); single sequences, double, triple …

The found patterns are presented in the order of “support”, i.e. the frequency of the occurrence of a given pattern in the set of elements in relation to the number of considered transactions.

### Clustering

*“What segments do our clients create?”*

Let’s assume that our clients have the following attributes and features:

We would like to divide them into two groups (two clusters) get to them more accurately with our offer or to adjust the sales team to handle their specific needs. We will do it in the Weka program using the k-means algorithm. How to interpret the results?

*Attribute Full Data 0 1*

*(7.0) (5.0) (2.0)*

*================================================*

*age 29.8571 33.8 20*

*Marital status married married single*

*Property house flat house*

*Education elementary high elementary*

We obtained two clusters (program settings — we can request more of them); 0 and 1. In the cluster 0 the average age is 34 years, marital status married, owner of an apartment with high education. There are five records in this set. Cluster 1 is 20 years old, marital status single, house owner with elementary education. The ‘Full Data’ column is the average of all instances.

The clustering used here is based on an algorithm that uses the arithmetic mean to calculate the distances of individual features in clusters.

The example has been deliberately limited to several records to better illustrate clustering. Actual grouping is made on thousands and more records. With such a wide sample, the data can be visualized to give a clear picture of our groups.

### Supervised methods and unsupervised methods.

In another sense ‘supervised learning’ and ‘unsupervised learning’. In supervised learning, we set a specific goal — we expect a certain result. Example:

*“Can we find groups of customers who have a particularly high probability of canceling their services shortly after the expiration of their contracts? “*

Or:

*“Let’s divide the clients because of the risk of insolvency; small, medium, large. “*

Examples of supervised methods are classification and regression. The algorithms used here are often decision tree, logistic regression, random forest, support vector machine, K-nearest neighbors.

In unsuperivsed learning, we do not set a specific goal — we do not expect a specific target result. The questions asked here are, for example:

*“Do our clients create different groups?”*

Examples of unsupervised methods are clustering and correlation (association)

Metaphorically, a teacher “supervises” the learner by carefully providing target information along with a set of examples. An unsupervised learning task might involve the same set of examples but would not include the target information. The learner would be given no information about the purpose of the learning, but would be left to form its own conclusions about what the examples have in common.

### Algorithms Baby!

The whole mystery of “Machine Learning” was born from the lack of easy access to the functions/algorithms that perform the work described above. The tools have been available on the market for years. What’s more, they are often free! However, to use them you need to have basic skills in the field of databases, programming, SQL, parsing files — the data most often require transforming to the appropriate form to be able to use them.

All these calculations are possible thanks to appropriate algorithms. Most of these calculations could have been made a decade or even more years earlier (!). The regression algorithm has more than two centuries (its beginnings are 1805). The j48 algorithm used for classification has its roots in entropy of information — Claude Shannon’s work from 1948. We have even older algorithms — k-means grouping objects, based on the idea of the Euclidean distance which derives from ancient Greek geometry.

If somebody were to do “lerning” here, then certainly not machines but human. The computer, as the perfect computing machine, does in a second what human would do for weeks. There was no new revolution in science — we gained access to high-speed computing machines. If “Machine Learning” is the basis of today’s “artificial intelligence”, how does it look itself?

Data mining is a craft. It involves the use of a significant amount of science and technology, but proper application still includes art. No machine can pick the attributes in the right way as a human does. For example, in the retail trade the attribute “frequency of purchases” may be more reliable than in B2B relations. In the United States there are data mining competitions (GE-NFL Head Health Challenge, GEQuest) and the rewards for solving specific humanity problems are very high (eg 10 million dollars in the challenge of the GE-NFL Head Health Challenge).