Original article was published on Artificial Intelligence on Medium
Essentials Of Machine learning — with zero programming
We’ve said a few times that machine-learning is the intersection of mathematics, statistics and computer science. However, it’s fair to say that for some, these three areas are sufficient kryptonite to stay well away from the whole thing — and that’s a pity because ML can be successfully applied to many different fields. But if you’ve decided coding isn’t your thing, it doesn’t have to mean giving machine learning a wide berth. This month, we look at a programming-free way to implement machine learning techniques and how it can be used in email spam detection.
WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS
While much of machine learning does rely on having good coding skills (particularly Python or R), there is an alternative developed by the University of Waikato in New Zealand called ‘Waikato Environment for Knowledge Analysis’ or ‘Weka’ for short. It’s a graphical user interface (GUI)-based app that runs on Windows, macOS and Linux and allows you to execute machine-learning algorithms on data without coding. However, before we jump in, it’ll help to explain how and when you use machine-learning.
AI, ML OR DM?
Most people have a handle on ‘artificial intelligence’; if not, they’ve most likely at least heard of the concept. Mention ‘machine learning’ and the number of blank stares begins to rise. Drop the term ‘data mining’ and it skyrockets. Unfortunately, ‘AI’ has become the media-fuelled umbrella term for everything involving computational techniques, when in fact, the terms ‘machine learning’ and ‘data mining’ are arguably more accurate in describing how and where much of this stuff is applied.
Just briefly, machine learning is the nuts-and-bolts method of using computational algorithms to find patterns within a set of data or ‘dataset’. Data mining, however, is the broader field that seeks information and knowledge from data. In fact, the area of ‘data science’ is also described as ‘knowledge discovery in databases’ (KDD). In other words, the way you mine data for knowledge is to use machine learning.
NO MACHINE LEARNING IS PERFECT
However, the important thing to remember is that, despite the hype, machine learning isn’t perfect. There is no one machine learning algorithm that can perfectly identify every pattern with 100% accuracy in every dataset. It hasn’t debuted here yet, but there’s been plenty of debate in the U.K. during the last couple of years over start-up Babylon Health’s ‘GP at Hand’ service using ‘artificial intelligence’ to perform medical triage on patients, releasing ‘live’ doctors to see more acutely-ill patients. The machine-learning chatbot behind the ‘GP at Hand’ app was reportedly tested by Babylon Health on questions similar to those used in the Royal College of General Practitioners membership exam. While human doctors are said to score around 72% on average, the chatbot claimed a score of 81% on its first attempt (tinyurl.com/y36bxnht). Depending on which side of the ‘AI doctor’ divide you stand, that 81% either excites or scares you. In our context, the key point is the score wasn’t 100%. Because life rarely is.
A WORKING EXAMPLE
The University of California, Irvine (UCI) houses arguably the world’s most well-known dataset archive, used in countless academic research papers, from machine-learning to health research. You’ll find it at https:// archive.ics.uci.edu/ml/datasets.php. One of the 450-plus datasets in the archive is the ‘spambase’ dataset. This dataset, created by Hewlett-Packard back in 1999, contains 4,601 records of emails, each with 57 features or ‘attributes’ defining some aspect of the email and an overall ‘class’ attribute of whether the email was spam (1) or not (0). The first 48 attributes look at different word frequencies (how many times a particular word occurs in the email), the next six attributes count specific character frequencies, while the last three look at average, maximum and total sequence length of capitalised letters.
What we’ll do is use machine learning to see if we can find patterns within those 57 attributes that help determine whether an email is spam or not.
RUNNING YOUR OWN ML
Now you don’t need a fancy system in order to implement machine learning. Any PC or laptop running Windows, macOS or Linux will do. The only thing is the older the system, the slower it runs (no surprises there).
Start by heading to www.cs.waikato. ac.nz/ml/weka/downloading.html and download the latest ‘stable version’ (at time of writing, this was version 3.8.3) for your operating system. We’ll work with the Windows version, but the others should be similar. Weka requires Java, so either Java 8 or 9 is recommended. If you don’t have Java, just choose a Weka download that includes the Java 1.8 virtual machine.
Once you’ve installed it, head to the UCI Machine Learning Repository website page for the Spambase dataset at https://archive.ics.uci.edu/ml/ datasets/Spambase. Click on the ‘Data Folder’ link and download the files ‘spambase.data’ and ‘spambase.names’.
Now that we’ve got the data, we need to knock it into shape — this step is often called ‘data preprocessing’. Weka handles data in ARFF and CSV (comma-separated variable) format. However, while the Spambase.data file is CSVready, it lacks a header row containing the attribute descriptors or ‘names’. That’s where the ‘spambase.names’ file comes in. What we’d normally do now is create the header row by copying the attribute names from the ‘spambase. names’ file into the ‘spambase.data’ file, separating each attribute name with a comma (,). However, for the sake of brevity, we’ll take a short-cut — head over to the OpenML website at https://www. openml.org/d/44 and you can download a ready-to-go Attribute-Relation File Format (ARFF) version instead.
Once you’ve downloaded the ARFF dataset, fire up Weka and you’ll get the Weka GUI Chooser panel. Click on the Explorer button to launch the Weka Explorer. Next, click the Open File button just under the Preprocess tab and load the ‘dataset_44_spambase.arff’ file. Shortly after, you’ll see the list of attributes on the left. Click on one and you’ll get brief details on the right-side panel. Scroll down to the last attribute, ‘class’ — this is the attribute that describes whether or not each email record was classified as ‘spam’ (1, red) or ‘not’ (0, blue). Of the 4,601 records, 1,813 were classified as spam and the remaining 2,788 were not.
Because each record already has a class value, what we’re doing is called ‘supervised learning’ — we’re looking for the patterns in the attributes that match which email records are spam and which are not. The next step is to select the ‘Classify’ tab at the top of the screen, then press the ‘Choose’ button. Weka comes with a wide array of classification, clustering and association rule mining algorithms built-in. The new subpanel should open up with the ‘Rules’ sub-list showing. Select ‘JRip’. JRip is a Java language version of the ‘Repeated Incremental Pruning to Produce Error Reduction’ or RIPPER algorithm and we’ll use it to discover a list of ‘decision rules’ by which we can classify the 4,601 email records as ‘spam’ or ‘not spam’. When you’re ready, press the ‘Start’ button and Weka will get cracking. The first step — discovering the rules — should take just a couple of seconds. The second step — performing what’s called ‘ten-fold cross-validation’ — will take a little longer.
At the end, you’ll end up with a results summary that shows a percentage for ‘correctly classified instances’ of 92.393%. What this is saying is that the rules produced by the JRip algorithm are able to correctly classify an average of just over 92 email records out of every 100, which is actually pretty darn good.
WHAT ARE ‘DECISION RULES’?
Ever played around with the web service ‘If This Then That’ (ifttt.com)? It allows you to create conditions or ‘rules’ by which you can set activities to occur, such as automatically lighting the path for the pizza guy when he comes to drop off your pizza. It uses the simple ‘IF <condition> THEN <response>’ rule scheme. Decision Rules work the same way, except that the <condition> can be multiple attributes having certain values or ranges of values and the <response> is the appropriate class value, which, in our case, is spam (1) or not spam (0).
Scroll back up the Weka output screen until you see the ‘JRIP rules:’ header. JRip found 17 rules for determining the classification of the 4,601 email records. The first rule is:
(char_freq_%21 >= 0.079) and (char_ freq_%24 >= 0.013) and (capital_run_ length_longest >= 43) and (char_ freq_%23 >= 0.008) => class=1 (337.0/0.0)
This rule says that if the frequency of character ‘%21’ (exclamation mark, !) is greater than or equal to 0.079, the frequency of character ‘%24’ (dollar sign, $) is greater than or equal to 0.013, the longest run of capitalised letter is 43 or more and the frequency of character ‘%23’ (hash, #) is greater than or equal to 0.008, then the email record is considered ‘spam’ (class=1). The (337.0/0.0) at the end indicates there were 337 records that had this combination of attributes and values, with zero cases where the class was not ‘spam’. We don’t have the space to follow up all remaining 16 rules but each one can be applied to each record in the same way and in 92.393% of records, these rules will get you the right answer.
From here, if you were into coding, you could use these decision rules as the basis of a (very basic) spam filter using a simple form of ‘natural language processing’ (NLP). You process each email, counting up the particular words and characters, then feed the results into the rules. The class value the rules suggest then gives you the answer to whether the email was spam or not.
Now all that said, this dataset dates back to 1999, so it’s pretty long in the tooth. However, it does show, albeit in a fairly simplistic way, that machine learning can be applied to almost any application — including identification of spam email.