Original article was published on Artificial Intelligence on Medium

# Part I: The Basics

## Goal

Our goal is to classify a text into 46 different classes. Since it’s not a binary classification, we will use a last layer of densely connected 46 noder with a `softmax `

activation function.

## Input & Output

Our input is a text stored in a string. We will encode it into a one-hot vector. Our output, will be be vector of 46 values, where each value represents a probability. Naturally they all add up to 1 and the highest probability is the guess.

We will import the Reuters news-wire data set that’s part of the `keras `

datasets.

## Encoding & Decoding

Encoding the training data, in other words, our input string into array of numbers. These numbers signify the rank of the word, if they were all ranked by the frequency of occurrence in the entire data set.

We’re also gonna encode our training labels into one-hot vectors. While the keras data set doesn’t provide any label strings. A very similar data set of Reuters newswires has labels such as:

`wheat`

corn

coffee

nat-gas

etc...

Decoding the result is simple, the highest probability value in the 46 vector is the network’s best guess.

## Architecture

We will use a 3 layer deep neural network. That’s densely connected. There’s no skipped connections. The last layer, output layer, has 46 nodes. Which brings up an important point about neural network architectures. That generally, the hidden layers leading up to the output layer often need more nodes than the output layer.

Our optimizer is `rmsprop`

. Our loss function is `categorical_crossentropy`

, which means our neural network is always trying to minimize the cross entropy between the actual label data and the network’s current best guess.

## Regularization

Our primary method of regularization so far, is early stopping. We do that by plotting the accuracy and loss on the training and validation sets and just looking where the validation has the lowest loss.