Source: Deep Learning on Medium

# Breast Cancer diagnosis using Deep Learning & The use of ln(Natural Logarithm) in selecting the depth and the width of the hidden layers for baseline models

**Introduction:**

Came across a bone marrow microscopy sample report in pdf format. It has got many categorical/numerical data items using which diagnosis is done. Just to get hands-on on a deep learning project which has a similar data, Breast Cancer Wisconsin (Diagnostic) Data Set [1] from University of California, Irvine, Machine Learning Repository is chosen.

The following single row is how the breast cancer data looks like in wdbc.data [2]:

842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189

The first column(842302) in the above row is the patient ID. The second one (M) is the diagnosis. It can be either M for malignant or B for bening. And the rest of the data items are features. The file wdbc.names [3] has further information.

**Observations:**

**Procedure:**

The following steps are done to get to a baseline model.

Step 1:

- Read the data file which is in csv format
- Consider csv as having data of patients in each line and each line can be considered to have columns
- Label the columns appropriately
- Make the diagnosis as a numeric value so that we can work on numeric data

Step 2:

- Separate features and output(diagnosis)

Step 3:

- Split data into test and train data sets
- The data sets are standardized to properly learn from the features.

Step 4:

- The input layer containing NUM_FEATURES=30 number of units and output layer containing 1 unit is taken.
- Many configurations having different number of units and layers are taken. By considering an analogy from computer network’s hierarchical routing where optimality is achieved by a N router network by having
*ln N (ln is a Natural Logarithm)*number of levels[4]*,*and without any other theoretical or empirical justifications in Deep Learning, calculated the number of layers and units per layer for a given number of units.

So, for a given number of* units, total_units:*

*num_layers = ln total_units*

*num_units_per_layer = total_units / num_layers*

- A few configurations are shown below:

- The average accuracy in 20 runs is calculated for unit sequences generated by the code:

and the top performing nine from a sequence of 100 units are shown below:

- All the layers use ‘relu’ activation function except the output layer, which uses ‘sigmoid’ activation function
- The model uses ‘binary_crossentropy’ as the loss function and ‘adam’ as optimizer

Step 5:

- Three models are built and the ROC(/Magnified ROC) curves are drawn with three models per graph and the corresponding AUC are shown for comparision:

- Model with code 34 having 7 layers and 184 units per layer is chosen as best layer based on its average accuracy.

**Conclusion:**

Experience is gained in creating a baseline model given a non-image data related to a disease. Since we got the data in a format that is easily consumed with pandas, the experiment took less time. But if the laboratory reports of a disease are obtained directly from the hospital in the form of pdf or other file formats, some extra effort to identify features, to write source code for extraction of features from those files, and to encode non-numerical data of the extracted features is needed. And also obtained a reasonable baseline with width and depth calculated using the mentioned way.

**BIBLIOGRAPHICAL NOTES**

“*Much of machine learning, from the most basic techniques to the state-of-the-art algorithms presented at research conferences, is statistical in flavor.”-Roger Grosse *[5]

To get an understanding of Statistical Learning(SL), [6] is a wonderful and book which is free to download. All of the course material, especially the lecture notes and slides related to Roger Grosse’s course [7] are lucid and very useful to learn Deep Learning(DL). To get quick exposure in solving SL and DL problems, [8], [9], and [10] are found to be good. For python programming, [11] is used as a reference. [12], [13], [14], [15] sites have documentation of python libraries useful for machine learning, which can be referenced during programming or while reading programs.

**REFERENCES**

- UCI Machine Learning Repository, Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- UCI Machine Learning Repository, Data from the Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data
- UCI Machine Learning Repository, Information related to the Breast Cancer Wisconsin (Diagnostic) Data Set: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names
- Hierarchical routing for large networks:Performance evaluation and optimization:doi=10.1.1.6.4852
- Course Notes: Introduction, CSC 421/2516 Winter 2019, Neural Networks and Deep Learning: http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L01%20Introduction.pdf
- An Introduction to Statistical Learning with application in R: http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
- CSC 421/2516 Winter 2019, Neural Networks and Deep Learning: http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/
- Your First Machine Learning Project in Python Step-By-Step: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
- How to develop a CNN for MNIST handwritten digit classification: https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-from-scratch-for-mnist-handwritten-digit-classification/
- How to develop a CNN from scratch for CIFAR-10 photo classification: https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-cifar-10-photo-classification/
- Fundamentals of Python programming: https://python.cs.southern.edu/pythonbook/pythonbook.pdf
- pandas, Python Data Analysis Library: https://pandas.pydata.org/
- numpy, NumPy is the fundamental package for scientific computing with Python: https://numpy.org/
- scikit-learn, Machine Learning in Python: https://scikit-learn.org/stable/
- Keras: The Python Deep Learning Library: https://keras.io/

**APPENDIX: A**

System Configuration:

Hardware & OS: