Towards Explicating The Blackbox Connectionist-AI

Original article can be found here (source): Artificial Intelligence on Medium

Prologue

From disseminating targeted advertising to detecting pandemics, Artificial Intelligence (or AI) is truly transforming the world. Deep learning — an instance of the AI sub-field of Connectionist-AI, is at the heart of this transformative renaissance. Then, this article examines its foundational theory, which manifests through the neural networks for solving the learning problems, to understand its blackbox-operability nature.

The Foundational Theory

For understanding the state-of-the-art AI and its sub-field Connectionist-AI, in general, and Deep learning, in particular, it is important to consider its inceptive historical context and the incumbent interpretation of its foundational theory, before the proposed alternate interpretation to its foundational theory may be appreciated. Then, the threads explicating the inceptive historical context for the aforementioned AI sub-field of Connectionist-AI, its incumbent interpretation, and the proposed alternate interpretation are briefly described below.

The Inceptive Historical Context. With the help from the synergistic efforts of several forward-looking scientists including (alphabetically) Claude Shannon, John McCarthy, Marvin Minsky, Nathaniel Rochester, and more, the field of AI was born in the mid 1950s. Since its inception, the field has been broadly subdivided into four distinct sub-fields, namely, Connectionist, Bayesian, Evolutionary, and Symbolist AI. These sub-fields fundamentally differ in the way they theorize, formulate, and implement the underlying ideas governing the field of AI.

Notably, throughout its history, due to a variety of constraints including the progress made in the theoretical underpinnings as well as the hardware advancements or lack thereof, the field of AI has witnessed alternating periods of optimism and pessimism. Then, this article focuses on the AI sub-field of Connectionist-AI, in general, and Deep learning — an instance of the Connectionist-AI which utilizes the neural networks, in particular.

Deep learning, also known as the differential programming, utilizes the neural networks to formulate and solve the learning problems. This sub-field instance was formally introduced in the mid 1980s by Rina Dechter — a distinguished Professor of Computer Science at the University of California, Irvine and a former doctoral student of Judea Pearl — the 2011 Turing Award winner. Then, the sub-field instance has made tremendous strides since its inception, with several fundamental contributions made by (alphabetically) Geoffery Hinton, Yann LeCun, and Yoshua Bengio — the 2018 Turing Award winners, among other notable scientists.

The Incumbent Interpretation. The neural networks, through which Deep learning instruments the learning process, are a connectionist construct loosely modeled on the biological neural networks. A neural network is a collection of individual artificial neuron nodes connected with each other over the mesh-type network topology. Generally, a neural network has an input layer that accepts the domain-specific input such as images, texts, and videos, and an output layer which assigns prediction scores based on the requirements of the underlying learning problem. Between the input and output layers, there are one or more hidden layers — with each hidden layer providing varying-order feature output to the subsequent layer. Typically, the neural networks used in Deep learning have several tens of hidden layers, if not more.

Separately, by the universal approximation theorem, it is known, given the pertinent parameters, the neural networks can approximate all continuous function defined within the bounded R^n space — the limit-closed n-dimensional Euclidean real space. However, it does not explicitly elaborate on the computability-nature of such approximate continuous functions.

Then, a recent result, precipitating the computability-learnability equivalence principle, states that the neural networks can only learn those computable functions (i.e. the aforementioned approximate continuous functions), which the Turing machines are able to compute, operating over the classical and quantum computing paradigms. Thus, the aforementioned result simultaneously bounds the computability-nature of the aforementioned approximate continuous functions to the underlying model of computation (e.g. a Turing machine with random-access memory) while establishing the learnability-equivalence between the Connectionist and Symbolist AIs, among other manifestable connections.

For example, by the aforementioned principle, it is understood that the neural networks can learn to play large combinatorial-space albeit finite computable functions -based games such as Chess and Go, which are computable by the Turing machines. However, they cannot learn the uncomputable functions such as the Busy Beaver function as it is not computable by the Turing machines.

Operationally, to learn the latent computable functions in the multi-dimensional input, a neural network, in its forward propagation phase, transforms the input by applying the input-dimension adjusted weight at the input layer. Consequently, such a transformed input is operated upon by the successive application of the non-linear activation functions — whose purpose is to provide selective output utilizing the converging transformed inputs from the previous layer neuron nodes. This process is repeated till the output layer, where the predicted output along with the cumulative error is produced.

Then, in the backward propagation phase, the aforementioned cumulative error, which results from the application of the forward propagation phase and is derived from the aforementioned predicted output, is minimized. The error minimization is framed as an optimization problem, whose objective is to shape the predicted output towards minimizing the cumulative error. For solving the aforementioned optimization problem, the stochastic gradient descent algorithm or its computationally-efficient variant is used, whose purpose is to repeatedly apply the differentiation operation to reverse the effects of the aforementioned forward propagation phase. Thus, through repeated application of the forward and backward propagation phases, the cumulative error between the real and predicted output is minimized.

The Proposed Alternate Interpretation. The aforementioned repeated application of the forward and backward propagation phases is performed in the training and testing phase of the neural networks, in which they learn and select, respectively, the latent computable functions present in the multi-dimensional input. Essentially, the training phase extracts the aforementioned latent computable functions and represents them within the neural networks. Subsequently, in the testing phase, the latent computable functions in the new multi-dimensional input are extracted and the decision is made on whether they are equivalent to the previously represented computable functions.

Mathematically, the forward propagation process of successively applying the weights and activation functions, to the multi-dimensional input, while converging and diverging the intermediate hidden layer outputs till the output layer, can be abstracted through the successive application of the integration operation over a given multi-dimensional space. Then, at the output layer, the predicted output is the resulting computable functions, along with the cumulative error.

Subsequently, in the backward propagation phase, the differentiation operation is successively applied to shape the output computable functions in the manner where the aforementioned cumulative error is minimized. Therein, it is important to note, for efficiency purposes, the numerical variants for the integration and differentiation operations may be utilized in the aforementioned forward and backward propagation phases, respectively. Then, the only distinction between the training and testing phase of the neural networks is that before the training phase, the neural networks have not learned any latent computable functions from the multi-dimensional input, whereas during the testing phase, the neural networks decide whether the newly learned latent computable functions are equivalent to the prior-learned latent computable functions.

For, if the multi-dimensional input — provided to the neural networks in the training and testing phases, respectively, share the same underlying input data distribution then the latent computable functions extracted in the training and testing phases, respectively, would be approximately equivalent and any difference between them would be bounded. Then, the aforementioned alternate interpretation to the neural networks operability is formally defined as the equation below.

Where the aforementioned symbols have the following meaning.

f[i], g[i] : An instance of the computable function for the training and testing phases, respectively

F, G : The family of computable functions for the training and testing phases, respectively

x[1,…n] : The multi-dimensional input

X : The underlying input data distribution

d/dx[1,…n] : The successive application of the differentiation operation

int[n] : The successive application of the integration operation over the multi-dimensional space n

(+) : The statistical average structure operation over the family of computable functions

D : The n-dimensional Minkowski distance

O(T) : The upper bound on the threshold for the average structural difference between the training and testing phases -based family of computable functions, respectively

ε : A non-negative negligible real scalar value

From the aforementioned equation, it is evident that D — the Minkowski distance, takes two inputs: The first input is the average structure of the family of computable functions F learned, from the multi-dimensional input x[1,…n], in the training phase while the second input is the average structure of the family of computable functions G learned, from the multi-dimensional input x[1,…n], in the testing phase. Then, the Minkowski distance D computes the average structural difference between the family of computable functions extracted in the training and testing phases, respectively.

Subsequently, the aforementioned average difference is compared against the set threshold O(T). Thus, if the latent computable functions learned in the training and testing phases, respectively, are approximately equivalent then the aforementioned average difference would be small enough to be bounded by a non-negative negligible real scalar value ε. Notably, while the multi-dimensional inputs provided in the training and testing phases, respectively, may or may not be same, they are sourced from the same underlying input data distribution.

For example, if the set of multi-dimensional inputs X — also known as the underlying input data distribution, of which x[1,…n] is an input instance, is a collection of the cat images then the latent family of computable functions F and G for the training and testing phases, respectively, are the learned geometries of the cat images based on the underlying input data distribution. Then, the Minkowski distance D computes the average structural difference between the learned geometries of the collection of cat images over the training and testing phases, respectively.

Therefore, if the multi-dimensional input for the training and testing phases, respectively, is sourced from the same underlying input data distribution X then the aforementioned average difference would be negligibly small. Alternatively, if the underlying input distribution for the multi-dimensional input provided to the training and testing phases, respectively, is different for the each phase then the aforementioned average difference would be non-negligibly large. Thus, the aforementioned average difference helps discern between the training and testing phases -based multi-dimensional input, respectively.

Additionally, as previously noted, the neural networks perform complimentary tasks during the training and testing phases, respectively. In the training phase, the average structure of the family of computable functions F is learned. In the testing phase, the newly learned average structure of the family of computable functions G is compared with the prior-learned average structure of the family of computable functions F: To test whether a given testing phase computable function instance g[i] is a member of the training phase family of computable functions F. This set-membership inclusion basis is decided on the aforementioned average difference output of the Minkowski distance D. Then, a given testing phase computable function instance g[i] would pass the aforementioned set-membership test if its structure is comparable to the structures of the individual training phase computable function instances f[i] of the training phase family of functions F.

Thus, through the principled use of the aforementioned set-membership test, based on the aforementioned alternate interpretation of the foundational theory, the operability of the different learning paradigms can be understood.

The Reinterpreted Learning Paradigms

Broadly, there are five learning paradigms in the field of AI, which fundamentally differ on the degree of the human supervision needed during the training phase, among other notable differences. Then, the learning paradigms with their aforementioned alternate interpretations to the foundational theory, are briefly described below.

The Supervised Learning Paradigm. In this learning paradigm, during the training phase, the humans provide the inputs as well as the outputs, and the neural networks extract the latent computable functions by mapping the inputs to the outputs. Subsequently, in the testing phase, the learned latent computable functions from the newly provided input-output pairs are compared with the prior-learned latent computable functions, to solve the problems of the classification and regression variety, respectively. Additionally, the human-readable labels may be manually provided along with the aforementioned input-output pairs for helping classify the newly provided inputs, into the set of pre-defined categories.

Notably, in the classification problem setting, during the training phase, the human-readable labels are manually provided along with the input-output pairs to the neural networks. Subsequently, during the testing phase, the new input-output pairs, which share the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to classify them into the set of pre-defined categories. Then, the classification problems in the supervised learning paradigm, utilize the aforementioned set-membership test — manifesting from the aforementioned alternate interpretation to the foundational theory, to place the new input-output pairs into the pre-defined category-sets.

Separately, in the regression problem setting, during the training phase, the real-valued inputs to the neural networks are manually provided. Consequently, in the testing phase, the next real-value outputs need to be predicted using the previously provided real-valued inputs. Then, the regression problems in the supervised learning paradigm, utilize the aforementioned set-membership test, to interpolate the next real-value outputs by fitting them to the pre-defined curve-sets.

The Self-Supervised Learning Paradigm. In this learning paradigm, during the training phase, the humans provide the input-output pairs, and the neural networks learn the latent computable functions by mapping the inputs to the outputs. Subsequently, in the testing phase, the learned latent computable functions from the newly provided input-output pairs are compared with the prior-learned latent computable functions, to solve the problems of the classification, clustering, and regression variety, respectively. However, in comparison to the supervised learning paradigm, the algorithmically-generated labels are automatically inferred from the input-output pairs — on their clusterable properties, for helping classify the newly provided input-output pairs into the set of pre-defined categories.

Then, in the classification problem setting, during the training phase, the algorithmically-generated labels are automatically inferred from the input-output pairs on their clusterable properties. Then, such labels along with the input-output pairs are provided to the neural networks to learn their latent computable functions. Subsequently, during the testing phase, the new input-output pairs, which share the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to classify them into the set of pre-defined categories. Then, the classification problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test to place the new input-output pairs into the pre-defined category-sets.

Furthermore, in the clustering problem setting, during the training phase, the input-output pairs are clustered, by the neural networks, on their proximity to each other. Subsequently, during the testing phase, the new input-output pairs, which share the the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to cluster them in the pre-defined clusters. Then, the clustering problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test to place the new input-output pairs into the pre-defined cluster-sets.

Moreover, in the regression problem setting, during the training phase, the real-valued inputs to the neural networks are manually provided. Consequently, in the testing phase, the next real-value outputs need to be predicted using the previously provided real-valued inputs. Then, the regression problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test, to interpolate the next real-value outputs by fitting them to pre-defined curve-sets.

The Semi-Supervised Learning Paradigm. In this learning paradigm, during the training phase, the humans provide the input-output pairs, and the neural networks learn the latent computable functions by mapping the inputs to the outputs. Subsequently, in the testing phase, the learned latent computable functions from the newly provided input-output pairs are compared with the prior-learned latent computable functions, to solve the problems of the classification, clustering, and regression variety, respectively. However, in comparison to the supervised and self-supervised learning paradigms, a small set of human-generated labels are provided while the labels for the remaining input-output pairs are automatically inferred from the human-generated input-output pairs — on their clusterable properties, for helping classify the newly provided input-output pairs into the set of pre-defined categories.

Then, in the classification problem setting, during the training phase, a small set of human-generated labels are provided to the neural networks, which help infer the labels for the remaining input-output pairs on their clusterable properties. Then, such labels along with the input-output pairs are provided to the neural networks to learn their latent computable functions. Subsequently, during the testing phase, the new input-output pairs, which share the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to classify them into the set of pre-defined categories. Then, the classification problems in the semi-supervised learning paradigm, utilize the aforementioned set-membership test to place the new input-output pairs into the pre-defined category-sets.

Furthermore, in the clustering problem setting, during the training phase, the input-output pairs are clustered, by the neural networks, on their proximity to each other. Subsequently, during the testing phase, the new input-output pairs, which share the the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to cluster them into the pre-defined clusters. Then, the clustering problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test to place the new input-output pairs into the pre-defined cluster-sets.

Moreover, in the regression problem setting, during the training phase, the real-valued inputs to the neural networks are manually provided. Consequently, in the testing phase, the next real-value outputs need to be predicted using the previously provided real-valued inputs. Then, the regression problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test, to interpolate the next real-value outputs by fitting them to the pre-defined curve-sets.

The Reinforcement Learning Paradigm. In this learning paradigm, the actions taken by the software agents in a given environment, are modulated based on the pre-defined reward-penalty policies. This modulation carefully balances the need for exploration and exploitation, by the software agents, of the aforementioned environment expressed as the inexact Markov decision process. However, in comparison to the supervised, self-supervised, and semi-supervised learning paradigms, the reinforcement learning paradigm does not need human-generated labels along with the input-output pairs. Instead, the actions taken by the software agents are utilized towards solving the real-world problems by simulating them in the aforementioned environment.

Notably, the real-world problems, which are simulated in the aforementioned environment as the Markov decision process, are addressed by the software agents that operate under stochastic rules governed by the aforementioned carefully modulated exploration-exploitation policies. Therein, during the exploration and exploitation phase, the principled reward-penalty policies establish the basis through which the actions of the software agents are permitted or denied, thereby observing the aforementioned set-membership test — manifesting from the aforementioned alternate interpretation to the foundational theory. Thus, through the aforementioned set-membership test, the software agents help simulate the real-world problems, in the aforementioned environment, in order to help solve the learning problems.

The Unsupervised Learning Paradigm. In this learning paradigm, during the training phase, the humans provide the input-output pairs, and the neural networks learn the latent computable functions by mapping the inputs to the outputs. Subsequently, in the testing phase, the learned latent computable functions from the newly provided input-output pairs are compared with the prior-learned latent computable functions, to solve the problems of the clustering variety. However, in comparison to the supervised learning paradigm, the algorithmically-generated labels are automatically inferred from the input-output pairs — on their clusterable properties, for helping cluster the newly provided input-output pairs into the set of pre-defined clusters.

Then, in the clustering problem setting, during the training phase, the input-output pairs are clustered, by the neural networks, on their proximity to each other. Subsequently, during the testing phase, the new input-output pairs, which share the the underlying input data distribution with the training phase, are provided to the aforementioned neural networks in order to cluster them into the pre-defined clusters. Then, the clustering problems in the self-supervised learning paradigm, utilize the aforementioned set-membership test to place the new input-output pairs into the pre-defined cluster-sets.

Epilogue

This article examines the inceptive historical context and the incumbent interpretation for the foundational theory of the Deep learning — an instance of the AI sub-field of Connectionist-AI, which uses the neural networks for solving the learning problems. Subsequently, an alternate interpretation to the aforementioned foundational theory is proposed, to explicate its blackbox-operability nature.