“You Can Only Learn What You Can Compute”​ — On The Compressibility, Computability, Decidability…

Original article can be found here (source): Artificial Intelligence on Medium

Prologue

Artificial Intelligence (or AI) has marveled the world with its ability to impact all aspects of the human life. There are numerous emerging applications that aim to shape the human life experiences, thereby presenting the illusion of the limitless expressive power of AI.

Then, in this article, the limits of the modern AI are examined, by drawing the connections from the fundamental notions of compressibility, computability, decidability, and learnability — to show that such notions are equivalent; the manifesting computable-learnability takes finite computational resources towards the AI-learnability.

Such a line of inquiry enables charting a path towards a more general intelligence called the Artificial General Intelligence (or AGI).

While throughout this article the nature of emerging AI is examined through its neural networks based manifestation, the fundamental ideas described here are applicable to all computability based learning constructs, mechanisms, and paradigms.

Section 1: On The Connections Between Compressibility And Learnability

From the physics standpoint, in a stochastic system, entropy is the measure of degree of randomness in the system. For example, the super-cooled gas particles have minimal motion. Therefore, such particles have minimal motion based kinetic energy driven randomness, thereby having minimal entropy.

This notion of the entropy is translated to the information theoretic entropy using information content in the aforementioned system. That is, a random variable in the aforementioned system has more information content if it is more random. Certainly, the super-cooled gas particles with minimal kinetic energy information content have the minimal information theoretic entropy.

Subsequently, a discretized random variable in the aforementioned system with high entropy, has little to no compressible information content, as its unique element and regular set cardinalities, respectively, are equal.

Then, there is a larger amount of information content learnability available on the aforementioned random variable, by observing that its unique element set cardinality is greater compared to a random variable in the aforementioned system whose information content is highly compressible.

Conversely, if the information content in the aforementioned random variable is compressible by a certain amount then correspondingly the information content learnability reduces by that amount.

Furthermore, a function on the aforementioned discretized random variable with variable information content and compressibility is computable by a Turing machine if and only if:

  1. The input to the function has countably infinite cardinality.
  2. There is a finite computing algorithm for the aforementioned function.

Then, by the proposed computability-learnability equivalence principle, such a function can be learned by a neural network.

Conversely, if no finite computing algorithm for the aforementioned function exists then such a function is uncomputable by any Turing machine. Then, by the proposed computability-learnability equivalence principle, such uncomputable functions may not be learned by the neural networks.

Thus, learnability is determined by compressibility. That is, unbounded uncompressibility precludes it while bounded uncompressibility modulates it.

Reference: A statistical definition of entropy, MIT Lecture Notes on Thermodynamics and Propulsion, 2002.

Link: https://web.mit.edu/16.unified/www/FALL/thermodynamics/notes/node56.html

Section 2: On The Connections Between Computability And Learnability

There is a large body of research on the empirical performance of the neural networks but limited understanding of their learnability limits.

Mathematically, the neural networks learn the implicit function embedded in the input data. Such an input contains the mapping between the domain and range that manifests as the implicit function within the network through the configurable weights.

If the aforementioned function operates over the input with a countably infinite cardinality and a finite algorithm exists that computes it then it is computable. Naturally, it follows that the neural networks can learn such computable functions. Then, their learnability is at least equivalent to the computability of the Turing machines, indicating a computability-learnability equivalence principle.

However, if the aforementioned function does not have a finite computing algorithm then it is uncomputable. Uncomputable functions such as the Busy Beaver function and the Halting Problem are not computable by the Turing machines. Then, the neural networks cannot learn such functions, due to the non-existence of a finite computing algorithm.

For, if the neural networks could learn the aforementioned uncomputable functions, which implies that a finite computing algorithm exists that could compute such uncomputable functions in the polynomial time, then they could be structured as the Oracles, to be invoked by the Turing machines to compute the aforementioned uncomputable functions in the polynomial time.

Then, through a polynomial-time reduction from any NP-Complete problem to an uncomputable function computed by the aforementioned Oracle-enabled Turing machine, any NP-Complete problem could be solved in the polynomial time, thereby, proving P = NP — a longstanding unresolved problem in the field of computer science that asks whether the class of problems that can be solved in the polynomial time (i.e. the class of problems that belong to the set P or Polynomial time) by the deterministic Turing machines is equal to the class of problems whose solutions can be verified in the polynomial time (i.e. the class of problems that belong to the set NP or Non-deterministic Polynomial time, of which the subset NP-Complete contains the hardest NP-class problems) by the deterministic Turing machines.

However, since it is likely that PNP, then by the contradiction, it is unlikely that the aforementioned uncomputable functions can be learned by the neural networks. This independently suggests the non-existence of a finite computing algorithm for such uncomputable functions.

Alternatively, if P = NP then the scope of computability would expand to include the aforementioned uncomputable functions. Then, such previously uncomputable but newly computable functions would be learnable by the neural networks.

Thus, based on the nature of P and NP equivalency, the computability and learnability of functions, respectively, would likely have a tight bound on their equivalency. That is, the neural networks may only learn those functions which the Turing machines would be able to compute, whether they’re computable (i.e. if P ≠ NP) or uncomputable (i.e. if P = NP) functions.

Reference: P Vs NP Problem, Clay Mathematics Institute, 2019.

Link: https://www.claymath.org/millennium-problems/p-vs-np-problem

Section 3: On The Connections Between Decidability And Learnability

In computability theory, it is known that the set L — also known as a language, is decidable by a Turing machine M such that on input x, M accepts if x L and rejects otherwise. Such Turing-decidable languages are called recursively enumerable languages.

This means that the transition function rules of the aforementioned Turing machine are such that they recognize only the elements that are members of the language L, of which x is an element. Then, decidability is premised on testing the membership of elements to a select discrete set.

On closer examination, the contemporary learning problems spanning over the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning paradigms, respectively, closely intertwine with this notion of the decidability.

For example, the classification problem in the supervised learning setting is a set-membership test. Consider a set of dog images D on which a neural network is trained. Then, a cat image c D would be labeled not a dog, as it is not a member of the aforementioned set and therefore, will not be recognized, as a cat, by the neural network.

Generally, by the computability-learnability equivalence principle, any recursively enumerable language L that is Turing-decidable, its reducible grammar is the only generative computable function that is learnable by the neural networks. Furthermore, uncomputable functions such as the Busy Beaver function and the Halting Problem are Turing-undecidable. Then, by the computability-learnability equivalence principle, such Turing-undecidable languages cannot be learned by the neural networks.

Moreover, based on the Chomsky hierarchy, the grammars for such Turing-decidable languages are called unrestricted grammars (or Type-0 grammars). Then, by the computability-learnability equivalence principle, strings generated by such grammars are the only input on which the neural networks would classify using the pre-defined labels or form the unlabeled clusters.

Conversely, if the aforementioned neural networks take the set of inputs and cluster them on their underlying characteristics then for each such cluster and the overall cluster-group, there exists an unrestricted grammar to which they are reducible and Turing machine by which they are decidable, respectively. Notably, such clusters may be pre-labeled.

Then, the Turing machines and neural networks are equivalent in their expressive powers, with the notable exception that the Turing machines come pre-configured with their transition functions while the neural networks, with configurable weights, build the learnable computable functions using the input data.

Partial-Credit: Robert F. Dickerson

Link: https://www.linkedin.com/posts/rfdickerson_im-pretty-surprised-i-havent-encountered-activity-6562824855538978816-vjAr/

Section 4: On The Manifesting Insight From The Connections

From the aforementioned connections, it is evident that there is a convergence in the notions of compressibility, computability, decidability, and learnability, to the extent that equivalences manifest amongst them. For example, there is an equivalence between the notions of computability and learnability.

A. On The Equivalence Between Computing And Learning Machines

Recall that, in the computability theory, P, is a class of decidable problems that can be solved by a deterministic Turing machine, using, as the function of the input size, only the polynomial amount of computational time.

Then, by the computability-learnability equivalence principle, COMPUTE-LEARN — a class of computable functions, generative unrestricted grammars, and recursively enumerable languages based learning problems, spread over the different learning paradigms such as the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning, respectively, that can be solved by the aforementioned Turing machine equivalent neural network, is in P.

Interpretatively, such clusters, functions, grammars, languages, and strings may contain juxtaposed bit-level encoded information that could be viewed as a collection of characters, digits, musical notes, pixels, among other interpretations.

Essentially, this means, that for each aforementioned Turing machine there exist an equivalent neural network, either of which can decide or solve, respectively, the learning problems in the class COMPUTE-LEARN by taking only a polynomial amount of time, proportional to the input size. Notably, the uncomputable functions such as the Busy Beaver function and the Halting Problem are not the member of the aforementioned class.

Thus, the computable-learnability takes, at most, countably infinite amount of space and time towards the AI-learnability.

B. On The Probabilistic Learnability

Thus far, the under consideration Turing machines and neural networks have been assumed to be deterministic in nature. Then, it is important to consider the enhancement in their computability and learnability rates at the cost of their computing and learning fidelities, respectively, by introducing the notion of probability.

In particular, in complexity theory, a probability based proof system exists — also known as the Probabilistically Checkable Proof (or PCP), which verifies the correctness of the pertinent proof using a fixed amount of randomness and number of proof queries.

In such a proof system, there are two complementary components — a prover and a verifier. The prover, by examining the solution of a given problem, generates a proof certificate, which the verifier verifies using the aforementioned amount of randomness and number of queries to the proof.

More precisely, the PCP[R(n), Q(n)] can be defined as a class of decision problems that are decided, by the randomized Oracle-enabled Turing machines, through the utilization of R(n) amount of randomness and at most Q(n) number of proof queries, proportional to the input of size n. Such a Turing machine satisfies the definitional requirements for the aforementioned verifier.

Then, any language L ∈ PCP[R(n), Q(n)] is decided by the corresponding aforementioned verifier Turing machine. This means that the prover for the aforementioned class examines an input x ∈ L to generate a proof certificate π, which the verifier reads to accept if the input x ∈ L, by using R(n) amount of randomness and at most Q(n) number of proof queries, proportional to all inputs x of size n.

For example, PCP[0, O(Log[n])] is a class of decision problems based on the verifier Turing machine that uses no amount of randomness (i.e. R(n) = 0) and proof queries of at most O(Log[n]) size (i.e. Q(n) = O(Log[n])), to accept if an input x ∈ L, for any language L that is in the aforementioned class.

Where the aforementioned symbols have the following meaning.

PCP : The probabilistic checkable proof system.

n : The input size.

R(n) : The function of n defined by selecting random bits from the proof certificate.

Q(n) : The function of n defined by performing number of queries on the proof certificate.

O(Log[n]) : The class of decision problems having logarithmic time complexity.

π : The proof certificate symbol.

Notably, such a PCP class is in P — a class of decision problems that can be decided by the deterministic Turing machines, taking only polynomial amount of time proportional to the input size. This is noted by observing that the aforementioned verifier Turing machine does not require randomness to operate and can try all possible, at most O(Log[n]) size, proof certificates to make a decision in the polynomial amount of time, proportional to the size of the input proof certificates.

Separately, one of the numerous ways to construct the proof certificates is for the prover to utilize the error-correction codes such as the Hamming code.

Consider, the Hamming code function as follows.

H : {0, 1}^k → {0, 1}^m,

s.t. m = k + Log[k],

k ≥ 2,

k < m n.

Where the aforementioned symbols have the following meaning.

H : The Hamming code function.

{0, 1}^k : The binary bit-string of size k.

{0, 1}^m : The binary bit-string of size m.

: The directional correspondence symbol.

s.t. : The such that abbreviation.

Then, by using a parity check matrix, the Hamming code function H takes the domain of k-bit bit-strings and converts them to the range of m-bit bit-strings by adding the Log[k] check-bits to them at their 2^p index-locations where p ∈ [0, Log[k]]. While the definition of the parity check matrix is not provided here, the scope of the exposition remains complete.

Then, the prover can construct a m-bit proof certificate π of the input x, which the verifier Turing machine can examine, at the aforementioned 2^p index-locations, to accept if x ∈ L, a language that is in PCP[0, O(Log[n])] class.

Notably, by examining at most O(Log[n]) check-bits of the m-bit proof certificate π over a few iterations, the verifier Turing machine probabilistically accepts if x ∈ L in the polynomial amount of time, proportional to the size of input x.

Then, by the computability-learnability equivalence principle, the aforementioned verifier Turing machine equivalent neural network — also known as the Probabilistic Neural Network, can be constructed that can learn the aforementioned language L to probabilistically classify a given input.

While the Hamming code is used in the construction of the aforementioned proof certificate, it should be noted that any finite discrete computable function that allows selection of a subset of the proof certificate bits would work. Then, it would require only that subset of the proof certificate bits, by the aforementioned neural network, to classify the input over a few iterations, with a certain probability, in the polynomial amount of time proportional to the size of input proof certificates.

For example, such a neural network would learn a set of cat images C to probabilistically classify a new cat image c by examining a subset of its image-pixels over a few iterations.

For applications where such a probabilistic classification is acceptable, a considerable amount of decision speedup may be available albeit with an understanding that a portion of the decisions may not be correct.

The amount of proportional incorrect decisions in the decision mixture would depend on the amount of subset bits used by the aforementioned neural network to probabilistically classify the inputs.

Separately, thus far, the aforementioned neural networks have been used in the supervised learning manner. That is, in the training phase, the neural network probabilistically learns the language L, wherein a label A is provided for each input x that is probabilistically admitted to the language L. Then, in the test phase, when a new input is probabilistically admitted to language L, the label A is generated as the output along with the admittance probability of input to the language L.

However, the aforementioned neural network can also be applied in the unsupervised learning manner, by observing that the labels are not required for the probabilistically admissible inputs to the language L.

Then, all such admitted inputs to the language L would form a cluster on their proximities to each other using a bit based distance metric (e.g. the Hamming distance).

Naturally, such clusters, individually and collectively, would reduce to their corresponding probabilistic grammars in the hierarchical manner, wherein the top-level probabilistic grammar would have production rules for the lower-level probabilistic grammars, all the way down to the leaf-level probabilistic grammars which would help generate the aforementioned individual clusters.

Thus, by relaxing the constraint on the computing and learning fidelities of the Turing machines and neural networks, respectively, their computability and learnability rates can be improved.

C. On The Quantum Learnability

There is a key subtlety regarding the nature of the Turing machines that must be considered, which subsumes the aforementioned connections and their convergence. In particular, in the computability theory, a Turing machine is an instance of the mathematical model of computation as defined by the abstract computing machine.

Notably, the Turing machines, described here, have been implicitly assumed to be operating under the classical computing constraints. However, such machines can also operate in the quantum computing realm.

Therein, by considering the state space to be the Hilbert space, the transition function to be the collection of unitary matrices that are auto-isomorphisms of the aforementioned state space, and by introducing the mixed states to describe the superposition states in the quantum mechanics, the quantum Turing machines (or the equivalent quantum circuits) can be constructed. Naturally, the analogous Quantum Neural Networks can be also constructed.

Then, by noting that the aforementioned computability-learnability equivalence principle is implicitly an equivalence between the abstract computing and learning machines, respectively, it is discernible that such an equivalence would also hold true in the quantum computing realm.

More importantly, the computable functions, generative unrestricted grammars, and recursively enumerable languages manifesting in the quantum computing realm, would take advantage of the aforementioned mixed states. Then, by the computability-learnability equivalence principle such functions, grammars, and languages that are computable, recognizable, and decidable by the quantum Turing machines (or the equivalent quantum circuits) are the only functions, grammars, and languages that would be learnable by the quantum neural networks.

Thus, by switching the underlying abstract computing and learning machines from the classical Turing machines and neural networks to quantum Turing machines (or equivalent quantum circuits) and quantum neural networks, respectively, the classical limits of the learnability can be transcended. Notably, such a learnability enhancement is due to the introduction of aforementioned mixed states, the quantum transition function, the Hilbert state space, along with the mixed state enabled computable functions, generative unrestricted grammars, and recursively enumerable languages, respectively.

D. On The Divergence Between Data Semantics and Structures

Beyond the nature of the abstract computing and learning machines, is the issue of divergence in data semantics and structures that such computing machines compute over and learning machines learn on, respectively. In particular, it is notable that the neural networks focus on the structure of the input data and do not understand its semantics.

While it is conceivable that the data semantics and structures may converge, in general, they are not equivalent. Then, the learning problems spread over the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning paradigms, respectively, could manifest in the manner where the aforementioned data semantics-structures convergence takes place. However, in general, they would tend to diverge.

To further elucidate the notion of the potential data semantics-structures divergence, consider the following example of the graph isomorphism.

In graph theory, the two graphs A and B are considered isomorphic if there exists a one-to-one correspondence between their vertex sets. Mathematically, it is defined as follows.

F : V(A) ↔ V(B),

∃ a, ã ∈ V(A) Λ F(a), F(ã) ∈ V(B),

s.t. iff a — ã then F(a) — F(ã).

Where the aforementioned symbols have the following meaning.

F : The one-to-one correspondence.

A, B : The two graphs under consideration for being isomorphic.

V(A), V(B) : The vertex sets for the graphs A and B, respectively.

a, ã : The vertices belonging to the graph A.

F(a), F(ã) : The vertices belonging to the graph B.

: The one-to-one correspondence symbol.

: The belongs to symbol.

: The adjacent vertices symbol.

Λ : The and symbol.

s.t. : The such that abbreviation.

iff : The if and only if abbreviation.

Essentially, the aforementioned mathematical proposition suggests that the two graphs A and B are isomorphic if and only if the two vertices a and ã, belonging to the vertex set V(A), are adjacent then the two vertices F(a) and F(ã), belonging to the vertex set V(B), are also adjacent.

Naturally, based on the nature of the aforementioned one-to-one correspondence, the aforementioned two graphs can be semantically equivalent but structurally different (i.e. they are semantically-isomorphic), semantically different but structurally equivalent (i.e. they are structurally-isomorphic), and semantically and structurally equivalent (i.e. they are semantically-structurally auto-isomorphic).

Then, graph isomorphism shows that the data semantics and structures would tend to diverge. That is, they may converge but are not equivalent.

Notably, such isomorphisms are subsumed within the notions of the abstract computing and learning machines, respectively, they are operated over and learned on, both of the classical and quantum variety.

References:

Graph isomorphism, Wikipedia, 2019.

Link: https://en.wikipedia.org/wiki/Graph_isomorphism

Probabilistic checking of proofs and hardness of approximation problems, Sanjeev Arora, University of California, Berkeley Ph.D. Thesis, 1994.

Link: https://www.cs.princeton.edu/~arora/pubs/thesis.pdf

Quantum theory, the Church-Turing principle and the universal quantum computer, David Deutsch, Royal Society London, 1985.

Link: https://people.eecs.berkeley.edu/~christos/classics/Deutsch_quantum_theory.pdf

Section 5: On The Undecidability Of Learnability

In computability theory, the notions of computability and decidability are interchangeable.

That is, for a given computable function and decidable language, a single Turing machine can be constructed that computes and decides the aforementioned function and language, respectively.

Conversely, in computability theory, the notions of uncomputability and undecidability are interchangeable. For example, the Busy Beaver function and the Halting Problem are uncomputable and undecidable by any Turing machine, respectively.

More importantly, if PNP and the chosen functions, grammars, and languages are uncomputable and undecidable by the Turing machines then due to the aforementioned computability-learnability equivalence principle, the neural networks cannot learn such functions, grammars, and languages.

Then, it doesn’t prevent the neural networks from failing to learn such functions, grammars, and languages along a certain dimension. Thus, learnability can be uncomputable and undecidable in that particular sense.

Reference: Learning can be undecidable, Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff, Nature Machine Intelligence, 2019.

Link: https://www.nature.com/articles/s42256-018-0002-3

Section 6: On The Generalizability Of Learnability

Contemporary neural networks operate on the input over the Euclidean space. However, this assumption, of space being Euclidean in nature, is only true in a proximate-distance based sensing-modality setup.

Generally, observing the space as a sequence of two-dimensional Euclidean data frames fails to incorporate remote-distance based kinematic and spatial curvature properties of the measured objects within that space.

For example, in a proximate-distance based Euclidean setup, the triangular inequality holds true, which isn’t true in the remote-distance setup. Consequently, a neural network trained over the Euclidean space would have its neural-weight configuration imbue the deliverable predictions with the Euclidean space induced spatial-bias.

Then, state-of-the-art neural networks, learning protein folding structures, road artifacts, and more, miss the opportunity of taking advantage of the additional domain-specific characteristics that are available in the non-Euclidean space.

Then, one of the key goals in the generalizability of AI would be its learnability over the non-Euclidean space.

Reference: “To Artificially Learn Is To Artificially Live” — On Examining The Human Intelligence For Defining The Artificial Intelligence, Kirti Chawla, LinkedIn Publishing, 2019.

Link: https://www.linkedin.com/pulse/artificially-learn-live-examining-human-intelligence-chawla-ph-d–1c/

Section 7: On The Physical Limits Of Learnability

In any given universe, the manifesting laws of physics define the physical limits to the computation through several universal constants. For example, in this universe such constants include the speed of light, quantum scale, and gravitational constant.

Then, based on the aforementioned computability-learnability equivalence principle, the aforementioned laws also determine the physical limits to AI-learnability in this universe, as any Turing machine and neural network must adhere to such limits.

Reference: Ultimate physical limits to computation, Seth Lloyd, Nature, 1999.

Link: https://www.nature.com/articles/35023282

Epilogue

This article, through the established foundations of the information theory, computer science, philosophy, and theoretical physics, presents a view on the limits of the state-of-the-art AI, and the manner in which it may be further developed towards a more general AI.