Original article can be found here (source): Artificial Intelligence on Medium

# Prologue

*Artificial Intelligence* (or *AI*) has marveled the world with its ability to impact all aspects of the human life. There are numerous emerging applications that aim to shape the human life experiences, thereby presenting the illusion of the limitless expressive power of *AI*.

Then, in this article, the limits of the modern *AI* are examined, by drawing the connections from the fundamental notions of compressibility, computability, decidability, and learnability — to show that such notions are equivalent; the manifesting computable-learnability takes finite computational resources towards the *AI*-learnability.

Such a line of inquiry enables charting a path towards a more general intelligence called the *Artificial General Intelligence* (or *AGI*).

While throughout this article the nature of emerging *AI* is examined through its neural networks based manifestation, the fundamental ideas described here are applicable to all computability based learning constructs, mechanisms, and paradigms.

# Section 1: On The Connections Between Compressibility And Learnability

From the physics standpoint, in a stochastic system, entropy is the measure of degree of randomness in the system. For example, the super-cooled gas particles have minimal motion. Therefore, such particles have minimal motion based kinetic energy driven randomness, thereby having minimal entropy.

This notion of the entropy is translated to the information theoretic entropy using information content in the aforementioned system. That is, a random variable in the aforementioned system has more information content if it is more random. Certainly, the super-cooled gas particles with minimal kinetic energy information content have the minimal information theoretic entropy.

Subsequently, a discretized random variable in the aforementioned system with high entropy, has little to no compressible information content, as its unique element and regular set cardinalities, respectively, are equal.

Then, there is a larger amount of information content learnability available on the aforementioned random variable, by observing that its unique element set cardinality is greater compared to a random variable in the aforementioned system whose information content is highly compressible.

Conversely, if the information content in the aforementioned random variable is compressible by a certain amount then correspondingly the information content learnability reduces by that amount.

Furthermore, a function on the aforementioned discretized random variable with variable information content and compressibility is computable by a *Turing* machine if and only if:

- The input to the function has countably infinite cardinality.
- There is a finite computing algorithm for the aforementioned function.

Then, by the proposed computability-learnability equivalence principle, such a function can be learned by a neural network.

Conversely, if no finite computing algorithm for the aforementioned function exists then such a function is uncomputable by any *Turing* machine. Then, by the proposed computability-learnability equivalence principle, such uncomputable functions may not be learned by the neural networks.

Thus, learnability is determined by compressibility. That is, unbounded uncompressibility precludes it while bounded uncompressibility modulates it.

Reference: *A statistical definition of entropy*, MIT Lecture Notes on Thermodynamics and Propulsion, 2002.

Link: https://web.mit.edu/16.unified/www/FALL/thermodynamics/notes/node56.html

# Section 2: On The Connections Between Computability And Learnability

There is a large body of research on the empirical performance of the neural networks but limited understanding of their learnability limits.

Mathematically, the neural networks learn the implicit function embedded in the input data. Such an input contains the mapping between the domain and range that manifests as the implicit function within the network through the configurable weights.

If the aforementioned function operates over the input with a countably infinite cardinality and a finite algorithm exists that computes it then it is computable. Naturally, it follows that the neural networks can learn such computable functions. Then, their learnability is at least equivalent to the computability of the *Turing* machines, indicating a computability-learnability equivalence principle.

However, if the aforementioned function does not have a finite computing algorithm then it is uncomputable. Uncomputable functions such as the *Busy Beaver* function and the *Halting Problem* are not computable by the *Turing* machines. Then, the neural networks cannot learn such functions, due to the non-existence of a finite computing algorithm.

For, if the neural networks could learn the aforementioned uncomputable functions, which implies that a finite computing algorithm exists that could compute such uncomputable functions in the polynomial time, then they could be structured as the *Oracles*, to be invoked by the *Turing* machines to compute the aforementioned uncomputable functions in the polynomial time.

Then, through a polynomial-time reduction from any *NP*-Complete problem to an uncomputable function computed by the aforementioned *Oracle*-enabled *Turing* machine, any *NP*-Complete problem could be solved in the polynomial time, thereby, proving *P = NP* — a longstanding unresolved problem in the field of computer science that asks whether the class of problems that can be solved in the polynomial time (i.e. the class of problems that belong to the set *P* or Polynomial time) by the deterministic *Turing* machines is equal to the class of problems whose solutions can be verified in the polynomial time (i.e. the class of problems that belong to the set *NP* or Non-deterministic Polynomial time, of which the subset *NP*-Complete contains the hardest *NP*-class problems) by the deterministic *Turing* machines.

However, since it is likely that *P* ≠ *NP*, then by the contradiction, it is unlikely that the aforementioned uncomputable functions can be learned by the neural networks. This independently suggests the non-existence of a finite computing algorithm for such uncomputable functions.

Alternatively, if *P = NP* then the scope of computability would expand to include the aforementioned uncomputable functions. Then, such previously uncomputable but newly computable functions would be learnable by the neural networks.

Thus, based on the nature of *P* and *NP* equivalency, the computability and learnability of functions, respectively, would likely have a tight bound on their equivalency. That is, the neural networks may only learn those functions which the *Turing* machines would be able to compute, whether they’re computable (i.e. if *P ≠ NP*) or uncomputable (i.e. if *P = NP*) functions.

Reference: *P Vs NP Problem*, Clay Mathematics Institute, 2019.

Link: https://www.claymath.org/millennium-problems/p-vs-np-problem

# Section 3: On The Connections Between Decidability And Learnability

In computability theory, it is known that the set *L* — also known as a language, is decidable by a *Turing* machine *M* such that on input *x*, *M* accepts if *x* *∈* *L* and rejects otherwise. Such *Turing*-decidable languages are called recursively enumerable languages.

This means that the transition function rules of the aforementioned *Turing* machine are such that they recognize only the elements that are members of the language *L*, of which *x* is an element. Then, decidability is premised on testing the membership of elements to a select discrete set.

On closer examination, the contemporary learning problems spanning over the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning paradigms, respectively, closely intertwine with this notion of the decidability.

For example, the classification problem in the supervised learning setting is a set-membership test. Consider a set of dog images *D* on which a neural network is trained. Then, a cat image *c* *∉* *D* would be labeled *not a dog*, as it is not a member of the aforementioned set and therefore, will not be recognized, as a cat, by the neural network.

Generally, by the computability-learnability equivalence principle, any recursively enumerable language *L *that is *Turing*-decidable, its reducible grammar is the only generative computable function that is learnable by the neural networks. Furthermore, uncomputable functions such as the *Busy Beaver* function and the *Halting Problem* are *Turing*-undecidable. Then, by the computability-learnability equivalence principle, such *Turing*-undecidable languages cannot be learned by the neural networks.

Moreover, based on the *Chomsky* hierarchy, the grammars for such *Turing*-decidable languages are called unrestricted grammars (or *Type-0* grammars). Then, by the computability-learnability equivalence principle, strings generated by such grammars are the only input on which the neural networks would classify using the pre-defined labels or form the unlabeled clusters.

Conversely, if the aforementioned neural networks take the set of inputs and cluster them on their underlying characteristics then for each such cluster and the overall cluster-group, there exists an unrestricted grammar to which they are reducible and *Turing* machine by which they are decidable, respectively. Notably, such clusters may be pre-labeled.

Then, the *Turing* machines and neural networks are equivalent in their expressive powers, with the notable exception that the *Turing* machines come pre-configured with their transition functions while the neural networks, with configurable weights, build the learnable computable functions using the input data.

Partial-Credit: Robert F. Dickerson

# Section 4: On The Manifesting Insight From The Connections

From the aforementioned connections, it is evident that there is a convergence in the notions of compressibility, computability, decidability, and learnability, to the extent that equivalences manifest amongst them. For example, there is an equivalence between the notions of computability and learnability.

# A. On The Equivalence Between Computing And Learning Machines

Recall that, in the computability theory, *P*, is a class of decidable problems that can be solved by a deterministic *Turing* machine, using, as the function of the input size, only the polynomial amount of computational time.

Then, by the computability-learnability equivalence principle, *COMPUTE-LEARN* — a class of computable functions, generative unrestricted grammars, and recursively enumerable languages based learning problems, spread over the different learning paradigms such as the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning, respectively, that can be solved by the aforementioned *Turing* machine equivalent neural network, is in *P*.

Interpretatively, such clusters, functions, grammars, languages, and strings may contain juxtaposed bit-level encoded information that could be viewed as a collection of characters, digits, musical notes, pixels, among other interpretations.

Essentially, this means, that for each aforementioned *Turing* machine there exist an equivalent neural network, either of which can decide or solve, respectively, the learning problems in the class *COMPUTE-LEARN* by taking only a polynomial amount of time, proportional to the input size. Notably, the uncomputable functions such as the *Busy Beaver* function and the *Halting Problem *are not the member of the aforementioned class.

Thus, the computable-learnability takes, at most, countably infinite amount of space and time towards the *AI*-learnability.

# B. On The Probabilistic Learnability

Thus far, the under consideration *Turing* machines and neural networks have been assumed to be deterministic in nature. Then, it is important to consider the enhancement in their computability and learnability rates at the cost of their computing and learning fidelities, respectively, by introducing the notion of probability.

In particular, in complexity theory, a probability based proof system exists — also known as the *Probabilistically Checkable Proof* (or *PCP*), which verifies the correctness of the pertinent proof using a fixed amount of randomness and number of proof queries.

In such a proof system, there are two complementary components — a prover and a verifier. The prover, by examining the solution of a given problem, generates a proof certificate, which the verifier verifies using the aforementioned amount of randomness and number of queries to the proof.

More precisely, the *PCP[R(n), Q(n)]* can be defined as a class of decision problems that are decided, by the randomized *Oracle*-enabled *Turing* machines, through the utilization of *R(n)* amount of randomness and at most *Q(n)* number of proof queries, proportional to the input of size *n*. Such a *Turing* machine satisfies the definitional requirements for the aforementioned verifier.

Then, any language *L ∈ PCP[R(n), Q(n)]* is decided by the corresponding aforementioned verifier *Turing* machine. This means that the prover for the aforementioned class examines an input *x ∈ L* to generate a proof certificate *π*, which the verifier reads to accept if the input *x ∈ L*, by using *R(n)* amount of randomness and at most *Q(n)* number of proof queries, proportional to all inputs *x* of size *n*.

For example, *PCP[0, O(Log[n])]* is a class of decision problems based on the verifier *Turing* machine that uses no amount of randomness (i.e. *R(n)* = 0) and proof queries of at most *O(Log[n])* size (i.e. *Q(n)* = *O(Log[n])*), to accept if an input *x ∈ L*, for any language *L* that is in the aforementioned class.

Where the aforementioned symbols have the following meaning.

*PCP* : The probabilistic checkable proof system.

*n* : The input size.

*R(n)* : The function of *n* defined by selecting random bits from the proof certificate.

*Q(n)* : The function of *n* defined by performing number of queries on the proof certificate.

*O(Log[n])* : The class of decision problems having logarithmic time complexity.

*π* : The proof certificate symbol.

Notably, such a *PCP* class is in *P* — a class of decision problems that can be decided by the deterministic *Turing* machines, taking only polynomial amount of time proportional to the input size. This is noted by observing that the aforementioned verifier *Turing* machine does not require randomness to operate and can try all possible, at most *O(Log[n])* size, proof certificates to make a decision in the polynomial amount of time, proportional to the size of the input proof certificates.

Separately, one of the numerous ways to construct the proof certificates is for the prover to utilize the error-correction codes such as the *Hamming* code.

Consider, the *Hamming* code function as follows.

*H : {0, 1}^k → {0, 1}^m,*

*s.t. m = k + Log[k],*

*k ≥* 2*,*

*k <* *m *≤ *n.*

Where the aforementioned symbols have the following meaning.

*H* : The *Hamming* code function.

*{0, 1}^k* : The binary bit-string of size *k*.

*{0, 1}^m* : The binary bit-string of size *m*.

*→* : The directional correspondence symbol.

*s.t.* : The such that abbreviation.

Then, by using a parity check matrix, the *Hamming* code function *H* takes the domain of *k*-bit bit-strings and converts them to the range of *m*-bit bit-strings by adding the *Log[k]* check-bits to them at their *2^p* index-locations where *p ∈ [0, Log[k]]*. While the definition of the parity check matrix is not provided here, the scope of the exposition remains complete.

Then, the prover can construct a *m*-bit proof certificate *π* of the input *x*, which the verifier *Turing* machine can examine, at the aforementioned *2^p* index-locations, to accept if *x ∈ L, *a language that is in *PCP[0, O(Log[n])]* class.

Notably, by examining at most *O(Log[n])* check-bits of the *m*-bit proof certificate *π* over a few iterations, the verifier *Turing* machine probabilistically accepts if *x ∈ L* in the polynomial amount of time, proportional to the size of input *x*.

Then, by the computability-learnability equivalence principle, the aforementioned verifier *Turing* machine equivalent neural network — also known as the *Probabilistic Neural Network*, can be constructed that can learn the aforementioned language *L* to probabilistically classify a given input.

While the *Hamming* code is used in the construction of the aforementioned proof certificate, it should be noted that any finite discrete computable function that allows selection of a subset of the proof certificate bits would work. Then, it would require only that subset of the proof certificate bits, by the aforementioned neural network, to classify the input over a few iterations, with a certain probability, in the polynomial amount of time proportional to the size of input proof certificates.

For example, such a neural network would learn a set of cat images *C* to probabilistically classify a new cat image *c* by examining a subset of its image-pixels over a few iterations.

For applications where such a probabilistic classification is acceptable, a considerable amount of decision speedup may be available albeit with an understanding that a portion of the decisions may not be correct.

The amount of proportional incorrect decisions in the decision mixture would depend on the amount of subset bits used by the aforementioned neural network to probabilistically classify the inputs.

Separately, thus far, the aforementioned neural networks have been used in the supervised learning manner. That is, in the training phase, the neural network probabilistically learns the language *L*, wherein a label *A* is provided for each input *x* that is probabilistically admitted to the language *L*. Then, in the test phase, when a new input *x̄* is probabilistically admitted to language *L*, the label *A* is generated as the output along with the admittance probability of input *x̄* to the language *L*.

However, the aforementioned neural network can also be applied in the unsupervised learning manner, by observing that the labels are not required for the probabilistically admissible inputs to the language *L*.

Then, all such admitted inputs to the language *L* would form a cluster on their proximities to each other using a bit based distance metric (e.g. the *Hamming* distance).

Naturally, such clusters, individually and collectively, would reduce to their corresponding probabilistic grammars in the hierarchical manner, wherein the top-level probabilistic grammar would have production rules for the lower-level probabilistic grammars, all the way down to the leaf-level probabilistic grammars which would help generate the aforementioned individual clusters.

Thus, by relaxing the constraint on the computing and learning fidelities of the *Turing* machines and neural networks, respectively, their computability and learnability rates can be improved.

# C. On The Quantum Learnability

There is a key subtlety regarding the nature of the *Turing* machines that must be considered, which subsumes the aforementioned connections and their convergence. In particular, in the computability theory, a *Turing* machine is an instance of the mathematical model of computation as defined by the abstract computing machine.

Notably, the *Turing* machines, described here, have been implicitly assumed to be operating under the classical computing constraints. However, such machines can also operate in the quantum computing realm.

Therein, by considering the state space to be the* Hilbert* space, the transition function to be the collection of unitary matrices that are auto-isomorphisms of the aforementioned state space, and by introducing the mixed states to describe the superposition states in the quantum mechanics, the quantum *Turing* machines (or the equivalent quantum circuits) can be constructed. Naturally, the analogous *Quantum Neural Networks* can be also constructed.

Then, by noting that the aforementioned computability-learnability equivalence principle is implicitly an equivalence between the abstract computing and learning machines, respectively, it is discernible that such an equivalence would also hold true in the quantum computing realm.

More importantly, the computable functions, generative unrestricted grammars, and recursively enumerable languages manifesting in the quantum computing realm, would take advantage of the aforementioned mixed states. Then, by the computability-learnability equivalence principle such functions, grammars, and languages that are computable, recognizable, and decidable by the quantum *Turing* machines (or the equivalent quantum circuits) are the only functions, grammars, and languages that would be learnable by the quantum neural networks.

Thus, by switching the underlying abstract computing and learning machines from the classical *Turing* machines and neural networks to quantum *Turing* machines (or equivalent quantum circuits) and quantum neural networks, respectively, the classical limits of the learnability can be transcended. Notably, such a learnability enhancement is due to the introduction of aforementioned mixed states, the quantum transition function, the *Hilbert* state space, along with the mixed state enabled computable functions, generative unrestricted grammars, and recursively enumerable languages, respectively.

# D. On The Divergence Between Data Semantics and Structures

Beyond the nature of the abstract computing and learning machines, is the issue of divergence in data semantics and structures that such computing machines compute over and learning machines learn on, respectively. In particular, it is notable that the neural networks focus on the structure of the input data and do not understand its semantics.

While it is conceivable that the data semantics and structures may converge, in general, they are not equivalent. Then, the learning problems spread over the supervised, self-supervised, semi-supervised, reinforcement, and unsupervised learning paradigms, respectively, could manifest in the manner where the aforementioned data semantics-structures convergence takes place. However, in general, they would tend to diverge.

To further elucidate the notion of the potential data semantics-structures divergence, consider the following example of the graph isomorphism.

In graph theory, the two graphs *A* and *B* are considered isomorphic if there exists a one-to-one correspondence between their vertex sets. Mathematically, it is defined as follows.

*F : V(A) ↔ V(B),*

*∃ a, ã ∈ V(A) Λ F(a), F(ã) ∈ V(B),*

*s.t. iff a — ã then F(a) — F(ã).*

Where the aforementioned symbols have the following meaning.

*F *: The one-to-one correspondence.

*A*, *B *: The two graphs under consideration for being isomorphic.

*V(A), V(B) *: The vertex sets for the graphs *A* and *B*, respectively.

*a*, *ã *: The vertices belonging to the graph *A.*

*F(a)*, *F(ã) *: The vertices belonging to the graph *B.*

*↔* : The one-to-one correspondence symbol.

*∈* : The belongs to symbol.

*— *: The adjacent vertices symbol.

*Λ* : The and symbol.

*s.t. *: The such that abbreviation.

*iff* : The if and only if abbreviation.

Essentially, the aforementioned mathematical proposition suggests that the two graphs *A* and *B* are isomorphic if and only if the two vertices *a* and *ã*,* *belonging to the vertex set *V(A)*, are adjacent then the two vertices *F(a)* and *F(ã)*, belonging to the vertex set *V(B)*, are also adjacent.

Naturally, based on the nature of the aforementioned one-to-one correspondence, the aforementioned two graphs can be semantically equivalent but structurally different (i.e. they are semantically-isomorphic), semantically different but structurally equivalent (i.e. they are structurally-isomorphic), and semantically and structurally equivalent (i.e. they are semantically-structurally auto-isomorphic).

Then, graph isomorphism shows that the data semantics and structures would tend to diverge. That is, they may converge but are not equivalent.

Notably, such isomorphisms are subsumed within the notions of the abstract computing and learning machines, respectively, they are operated over and learned on, both of the classical and quantum variety.

References:

*Graph isomorphism*, Wikipedia, 2019.

Link: https://en.wikipedia.org/wiki/Graph_isomorphism

*Probabilistic checking of proofs and hardness of approximation problems*, Sanjeev Arora, University of California, Berkeley Ph.D. Thesis, 1994.

Link: https://www.cs.princeton.edu/~arora/pubs/thesis.pdf

*Quantum theory, the Church-Turing principle and the universal quantum computer*, David Deutsch, Royal Society London, 1985.

Link: https://people.eecs.berkeley.edu/~christos/classics/Deutsch_quantum_theory.pdf

# Section 5: On The Undecidability Of Learnability

In computability theory, the notions of computability and decidability are interchangeable.

That is, for a given computable function and decidable language, a single *Turing* machine can be constructed that computes and decides the aforementioned function and language, respectively.

Conversely, in computability theory, the notions of uncomputability and undecidability are interchangeable. For example, the *Busy Beaver* function and the *Halting Problem* are uncomputable and undecidable by any *Turing* machine, respectively.

More importantly, if *P* ≠ *NP* and the chosen functions, grammars, and languages are uncomputable and undecidable by the *Turing* machines then due to the aforementioned computability-learnability equivalence principle, the neural networks cannot learn such functions, grammars, and languages.

Then, it doesn’t prevent the neural networks from failing to learn such functions, grammars, and languages along a certain dimension. Thus, learnability can be uncomputable and undecidable in that particular sense.

Reference: *Learning can be undecidable*, Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff, Nature Machine Intelligence, 2019.

Link: https://www.nature.com/articles/s42256-018-0002-3

# Section 6: On The Generalizability Of Learnability

Contemporary neural networks operate on the input over the *Euclidean* space. However, this assumption, of space being *Euclidean* in nature, is only true in a proximate-distance based sensing-modality setup.

Generally, observing the space as a sequence of two-dimensional *Euclidean* data frames fails to incorporate remote-distance based kinematic and spatial curvature properties of the measured objects within that space.

For example, in a proximate-distance based *Euclidean* setup, the triangular inequality holds true, which isn’t true in the remote-distance setup. Consequently, a neural network trained over the *Euclidean* space would have its neural-weight configuration imbue the deliverable predictions with the *Euclidean* space induced spatial-bias.

Then, state-of-the-art neural networks, learning protein folding structures, road artifacts, and more, miss the opportunity of taking advantage of the additional domain-specific characteristics that are available in the non-*Euclidean* space.

Then, one of the key goals in the generalizability of *AI* would be its learnability over the non-*Euclidean* space.

Reference: *“To Artificially Learn Is To Artificially Live” — On Examining The Human Intelligence For Defining The Artificial Intelligence*, Kirti Chawla, LinkedIn Publishing, 2019.

Link: https://www.linkedin.com/pulse/artificially-learn-live-examining-human-intelligence-chawla-ph-d–1c/

# Section 7: On The Physical Limits Of Learnability

In any given universe, the manifesting laws of physics define the physical limits to the computation through several universal constants. For example, in this universe such constants include the speed of light, quantum scale, and gravitational constant.

Then, based on the aforementioned computability-learnability equivalence principle, the aforementioned laws also determine the physical limits to *AI*-learnability in this universe, as any *Turing* machine and neural network must adhere to such limits.

Reference: *Ultimate physical limits to computation*, Seth Lloyd, Nature, 1999.

Link: https://www.nature.com/articles/35023282

# Epilogue

This article, through the established foundations of the information theory, computer science, philosophy, and theoretical physics, presents a view on the limits of the state-of-the-art *AI*, and the manner in which it may be further developed towards a more general *AI*.