Genetic Programming & GPlearn

Original article was published on Artificial Intelligence on Medium

Genetic Programming & GPlearn

Genetic programming is a type of evolutionary algorithm (which was explained in my last article ) and a subset of machine learning. It is a technique of evolving problems, it takes the random functions and makes them fit for any particular datasets using various operations like the crossover, mutation, etc. This is commonly used as a tuning problem. Genetic programming is capable of taking a series of totally random programs, untrained and unaware of any given target function you might have had in mind, and making them breed, mutate and evolve their way towards the truth. The basic approach is to let the machine automatically test various simple evolutionary algorithms and then “breed” the most successful programs in new generations.

Genetic programming is used to find how data is functionally related (symbolic regression) and categorize them (classification). It can assist in designing electric circuits, antennae, and quantum algorithms. The entire process is still an area of active research. One of the biggest obstacles to the widespread adoption of this genetic machine learning approach is quantifying the fitness function, i.e to what degree each new program is contributing to reaching the desired goal.

Symbolic regression is a model used to fit a symbolic relationship that searches the space for mathematical expression to give the best suitable expression for the given input datasets. Mathematical expressions are fed into the GP based on the requirements. The genetic program then uses these expressions to get various combinations and give the best-fit expression. There are some other models like a symbolic transformer to generate new non-linear features automatically, symbolic classifiers for classifier comparisons.

Types of Genetic Programs:

  • Tree-based GP
  • Stack-based GP
  • Linear GP (LGP)
  • Grammatical evolution
  • Extended Compact Genetic Programming (ECGP)
  • Cartesian GP(CGP)
  • Probabilistic Incremental Program Evolution (PIPE)
  • Strongly Typed Genetic Programming (STGP)
  • Genetic Improvement in Software Multi Objectives(GISMO)


GP Learn is genetic programming in python with a scikit-learn inspired API. There are various parameters in GPlearn tuning which we can achieve the relevant equation for the given datasets. Gplearn uses representation which is a combination of variables, constants, and functions.

· Representation: GPlearn has a set of functions already predefined, We can use any of them to get the relation between the data. But, we have to note a few things here like computational speed and time.

Let us understand this with the following example:

Expression: y=X⁰²−3×X1+0.5 ……(1)

We can represent the above expression (1) in many other ways like the following:


Here, we have a mixture of variables, constants, and functions. We have the functions addition, subtraction, and multiplication. We also have the variables X0 and X1 and constants 3 and 0.5. This expression can be represented as a tree structure as below:

In this example, the formula can be interpreted in a recursive manner. If we try to remove the parenthesis we can simplify the S-expression as:


GPlearn’s representation is similar to this and uses Python lists to store the functions and terminals. The below strings are put under “functional_set” argument to be used in our program. Addition, subtraction, division and multiplication, transformers, comparison functions, or trigonometric functions are all built-in.

One can make their own functional sets by using “funtions.make_functions()” in this framework.

· Fitness: As in many other machine learning techniques we evaluate the score and error percentage for the GP. In GPlearn for symbolic regression, there are 2 metrics available to calculate the fitness which are ‘mean square error(MSE)’ and ‘root mean square error(RMSE)’. For symbolic transformer, ‘Pearson’, for Pearson’s product-moment correlation coefficient (the default), and ‘spearman’ for Spearman’s rank-order correlation coefficient. The symbolic classifier currently uses the ‘log loss’ aka binary cross-entropy loss as its default metric to optimize.

The fitness calculation is considered as the most expensive part of Genetic programming.

· Initialization: When starting a GP run, the first generation is unaware that there is any fitness function that needs to be maximized. These programs are a random mix of the available functions and variables and will generally perform poorly. But the user can be able to “strengthen” the initial population by providing good initialization parameters. While these parameters may be of some help, point to be noted here is the most significant factor impacting performance is the number of features in your dataset. Taking from [5] let us see some of these parameters.

init_depth: To specify the range of initial depths that the first generation of programs can have. Defines the depth and height of the tree/node.

init_method: This can be one of ‘grow’(method, nodes are chosen at random from both functions and terminals, allowing for smaller trees than init_depth specifies.), ‘full’(method chooses nodes from the function set until the max depth is reached, and then terminals are chosen. This tends to grow “bushy”, symmetrical trees), or ‘half and half’(method. Program trees are grown through a 50/50 mix of ‘full’ and ‘grow’). For all options, the root node must be a function.

population_size: This controls the number of programs competing in the first generation and every generation thereafter.

Note: bigger the generation size, bigger the probability of getting the optimum expression.

· Selection: Symbolic regressor use tournament selection (Explained in genetic algorithm article, Therefore, we have to provide the tournament size as one of the parameters in the symbolic regressor for the selection rate.

· Evolution: Crossover and various mutation processes can be used here (for more information refer to Genetic algorithm article, These processes are applied to the nodes of the best fittest population (tree/expressions) that comes out of the selection process and gives the best fittest expression as the output

· Stopping criteria: Stopping criteria is to define exactly when the program has to be terminated. We have to provide a stopping criteria parameter in the symbolic regressor depending on which the GP gives the output expression when it is met. There is another way to stop the GP execution which is generations. The execution is stopped whenever it meets any one of the criteria

· Blot: As explained in [5], An interesting phenomenon is often encountered in GP where the program sizes grow larger and larger with no significant improvement in fitness. This is known as bloat and leads to longer and longer computation times with little benefit to the solution.

We can handle bloating in GP by, passing many parameters like int_deapth, parsimony_coefficient, verbose, max_sample (each parameter is explained below).

init_depth: To specify the range of initial depths that the first generation of programs can have. Defines the depth and height of the tree/node.

parsimony_coefficient: It is used so that the smallest expression is taken out whether or not the expression is 100% suitable for the given dataset (error rate may be a bit higher but this makes the code execute faster). Even removing the parameters does not make any noticeable changes to the error rate.

verbose: This parameter is to off the log information of the program which may sometimes slow down the execution in-order to maintain activity records.

max_sample: This parameter is used for sub-sampling.

Areas of application:

  1. Evolutionary computation
  2. Autonomic Neuroplasticity: Development
  3. Cell Death
  4. Neurology
  5. ADME-Tox Approaches
  6. Hypospadias


  1. Wikipedia contributors. (2020, May 18). Genetic programming. In Wikipedia, The Free Encyclopedia. Retrieved 17:57, June 3, 2020, from
  2. About genetic programming: