Computational Chaos : Optimization, Exploration and Exploitation
“In the long history of humankind, those who learned to collaborate and improvise most effectively have prevailed” — in commemoration of Charles Darwin on the ‘Darwin Day’.
In the ‘ Computational Intelligence ‘ domain, we have to consider a situation in which we learn and characterize the uncertainty for decision making. The problem is how to balance between exploring and exploiting while taking decisions and making computations.
Splitting your computational time into ‘ Exploration’ (learning) and ‘Exploitation’ (optimal decisions) is a challenging task, which is the same as to find the most delicious ice-cream you can go to every ice-cream parlor in your city (exploration) and try every ice-cream flavor at every parlor(exploitation). The equilibrium position in the tradeoff also depends on the optimization strategy for the problem and data distribution. Sometimes in problems like ‘Multi-Arm Bandit,’ it not possible to make decisions unless you play and understand the underlying distribution. Consider the scenario where 4 bandit machines are mounted and you want to figure out the machine with a high predicted win value. Upon playing you’ve got D1, D2, D3 and D4 distributions with values that are independent and identically distributed with respect to machines. While playing we will find expected values and variance of each distribution that will enable us to make further decisions, but in some problems, you don’t need to always try it like ‘ Stock Investing Problem, ‘ where you can see other stocks when investing in one that will be useful to you in making further decisions. This also depends upon the type of computational system you choose i.e Single Agent or Multi-agent and depending upon the type of system you can decide your optimization strategy. Also, you can decide your optimization algorithm based on error-function of your model like whether the function is convex (Non-wavy function with single minima ex: x²-1=0) or non-convex (wavy function with multiple local minima ex: x⁴+x³-2x²-2x=0) here we are interested in finding parameters where error function is minimum.
How do I decide which method of optimization fits my problem?
(1) Gradient Descent (Deterministic Hill Climbing) — The most commonly used method of optimization is ‘ Gradient Descent, ‘ which is valid if you have convex error function. Gradient Descent has a problem getting stuck at local minima if the error function is non-convex, but by running Gradient Descent for ‘ exponential time ‘ you can get global minimum which costs more time. Before selecting an optimization approach, you can ask a question ‘’what is important is that you just want to minimize an error function or that it really matters good results?”. If you’re looking for good results, ‘Simulated Annealing’ is a decent choice but it takes some time compared to Gradient Descent. Problem with Neural Net: For neural networks, ‘local minima’ is not a much problem. (i) You get some of the local minima because by permuting hidden layer units or by negating input and output weights in the network, also you get a functionally identical layout that causes problems. When local minima are slightly non-optimal then it doesn’t really matter because the neural network has other major problems such as ‘ overfitting. ‘ The issue is that if we aggressively search for global minima then the outcome will be more overfitting, hence model may perform poorly. (ii) Overfitting typically happens in neural networks as we use too many parameters so it’s not that much depending on local minima. But with a small neural network with few parameters, the local minima problem will come up. We use Gaussian Process Model or Radial Basis Functions in a neural net which sometimes alleviates the problem. (iii) Certain versions, such as Non-Linear Conjugate Gradient-used in Regression with O(N), L-BFGS-are very competitive in classification issues and use’ Hassan Approximation,’ Levenberg Marquardt Algorithm-which is perfect for training small neural networks, and so on. But all these optimizers do have an interesting alternative.
(2) Simulated Annealing- Simulated annealing takes a population and applies each member of the population a reduced random variation. It is a technique for the globally optimal approximation of a given function. Specifically, approximating global optimization in a large search space to an optimization problem is a metaheuristic function of Simulated Annealing. The rate, quantity, and kind of random variation is part of the design process. It makes use of the principle of annealing(temperature) to trade-off exploration and exploitation. This introduces new measurement points at each iteration, but as a number of iterations increase, the temperature drops and the algorithm is less and less likely to ‘ converge ‘ the space towards its current best candidate.
(3) What if we apply Simulated Annealing before Gradient Descent to get a global minimum? (i) Use Simulated Annealing to widely explore the optimization landscape. (before climbing, look at the set of hills and find out the tallest hill to climb) (ii) Get a sense of problem structure, choose a hill to be climbed and perform gradient descent. PROBLEM with this approach — Often, as in neural networks, there are millions of parameters so no point in doing Simulated Annealing. We apply Gradient Descent directly from whatever initial random state we find ourselves in.
(4) There are some population-based training algorithms that are used for optimization, such as Evolutionary Algorithm (EA) and Particle Swarm Optimization (PSO). Iteratively explored search space (which is a collection of many sub-populations) while sharing information and ultimately converging to minima is the basic idea behind population-based learning. Because of the use of multiple starting points (candidate solutions), the chances of converging to global minima are considerably increased. EA and PSO both perform competitively and outperforms the gradient descent in many population-based systems.
(5) Genetic Algorithms — If there are many independent individuals and all other algorithms fail on large datasets to find an optimal solution, then use Genetic Algorithms. In issues such as Travelling Salesman, the solution set contains multiple random paths (gene = orders of cities visited). Our goal is to choose a minimum cost path i.e finding fitness in polynomial time.
The genetic algorithm takes a population to produce a new member and repeatedly takes two members of the population and “mates” them. The new member may be put in a new population, they may eliminate the “parents” from the original population. They could always produce two children. How they are selected, where they “mate,” how many parents are allowed to engage in mating, and many other variables can be altered. Sometimes, Genetic Algorithms converge to local minima but there exist other solutions for this like (i) Mutation: randomly flip bits (ii) cataclysm in genes (iii) Annealing of mating pair.
Two important things we should keep in mind: (i) If crossovers are ‘too strict’ it may cause a problem of ‘local minima’, but if cross overs are too lax then it will lead to slow convergence. (ii) If mutations are ‘too little’ it may cause a problem of ‘local minima’, but too much mutation cause wrong noisy data.
(6) How Genetic Algorithms are different that Simulated Annealing?
In Simulated Annealing, we are gradually reducing random variance to each member of the population where we take two members of the population for mutation repeatedly with Genetic Algorithms.
A Genetic Algorithm maintains a population of possible solutions and selects pairs at each stage of a possible solution, mixes them (crossover) and applies some random (mutation) changes. The algorithm is based on the idea of “survival of the fittest” where the selection process is done according to fitness parameters (usually it is simply the value of the objective function measured using the current solution in optimization problems). The fusion is done in the hope that, when combined, two good solutions will give an even better solution. On the other hand, Simulated Annealing tracks only one solution in the space of possible solutions and decides at each iteration whether to switch to a neighboring solution or to remain in the current one according to certain probabilities (which decay over time). This is different from a heuristic search (say greedy search) because it doesn’t suffer from local optimum problems as it can get out of cases where all neighboring solutions are worse than the current one.
(7) Particle Swarm Optimization (PSO)- Swarm intelligence is the observation that smart-seeming behavior can emerge from complex systems made up of simple “non-intelligent” agents. It’s like a digital computer with millions of “feeble-minded” transistors being able to perform high-speed advanced, precise maths while “working together.” Swarm intelligence becomes more important when the pieces seem to be operating independently of each other, as with a swarm of honeybees finding a new home for the hive, or a fish swimming school, or molecules in a life-generating cell.
Particle Swarm Optimization is an algorithm for metaheuristic optimization that can be applied to a large class of optimization issues. It does not have rigid premises, such as cost function differentiability. It is commonly employed with a single global cost function on cooperative optimization problems. PSO uses stochastic perturbations to find the equilibrium point of a cost function. A number of randomly distributed particles are generated and these particles move randomly until convergence. When traveling, they share their personal best with each other. The particles migrate towards the direction of a combination of their former velocity, personal best and global best. Particles move toward a combination of their former pace, personal best, and global best. Previous velocity integration ensures dependence on preceding measurements, moving toward the global best provides optimum convergence, and the best personal aspect provides search diversity. When optimization continues, there are personal bests and the best global approach to each other and integration.
In short, Gradient Descent and Simulated Annealing are single solution based algorithms, while the Genetic Algorithm and Swarm Analysis are population-based. This means that single solution-based algorithms have only one solution and seek to improve it, whereas population-based algorithms have several solutions-or perhaps hundreds based on population size solutions. In other words, a single solution based algorithm has only (Exploitation), while the population-based algorithm has both ( Exploration and Exploitation ). Therefore, we called a single solution based algorithm as an exploitative algorithm.
Both of these algorithms depend on the nature of the problem, from the definition above. If there is only one optimal solution (Unimodal) to the problem then SA and GA can be used. If there is more than one optimal solution to this problem, GA can find better solutions than SA. Because of multi-model problems, we have to look for both searching capabilities (Exploration and Exploitation ). As described above, Simulated Annealing, Particle Swarm Optimization, and Genetic Algorithms are good global optimization algorithms that navigate well through vast search spaces and, unlike Gradient Descent, do not need any knowledge about the gradient and could be used effectively with objective BlackBox functions and problems that require simulations to run.
Go on and explore more on the ‘Exploration-Exploitation dilemma’.