Optimization Algorithms Explained: Finding the Best Solution Computationally
What Optimization Means in Computing
An optimization problem consists of three components: an objective function that assigns a numerical score to each candidate solution, a set of decision variables that define the solution, and constraints that limit which solutions are feasible. The goal is to find the values of the decision variables that produce the best score while satisfying all constraints. "Best" might mean the lowest cost, the highest efficiency, the minimum error, or the maximum likelihood, depending on the application.
Optimization problems are classified by the properties of the objective function and constraints. Convex problems, where any local minimum is also a global minimum, are the easiest to solve reliably. Nonconvex problems have multiple local minima, making it difficult to guarantee that the algorithm has found the best possible solution. Constrained problems restrict the feasible region, while unconstrained problems allow any values for the decision variables. Continuous problems have real-valued variables, while combinatorial problems have discrete or integer variables.
Gradient-Based Methods
Gradient descent is the most fundamental optimization algorithm. It starts from an initial guess and repeatedly moves in the direction of steepest descent, the negative gradient, taking steps proportional to a learning rate parameter. For smooth, convex functions, gradient descent converges to the global minimum. Its simplicity and low memory requirements make it the foundation of machine learning training, where it optimizes models with millions or billions of parameters.
Newton method for optimization uses second-derivative information (the Hessian matrix) to determine both the direction and the step size, converging quadratically near the solution. The cost is computing and inverting the Hessian, which is impractical for very large problems. Quasi-Newton methods like BFGS approximate the Hessian using gradient information accumulated over successive iterations, achieving superlinear convergence without the full cost of computing second derivatives. L-BFGS, a limited-memory variant, stores only a few vectors instead of the full approximate Hessian, making it suitable for problems with thousands or millions of variables.
Conjugate gradient methods were originally developed for solving linear systems but extend naturally to optimization. They generate search directions that are conjugate with respect to the Hessian, ensuring that progress in one direction does not undo progress in previous directions. For quadratic objective functions, conjugate gradient converges in at most n steps for n variables. For general nonlinear functions, it provides an effective compromise between the simplicity of gradient descent and the cost of Newton methods.
Constrained Optimization
Linear programming (LP) optimizes a linear objective function subject to linear constraints. Despite the name, linear programming is not about computer programming but about planning and optimization. The simplex method, developed by George Dantzig in 1947, navigates from vertex to vertex of the feasible polytope, improving the objective at each step. Interior-point methods take a fundamentally different approach, traveling through the interior of the feasible region along a central path. Both methods solve large-scale linear programs efficiently and are used in supply chain optimization, scheduling, resource allocation, and network flow problems.
Quadratic programming (QP) handles quadratic objectives with linear constraints. It is central to support vector machines in machine learning, portfolio optimization in finance, and model predictive control in engineering. Active-set methods and interior-point methods are the standard solvers.
Nonlinear constrained optimization uses techniques like Lagrange multipliers, penalty methods, and sequential quadratic programming (SQP). Lagrange multipliers convert the constrained problem into an unconstrained one by introducing dual variables that penalize constraint violations. SQP solves a sequence of quadratic programming subproblems, each approximating the original nonlinear problem near the current iterate. Interior-point methods generalize from linear programming to handle nonlinear objectives and constraints.
Global Optimization and Metaheuristics
For nonconvex problems with many local minima, gradient-based methods can get trapped at suboptimal solutions. Global optimization methods aim to find the best solution over the entire feasible region, though they typically cannot guarantee optimality for general problems.
Simulated annealing draws inspiration from the physical process of cooling a metal. The algorithm accepts moves that worsen the objective function with a probability that decreases over time, allowing it to escape local minima early in the search and gradually settle into a good solution. The cooling schedule controls the trade-off between exploration and exploitation.
Genetic algorithms maintain a population of candidate solutions that evolve through selection, crossover, and mutation, mimicking biological evolution. Solutions with better objective values are more likely to survive and reproduce, and crossover combines features from two parent solutions to create offspring. Genetic algorithms are effective for problems where the objective function is noisy, discontinuous, or lacks gradient information.
Particle swarm optimization models a swarm of particles moving through the solution space, each influenced by its own best-known position and the best position found by the entire swarm. Bayesian optimization builds a probabilistic model of the objective function and uses it to decide where to evaluate next, making it efficient for expensive-to-evaluate functions where each function evaluation might take hours of simulation time.
Optimization in Scientific Applications
In structural engineering, topology optimization determines the optimal distribution of material within a design domain to minimize weight while maintaining structural integrity. This technique has produced organic-looking structures for aerospace components, bridges, and medical implants that are lighter and stronger than traditionally designed alternatives.
In computational chemistry, geometry optimization finds the molecular configuration with the lowest energy, determining the equilibrium structure of molecules and the transition states of chemical reactions. These calculations use gradient-based methods applied to the potential energy surface computed from quantum mechanics.
In machine learning, training a neural network is an optimization problem: minimize the difference between the network predictions and the training data by adjusting millions of weight parameters. Stochastic gradient descent and its variants (Adam, AdaGrad, RMSProp) are the standard optimizers, using random mini-batches of training data to estimate gradients efficiently.
In inverse problems, optimization finds model parameters that best match observed data. Seismologists use optimization to infer Earth internal structure from earthquake recordings. Medical imaging uses optimization to reconstruct images from indirect measurements. Weather forecasting uses data assimilation, a form of constrained optimization, to combine model predictions with observations.
Optimization algorithms are the bridge between mathematical models and practical solutions, and choosing the right algorithm depends on whether the problem is convex or nonconvex, smooth or discontinuous, small or large-scale.