Numerical Error Analysis: Understanding and Controlling Computational Errors
Types of Numerical Error
Round-off error originates from the finite precision of floating-point arithmetic. Computers represent real numbers using the IEEE 754 standard, which allocates 64 bits for double-precision numbers: 1 sign bit, 11 exponent bits, and 52 mantissa bits. This provides roughly 15 to 16 significant decimal digits. Any real number that cannot be represented exactly in this format is rounded to the nearest representable value, introducing an error of about 1 part in 10 to the 16th for each operation.
Individually, round-off errors are tiny. But they accumulate over billions of arithmetic operations in a large simulation, and certain operations amplify them dramatically. Catastrophic cancellation occurs when subtracting two nearly equal numbers: the leading digits cancel, leaving only the trailing digits that are contaminated by round-off. For example, computing the difference between 1.000000000000001 and 1.000000000000000 leaves only the last digit, which may be dominated by accumulated round-off errors from earlier calculations.
Truncation error arises from mathematical approximations used in numerical methods. When a derivative is approximated by a finite difference, the Taylor series terms that are dropped constitute the truncation error. When an integral is approximated by a quadrature rule, the difference between the exact integral and the approximation is truncation error. Unlike round-off error, truncation error can be made arbitrarily small by refining the discretization, using a finer grid or smaller time step.
Modeling error comes from simplifications in the mathematical model itself. A rigid-body assumption when the body actually deforms, a laminar flow assumption when the flow is turbulent, or a constant-property assumption when properties vary with temperature are all sources of modeling error. These errors cannot be reduced by numerical refinement because the equations being solved are not a perfect representation of the physical system. Only comparison with experimental data can assess modeling error.
Floating-Point Arithmetic in Detail
Understanding IEEE 754 arithmetic is essential for numerical error analysis. A double-precision number is stored as a sign times a mantissa (between 1 and 2) times 2 raised to an exponent. The smallest positive double-precision number is about 5 times 10 to the minus 324. The largest is about 1.8 times 10 to the 308. The machine epsilon, the smallest number such that 1 plus epsilon is distinguishable from 1 in floating-point arithmetic, is about 2.2 times 10 to the minus 16 for double precision.
Floating-point addition is not associative: (a + b) + c may differ from a + (b + c) due to rounding. This means that the order in which numbers are added affects the result. When summing a large number of terms, adding from smallest to largest typically produces a more accurate result than adding in arbitrary order. Compensated summation algorithms like Kahan summation track the accumulated round-off error and correct for it, achieving nearly full precision for long sums.
Special values in IEEE 754 include positive and negative infinity (produced by overflow or division by zero), negative zero (which compares equal to positive zero but can produce different results in some operations), and NaN (Not a Number, produced by operations like 0/0 or the square root of a negative number). NaN has the unusual property that it compares unequal to everything, including itself. Encountering NaN in a computation usually indicates a bug or a degenerate case that requires special handling.
Condition Numbers and Stability
The condition number of a problem measures how sensitive the output is to small changes in the input. A well-conditioned problem has a small condition number: small input perturbations produce small output changes. An ill-conditioned problem has a large condition number: small input perturbations can produce large output changes. The condition number is a property of the mathematical problem, not of the algorithm used to solve it. Even a perfect algorithm cannot produce accurate results for an ill-conditioned problem with uncertain inputs.
For solving the linear system Ax = b, the condition number of the matrix A determines how sensitive the solution x is to perturbations in A or b. If the condition number is 10 to the 6th, then a perturbation of 1 part in 10 to the 16th (machine epsilon) in the input can produce an error of 1 part in 10 to the 10th in the output. A condition number near 10 to the 16th means that no digits in the solution are reliable. Condition estimation (computing or bounding the condition number) is an important diagnostic in numerical linear algebra.
An algorithm is numerically stable if it produces results that are the exact solution to a slightly perturbed problem. In other words, a stable algorithm introduces perturbations no larger than round-off error. The backward error of a computation measures the size of the perturbation needed to make the computed result exact. Backward error analysis, developed by James Wilkinson in the 1960s, is the standard framework for analyzing the stability of numerical algorithms.
Convergence and Order of Accuracy
A numerical method converges if its solution approaches the exact solution as the discretization is refined (grid spacing approaching zero, time step approaching zero). The order of accuracy describes how fast this convergence occurs. A first-order method has errors proportional to h (the grid spacing or time step). A second-order method has errors proportional to h squared. A fourth-order method has errors proportional to h to the fourth power.
The order of accuracy has profound practical implications. Halving the grid spacing with a second-order method reduces the error by a factor of 4. With a fourth-order method, the error drops by a factor of 16. This means that a fourth-order method on a coarse grid can be more accurate than a second-order method on a fine grid, while using far fewer computational resources. This trade-off between method order and grid resolution is one of the central design decisions in scientific computing.
Richardson extrapolation uses solutions at two or more grid resolutions to estimate the discretization error and produce a more accurate answer. If a second-order method gives answers A1 on a coarse grid and A2 on a grid with half the spacing, then (4 times A2 minus A1) divided by 3 cancels the leading error term and produces a fourth-order accurate result. This technique is also used to verify the order of accuracy of a numerical implementation by checking that the observed convergence rate matches the theoretical prediction.
Practical Error Control
Convergence studies are the most important practical tool for assessing numerical error. Run the computation at several grid resolutions or time step sizes and compare the results. If the results are not changing significantly with refinement, the discretization error is likely small. If they are still changing, finer resolution is needed. Plot the error (or the difference between successive resolutions) against the grid spacing on a log-log plot. The slope of this plot gives the observed order of accuracy, which should match the theoretical order.
Error bounds and estimates provide rigorous or heuristic assessments of accuracy without computing the exact solution. A posteriori error estimates, computed from the numerical solution itself, can guide adaptive mesh refinement by identifying where the error is largest. In finite element analysis, the Zienkiewicz-Zhu error estimator and the residual-based error estimator are widely used to drive mesh adaptation.
Sensitivity analysis determines how much the results change when input parameters are varied within their uncertainty ranges. If the results are insensitive to uncertain parameters, the predictions are robust. If they are highly sensitive, then the input uncertainties must be reduced or the results must be reported with appropriate uncertainty bounds.
Every computational result is an approximation, and error analysis provides the tools to quantify that approximation, distinguish meaningful predictions from numerical noise, and design computations that achieve the required accuracy with minimum computational cost.