Ethics in Machine Learning
How Bias Enters ML Systems
Historical bias is the most common source. ML models learn patterns from historical data, and if that history reflects discrimination, the model reproduces it. Amazon's experimental hiring model, trained on 10 years of resume data from a male-dominated tech industry, learned to penalize resumes that contained the word "women's" (as in "women's chess club captain"). The model was not programmed to discriminate; it learned discrimination from the data.
Representation bias occurs when training data does not represent the full population the model will serve. Facial recognition systems trained predominantly on lighter-skinned faces perform significantly worse on darker-skinned faces. A 2018 MIT study found that commercial facial recognition systems from major companies had error rates of 0.8% for lighter-skinned males but 34.7% for darker-skinned females. The algorithms worked; the training data was unrepresentative.
Measurement bias occurs when the features or labels in the data are imperfect proxies for what you actually want to measure. Using arrest records as a proxy for criminal behavior encodes policing patterns: neighborhoods with more police presence generate more arrests regardless of actual crime rates. A model trained on arrest data will predict higher crime in over-policed neighborhoods, reinforcing the pattern.
Aggregation bias occurs when a single model is used for groups that should be modeled separately. A medical model trained on the general population may perform poorly on specific subgroups (elderly patients, pregnant women, children) whose biology differs from the majority of the training data.
Fairness in Machine Learning
Defining "fairness" for an ML system is harder than it sounds, because different definitions of fairness are mathematically incompatible. You cannot satisfy all of them simultaneously.
Demographic parity requires that the model's positive prediction rate is equal across groups. If 20% of male applicants are predicted to be hired, then 20% of female applicants should be as well. This definition ignores whether the groups differ in qualification rates.
Equal opportunity requires that the model's true positive rate is equal across groups. Among actually qualified candidates, the same proportion should be correctly identified regardless of group. This allows different overall prediction rates if the qualification rates differ.
Calibration requires that when the model says "80% probability," the actual positive rate is 80% for all groups. A model is well-calibrated if its confidence scores mean the same thing regardless of which group the individual belongs to.
Choquet and Kleinberg proved in 2016 that except in trivial cases, demographic parity, equal opportunity, and calibration cannot all be satisfied simultaneously. This means every fairness-aware ML system involves a choice about which type of fairness to prioritize, and that choice is a value judgment, not a technical decision.
Real-World Consequences
Criminal justice: The COMPAS recidivism prediction tool, used in US courts to inform sentencing and bail decisions, was found by ProPublica to have significantly different false positive rates for Black and white defendants. Black defendants who did not reoffend were roughly twice as likely to be incorrectly classified as high-risk. The tool's developer argued the model was calibrated (the risk scores meant the same thing across races), illustrating the fairness definition conflict.
Healthcare: A widely used hospital algorithm for allocating healthcare resources was found to systematically assign lower risk scores to Black patients than white patients with the same level of illness. The root cause was using healthcare spending as a proxy for health needs. Because Black patients had historically received less healthcare spending (due to access barriers), the model learned that they needed less care, perpetuating the disparity.
Finance: Apple Card's credit limit algorithm was investigated after reports that it offered significantly lower credit limits to women than to their husbands, even when the women had higher credit scores. While the investigation did not find intentional discrimination, it highlighted how ML systems can produce discriminatory outcomes without any engineer intending them.
Practical Steps for Responsible ML
Audit training data. Before training, examine the demographic composition of your data. Are all relevant groups represented? Are labels consistent across groups? Does the data reflect historical discrimination that the model should not perpetuate? Document these findings and the decisions made to address them.
Choose fairness metrics upfront. Decide which definition of fairness applies to your application before building the model. For a hiring tool, equal opportunity (qualified candidates are equally likely to be identified regardless of group) is usually the right choice. For a lending tool, calibration (risk scores should be equally accurate across groups) may be more appropriate. Document the choice and the reasoning.
Test across subgroups. Evaluate model performance separately for each demographic group, not just overall. A model with 95% overall accuracy might have 98% accuracy for one group and 80% for another. Per-group metrics reveal disparities that aggregate metrics hide.
Use fairness-aware algorithms. Libraries like Fairlearn (Microsoft), AIF360 (IBM), and What-If Tool (Google) provide tools for measuring and mitigating bias. Techniques include preprocessing (rebalancing training data), in-processing (adding fairness constraints to the optimization), and post-processing (adjusting model outputs to equalize metrics across groups).
Maintain human oversight. ML models should inform human decisions, not replace them, particularly for high-stakes applications. A judge should consider a risk score alongside other evidence, not defer to it. A doctor should use a diagnostic model as a second opinion, not a final answer. The model's role is to augment human judgment, not eliminate it.
Environmental Impact
Training large ML models consumes significant energy. Training GPT-3 consumed an estimated 1,287 MWh of electricity and produced approximately 552 tons of CO2 equivalent. Training a single large language model can consume as much energy as five cars over their entire lifetimes. As models grow larger, this environmental cost grows proportionally.
Responsible ML includes considering whether the compute budget is justified by the expected benefit. Do you need to train a 175-billion-parameter model from scratch, or would fine-tuning an existing model achieve similar results at a fraction of the cost? Energy-efficient architectures, smaller models trained on higher-quality data, and carbon-offset programs are all part of the ethical calculus.
ML systems inherit and amplify biases from their training data, producing discriminatory outcomes at scale. Building ethical ML requires auditing data for bias, choosing appropriate fairness metrics upfront, testing across subgroups, using fairness-aware tools, and maintaining human oversight. These are engineering practices, not optional extras.