How to Assess AI Risks
Risk assessment for AI systems draws on established risk management frameworks from other domains, including cybersecurity, financial services, and medical device regulation, but requires adaptations for the unique properties of machine learning systems. Unlike traditional software, AI systems can produce different outputs for similar inputs, their behavior changes when retrained on new data, their decision logic resists complete specification, and their performance degrades in ways that are difficult to predict when operating conditions shift from training conditions. These properties mean that risk assessment must be more continuous, more empirical, and more attuned to distributional effects than traditional software risk management.
Define the Risk Scope and Stakeholders
Begin by mapping the full scope of who and what the AI system affects. Direct users interact with the system intentionally: a physician using a diagnostic tool, a recruiter using a screening algorithm, a consumer using a recommendation engine. Affected subjects are people whose outcomes are influenced by the system without directly using it: job applicants screened by AI, patients whose treatment is shaped by algorithmic recommendations, individuals scored by credit algorithms. Indirect stakeholders include communities affected by aggregate patterns (neighborhoods subjected to predictive policing) and society broadly (when AI systems shape information access, economic opportunity, or political discourse).
Classify the decision stakes. The NIST AI Risk Management Framework uses a matrix of likelihood and impact severity. The EU AI Act uses a risk tier system: unacceptable, high, limited, and minimal risk. For practical assessment, ask: what happens when this system makes an error? If the worst case is a mildly irrelevant product recommendation, the risk is low. If the worst case is a wrongful arrest, denied medical treatment, or financial ruin, the risk is high. The assessment rigor should scale with the stakes. A recommendation engine for movie suggestions does not need the same depth of risk assessment as a sentencing recommendation algorithm.
Identify the relevant risk categories for this specific application. Technical risks include accuracy failures, bias, robustness to adversarial inputs, and performance degradation over time. Operational risks include deployment errors, inadequate human oversight, insufficient monitoring, and unclear accountability. Ethical risks include fairness violations, privacy infringement, lack of transparency, and consent issues. Societal risks include displacement effects, concentration of power, environmental impact, and systemic risks from widespread adoption. Not every category applies to every system, but explicitly considering each category prevents blind spots.
Evaluate Capabilities and Failure Modes
Test the system's performance under the full range of conditions it will encounter in deployment, not just the curated conditions of benchmark datasets. This means testing with data from different geographic regions, demographic groups, time periods, hardware configurations, and edge cases that are rare but consequential. A medical imaging AI must work with images from different scanner manufacturers, different imaging protocols, different patient populations, and different disease prevalences than its training set. A hiring algorithm must perform fairly across applicants from different educational backgrounds, career paths, and demographic groups.
Identify the system's failure modes systematically. Where does accuracy degrade? Under what conditions does the system produce confidently wrong outputs? What inputs cause unpredictable behavior? Failure mode analysis borrows from engineering reliability practices: for each component and each operating condition, ask "what could go wrong, how likely is it, and how severe would the consequences be?" Document known failure modes explicitly, because deployers need this information to implement appropriate safeguards and users need it to calibrate their trust appropriately.
Evaluate calibration: does the system's confidence score accurately reflect its probability of being correct? A system that reports 90% confidence should be right approximately 90% of the time. Poorly calibrated systems are dangerous because they provide false signals about reliability. A medical AI that expresses high confidence in incorrect diagnoses is more dangerous than one that correctly signals uncertainty, because clinicians use confidence scores to decide when to trust the system and when to seek additional information. Calibration should be evaluated separately across subgroups, because systems can be well-calibrated on average while being poorly calibrated for specific populations.
Audit for Bias and Fairness
Disaggregate performance metrics across all relevant demographic subgroups. At minimum, evaluate accuracy, false positive rate, false negative rate, precision, and recall separately for each group. Use established fairness toolkits (Fairlearn, AI Fairness 360, Aequitas) to compute standard fairness metrics including demographic parity difference, equalized odds difference, and predictive parity. Compare these metrics against thresholds defined by applicable regulations (the four-fifths rule for employment, disparate impact standards for lending) and organizational fairness standards.
Test for proxy discrimination by measuring the correlation between model predictions and protected attributes, even when those attributes are not direct model inputs. If the model's approval rate correlates with race at r > 0.3 even though race is not an input feature, proxy variables are likely encoding racial information. Use counterfactual analysis: change a single protected attribute (e.g., flip gender) while holding all other features constant, and measure whether the prediction changes. Significant counterfactual sensitivity indicates that the model is using protected information, directly or through proxies.
Evaluate the training data for representation gaps, historical bias, and label quality. Are all relevant populations adequately represented? Do historical labels reflect discriminatory patterns that the model should not reproduce? Were labels created by annotators with potential biases? Document the findings and the decisions made in response: if a bias was identified, what mitigation was applied, how effective was it, and what residual bias remains?
Conduct Adversarial and Security Testing
Test the system's robustness to adversarial inputs designed to cause misclassification or undesirable behavior. For image models, use established adversarial attack methods (FGSM, PGD, C&W) to generate adversarial examples and measure the model's accuracy on perturbed inputs. For language models, test for prompt injection (inputs designed to override system instructions), jailbreaking (inputs designed to bypass safety training), and data extraction (inputs designed to elicit memorized training data). The severity of adversarial vulnerabilities depends on the deployment context: an adversarial vulnerability in a content recommendation system is less dangerous than one in a security-critical system.
Evaluate misuse potential: how could this system be deliberately misused by bad actors? A text generation model could be used to generate disinformation at scale. A voice cloning system could be used for impersonation fraud. A facial recognition system could be used for stalking. The assessment should identify the most plausible and most harmful misuse scenarios, evaluate the technical barriers to misuse, and assess whether the system provides meaningful incremental capability beyond what is already available through other means.
Test for data poisoning susceptibility if the system will be retrained on user-generated data. Can a malicious actor influence the model's behavior by injecting specially crafted data into the training pipeline? How much poisoned data would be required to produce meaningful behavior change? What monitoring would detect a poisoning attempt? Systems that learn continuously from user feedback are particularly susceptible because the feedback mechanism provides a direct channel for influencing model behavior.
Assess Societal and Systemic Impacts
Evaluate second-order effects that extend beyond the system's direct outputs. If this system is widely adopted, what are the aggregate consequences? A single AI hiring tool may produce modest efficiency gains. If every employer uses similar AI hiring tools, the aggregate effect could be systematic exclusion of candidates who do not match the patterns these tools favor, narrowing the talent pipeline in ways that individual deployment decisions do not reveal. Systemic risk assessment asks: what happens when this technology is not just one system but the infrastructure?
Consider displacement effects: which human roles, relationships, or capabilities does this system replace or diminish? A customer service AI displaces call center workers. A diagnostic AI may reduce the development of diagnostic reasoning skills in young physicians who rely on it. An AI writing tool may reduce the reading and writing skills that develop through sustained practice. These effects may be justified by the benefits, but they should be explicitly acknowledged and weighed rather than ignored.
Evaluate environmental costs for systems that require significant compute resources. Estimate training energy, inference energy at projected scale, and the hardware lifecycle costs including manufacturing and e-waste. For systems deployed at large scale, inference energy costs may exceed training costs within months of deployment. Consider whether efficiency optimizations (distillation, quantization, caching) could reduce the environmental footprint without significantly degrading performance.
Establish Ongoing Monitoring and Review
Risk assessment is not a one-time pre-launch activity. Deploy monitoring systems that continuously track the metrics evaluated during pre-launch assessment: accuracy, fairness metrics, calibration, and confidence distributions. Set alert thresholds that trigger investigation when metrics drift beyond acceptable ranges. Data drift monitoring compares the distribution of incoming data to the training distribution, flagging when the system is operating on data significantly different from what it was trained on. Performance monitoring tracks accuracy and error rates over time, detecting gradual degradation that may not be apparent from day-to-day observation.
Schedule regular reassessments: quarterly for high-risk systems, annually for lower-risk systems. Each reassessment should re-evaluate the original risk findings using current production data, assess whether new risk categories have emerged that were not considered in the original assessment, and evaluate whether mitigations implemented during the original assessment remain effective. The regulatory landscape also evolves, and periodic review ensures ongoing compliance with requirements that may have changed since the original assessment.
Maintain incident response procedures for when the system produces harmful outputs or when monitoring detects significant risk. Define who has authority to modify or suspend the system, what information must be preserved for investigation, how affected individuals will be notified, and what corrective actions will be taken. Test these procedures periodically, because incident response that has never been practiced will be slow and disorganized when a real incident occurs.
AI risk assessment requires a systematic approach spanning technical evaluation, fairness auditing, adversarial testing, societal impact analysis, and ongoing monitoring. The depth of assessment should match the stakes of the application, and the process must continue after deployment because real-world conditions reveal risks that pre-launch testing cannot fully anticipate.