How to Clean Data in Python

Updated May 2026
Data cleaning is the process of detecting and correcting errors, inconsistencies, and missing values in datasets so that analysis produces accurate results. In practice, data cleaning consumes 50 to 80 percent of the time in any data analysis project, because real-world data is never collected perfectly: instruments malfunction, humans make entry errors, formats vary between sources, and values go missing for reasons both random and systematic. pandas provides the tools to handle all of this programmatically, making cleaning reproducible rather than manual.

The most dangerous data quality problems are the ones you do not notice. An analysis that runs without errors on dirty data produces plausible but wrong results. A temperature column that reads as text because one row contains "N/A" instead of a number will silently compute a string concatenation instead of a mean. A date column with inconsistent formats will sort incorrectly. Duplicate rows will inflate sample sizes and deflate p-values. The systematic approach of diagnosing, fixing, and validating prevents these invisible errors from corrupting results.

Step 1: Diagnose Data Quality Issues

Start every cleaning session with a systematic audit. df.info() reveals column types and null counts. Any numeric column showing as 'object' type likely contains non-numeric entries that prevented automatic type conversion. df.isnull().sum() counts missing values per column. df.isnull().mean() shows the fraction missing, which determines strategy: columns missing less than 5% can usually be imputed, columns missing 5 to 30% need careful investigation of the missingness pattern, and columns missing more than 50% may need to be dropped entirely unless the missingness itself is informative.

df.describe() reveals statistical anomalies. A minimum of -999 or a maximum of 99999 suggests sentinel values used to represent missing data (common in legacy systems and government datasets). A mean that is wildly different from the median indicates skewness or outliers. A standard deviation of zero means the column is constant and uninformative. For categorical columns, df['column'].value_counts() reveals inconsistent entries: "United States", "US", "U.S.", "usa", and "United States of America" might all represent the same category.

df.duplicated().sum() counts exact duplicate rows. df.duplicated(subset=['id']).sum() counts rows with duplicate IDs specifically, which indicates either data entry errors or unintended merges. df[df.duplicated(subset=['id'], keep=False)].sort_values('id') shows all duplicate-ID rows for manual inspection. For time series data, check for irregular spacing: df['timestamp'].diff().describe() reveals gaps and jumps in the time index that may indicate missing observations.

Cross-column consistency checks catch errors that single-column inspection misses. Verify that end dates are after start dates: assert (df['end_date'] >= df['start_date']).all(). Verify that percentages sum to 100: assert np.isclose(df[['part1', 'part2', 'part3']].sum(axis=1), 100).all(). Verify that children's ages are less than adults' ages in family data. These domain-specific checks require knowing what the data represents, which is why data cleaning is partly a domain expertise task, not purely a technical one.

Step 2: Handle Missing Data

Understand why data is missing before choosing a strategy. Missing Completely at Random (MCAR) means the probability of being missing is unrelated to any variable. Missing at Random (MAR) means the missingness depends on observed variables (e.g., older patients are less likely to complete follow-up surveys). Missing Not at Random (MNAR) means the missingness depends on the missing value itself (e.g., patients with the worst outcomes drop out of a study). MCAR and MAR can be handled with imputation. MNAR requires modeling the missingness mechanism explicitly, which is substantially more complex.

Dropping missing values is appropriate when missingness is rare and MCAR. df.dropna() drops all rows with any missing value. df.dropna(subset=['critical_column']) drops rows only where a specific column is missing. df.dropna(thresh=5) keeps rows that have at least 5 non-null values. For columns, df.drop(columns=df.columns[df.isnull().mean() > 0.5]) removes columns where more than 50% of values are missing. Always check that dropping does not introduce bias: if the remaining data has different demographics or conditions than the full dataset, the analysis results may not generalize.

Imputation replaces missing values with estimated values. df['column'].fillna(df['column'].median()) fills with the column median (robust to outliers). df.groupby('group')['value'].transform(lambda x: x.fillna(x.median())) fills with the group-specific median, preserving between-group differences. For time series, df['column'].interpolate(method='linear') fills gaps with linear interpolation between neighbors. df['column'].fillna(method='ffill') carries the last observed value forward, appropriate when the true value is likely unchanged (e.g., a setting that was not re-measured).

Creating missing indicators preserves information about the pattern of missingness. df['column_missing'] = df['column'].isnull().astype(int) creates a binary flag before imputation. Including this flag as a predictor in models allows the model to learn whether the value being missing is itself predictive of the outcome. This is especially valuable when missingness is MAR or MNAR, because the pattern of missingness carries information that imputation alone discards.

Step 3: Fix Types and Formats

Type conversion resolves the most common data quality issue. df['count'] = pd.to_numeric(df['count'], errors='coerce') converts to numeric, replacing non-numeric entries with NaN. The errors='coerce' parameter is critical: without it, a single non-numeric value in a column of thousands prevents the entire conversion. After coercion, df[df['count'].isnull()] shows which rows had problematic values, enabling you to investigate and fix the root cause.

Date parsing standardizes temporal data. pd.to_datetime(df['date']) automatically parses common date formats. For ambiguous formats (is "01/02/03" January 2, February 1, or 2001-02-03?), specify the format explicitly: pd.to_datetime(df['date'], format='%m/%d/%Y'). For mixed formats within a single column, try pd.to_datetime(df['date'], format='mixed', dayfirst=False). After parsing, extract components: df['year'] = df['date'].dt.year, df['month'] = df['date'].dt.month, df['day_of_week'] = df['date'].dt.dayofweek.

String standardization eliminates invisible inconsistencies. df['name'] = df['name'].str.strip() removes leading and trailing whitespace (the most common invisible data quality problem). df['name'] = df['name'].str.lower() normalizes case. df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True) collapses multiple spaces into one. df['code'] = df['code'].str.replace(r'[^A-Za-z0-9]', '', regex=True) removes all non-alphanumeric characters from code fields. For categorical standardization, create a mapping dictionary: mapping = {'US': 'United States', 'USA': 'United States', 'U.S.': 'United States'}, then df['country'] = df['country'].map(mapping).fillna(df['country']).

Unit normalization ensures consistent measurement scales. If some temperature values are in Celsius and others in Fahrenheit (identified by implausibly high values for Celsius), convert: df.loc[df['temp'] > 50, 'temp'] = (df.loc[df['temp'] > 50, 'temp'] - 32) * 5/9. If weight is recorded in both kilograms and pounds, standardize to one unit. If distances mix miles and kilometers, convert. Document every unit conversion in a comment or Markdown cell so that the transformation is auditable.

Step 4: Detect and Handle Outliers

Statistical detection identifies values that are unusually far from the bulk of the data. The Z-score method flags values more than 3 standard deviations from the mean: outliers = np.abs(stats.zscore(df['value'])) > 3. The IQR method flags values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR: Q1, Q3 = df['value'].quantile([0.25, 0.75]), IQR = Q3 - Q1, then outliers = (df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR). The IQR method is more robust because it is not affected by the outliers it is trying to detect.

Domain knowledge determines whether outliers are errors or genuine extreme values. A body temperature of 45 C is almost certainly a measurement error or unit confusion. A stock return of 50% in a single day is extreme but real. A wind speed of 500 km/h is physically impossible at the surface. Never blindly remove statistical outliers without asking: could this value be real? For scientific data, investigate outlier causes before deciding: instrument malfunction, data entry error, or genuine extreme observation. Record your reasoning for each decision.

Handling strategies depend on the cause. Remove values that are clearly errors (physically impossible, known instrument failures): df = df[~outliers]. Cap values at reasonable thresholds (Winsorizing): df['value'] = df['value'].clip(lower=lower_bound, upper=upper_bound). Transform skewed data: df['log_value'] = np.log1p(df['value']) compresses the range while preserving order. Keep genuine extreme values and use robust statistical methods (median instead of mean, non-parametric tests instead of parametric) that are less affected by extreme values.

Multivariate outliers are observations that are unusual when multiple variables are considered together, even if each individual variable is within normal range. A person who is 150 cm tall and weighs 100 kg has a normal height and a normal weight individually, but the combination is unusual. Mahalanobis distance measures how far a multivariate observation is from the center of the data distribution. from scipy.spatial.distance import mahalanobis, computed with the inverse covariance matrix, identifies these multivariate outliers that univariate methods miss.

Step 5: Validate and Document

Assertion checks verify that cleaning achieved the intended result. Place assertions after each cleaning step: assert df.isnull().sum().sum() == 0 (no missing values remaining). assert df.duplicated().sum() == 0 (no duplicates). assert (df['age'] >= 0).all() and (df['age'] <= 120).all() (ages are physically plausible). assert df['percentage'].between(0, 100).all() (percentages are in range). These assertions serve as executable documentation of data quality requirements. If an assertion fails, it means the cleaning was incomplete or a new data quality issue appeared.

Compare before and after statistics to ensure cleaning did not introduce bias. print(f"Rows: {len(df_raw)} -> {len(df_clean)} ({len(df_raw) - len(df_clean)} removed)"). Compare means, medians, and distributions of key variables before and after cleaning. If the mean of a variable changed substantially after removing outliers or imputing missing values, investigate whether the change is justified or whether the cleaning introduced systematic distortion. Plot before-and-after histograms side by side for visual comparison.

Document every cleaning decision. In a Jupyter notebook, use Markdown cells between cleaning code to explain what was done and why: "Removed 47 rows where temperature > 60C because the sensor was known to malfunction above this threshold per equipment manual." In a script, add comments for non-obvious cleaning steps. Create a cleaning log DataFrame that records each modification: what changed, how many rows were affected, and the justification. This log is essential for peer review, regulatory compliance, and your future self trying to understand why certain decisions were made.

Save the cleaned dataset with a clear naming convention. df_clean.to_csv('data_cleaned_2026-05-18.csv', index=False) includes the date for version tracking. Keep the raw data untouched in a separate directory: never modify the original data files. The cleaning script itself is the documentation of the transformation from raw to clean. Anyone should be able to take the raw data, run the cleaning script, and reproduce the exact same clean dataset. This reproducibility is non-negotiable for scientific research.

Key Takeaway

Never clean data blindly. Every cleaning decision (dropping, imputing, transforming) should be based on understanding why the data quality issue exists, and every decision should be documented so it can be reviewed and reproduced.