How to Use pandas for Data Analysis

Updated May 2026

pandas is Python's standard library for working with structured tabular data, providing DataFrame and Series objects that handle the messy reality of real-world datasets: mixed column types, missing values, inconsistent formatting, and the need to group, merge, reshape, and aggregate before analysis. If you work with data that fits in a spreadsheet or database table, pandas is the tool that replaces manual Excel work with reproducible, scriptable analysis that handles millions of rows in seconds.

Import pandas with the standard alias: import pandas as pd. The two core data structures are Series (a one-dimensional labeled array, like a single spreadsheet column) and DataFrame (a two-dimensional labeled table, like a full spreadsheet). DataFrames have row labels (the index) and column labels (column names). Most pandas work involves DataFrames, and most operations return new DataFrames, making it natural to chain operations together: df.dropna().groupby('condition').mean() drops missing values, groups by condition, and computes means in a single readable expression.

Step 1: Load Data into DataFrames

pd.read_csv() is the most common entry point, handling comma-separated, tab-separated, and other delimited text files. pd.read_csv('experiment.csv') reads the file with automatic header detection, type inference, and handling of common missing value representations (empty cells, "NA", "NaN", "null"). Parameters customize the parsing: sep='\t' for tab-separated files, header=None if there is no header row, usecols=['col1', 'col2'] to load only specific columns, nrows=1000 to load only the first 1000 rows for quick inspection, and parse_dates=['date_column'] to automatically parse date strings into datetime objects.

pd.read_excel('data.xlsx', sheet_name='Sheet1') reads Excel files. pd.read_sql('SELECT * FROM experiments', connection) reads directly from SQL databases using a SQLAlchemy or sqlite3 connection object. pd.read_json('data.json') reads JSON files or API responses. pd.read_parquet('data.parquet') reads Parquet columnar files, which are 2 to 10 times smaller than CSV and 10 to 100 times faster to load for large datasets. pd.read_clipboard() reads whatever is currently on the clipboard, useful for quickly importing data copied from a web page or document.

Creating DataFrames directly from Python data structures is useful for testing and small datasets. pd.DataFrame({'name': ['Alice', 'Bob'], 'score': [95, 87]}) creates a DataFrame from a dictionary of lists. pd.DataFrame(numpy_array, columns=['x', 'y', 'z']) wraps a NumPy array with column names. pd.DataFrame.from_records(list_of_dicts) converts a list of dictionaries (common from JSON APIs) into a DataFrame where dictionary keys become column names.

Step 2: Explore and Understand Your Data

Before any analysis, understand what you have. df.shape returns (rows, columns) as a tuple. df.head() shows the first 5 rows. df.tail() shows the last 5. df.info() prints column names, non-null counts, and data types, immediately revealing columns with missing data and incorrect types (a numeric column read as 'object' type usually contains non-numeric entries that need cleaning). df.describe() computes count, mean, std, min, 25%, 50%, 75%, and max for all numeric columns, giving a statistical overview that highlights outliers and unexpected distributions.

Column-level exploration reveals data quality issues. df['column'].value_counts() shows the frequency of each unique value, essential for categorical data (treatment groups, categories, yes/no fields). df['column'].nunique() returns the number of unique values. df.isnull().sum() returns the count of missing values per column. df.dtypes shows the data type of each column. df.duplicated().sum() counts duplicate rows. These quick inspections, which take seconds even on large datasets, prevent the common mistake of running analyses on data with undetected quality problems.

Visual exploration combines pandas with matplotlib. df['column'].hist() plots a histogram. df.plot.scatter(x='col1', y='col2') creates a scatter plot. df.boxplot(column='measurement', by='group') creates grouped box plots. df.corr() computes the correlation matrix between all numeric columns, which when passed to Seaborn's heatmap function (sns.heatmap(df.corr(), annot=True)) produces an annotated correlation matrix that immediately reveals relationships and potential multicollinearity in your variables.

Step 3: Clean and Transform Data

Missing value handling is the first cleaning step. df.dropna() removes rows with any missing values. df.dropna(subset=['critical_column']) removes rows only where a specific column is missing. df.fillna(0) replaces all missing values with zero. df.fillna(df.mean()) fills with column means. df['column'].fillna(method='ffill') forward-fills missing values with the last known value, appropriate for time series data where the previous measurement is the best estimate. df.interpolate() fills gaps using linear interpolation, appropriate for regularly sampled measurements.

Type conversion fixes columns that were read as the wrong type. df['count'] = df['count'].astype(int) converts to integer. pd.to_numeric(df['value'], errors='coerce') converts to numeric, replacing non-numeric entries with NaN rather than raising an error, useful for identifying and isolating problematic rows. pd.to_datetime(df['date']) parses date strings into datetime objects. df['category'] = df['category'].astype('category') converts to pandas' memory-efficient categorical type, which also enables meaningful sorting for ordinal categories.

String operations clean text data. df['name'].str.strip() removes leading and trailing whitespace. df['name'].str.lower() converts to lowercase for consistent comparison. df['name'].str.replace('old', 'new') performs text replacement. df['name'].str.contains('pattern') returns a boolean Series for filtering. df['name'].str.split(',', expand=True) splits a column into multiple columns. These string methods are vectorized, operating on every element of the column simultaneously without explicit loops.

Column operations create new derived data. df['bmi'] = df['weight'] / (df['height'] / 100) ** 2 creates a calculated column. df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100], labels=['child', 'young_adult', 'middle', 'senior', 'elderly']) bins continuous data into categories. df.rename(columns={'old_name': 'new_name'}) renames columns. df.drop(columns=['unnecessary_column']) removes columns. df.assign() creates new columns and returns a new DataFrame, useful for chaining: df.assign(bmi=lambda d: d.weight / (d.height / 100) ** 2).

Step 4: Filter, Group, and Aggregate

Boolean filtering selects rows meeting specified conditions. df[df['temperature'] > 37] selects rows where temperature exceeds 37. df[(df['group'] == 'treatment') & (df['response'] > 0)] combines conditions with & (and), | (or), and ~ (not). df.query('temperature > 37 and group == "treatment"') provides a string-based alternative that some find more readable. df.loc[df['status'].isin(['active', 'pending'])] selects rows where status is one of several values. df.nlargest(10, 'score') selects the 10 highest-scoring rows.

Grouping with groupby splits data by one or more categorical columns and applies functions to each group independently. df.groupby('treatment').mean() computes the mean of all numeric columns for each treatment group. df.groupby(['treatment', 'timepoint']).mean() groups by two variables simultaneously. df.groupby('group').agg({'measurement': ['mean', 'std', 'count'], 'duration': 'median'}) applies different aggregation functions to different columns. The result is a new DataFrame with one row per group, ready for comparison or plotting.

Custom aggregation functions handle cases where built-in functions are insufficient. df.groupby('group').apply(lambda g: g['value'].quantile(0.95)) computes the 95th percentile per group. df.groupby('group').transform('mean') returns a Series the same size as the original DataFrame where each value is replaced by its group mean, useful for computing within-group deviations: df['deviation'] = df['value'] - df.groupby('group')['value'].transform('mean').

Pivot tables reshape data from long to wide format. df.pivot_table(values='measurement', index='subject', columns='timepoint', aggfunc='mean') creates a table with subjects as rows, timepoints as columns, and measurements as values, exactly the format many statistical tests and visualizations require. pd.crosstab(df['category1'], df['category2']) creates a frequency table counting occurrences of each combination, the starting point for chi-squared tests of independence.

Step 5: Merge and Reshape Datasets

Merging combines DataFrames that share a common column. pd.merge(df1, df2, on='subject_id') performs an inner join, keeping only rows present in both DataFrames. The how parameter controls join type: 'inner' (default, intersection), 'outer' (union), 'left' (keep all rows from df1), 'right' (keep all rows from df2). pd.merge(df1, df2, left_on='id', right_on='subject_id') handles cases where the matching columns have different names. This is the same logic as SQL joins and is essential for combining data from multiple sources.

Concatenation stacks DataFrames vertically or horizontally. pd.concat([df1, df2, df3]) stacks DataFrames with the same columns on top of each other (axis=0, the default), aligning by column names. pd.concat([df1, df2], axis=1) places DataFrames side by side, aligning by index. When combining data from multiple experimental runs or files, reading each into a DataFrame and concatenating is the standard pattern: pd.concat([pd.read_csv(f) for f in file_list]) loads and combines many files in one line.

Reshaping converts between long format (one observation per row) and wide format (one subject per row with multiple columns). pd.melt(df, id_vars=['subject'], value_vars=['day1', 'day2', 'day3']) converts wide format to long format by unpivoting columns into rows, producing a DataFrame with subject, variable (which day), and value columns. This long format is required by most statistical functions and visualization libraries. df.pivot(index='subject', columns='day', values='measurement') does the reverse, converting long to wide.

Exporting results completes the workflow. df.to_csv('results.csv', index=False) writes to CSV (index=False prevents writing the row index as an extra column). df.to_excel('results.xlsx', sheet_name='Analysis') writes to Excel. df.to_parquet('results.parquet') writes to Parquet for efficient storage and fast re-loading. df.to_latex() generates a LaTeX table for direct inclusion in papers. df.to_clipboard() copies to the clipboard for pasting into other applications. Each export function has parameters for controlling formatting, encoding, compression, and what to include.

Key Takeaway

pandas transforms messy, real-world data into clean, analysis-ready DataFrames through a consistent chain of operations: load, explore, clean, filter, group, aggregate, export. Master this workflow and you can handle any tabular dataset.

Step 1: Load Data into DataFrames

Step 2: Explore and Understand Your Data

Step 3: Clean and Transform Data

Step 4: Filter, Group, and Aggregate

Step 5: Merge and Reshape Datasets

Related Articles

How to Use NumPy

How to Clean Data in Python

How to Use matplotlib

Data Collection Methods