Geospatial Analysis with Python
Geographic data adds a spatial dimension to standard data analysis. Instead of asking "what is the average income by county?" you ask "what is the average income by county, and do neighboring counties have similar values?" The spatial component reveals patterns invisible in tabular data: clustering, spatial autocorrelation, distance decay, and boundary effects. Python's geospatial stack brings the full power of pandas (filtering, grouping, merging) to spatial data while adding operations that understand geography: "find all hospitals within 10 km of this location," "calculate the area of each watershed," or "identify which census tracts overlap with this flood zone."
Step 1: Load and Inspect Spatial Data
GeoPandas extends pandas with a geometry column that stores spatial information. import geopandas as gpd, then gdf = gpd.read_file('boundaries.shp') reads shapefiles, GeoJSON, GeoPackage, and dozens of other spatial formats. The result is a GeoDataFrame: a regular pandas DataFrame with an extra geometry column containing Shapely geometry objects (Point, LineString, Polygon, MultiPolygon). All standard pandas operations work: gdf.head(), gdf.describe(), gdf.groupby(), gdf.merge(). The geometry column adds spatial capabilities: gdf.area returns the area of each polygon, gdf.length returns the length of each line, gdf.centroid returns the center point of each feature.
Common data sources for geographic analysis include Natural Earth (free global boundaries and physical features at multiple resolutions), the US Census Bureau TIGER/Line shapefiles (US states, counties, tracts, roads, water bodies), OpenStreetMap (global crowdsourced geographic data, downloadable by region), and government open data portals that provide spatial datasets for environmental, demographic, and infrastructure analysis. gpd.read_file() handles URLs directly: gdf = gpd.read_file('https://example.com/data.geojson') downloads and parses in one step.
Inspect spatial data visually with gdf.plot(), which produces a matplotlib figure with each geometry drawn as a shape. gdf.plot(column='population', cmap='YlOrRd', legend=True) creates a choropleth map colored by population. gdf.plot(figsize=(12, 8), edgecolor='black', linewidth=0.5) controls the figure size and boundary styling. gdf.boundary.plot() draws only the outlines. gdf.explore() (requires folium) creates an interactive web map in a Jupyter notebook where you can zoom, pan, and click features to see their attributes.
Step 2: Work with Coordinate Reference Systems
Every spatial dataset uses a Coordinate Reference System (CRS) that defines how 2D coordinates on a map relate to locations on the 3D Earth. The most common CRS is EPSG:4326 (WGS84), which uses latitude and longitude in degrees. gdf.crs shows the current CRS. gdf.crs == 'EPSG:4326' checks whether the data uses WGS84. If two datasets use different CRS values, spatial operations between them will produce wrong results because the coordinates do not align. Always check CRS before combining datasets.
Projected CRS values convert the curved Earth surface to a flat plane, enabling accurate distance and area calculations. In a geographic CRS (degrees), one degree of longitude covers 111 km at the equator but 0 km at the poles, so distance calculations in degrees are meaningless. gdf_projected = gdf.to_crs('EPSG:3857') converts to Web Mercator (meters). gdf_projected = gdf.to_crs(epsg=32617) converts to UTM Zone 17N (meters, accurate for eastern US). After projection, gdf_projected.area returns areas in square meters and gdf_projected.length returns lengths in meters. Choose a projection appropriate for your study region: UTM zones for local studies, equal-area projections for area comparisons, equidistant projections for distance measurements.
Setting a CRS differs from transforming between CRS values. gdf.set_crs('EPSG:4326') tells GeoPandas what CRS the existing coordinates are in, without changing the coordinates. gdf.to_crs('EPSG:3857') transforms the coordinates from the current CRS to a new one, changing the actual numbers. If you set the wrong CRS and then transform, the results will be incorrect because the transformation assumes the starting coordinates are in the specified CRS. When loading data without CRS metadata (common with CSV files containing lat/lon columns), always set the CRS before transforming.
Step 3: Perform Spatial Operations
Spatial joins combine two GeoDataFrames based on spatial relationships rather than shared column values. gpd.sjoin(points_gdf, polygons_gdf, predicate='within') finds which polygon each point falls inside, adding the polygon's attributes to each point's row. predicate='intersects' matches features that overlap or touch. predicate='contains' matches polygons that completely contain the other feature. This replaces the tedious manual process of looking up which county, district, or zone each data point belongs to, and it handles millions of points in seconds.
Buffer operations create areas around features. gdf.buffer(1000) creates a 1000-unit buffer around each geometry (1000 meters if the CRS uses meters). This answers "what is within 1 km of each hospital?" by creating circular zones, then using spatial joins to find features inside those zones. For non-circular buffers, Shapely provides custom shapes. For network-based distances (travel time along roads rather than straight-line distance), use the OSMnx library which downloads road networks from OpenStreetMap and computes shortest paths.
Geometric set operations combine shapes. gdf1.overlay(gdf2, how='intersection') finds the areas where two polygon layers overlap. how='union' combines both layers, keeping all areas. how='difference' subtracts one layer from another. how='symmetric_difference' keeps areas in one layer but not both. These operations answer questions like "which parts of the flood zone overlap with residential areas?" or "what percentage of the national park falls within the proposed development zone?"
Distance calculations measure how far apart features are. gdf1.geometry.distance(gdf2.geometry) computes the minimum distance between each pair of geometries. For nearest-neighbor analysis, gpd.sjoin_nearest(points_gdf, facilities_gdf, distance_col='dist_to_nearest') finds the closest facility to each point and reports the distance. For full distance matrices between all point pairs, use scipy.spatial.distance.cdist on the coordinate arrays. Always compute distances in a projected CRS (meters) rather than geographic (degrees) for meaningful results.
Step 4: Analyze Raster Data
Rasterio reads raster data (satellite imagery, elevation models, climate grids) into NumPy arrays with geographic metadata. import rasterio, then with rasterio.open('elevation.tif') as src: data = src.read(1), transform = src.transform, crs = src.crs. The data array contains pixel values (elevation in meters, temperature in Celsius, reflectance values). The transform maps pixel coordinates to geographic coordinates. The CRS defines the coordinate system. Rasterio handles GeoTIFF, NetCDF, HDF5, and dozens of other raster formats used in earth science.
Extracting raster values at point locations is a common research task: "what is the elevation at each sample site?" Use rasterio's sample method: with rasterio.open('elevation.tif') as src: values = [val[0] for val in src.sample([(x, y) for x, y in zip(points.geometry.x, points.geometry.y)])]. Add the extracted values to your point GeoDataFrame: points['elevation'] = values. For extracting statistics within polygon boundaries (mean temperature per county, total rainfall per watershed), use rasterstats: from rasterstats import zonal_stats, stats = zonal_stats(polygons_gdf, 'raster.tif', stats=['mean', 'min', 'max', 'count']).
Raster algebra operates on entire grids using NumPy. Compute the Normalized Difference Vegetation Index (NDVI) from satellite bands: ndvi = (nir_band - red_band) / (nir_band + red_band). Reclassify elevation into categories: landform = np.where(elevation < 200, 1, np.where(elevation < 1000, 2, 3)). Compute slope from elevation: from scipy.ndimage import sobel, then slope = np.sqrt(sobel(elevation, axis=0)**2 + sobel(elevation, axis=1)**2). These operations run at NumPy speed on million-pixel grids.
Xarray extends NumPy arrays with labeled dimensions and coordinates, ideal for multi-dimensional geographic data (time series of satellite images, 3D climate model output). import xarray as xr, ds = xr.open_dataset('climate.nc'). Access data by label: ds['temperature'].sel(time='2025-06', lat=slice(30, 50)). Compute statistics along dimensions: ds['temperature'].mean(dim='time') computes the temporal mean at each location. Xarray integrates with Dask for out-of-core computation on datasets larger than memory, essential for global climate datasets that can reach terabytes.
Step 5: Create Geographic Visualizations
Choropleth maps color regions by a data variable. gdf.plot(column='income_median', cmap='viridis', legend=True, legend_kwds={'label': 'Median Income ($)', 'shrink': 0.7}, edgecolor='gray', linewidth=0.3, figsize=(12, 8)). Choose colormaps that match the data type: sequential ('viridis', 'YlOrRd') for values that go from low to high, diverging ('RdBu', 'coolwarm') for values centered on a meaningful midpoint (deviation from average, positive/negative change), qualitative ('Set2', 'tab10') for categorical data (land use types, political parties). Always include a legend with units.
Point maps show discrete locations with optional size and color encoding. ax.scatter(gdf.geometry.x, gdf.geometry.y, c=gdf['value'], s=gdf['magnitude']*10, cmap='plasma', alpha=0.6, edgecolors='black', linewidth=0.3). For point density visualization, use kernel density estimation: from scipy.stats import gaussian_kde, then compute density on a grid and display with imshow or contourf. Hexbin plots (ax.hexbin(x, y, gridsize=30, cmap='YlOrRd')) aggregate points into hexagonal cells and color by count, effective for showing spatial patterns in large point datasets without overplotting.
Interactive web maps with Folium enable zooming, panning, and clicking. import folium, m = folium.Map(location=[40, -100], zoom_start=4). Add markers: for _, row in gdf.iterrows(): folium.Marker([row.geometry.y, row.geometry.x], popup=row['name']).add_to(m). Add choropleth layers: folium.Choropleth(geo_data=gdf, data=df, columns=['id', 'value'], key_on='feature.properties.id', fill_color='YlOrRd').add_to(m). Save: m.save('map.html'). Interactive maps are powerful for data exploration and for sharing results with collaborators who want to examine specific regions.
Contextual base maps provide geographic reference. contextily (pip install contextily) adds basemap tiles (OpenStreetMap, satellite imagery, terrain) behind your data. import contextily as ctx, then after plotting your data on a projected axes: ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik). The data must be in Web Mercator (EPSG:3857) for basemap tiles to align. For publication figures, use a subtle basemap (ctx.providers.Stamen.TonerLite) that provides geographic context without overwhelming the data layer.
Always verify that your datasets share the same CRS before combining them, and always project to a meter-based CRS before computing distances or areas. These two habits prevent the most common errors in geospatial analysis.