API Integration with Python

Updated May 2026
APIs (Application Programming Interfaces) provide structured, programmatic access to data from scientific databases, government repositories, cloud services, and research platforms. Python's requests library makes API integration straightforward: send HTTP requests with parameters, receive JSON responses, and parse them into pandas DataFrames for analysis. For researchers, APIs replace manual data downloads with automated, reproducible data collection that can be scheduled, scaled, and incorporated directly into analysis pipelines.

REST APIs are the most common type for data access. They use standard HTTP methods (GET to retrieve data, POST to send data, PUT to update, DELETE to remove) on URL endpoints that represent resources. A request to a weather API might look like: GET /api/v1/weather?lat=40.7&lon=-74.0&date=2026-05-19. The server returns a JSON response with the requested data. Python's requests library handles the HTTP communication, and json parsing converts the response into Python dictionaries and lists that feed directly into pandas DataFrames for analysis.

Step 1: Understand REST API Basics

Every API request has four components: the HTTP method (usually GET for data retrieval), the URL (base URL plus endpoint path), query parameters (key-value pairs that filter or specify the request), and headers (metadata including authentication tokens and content type preferences). The API documentation specifies available endpoints, required and optional parameters, response format, rate limits, and authentication requirements. Always read the documentation before writing code; it saves time and prevents wasted requests.

Responses come with HTTP status codes that indicate success or failure. 200 means success. 201 means resource created (after POST). 400 means bad request (invalid parameters). 401 means unauthorized (authentication required or failed). 403 means forbidden (authenticated but not authorized for this resource). 404 means not found (endpoint does not exist or resource not found). 429 means too many requests (rate limit exceeded). 500 means internal server error (the API has a bug). Your code should check status codes and handle each category appropriately rather than assuming every response is successful.

JSON (JavaScript Object Notation) is the standard response format for modern APIs. JSON objects map to Python dictionaries: {"name": "value", "count": 42}. JSON arrays map to Python lists: [1, 2, 3]. Nested structures combine both: {"results": [{"id": 1, "value": 42}, {"id": 2, "value": 87}]}. Python's json module (json.loads for parsing, json.dumps for formatting) and the requests library's .json() method handle the conversion automatically.

Step 2: Make Your First API Request

import requests. response = requests.get('https://api.example.com/data', params={'query': 'term', 'limit': 100}). The params dictionary is automatically URL-encoded and appended to the URL. response.status_code checks the HTTP status. response.json() parses the JSON response into a Python dictionary. response.text gives the raw text response. response.headers shows the response headers (which often include rate limit information).

Parse JSON responses into pandas DataFrames for analysis. For flat JSON: data = response.json(), df = pd.DataFrame(data['results']). For nested JSON, use json_normalize: from pandas import json_normalize, df = json_normalize(data['results'], record_path='measurements', meta=['id', 'name']). This flattens nested structures into a tabular format. For deeply nested responses, extract the relevant portion first: records = [{'id': item['id'], 'value': item['data']['measurement']['value']} for item in data['results']], then df = pd.DataFrame(records).

Scientific API examples that are freely available: NCBI E-utilities (PubMed, GenBank, Protein), NASA Open APIs (Mars rover photos, near-earth objects, astronomy picture of the day), NOAA Climate Data Online, World Bank Open Data, CrossRef (publication metadata), OpenAlex (research papers and citations), USGS Earthquake API, and EPA Air Quality API. Each has its own endpoint structure and parameters, but the requests.get() pattern is the same for all of them. Start with a simple query in a browser to see the response format before writing Python code.

Step 3: Authenticate and Manage Keys

Most APIs require authentication to track usage and enforce rate limits. API key authentication is the simplest: include the key as a query parameter (params={'api_key': key}) or as a header (headers={'Authorization': 'Bearer ' + key}). The API documentation specifies which method to use. Register for a free API key on the provider's developer portal. Some APIs (PubMed, CrossRef) work without authentication but impose lower rate limits.

Never hardcode API keys in source code. Use environment variables: import os, api_key = os.environ['MY_API_KEY']. Set the variable in your shell: export MY_API_KEY='your-key-here' (add to .bashrc for persistence). For Jupyter notebooks, use a .env file with python-dotenv: from dotenv import load_dotenv, load_dotenv(), key = os.getenv('API_KEY'). Add .env to .gitignore so keys are never committed to version control. Accidentally publishing an API key to a public repository is a common security mistake that can lead to unauthorized usage charges.

OAuth 2.0 authentication is more complex but required by some APIs (Google APIs, GitHub API for private repos). The flow: register your application, obtain a client ID and secret, redirect the user to the authorization page, receive an authorization code, exchange it for an access token, include the token in subsequent requests. The requests-oauthlib library simplifies this: from requests_oauthlib import OAuth2Session. For service-to-service authentication without user interaction, use a service account with a pre-authorized token or client credentials grant.

Step 4: Handle Pagination and Rate Limits

APIs return results in pages to avoid sending massive responses. Offset-based pagination: params={'offset': 0, 'limit': 100} for the first page, params={'offset': 100, 'limit': 100} for the second, continuing until fewer than 100 results are returned. Cursor-based pagination: the response includes a next_cursor value that you pass in the next request. Link-header pagination: the response headers contain a Link header with the URL for the next page. Build a loop that follows the pagination mechanism until all results are collected.

Rate limiting controls how many requests you can make per time period. Check the API documentation for limits (e.g., "100 requests per minute"). Implement rate limiting with time.sleep(): time.sleep(0.6) between requests keeps you under 100/minute. More sophisticated rate limiting reads the response headers: many APIs return X-RateLimit-Remaining and X-RateLimit-Reset headers. If remaining hits zero, sleep until the reset time: if int(response.headers['X-RateLimit-Remaining']) == 0: reset_time = int(response.headers['X-RateLimit-Reset']), time.sleep(max(0, reset_time - time.time())).

Combine pagination and rate limiting in a collection loop: results = []. url = base_url. while url: response = requests.get(url, headers=auth_headers), response.raise_for_status(), data = response.json(), results.extend(data['items']), url = data.get('next_page_url'), time.sleep(0.5). After the loop: df = pd.DataFrame(results). This pattern, with appropriate error handling, reliably collects complete datasets from paginated APIs regardless of size.

Step 5: Build Robust API Clients

Retry logic handles transient failures. from requests.adapters import HTTPAdapter, from urllib3.util.retry import Retry. session = requests.Session(). retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504]). session.mount('https://', HTTPAdapter(max_retries=retry)). This automatically retries failed requests with exponential backoff (1s, 2s, 4s between retries), handling temporary server errors and rate limit responses without manual intervention. Use session for all subsequent requests.

Response caching prevents redundant API calls during development and re-runs. requests-cache (pip install requests-cache) transparently caches responses: import requests_cache, requests_cache.install_cache('api_cache', expire_after=3600). All requests are served from cache for 1 hour, dramatically speeding up script development (you can re-run without hitting the API) and reducing API usage. Disable or clear the cache for production runs when you need fresh data: requests_cache.clear().

Structured error handling wraps API calls in try/except blocks that handle each failure mode appropriately. try: response = session.get(url, timeout=30). except requests.ConnectionError: log('Network unreachable, skipping'). except requests.Timeout: log('Request timed out, retrying'). The timeout parameter (in seconds) prevents requests from hanging indefinitely on unresponsive servers. Always set a timeout; the default is no timeout, meaning a stalled server will hang your script forever.

Logging and monitoring track API collection progress and diagnose issues. Use Python's logging module to record: each request URL, response status codes, number of results per page, cumulative progress, and any errors. Save raw API responses to files (json.dump(data, open(f'responses/page_{n}.json', 'w'))) so you can re-parse without re-downloading if your parsing logic changes. For long-running collection jobs, add email or Slack notifications on completion or failure so you do not need to watch the terminal.

Key Takeaway

Build API clients with retries, timeouts, rate limiting, and caching from the start. These four features transform fragile scripts that fail on the first hiccup into reliable data collection tools that run unattended.