Web Scraping for Research with Python
Web scraping sits in a legal and ethical gray area that researchers must navigate carefully. Always check the site's robots.txt file (e.g., example.com/robots.txt), which specifies which pages may be crawled by automated tools. Review the site's terms of service for restrictions on automated access. Prefer APIs over HTML scraping when available, because APIs are designed for programmatic access and are more stable. Rate-limit your requests (1 to 2 per second maximum for most sites) to avoid overloading servers. For academic research, many organizations grant special access or provide bulk data downloads upon request, which is faster and more reliable than scraping.
Step 1: Check for APIs and Structured Data
Before writing any scraping code, check whether the data is available through a cleaner channel. Many data sources provide REST APIs with structured JSON responses: government databases (Census Bureau, EPA, FDA, NASA, NOAA), social platforms (Twitter/X API, Reddit API), academic databases (PubMed E-utilities, CrossRef, OpenAlex, Semantic Scholar), financial data (Yahoo Finance, FRED, Alpha Vantage), and geographic data (OpenStreetMap Overpass API, Google Maps API). API data is structured, documented, and comes with explicit usage terms, making it vastly preferable to HTML scraping.
Many datasets are available for direct download. Kaggle hosts thousands of datasets. data.gov provides US government datasets. The UCI Machine Learning Repository provides classic ML datasets. Scientific data repositories (Dryad, Figshare, Zenodo) host research datasets. Google Dataset Search indexes datasets across the web. Checking these sources first saves hours of scraping and produces cleaner, better-documented data.
Even when scraping is necessary, check the page source for embedded structured data. Many websites embed JSON-LD or microdata in their HTML for search engines. View the page source and search for "application/ld+json" or "itemtype." Wikipedia embeds structured data through Wikidata. News sites embed article metadata. E-commerce sites embed product data. Extracting embedded structured data is simpler and more reliable than parsing visual HTML layout.
Step 2: Fetch Web Pages with requests
The requests library sends HTTP requests and handles responses. import requests, then response = requests.get(url). response.status_code returns 200 for success, 404 for not found, 403 for forbidden, 429 for rate-limited. response.text returns the HTML content as a string. response.json() parses JSON API responses into Python dictionaries. Always check the status code before processing: if response.status_code != 200: log the error and skip.
Headers and sessions make requests look like normal browser traffic. requests.get(url, headers={'User-Agent': 'ResearchBot/1.0 (university.edu; researcher@university.edu)'}) identifies your scraper honestly, which is the ethical approach. A Session object (session = requests.Session()) maintains cookies and authentication across multiple requests, necessary for sites that require login. session.post(login_url, data={'username': user, 'password': pw}) authenticates, and subsequent session.get() requests include the session cookies.
Rate limiting prevents overloading servers and getting blocked. import time, then add time.sleep(1) between requests for a 1-second delay. For more sophisticated rate limiting, use the ratelimit library: from ratelimit import limits, sleep_and_retry. Decorate your fetch function with @sleep_and_retry and @limits(calls=1, period=2) to enforce 1 request per 2 seconds automatically. Respectful rate limiting is both ethical (do not overwhelm someone else's server) and practical (aggressive scraping gets your IP banned).
Error handling with retries handles transient failures. from requests.adapters import HTTPAdapter, from urllib3.util.retry import Retry. Configure a session with automatic retries: retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504]), adapter = HTTPAdapter(max_retries=retry), session.mount('https://', adapter). This automatically retries failed requests with exponential backoff (1 second, then 2, then 4), handling the temporary network errors and server hiccups that are inevitable when making thousands of requests.
Step 3: Parse HTML with BeautifulSoup
BeautifulSoup converts HTML text into a searchable tree structure. from bs4 import BeautifulSoup, then soup = BeautifulSoup(response.text, 'html.parser'). The 'html.parser' is Python's built-in parser. For faster parsing of large pages, use 'lxml' (pip install lxml). For handling malformed HTML, use 'html5lib' (most lenient parser). The soup object represents the entire HTML document as a tree of nested elements that you can search, filter, and extract text from.
Finding elements uses CSS selectors or tag-based methods. soup.select('table.data-table tr td') finds all td elements inside tr elements inside tables with class "data-table," using CSS selector syntax. soup.find('div', {'class': 'article-body'}) finds the first div with a specific class. soup.find_all('a', href=True) finds all links. element.text extracts the visible text content. element['href'] extracts an attribute value. element.find_next_sibling('p') navigates to the next sibling element. These navigation methods combine to extract any specific piece of data from any HTML structure.
Table extraction is the most common research scraping task. pandas.read_html(url) extracts all HTML tables from a page as a list of DataFrames, often with a single line of code. For more control, find the specific table with BeautifulSoup: table = soup.find('table', {'id': 'results'}), then iterate over rows: for row in table.find_all('tr'): cells = [td.text.strip() for td in row.find_all(['td', 'th'])]. Convert the list of rows to a DataFrame: pd.DataFrame(rows[1:], columns=rows[0]).
Handling pagination and multi-page results requires building a loop that follows "next page" links or increments page numbers. Start with page 1, extract data, find the "next" link (next_link = soup.find('a', {'class': 'next'})), and if it exists, fetch that URL and repeat. For numbered pagination, construct URLs with page parameters: f'{base_url}?page={page_num}' for page_num in range(1, total_pages + 1). Collect results from all pages into a single list, then convert to a DataFrame at the end.
Step 4: Handle Dynamic Content
Many modern websites load data dynamically with JavaScript after the initial HTML loads. When requests.get() returns HTML that does not contain the data you see in the browser, the data is loaded by JavaScript. Before reaching for a browser automation tool, check the browser's Network tab (Developer Tools, F12) for XHR/Fetch requests: the data often comes from a JSON API endpoint that you can call directly with requests, which is faster and simpler than rendering the full page.
Selenium automates a real browser to handle JavaScript-rendered content. from selenium import webdriver, from selenium.webdriver.common.by import By. driver = webdriver.Chrome() launches a Chrome browser. driver.get(url) loads the page including JavaScript execution. driver.find_element(By.CSS_SELECTOR, '.data-table') finds elements after JavaScript has finished rendering. driver.page_source gives the fully rendered HTML for BeautifulSoup parsing. Use WebDriverWait to wait for specific elements to appear before extracting data, preventing errors from trying to read content that has not loaded yet.
Playwright is a newer alternative to Selenium with better performance and reliability. from playwright.sync_api import sync_playwright. with sync_playwright() as p: browser = p.chromium.launch(headless=True), page = browser.new_page(), page.goto(url), page.wait_for_selector('.data-table'), content = page.content(). Playwright handles modern web frameworks (React, Vue, Angular) more reliably, provides better waiting mechanisms, and supports interception of network requests so you can capture API responses directly.
Headless mode runs the browser without displaying a window, which is essential for automated scripts and server environments. driver = webdriver.Chrome(options=options) where options.add_argument('--headless') runs Chrome in headless mode. Headless browsers use less memory and run faster, but some sites detect headless browsers and block them. For these cases, undetected-chromedriver provides patches that make the headless browser indistinguishable from a regular browser session.
Step 5: Store and Validate Collected Data
Save scraped data incrementally rather than keeping everything in memory. After each page or batch, append results to a CSV file: df.to_csv('results.csv', mode='a', header=not Path('results.csv').exists(), index=False). This ensures that if the script crashes after collecting 5000 of 10000 records, you still have the first 5000. For structured storage, use SQLite: import sqlite3, conn = sqlite3.connect('research_data.db'), df.to_sql('records', conn, if_exists='append', index=False). SQLite handles concurrent access, deduplication queries, and complex queries better than flat files.
Deduplication prevents redundant records when re-running scrapers. Use a unique identifier for each record (URL, ID number, date+name combination) and check for existence before inserting: if not df[df['id'] == new_id].empty: skip. In SQLite, use INSERT OR IGNORE with a unique constraint on the identifier column. For incremental collection, save the last-processed page, date, or ID and resume from that point: last_id = pd.read_sql('SELECT MAX(id) FROM records', conn).iloc[0, 0].
Data validation ensures scraped data is complete and correctly structured. After collection, check: does every record have all expected fields? Are numeric fields actually numeric? Are dates parseable? Are there unexpected null values indicating that the page structure changed? Create a validation function that runs after each scraping session and flags anomalies. Website layouts change without warning, and a scraper that worked yesterday may extract garbage today if a CSS class name changed or a table column was reordered.
Caching responses avoids re-downloading pages during development and debugging. requests-cache (pip install requests-cache) automatically caches HTTP responses: import requests_cache, requests_cache.install_cache('scraper_cache', expire_after=86400). All requests are served from the local cache for 24 hours, eliminating network requests while you refine your parsing code. This speeds up development dramatically and reduces load on the target server. Disable the cache for production runs when you need fresh data.
Always check for an API or downloadable dataset before scraping HTML. When scraping is necessary, be ethical (respect robots.txt, rate-limit), be robust (retry failures, save incrementally), and be patient (a slow, reliable scraper is better than a fast one that gets banned).