NLP with Python

Updated May 2026

Natural language processing (NLP) in Python uses libraries like NLTK, spaCy, and Hugging Face Transformers to extract structure, meaning, and patterns from text data. For researchers, this means automatically analyzing literature corpora, classifying documents by topic, extracting entities and relationships from papers, measuring sentiment in survey responses, and summarizing large collections of text, tasks that would take months manually but run in minutes with the right Python tools.

Python's NLP ecosystem spans three levels of complexity. NLTK (Natural Language Toolkit) is educational and comprehensive, providing implementations of classic NLP algorithms alongside extensive documentation and example corpora. spaCy is production-focused, providing fast, accurate pipelines for standard NLP tasks (tokenization, POS tagging, NER, dependency parsing) that process thousands of documents per second. Hugging Face Transformers provides access to state-of-the-art pre-trained language models (BERT, GPT, T5, and thousands of variants) for tasks like text classification, summarization, question answering, and text generation. Most research projects use spaCy for preprocessing and linguistic analysis, then Transformers for tasks requiring deep language understanding.

Step 1: Preprocess Text Data

Tokenization splits text into individual words (or subword units) that can be analyzed independently. spaCy's tokenizer handles complex cases correctly: contractions ("don't" to "do" + "n't"), hyphenated words, URLs, email addresses, and punctuation attached to words. import spacy, nlp = spacy.load('en_core_web_sm'), doc = nlp(text), tokens = [token.text for token in doc]. NLTK's word_tokenize is a simpler alternative: from nltk.tokenize import word_tokenize, tokens = word_tokenize(text). Sentence tokenization uses doc.sents in spaCy or nltk.sent_tokenize(text).

Stopword removal eliminates common words (the, is, at, which, on) that carry little semantic meaning. spaCy marks stopwords: meaningful_tokens = [token for token in doc if not token.is_stop and not token.is_punct]. NLTK provides a stopword list: from nltk.corpus import stopwords, stops = set(stopwords.words('english')), filtered = [w for w in tokens if w.lower() not in stops]. Be cautious with stopword removal: in some contexts, stopwords carry meaning ("to be or not to be" loses all meaning without stopwords). For topic modeling and text classification, removing stopwords usually helps. For sentiment analysis, negation words (not, never, no) are critical and should not be removed.

Lemmatization reduces words to their base form: "running" to "run", "better" to "good", "mice" to "mouse". spaCy provides lemmas automatically: lemmas = [token.lemma_ for token in doc]. Stemming is a cruder alternative that strips suffixes: from nltk.stem import PorterStemmer, stemmer = PorterStemmer(), stems = [stemmer.stem(w) for w in tokens]. Stemming is faster but produces non-words ("studying" becomes "studi"), while lemmatization produces valid words. Use lemmatization for analysis where readability matters, stemming for high-speed indexing where it does not.

Text cleaning handles the noise in real-world text. Lowercase: text = text.lower(). Remove HTML tags: from bs4 import BeautifulSoup, text = BeautifulSoup(html, 'html.parser').get_text(). Remove URLs: import re, text = re.sub(r'https?://\S+', '', text). Remove special characters: text = re.sub(r'[^a-zA-Z0-9\s]', '', text). Expand contractions: text = text.replace("can't", "cannot").replace("won't", "will not"). For scientific text, preserve domain-specific terms, chemical formulas, gene names, and numeric values that generic cleaning might destroy.

Step 2: Extract Linguistic Features

Part-of-speech (POS) tagging assigns grammatical categories to each word. spaCy's tagger identifies nouns, verbs, adjectives, adverbs, and other parts of speech: [(token.text, token.pos_) for token in doc] produces pairs like [('Python', 'PROPN'), ('processes', 'VERB'), ('text', 'NOUN')]. POS tags enable analyses like extracting all adjectives used to describe a product (sentiment analysis), identifying technical nouns in scientific abstracts (terminology extraction), and filtering words by grammatical role for more targeted text analysis.

Named entity recognition (NER) identifies real-world entities in text: people, organizations, locations, dates, monetary values, and more. spaCy's NER model extracts entities directly: [(ent.text, ent.label_) for ent in doc.ents] produces [('Harvard University', 'ORG'), ('January 2026', 'DATE'), ('Boston', 'GPE')]. For scientific text, the standard models miss domain-specific entities (gene names, chemical compounds, diseases). SciSpacy (pip install scispacy) provides biomedical NER models trained on medical literature, and custom NER models can be trained on annotated examples from your specific domain.

Dependency parsing reveals the grammatical structure of sentences: which words modify which, what is the subject of a verb, what is the object. spaCy's parser produces a tree: for token in doc: print(token.text, token.dep_, token.head.text) shows that in "The large model predicts outcomes," "model" is the subject of "predicts," "large" and "The" modify "model," and "outcomes" is the object of "predicts." This structure enables extracting subject-verb-object triples (who did what), identifying relationships between entities, and analyzing sentence complexity.

Similarity measurement compares how semantically related two texts are. spaCy's medium and large models include word vectors: doc1.similarity(doc2) returns a value between 0 (unrelated) and 1 (identical meaning). This enables finding documents similar to a query, clustering related papers, detecting near-duplicate content, and measuring how research topics evolve over time. For higher accuracy, use sentence-transformers (from sentence_transformers import SentenceTransformer) which encode entire sentences into fixed-length vectors using pre-trained transformer models.

Step 3: Represent Text Numerically

Machine learning algorithms require numerical input, so text must be converted to numbers. Bag-of-words represents each document as a vector of word counts. from sklearn.feature_extraction.text import CountVectorizer, vectorizer = CountVectorizer(max_features=5000), X = vectorizer.fit_transform(documents). Each document becomes a row with 5000 columns (one per word), and each value is the count of that word in the document. This representation ignores word order but captures topic content effectively for classification and clustering.

TF-IDF (Term Frequency, Inverse Document Frequency) improves on raw counts by weighting words that are distinctive to specific documents. from sklearn.feature_extraction.text import TfidfVectorizer, vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2)), X = vectorizer.fit_transform(documents). Words that appear in many documents (common words) get low weights. Words that appear frequently in a few documents (topic-specific words) get high weights. ngram_range=(1, 2) includes both single words and two-word phrases, capturing "machine learning" as a distinct feature rather than separate "machine" and "learning" features.

Word embeddings represent words as dense vectors in a continuous space where similar words are close together. Pre-trained embeddings like Word2Vec, GloVe, and FastText capture semantic relationships: the vector for "king" minus "man" plus "woman" is close to "queen." spaCy includes word vectors in its medium (md) and large (lg) models. Gensim (pip install gensim) provides tools for loading pre-trained embeddings and training custom embeddings on domain-specific text. To represent a document, average its word vectors: doc_vector = np.mean([token.vector for token in doc if token.has_vector], axis=0).

Transformer-based representations from models like BERT, SciBERT, and Sentence-BERT produce contextualized embeddings where the same word has different representations depending on its surrounding context. from sentence_transformers import SentenceTransformer, model = SentenceTransformer('all-MiniLM-L6-v2'), embeddings = model.encode(documents). These embeddings capture meaning far better than bag-of-words or averaged word vectors, producing state-of-the-art results for semantic search, document clustering, and text classification. The trade-off is computational cost: encoding 10,000 documents takes minutes rather than milliseconds.

Step 4: Analyze Text at Scale

Sentiment analysis classifies text as positive, negative, or neutral. For quick analysis, VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based tool that works well on social media and informal text: from nltk.sentiment import SentimentIntensityAnalyzer, sia = SentimentIntensityAnalyzer(), scores = sia.polarity_scores(text). For higher accuracy, use a pre-trained transformer: from transformers import pipeline, sentiment = pipeline('sentiment-analysis'), result = sentiment('The experiment was remarkably successful.') returns {'label': 'POSITIVE', 'score': 0.99}.

Topic modeling discovers the latent themes in a document collection. Latent Dirichlet Allocation (LDA) is the standard approach. from sklearn.decomposition import LatentDirichletAllocation, lda = LatentDirichletAllocation(n_components=10, random_state=42).fit(tfidf_matrix). Each topic is a distribution over words: the top 10 words per topic reveal what each topic is about. Each document is a mixture of topics: lda.transform(document_vector) shows the topic proportions. BERTopic (pip install bertopic) provides a modern transformer-based alternative that produces more coherent topics with automatic topic labeling.

Text classification assigns predefined categories to documents. Train a classifier: from sklearn.naive_bayes import MultinomialNB, model = MultinomialNB().fit(X_train_tfidf, y_train), predictions = model.predict(X_test_tfidf). Naive Bayes works well for text classification with minimal tuning. For better accuracy, use scikit-learn's SGDClassifier with TF-IDF features or fine-tune a transformer model. For zero-shot classification (categorize without training data), use Hugging Face: classifier = pipeline('zero-shot-classification'), result = classifier(text, candidate_labels=['physics', 'biology', 'chemistry']) returns confidence scores for each label without any training.

Information extraction pulls structured data from unstructured text. Combine NER (what entities exist) with dependency parsing (how they relate) to extract relationships: "Aspirin inhibits COX-2" yields (Aspirin, inhibits, COX-2). Regular expressions extract patterned data: re.findall(r'\d+\.?\d*\s*(?:mg|ml|kg|g|mol)', text) extracts quantities with units from scientific text. For processing thousands of research abstracts, spaCy's nlp.pipe(texts, batch_size=50) processes documents in batches using all CPU cores, achieving thousands of documents per second.

Step 5: Use Pre-trained Language Models

Hugging Face Transformers provides access to thousands of pre-trained models through a simple API. from transformers import pipeline. The pipeline function auto-downloads and loads the appropriate model: summarizer = pipeline('summarization'), summary = summarizer(long_text, max_length=130, min_length=30). qa = pipeline('question-answering'), answer = qa(question='What method was used?', context=paper_abstract). translator = pipeline('translation_en_to_fr'), translated = translator(english_text). Each pipeline call downloads the model on first use (cached for future runs) and handles tokenization, inference, and output formatting automatically.

Domain-specific models outperform general models on specialized text. SciBERT (trained on scientific papers) provides better embeddings for scientific text than standard BERT. BioBERT (trained on biomedical literature) excels at biomedical NER and relation extraction. PubMedBERT (trained on PubMed abstracts) is the best choice for medical text analysis. Load these models from Hugging Face: model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased'). The Hugging Face Model Hub hosts over 100,000 models, many specialized for specific domains, languages, or tasks.

Fine-tuning adapts a pre-trained model to your specific task. Start with a pre-trained model, add a classification head, and train on your labeled data. With Hugging Face: from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments. Load the model, prepare your labeled dataset, configure training arguments (learning rate, epochs, batch size), and call trainer.train(). Fine-tuning typically requires only 100 to 1000 labeled examples to achieve strong performance, because the pre-trained model already understands language structure. For text classification, fine-tuning a transformer typically outperforms TF-IDF + Naive Bayes by 5 to 15 percentage points on accuracy.

Computational considerations for transformer models: they require significant memory (2 to 4 GB for base models, 8+ GB for large models) and are slow compared to classical NLP tools. Processing 10,000 documents with spaCy takes seconds; with a transformer, it takes minutes to hours depending on document length and GPU availability. GPU acceleration (NVIDIA CUDA) speeds inference by 10 to 50 times compared to CPU. For research that requires processing millions of documents, use transformer models for a representative sample and classical methods for the full corpus, or use distilled models (DistilBERT is 60% faster than BERT with 97% of the performance).

Key Takeaway

Use spaCy for fast, reliable preprocessing and linguistic analysis, then Hugging Face Transformers for tasks requiring deep language understanding. Start with pre-trained models and zero-shot classification before investing in custom training.

Step 1: Preprocess Text Data

Step 2: Extract Linguistic Features

Step 3: Represent Text Numerically

Step 4: Analyze Text at Scale

Step 5: Use Pre-trained Language Models

Related Articles

Machine Learning with Python

Web Scraping for Research with Python

NLP Preprocessing

Tokenization in NLP