End-to-end NLP text classification on Kaggle “Disaster Tweets” — baseline ML + multiple deep learning architectures + transfer learning (USE) in a single reproducible notebook
Built with the tools and technologies:
Python |
TensorFlow / Keras |
TensorFlow Hub (USE) |
scikit-learn |
pandas |
NumPy |
Matplotlib |
Jupyter Notebook
- Overview
- Problem Statement
- Dataset
- Project Highlights
- Approach
- Models Implemented
- Results
- Getting Started
- How to Reproduce Exactly
- Notes on helper_functions.py
- Common Issues & Fixes
- Future Improvements
- References
- License
- Contact
Disaster Tweets NLP – Model Benchmarking is a portfolio-style NLP project built from a learning notebook and upgraded into a clean, reproducible mini-project.
The goal is to classify tweets into:
- Real disaster (
target = 1) - Not a disaster (
target = 0)
This repository is intentionally structured like a mini case study:
- Start with a strong classical ML baseline (fast and competitive)
- Add multiple neural architectures (Dense, RNNs, CNN)
- Add transfer learning using Universal Sentence Encoder (USE) from TensorFlow Hub
- Compare everything using the same evaluation metrics
Tweets are short, noisy, and often ambiguous.
Examples:
- “Forest fire near La Ronge Sask. Canada” → likely a real disaster
- “This exam destroyed me 😭” → contains “disaster-like” words but not a real disaster
The challenge is to build models that learn context rather than reacting to keywords alone.
This project uses Kaggle’s dataset:
- Competition: Natural Language Processing with Disaster Tweets
- Link: https://www.kaggle.com/competitions/nlp-getting-started
- Data page: https://www.kaggle.com/competitions/nlp-getting-started/data
Typical columns:
- id: unique tweet id
- keyword: (optional) keyword for the tweet
- location: (optional) user location
- text: tweet text (main input)
- target: label (only in train.csv) where 1 = disaster, 0 = not disaster
Important repo note:
- For licensing and cleanliness, you typically should NOT commit the raw Kaggle dataset into GitHub.
- Instead, place it locally (see Dataset Setup).
- End-to-end workflow in one notebook: data → split → vectorization → multiple models → evaluation → comparison
- Fair comparison:
- consistent train/validation split
- consistent evaluation metrics
- Covers both classic NLP and modern embedding-based workflows
- Includes reusable helper utilities (TensorBoard callback + plotting + metrics)
- Load dataset from CSV files
- Inspect class balance and sample texts
- Split into training and validation sets
- Prepare text for deep learning models using:
- TextVectorization layer (strings → integer token sequences)
- Embedding layer (token IDs → dense vectors)
This notebook progresses from simplest to strongest:
- A strong baseline model first
- Then neural models trained from scratch
- Then transfer learning (USE) as the high-performance benchmark
- Finally a “10% data” experiment to show label-efficiency of transfer learning
For each model we compute:
- Accuracy
- Precision
- Recall
- F1 score
Why F1 matters:
- Tweets can be ambiguous and noisy
- Precision/recall tradeoffs matter in “disaster detection” scenarios
- F1 gives a balanced view of classification quality
Baseline pipeline:
- TF-IDF Vectorizer
- Multinomial Naive Bayes classifier
Why this baseline is important:
- Fast to train
- Surprisingly strong for short-text classification
- Sets a minimum bar that neural models must beat
All neural models follow the pattern:
- Input: raw tweet strings
- TextVectorization: strings → sequences of token IDs
- Embedding: token IDs → trainable dense vectors
- Architecture-specific layers
- Output: sigmoid probability for binary classification
Neural models included:
- simple_dense
- lstm
- gru
- bidirectional
- conv1d
Why benchmark multiple architectures?
- Different inductive biases:
- RNNs capture sequential dependencies
- CNNs capture local n-gram patterns efficiently
- Dense baselines test if simple pooling is sufficient
Universal Sentence Encoder (USE) via TensorFlow Hub:
- Encodes an entire sentence/tweet into a pretrained embedding vector
- A small classifier head is trained on top
Why USE is strong:
- Pretrained sentence embeddings often generalize well on small/medium datasets
- Particularly useful for short texts where handcrafted features can miss semantics
USE model trained on only 10% of training data:
- Shows how transfer learning behaves when labeled data is limited
- Mirrors real-world situations where labels are expensive
Final benchmark results from this notebook run:
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| baseline | 0.792651 | 0.811139 | 0.792651 | 0.786219 |
| simple_dense | 0.776903 | 0.779556 | 0.776903 | 0.774487 |
| lstm | 0.772966 | 0.775649 | 0.772966 | 0.770438 |
| gru | 0.769029 | 0.773154 | 0.769029 | 0.765780 |
| bidirectional | 0.750656 | 0.752468 | 0.750656 | 0.747956 |
| conv1d | 0.780840 | 0.783458 | 0.780840 | 0.778533 |
| tf_hub_sentence_encoder | 0.814961 | 0.815272 | 0.814961 | 0.814225 |
| tf_hub_10_percent_data | 0.784777 | 0.790478 | 0.784777 | 0.781448 |
Key takeaways:
- Best overall model here: tf_hub_sentence_encoder (highest F1)
- Strong classical baseline: baseline (TF-IDF + NB)
- conv1d performs well among from-scratch neural models
- Transfer learning remains competitive even with only 10% training data
Recommended repository layout:
disaster-tweets-nlp-model-benchmarks/
├─ NLP.ipynb
├─ helper_functions.py
├─ requirements.txt
├─ screenshots/
│ ├─ results-table.png
│ ├─ dataset-preview.png
│ ├─ label-distribution.png
│ ├─ training-curves-use.png
│ └─ training-curves-baseline.png
├─ .gitignore
├─ LICENSE
└─ README.md
- Python 3.10+ recommended
- pip installed
- Optional: GPU for faster deep learning training
-
Clone the repository
git clone https://github.com/brej-29/disaster-tweets-nlp-model-benchmarks.git cd disaster-tweets-nlp-model-benchmarks
-
Create and activate a virtual environment
Windows (PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
macOS / Linux:
python3 -m venv .venv
source .venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
This notebook expects Kaggle dataset files to be available locally.
Option A: download Kaggle dataset zip
-
Download the dataset from: https://www.kaggle.com/competitions/nlp-getting-started/data
-
Keep the downloaded zip named exactly:
nlp_getting_started.zip
-
Place it in the SAME folder as NLP.ipynb
After unzipping, you should have:
- train.csv
- test.csv
- sample_submission.csv
Option B: already extracted files
- Place train.csv/test.csv/sample_submission.csv in the same folder as NLP.ipynb
- Ensure filenames match the notebook expectations
-
Start Jupyter Notebook
jupyter notebook
(or)
jupyter lab
- Open NLP.ipynb
- Run cells top-to-bottom
- Recommended: Restart Kernel & Run All for full reproducibility
If you have an NVIDIA GPU and want faster training:
- Follow official TensorFlow installation guidance for your OS/CUDA setup
- If you are on Google Colab:
- Runtime → Change runtime type → GPU
To reproduce results cleanly:
- Create a fresh virtual environment
- Install dependencies from requirements.txt
- Ensure dataset files are present
- Run the notebook from top to bottom without skipping cells
- Save the results table screenshot into screenshots/results-table.png
Optional:
-
Capture package versions for strict reproducibility:
pip freeze > requirements-freeze.txt
helper_functions.py provides reusable utilities commonly used in ML notebooks, such as:
- TensorBoard callback creation
- Training curve plotting
- Metric calculation helpers
Keeping helpers separate makes the notebook easier to read and the evaluation more consistent.
- FileNotFoundError: nlp_getting_started.zip
- Confirm the zip file exists next to NLP.ipynb
- Confirm the filename matches exactly
- Training is slow
- Use GPU if available
- Reduce epochs temporarily while testing
- Use the baseline model for quick sanity checks
- TensorFlow Hub download takes time
- First run may download the model from TF Hub
- Ensure stable internet and rerun the cell if needed
Ideas to upgrade this from “notebook project” to “full ML project”:
- Add an inference script: input tweet → output disaster probability
- Save best model + preprocessing artifacts for deployment
- Add confusion matrix and error analysis:
- inspect false positives/false negatives
- Try transformer baselines (DistilBERT/BERT) and threshold tuning
- Add experiment tracking:
- structured results logging (CSV/JSON)
- TensorBoard organization per model run
- Kaggle Competition: https://www.kaggle.com/competitions/nlp-getting-started
- TensorFlow Hub USE tutorial: https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder
- Keras TextVectorization docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
- scikit-learn TF-IDF docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
This project is licensed under the MIT License.
See the LICENSE file in this repository for full details.
If you’d like to discuss this project, provide feedback, or connect:
- LinkedIn: Brejesh Balakrishnan
Feel free to fork the repo, open issues, or suggest improvements!
