Project Overview
Objective
This project focused on building a classification model to identify disaster-related tweets from a corpus of real-world Twitter data. The goal was to develop an NLP pipeline capable of handling noisy social media content and outputting binary classifications: "Disaster" or "Not Disaster".
Scope
Social media platforms often serve as real-time sources during crises. Filtering disaster-related tweets accurately can support emergency response teams, news organizations, and public safety agencies. This project simulated such a real-world deployment scenario using a Kaggle-style dataset and state-of-the-art NLP techniques.
Tools & Methodologies
- Python, Pandas, NLTK, Scikit-learn, Matplotlib, Gensim
- NLP tasks: Tokenization, Stopword Removal, Embedding Construction
- Vectorization: CountVectorizer, TF-IDF, GloVe
- Modeling: Logistic Regression, KNN, Random Forest
- Model selection: GridSearchCV, Validation Curves
The Approach and Process
Data Preprocessing
- Cleaned tweets by removing URLs, punctuation, and special characters
- Filled missing keywords and locations
- Tokenized tweets and removed NLTK stopwords
- Created features like
text_clean, stopwords_removed

Exploratory Data Analysis
- Analyzed frequent words in disaster and non-disaster tweets
- Visualized top 50 most common words in both classes

Feature Engineering
- Created text vector representations using:
- CountVectorizer - Transforms each tweet into a vector of token frequencies
- TF-IDF Vectorizer - Assigns weights to tokens based on how frequently they occur in a tweet versus across the entire dataset
- GloVe Embeddings (100-dimension vectors) - Each word is represented by a pre-trained 100-dimensional dense vector
Modelling
To determine which model worked best, we trained and evaluated three common machine learning algorithms.
1. K-Nearest Neighbors (KNN)
- KNN predicts the label of a tweet based on the majority label of the K most similar tweets.
- It performs well when paired with TF-IDF features, as distances between vectorized tweets can be meaningfully compared.
- Simple and interpretable but can be slower on large datasets.
Validation Accuracy: around 77–78%
Best performance when used with TF-IDF vectorization
2. Random Forest (RF)
- Random Forest is an ensemble method that builds multiple decision trees and combines their outputs.
- It handles a mix of features well and reduces overfitting.
- Performed comparably to KNN in some TF-IDF settings but was slightly less efficient.
Validation Accuracy: approximately 77.4%
A solid baseline but not the most efficient or accurate in this case.
3. Logistic Regression (LR)
- Logistic Regression is a linear model that works particularly well with dense input vectors.
- It was paired with GloVe embeddings, where each tweet was represented by a 100-dimensional dense vector.
- This combination provided the most consistent and generalizable results.
Best Model: Logistic Regression + GloVe
Final Validation Accuracy: 81.1%
Training Accuracy: 80.1%
