Project Overview

Objective

This project focused on building a classification model to identify disaster-related tweets from a corpus of real-world Twitter data. The goal was to develop an NLP pipeline capable of handling noisy social media content and outputting binary classifications: "Disaster" or "Not Disaster".

Scope

Social media platforms often serve as real-time sources during crises. Filtering disaster-related tweets accurately can support emergency response teams, news organizations, and public safety agencies. This project simulated such a real-world deployment scenario using a Kaggle-style dataset and state-of-the-art NLP techniques.

Tools & Methodologies

Python, Pandas, NLTK, Scikit-learn, Matplotlib, Gensim
NLP tasks: Tokenization, Stopword Removal, Embedding Construction
Vectorization: CountVectorizer, TF-IDF, GloVe
Modeling: Logistic Regression, KNN, Random Forest
Model selection: GridSearchCV, Validation Curves

The Approach and Process

Data Preprocessing

Cleaned tweets by removing URLs, punctuation, and special characters
Filled missing keywords and locations
Tokenized tweets and removed NLTK stopwords
Created features like text_clean, stopwords_removed

Screenshot 2025-04-20 at 7.08.42 PM.png

Exploratory Data Analysis

Analyzed frequent words in disaster and non-disaster tweets
Visualized top 50 most common words in both classes

WhatsApp Image 2025-04-20 at 22.10.52.jpeg

Feature Engineering

Created text vector representations using:
- CountVectorizer - Transforms each tweet into a vector of token frequencies
- TF-IDF Vectorizer - Assigns weights to tokens based on how frequently they occur in a tweet versus across the entire dataset
- GloVe Embeddings (100-dimension vectors) - Each word is represented by a pre-trained 100-dimensional dense vector

Modelling

To determine which model worked best, we trained and evaluated three common machine learning algorithms.

1. K-Nearest Neighbors (KNN)

KNN predicts the label of a tweet based on the majority label of the K most similar tweets.
It performs well when paired with TF-IDF features, as distances between vectorized tweets can be meaningfully compared.
Simple and interpretable but can be slower on large datasets.

Validation Accuracy: around 77–78%

Best performance when used with TF-IDF vectorization

2. Random Forest (RF)

Random Forest is an ensemble method that builds multiple decision trees and combines their outputs.
It handles a mix of features well and reduces overfitting.
Performed comparably to KNN in some TF-IDF settings but was slightly less efficient.

Validation Accuracy: approximately 77.4%

A solid baseline but not the most efficient or accurate in this case.

3. Logistic Regression (LR)

Logistic Regression is a linear model that works particularly well with dense input vectors.
It was paired with GloVe embeddings, where each tweet was represented by a 100-dimensional dense vector.
This combination provided the most consistent and generalizable results.

Best Model: Logistic Regression + GloVe

Final Validation Accuracy: 81.1%

Training Accuracy: 80.1%

Screenshot 2025-04-20 at 9.57.03 PM.png