Project Overview

Objective

This project focused on building a classification model to identify disaster-related tweets from a corpus of real-world Twitter data. The goal was to develop an NLP pipeline capable of handling noisy social media content and outputting binary classifications: "Disaster" or "Not Disaster".

Scope

Social media platforms often serve as real-time sources during crises. Filtering disaster-related tweets accurately can support emergency response teams, news organizations, and public safety agencies. This project simulated such a real-world deployment scenario using a Kaggle-style dataset and state-of-the-art NLP techniques.

Tools & Methodologies

The Approach and Process

Data Preprocessing

Screenshot 2025-04-20 at 7.08.42 PM.png

Exploratory Data Analysis

WhatsApp Image 2025-04-20 at 22.10.52.jpeg

Feature Engineering

Modelling

To determine which model worked best, we trained and evaluated three common machine learning algorithms.

1. K-Nearest Neighbors (KNN)

Validation Accuracy: around 77–78%

Best performance when used with TF-IDF vectorization

2. Random Forest (RF)

Validation Accuracy: approximately 77.4%

A solid baseline but not the most efficient or accurate in this case.

3. Logistic Regression (LR)

Best Model: Logistic Regression + GloVe

Final Validation Accuracy: 81.1%

Training Accuracy: 80.1%

Screenshot 2025-04-20 at 9.57.03 PM.png