Project Overview
This project focused on building a sentiment analysis pipeline using Support Vector Machines (SVMs) to classify tweets as either positive or negative. The task involved preprocessing noisy text data, extracting informative features, and evaluating model performance through cross-validation and error analysis.
Techniques Used
Preprocessing:
- Tokenisation with NLTK
- Lowercasing, punctuation removal, stop word filtering
- Lemmatisation and emoji handling
Feature Extraction:
- Bag-of-Words and TF-IDF vectorisation
- Word frequency and binary presence tests
Model:
- Support Vector Machines (LinearSVC from scikit-learn)
- Grid search over C-values
- 10-fold cross-validation
Evaluation:
- Accuracy, Precision, Recall, F1-score
- Confusion matrix
- Manual error inspection (False Positives/Negatives)
Results
- Best performance achieved using TF-IDF features and optimised SVM parameters.
- F1-score improved from baseline ~0.76 to ~0.83 after advanced preprocessing and feature tuning.
- Model generalised well with minimal overfitting.
