Project Overview
This project explored the use of distributional semantics to measure character similarity in dialogue data from the BBC series EastEnders. Each character's speech was converted into a document vector, and similarity was assessed using IR-style techniques.
Techniques Used
Data Preparation:
- Extracted up to 300 utterances per character from training data
- Max 50 utterances from validation and test sets
Preprocessing:
- Tokenisation, lowercasing, stopword removal, punctuation stripping
- Added POS tagging and contextual sentence selection using scene metadata
Feature Engineering:
- Bag-of-words and TF-IDF vectors
- N-grams (1–3), document frequency filtering
- External resources for lexical enhancement
Evaluation:
- Cosine similarity for nearest neighbor ranking
- Metrics: Mean Rank, Accuracy, Hit@1, Hit@3
- Visualisations: heatmaps and similarity matrices
Results
- Achieved mean rank near 1.7 and accuracy up to 0.62 on the validation set.
- TF-IDF + contextual scene information gave the best semantic separability.
- Detailed analysis explained why some characters were confused with others.
