EN / TR
Özge Karasu Özge Karasu

Curiosity leads me. I follow and write.

Character Similarity via Vector Space Semantics

10.01.2025

Project Overview

This project explored the use of distributional semantics to measure character similarity in dialogue data from the BBC series EastEnders. Each character's speech was converted into a document vector, and similarity was assessed using IR-style techniques.

Techniques Used

  • Data Preparation:

    • Extracted up to 300 utterances per character from training data
    • Max 50 utterances from validation and test sets
  • Preprocessing:

    • Tokenisation, lowercasing, stopword removal, punctuation stripping
    • Added POS tagging and contextual sentence selection using scene metadata
  • Feature Engineering:

    • Bag-of-words and TF-IDF vectors
    • N-grams (1–3), document frequency filtering
    • External resources for lexical enhancement
  • Evaluation:

    • Cosine similarity for nearest neighbor ranking
    • Metrics: Mean Rank, Accuracy, Hit@1, Hit@3
    • Visualisations: heatmaps and similarity matrices

Results

  • Achieved mean rank near 1.7 and accuracy up to 0.62 on the validation set.
  • TF-IDF + contextual scene information gave the best semantic separability.
  • Detailed analysis explained why some characters were confused with others.