🎬 Word2Vec Model for Personalized Movie Recommendations

This repository provides a trained Word2Vec model and associated artifacts for a content-based movie recommendation system, used in the Letterboxd Movie Recommender Gradio application.

🧠 Model Overview

Architecture: Gensim Word2Vec (CBOW)
Vector Size: 100
Window Size: 5
Min Word Count: 2
Training Data: Aggregated movie metadata:
- Descriptions
- Taglines
- Genres
- Actors
- Directors
- Themes

Each movie is represented by a high-dimensional vector, created by averaging the embeddings of its metadata words. These "movie vectors" enable cosine similarity comparisons to generate personalized recommendations.

🔧 Intended Use

This model powers the backend of a Gradio app that:

Builds a user taste profile from their Letterboxd history
Computes cosine similarity between the user's profile and all movies
Returns the top-N most similar movies as recommendations

📦 Repository Contents

File	Description
`movie_vectors.npy`	NumPy array of 100-dim vectors representing each movie
`movie_data.pkl`	Pandas DataFrame with processed metadata
`word2vec.model`	Trained Word2Vec model
`README.md`	Documentation

🧪 Example Usage

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load data from the Hub (optional)
# from huggingface_hub import hf_hub_download
# vectors_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_vectors.npy")
# data_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_data.pkl")
# movie_vectors = np.load(vectors_path)
# movie_data = pd.read_pickle(data_path)

# Or load locally
movie_vectors = np.load("movie_vectors.npy")
movie_data = pd.read_pickle("movie_data.pkl")

# Prepare movie index for lookup
movie_data['name_year'] = movie_data['name'] + ' (' + movie_data['year'].astype(str) + ')'
movie_data_unique = movie_data.drop_duplicates(subset="name_year")
indices = pd.Series(movie_data_unique.index, index=movie_data_unique['name_year'])

# Find similar movies
movie_title = 'The Dark Knight (2008)'
try:
    idx = indices[movie_title]
    sim_scores = cosine_similarity(movie_vectors[idx].reshape(1, -1), movie_vectors)[0]
    top_indices = np.argsort(sim_scores)[-6:-1][::-1]  # Exclude the movie itself

    print(f"Movies similar to '{movie_title}':")
    print(movie_data['name_year'].iloc[top_indices])
except KeyError:
    print(f"Movie '{movie_title}' not found.")

📂 Data Source

Movie metadata was compiled using a Kaggle dataset and enriched with additional features for training.

⚠️ Limitations

Only uses content-based filtering—no collaborative (user-to-user) data
May not suggest serendipitous or out-of-profile recommendations
Depends on the quality and completeness of metadata

n9e6y
/

letterboxd_movie_recommender