π¬ Word2Vec Model for Personalized Movie Recommendations
This repository provides a trained Word2Vec model and associated artifacts for a content-based movie recommendation system, used in the Letterboxd Movie Recommender Gradio application.
π§ Model Overview
Architecture: Gensim Word2Vec (CBOW)
Vector Size: 100
Window Size: 5
Min Word Count: 2
Training Data: Aggregated movie metadata:
- Descriptions
- Taglines
- Genres
- Actors
- Directors
- Themes
Each movie is represented by a high-dimensional vector, created by averaging the embeddings of its metadata words. These "movie vectors" enable cosine similarity comparisons to generate personalized recommendations.
π§ Intended Use
This model powers the backend of a Gradio app that:
- Builds a user taste profile from their Letterboxd history
- Computes cosine similarity between the user's profile and all movies
- Returns the top-N most similar movies as recommendations
π¦ Repository Contents
File | Description |
---|---|
movie_vectors.npy |
NumPy array of 100-dim vectors representing each movie |
movie_data.pkl |
Pandas DataFrame with processed metadata |
word2vec.model |
Trained Word2Vec model |
README.md |
Documentation |
π§ͺ Example Usage
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Load data from the Hub (optional)
# from huggingface_hub import hf_hub_download
# vectors_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_vectors.npy")
# data_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_data.pkl")
# movie_vectors = np.load(vectors_path)
# movie_data = pd.read_pickle(data_path)
# Or load locally
movie_vectors = np.load("movie_vectors.npy")
movie_data = pd.read_pickle("movie_data.pkl")
# Prepare movie index for lookup
movie_data['name_year'] = movie_data['name'] + ' (' + movie_data['year'].astype(str) + ')'
movie_data_unique = movie_data.drop_duplicates(subset="name_year")
indices = pd.Series(movie_data_unique.index, index=movie_data_unique['name_year'])
# Find similar movies
movie_title = 'The Dark Knight (2008)'
try:
idx = indices[movie_title]
sim_scores = cosine_similarity(movie_vectors[idx].reshape(1, -1), movie_vectors)[0]
top_indices = np.argsort(sim_scores)[-6:-1][::-1] # Exclude the movie itself
print(f"Movies similar to '{movie_title}':")
print(movie_data['name_year'].iloc[top_indices])
except KeyError:
print(f"Movie '{movie_title}' not found.")
π Data Source
- Movie metadata was compiled using a Kaggle dataset and enriched with additional features for training.
β οΈ Limitations
- Only uses content-based filteringβno collaborative (user-to-user) data
- May not suggest serendipitous or out-of-profile recommendations
- Depends on the quality and completeness of metadata