🎬 Word2Vec Model for Personalized Movie Recommendations

This repository provides a trained Word2Vec model and associated artifacts for a content-based movie recommendation system, used in the Letterboxd Movie Recommender Gradio application.


🧠 Model Overview

  • Architecture: Gensim Word2Vec (CBOW)

  • Vector Size: 100

  • Window Size: 5

  • Min Word Count: 2

  • Training Data: Aggregated movie metadata:

    • Descriptions
    • Taglines
    • Genres
    • Actors
    • Directors
    • Themes

Each movie is represented by a high-dimensional vector, created by averaging the embeddings of its metadata words. These "movie vectors" enable cosine similarity comparisons to generate personalized recommendations.


πŸ”§ Intended Use

This model powers the backend of a Gradio app that:

  • Builds a user taste profile from their Letterboxd history
  • Computes cosine similarity between the user's profile and all movies
  • Returns the top-N most similar movies as recommendations

πŸ“¦ Repository Contents

File Description
movie_vectors.npy NumPy array of 100-dim vectors representing each movie
movie_data.pkl Pandas DataFrame with processed metadata
word2vec.model Trained Word2Vec model
README.md Documentation

πŸ§ͺ Example Usage

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load data from the Hub (optional)
# from huggingface_hub import hf_hub_download
# vectors_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_vectors.npy")
# data_path = hf_hub_download(repo_id="n9e6y/letterboxd_movie_recommender", filename="movie_data.pkl")
# movie_vectors = np.load(vectors_path)
# movie_data = pd.read_pickle(data_path)

# Or load locally
movie_vectors = np.load("movie_vectors.npy")
movie_data = pd.read_pickle("movie_data.pkl")

# Prepare movie index for lookup
movie_data['name_year'] = movie_data['name'] + ' (' + movie_data['year'].astype(str) + ')'
movie_data_unique = movie_data.drop_duplicates(subset="name_year")
indices = pd.Series(movie_data_unique.index, index=movie_data_unique['name_year'])

# Find similar movies
movie_title = 'The Dark Knight (2008)'
try:
    idx = indices[movie_title]
    sim_scores = cosine_similarity(movie_vectors[idx].reshape(1, -1), movie_vectors)[0]
    top_indices = np.argsort(sim_scores)[-6:-1][::-1]  # Exclude the movie itself

    print(f"Movies similar to '{movie_title}':")
    print(movie_data['name_year'].iloc[top_indices])
except KeyError:
    print(f"Movie '{movie_title}' not found.")

πŸ“‚ Data Source

  • Movie metadata was compiled using a Kaggle dataset and enriched with additional features for training.

⚠️ Limitations

  • Only uses content-based filteringβ€”no collaborative (user-to-user) data
  • May not suggest serendipitous or out-of-profile recommendations
  • Depends on the quality and completeness of metadata

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using n9e6y/letterboxd_movie_recommender 1