SigSpace Logo

SigSpace: An AI Agent for the Tahoe-100M dataset

This is a submission for the Tahoe-DeepDive Hackathon 2025.

Team Name

SigSpace

Members

Project

SigSpace: An AI Agent for the Tahoe-100M dataset

Overview

We have developed an AI agent that accesses the Tahoe-100M dataset along with publicly available and novel datasets. This agent works to refine and expand the mechanisms of action (MOA) and drug signatures of the perturbations within the Tahoe-100M dataset.

Motivation

Drug discovery in the age of Large Language Models (LLMs) can be enhanced through agentic workflows that parse diverse sources of unstructured information to synthesize and connect hypotheses across different fields and modalities. However, these models are primarily trained on text data and lack the capacity to effectively interrogate rich biological databases with complex, biologically-motivated queries. In this work, we provide a proof of concept demonstrating how the Tahoe-100M dataset can be integrated with publicly available relevant datasets to expand the hypothesis space for mechanisms of action and drug responses in the perturbations tested in the Tahoe-100M dataset.

Methods

We have curated new datasets that enhance the description of drugs and cell-lines present in the Tahoe-100M dataset.

Specifically:

  • TAHOE-100M: vision scores and metadata.
  • PRISM: We use PRISM drug sensitivity data, which reports the concentration of a compound needed to inhibit 50% of cancer cell viability. Measurements are based on pooled screening of barcoded cell lines and provide a high-throughput assessment of drug response across a large panel of cancer models.
  • NCI60: We use NCI-60 LC50 data, which reports the concentration of a drug that kills 50% of the cells present at the time of drug addition. It is measured across a panel of 60 human cancer cell lines using standardized multi-dose assays.
  • JUMP: We use the JUMP dataset, which captures morphological profiles of cells in response to chemical and genetic perturbations. High-content imaging and automated feature extraction are used to quantify cellular changes, enabling large-scale profiling of perturbation effects across diverse biological contexts.
  • UCE-CXG-EMBEDDING: natural perturbation search using AI virtual cell.

Data

The following datasets are used in our project:

  • drug_metadata_inchikey.csv: Drug metadata from Tahoe-100M including InChIKey identifiers for chemical structure representation.
  • compound_genetic_perturbation_cosine_similarity_inchikey.csv: Cosine similarity scores between compound and genetic perturbations in Jump dataset.
  • Tahoe_PRISM_cell_by_drug_ic50_matrix_named.csv: IC50 values showing drug sensitivity across cell lines.
  • filtered_results.csv: Filtered NCI60 LC50 data for drug response analysis.
  • cell_line_metadata.csv: Comprehensive metadata for cell lines in the Tahoe dataset.
  • drug_metadata.csv: Detailed information about drugs in the Tahoe dataset.
  • tahoe_vision_scores.h5ad: Vision scores in AnnData format capturing cellular morphological changes.
  • Tahoe_PRISM_matched_cell_metadata_final.csv: Cell metadata for PRISM-Tahoe matched cell lines.
  • Tahoe_PRISM_matched_drug_metadata_final.csv: Drug metadata for PRISM-Tahoe matched compounds.
  • in_tahoe_search_result_df.csv: Search results for perturbations within the Tahoe dataset embedded with UCE.
  • cxg_search_result_df.csv: Cross-dataset search results using CXG embeddings with UCE.

Results

We have developed a Gradio application that accesses these databases and performs complex queries, enhancing and grounding the reasoning in real biological measurements.

Discussion

We deployed SigSpace on a few queries and found it was able to integrate insights from across these datasets, generating novel hypotheses abotu the MOA of drugs of interest.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train yashaorg/sigspace