ITFormer-0.5B: Bridging Time Series Signals and Natural Language for Multi-Modal QA

Model Overview

ITFormer-0.5B is a state-of-the-art multi-modal framework that bridges time-series data and natural language for dynamic question answering (QA). Built using the Instruct Time Transformer (ITFormer) architecture, this model is specifically designed to handle complex multi-task, temporal-textual QA tasks. It is trained on the EngineMT-QA dataset, which focuses on real-world aircraft engine operational and maintenance data.

ITFormer-0.5B excels in integrating time-series data (such as sensor readings from engines) with natural language queries, enabling real-time, intelligent decision-making across several key areas: understanding, perception, reasoning, and decision-making.

Key Features

Multi-modal Fusion: Combines time-series sensor data with textual input for cross-modal reasoning.
Efficient and Scalable: Achieves high performance with fewer than 1% additional trainable parameters, making it both scalable and efficient.
State-of-the-art Performance: ITFormer-0.5B outperforms existing models in temporal-textual QA tasks, showing improvements in accuracy, BLEU, and F1 scores across all tasks.
Domain-Specific Application: Trained on the EngineMT-QA dataset, which consists of questions related to aircraft engine performance, faults, and maintenance.

Model Components

ITFormer-0.5B integrates several key components to enhance multi-modal reasoning:

Time Token Position Encoding (TPE): Encodes temporal, channel, and segment-level position information for better time-series representation.
Learnable Instruct Tokens (LIT): Helps align the temporal features with task-specific queries, enabling efficient multi-modal interaction.
Instruct Time Attention (ITA): A mechanism that dynamically aligns and fuses temporal features with textual queries.
Time Token as Language (TAL): Represents temporal features as language-compatible tokens, allowing for smooth integration with LLMs.

Model Architecture

ITFormer-0.5B integrates a time-series encoder with a frozen large language model (LLM). The encoder extracts semantic features from the time-series data, and the LLM processes the corresponding textual query. The fused representation is then passed through the LLM’s decoder to generate the final answer.

Tasks

ITFormer-0.5B is capable of answering questions across four primary tasks:

Understanding: Interprets sensor data to understand engine status and behavior.
Perception: Detects and diagnoses faults in the engine components based on time-series data.
Reasoning: Makes predictions about future engine health and identifies degradation trends.
Decision-Making: Suggests actionable maintenance strategies based on predicted trends and failure probabilities.

Dataset: EngineMT-QA

ITFormer-0.5B is trained on the EngineMT-QA dataset, a large-scale, multi-task dataset specifically created for time-series question answering. The dataset contains over 110k question-answer pairs based on real-world engine operational and maintenance scenarios.

Example Tasks from EngineMT-QA:

Understanding: What does the increase in temperature at the LPT outlet indicate in the provided engine signal?
Perception: What is the health status of the High-Pressure Turbine (HPT) in the given engine signal?
Reasoning: Given the engine signal across multiple cycles, what is the predicted probability of failure?
Decision-Making: Based on the engine signal data, what immediate actions should be taken to address observed issues?

Model Performance

ITFormer-0.5B Achieves State-of-the-Art Results

ITFormer-0.5B outperforms existing models across all QA tasks, achieving:

Understanding: Rouge-L of 81.22, BLEU of 69.23
Decision-Making: Rouge-L of 75.42, BLEU of 54.50
Perception: F1 Score of 79.26
Reasoning: Accuracy of 73.81

These results highlight the model's capability to handle complex temporal-textual reasoning and its robustness across different task types.

Usage

Installation

To use the ITFormer-0.5B model, you need to install the transformers library and the Hugging Face model hub interface:

pip install transformers
pip install huggingface_hub

Loading the Model

You can load the model using the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "your_username/ITFormer-0.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example input query
query = "What is the status of the High-Pressure Turbine (HPT)?"

# Tokenize and generate output
inputs = tokenizer(query, return_tensors="pt")
outputs = model.generate(inputs["input_ids"])

# Decode the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

Example Question-Answering

To perform multi-modal question answering with the ITFormer-0.5B model, you can provide both the time-series data and textual query. For instance:

# Example question with engine signal data and query
time_series_data = "path_to_time_series_data"  # Example: "engine_signal_data.csv"
query = "What is the condition of the engine?"

# Process the data and make predictions
# Assuming you have preprocessed the time-series data into the correct format
inputs = tokenizer(query, return_tensors="pt")
outputs = model.generate(inputs["input_ids"])

# Decode and output the result
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded_output)

Model Deployment

ITFormer-0.5B is optimized for efficient deployment, requiring minimal computational overhead due to its small number of trainable parameters. You can fine-tune or directly use the model for specific temporal-textual tasks, such as fault detection or predictive maintenance.

Citation

If you use ITFormer-0.5B in your research, please cite the following:

@article{ITFormer,
  title={ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset},
  author={Yilin Wang, Peixuan Lei, Jie Song,  Haoyu Zhe,Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, Zhongyu Wei},
  journal={ICML 2025},
  year={2025},
  url={https://huggingface.co/papers/2506.20093}
}

License

This model is released under the MIT License.

Acknowledgements

We would like to acknowledge the development of the EngineMT-QA dataset and the contributions of the researchers involved in time-series and multi-modal AI. Special thanks to Hugging Face for providing a platform for model sharing and research collaboration.