Seed-OSS-36B-Instruct / MODEL_CARD.md
Willem-BD's picture
docs: update MODEL_CARD.md, README.md, LICENSE (#6)
5f4e324 verified

This model is released to foster global open research and empower developers.

Model Details

  • Model Name: Seed-OSS
  • Model Type/Structure: Causal language model
  • Model Version:
    • Seed-OSS-36B-Base
    • Seed-OSS-36B-Base-woSyn
    • Seed-OSS-36B-Instruct
  • Context Length: 512K (524,288)
  • Developed by: ByteDance Seed Team
  • Release Date: Aug 20, 2025
  • License: Apache 2.0
  • Contact: [email protected]

Intended Use and Limitations

Intended Uses: As a general purpose model, Seed-OSS can support multiple use cases, including question answering, summarization, reasoning. It takes text as input and generates text as output. The following list is just some limited examples of the potential use cases:

  • Underlying technology for chatbots and conversational AI.
  • Assistance with content creation (e.g., drafting, summarizing, editing).
  • Auxiliary information retrieval in non-critical scenarios.

Prohibited Uses:

  • Professional medical, legal, or financial advice.
  • Automatic decision-making: Used for high-risk, high-impact automatic decision-making unless it has been rigorously fine-tuned and evaluated by humans for that specific use case.
  • Minor Safety: Abuse, exploit, or harm a minor or individual under the age of consent, including grooming, or child sexual exploitation
  • Illegal Activities: Violating any laws, including fraud, terrorism, or generating Child Sexual Abuse Material (CSAM).
  • Hate & Harassment: Generating discriminatory, hateful, or harassing content.
  • Misinformation: Engaging in disinformation, misinformation, or deceptive activities, including but not limited to passing off or representing AI-generated content as human-generated.
  • Military & Surveillance: Any use for military, weaponry, intelligence gathering, or mass surveillance purposes.
  • High-Risk Harm: Generating sexually explicit material, violating privacy (e.g., PII).

Limitations:

  • Languages: Seed-OSS is an international (i18n) model. It is primarily optimized for and evaluated in English. Performance in other languages is limited and not robustly tested.
  • Training Data: The training data is predominantly from publicly available sources, the models understanding of specific cultures, values and historical events may be incomplete.
  • Hallucination: Models generate content based on the statistical patterns in their training data. The model may generate information that is incorrect or entirely fictional.
  • Harmful Content Generation: Like any large language model, Seed-OSS may still be capable of producing outputs which are considered harmful, inaccurate or offensive—despite extensive safety training. We encourage developers to conduct their own testing, ensure human oversight and deploy mitigation strategies for their specific use cases.

For developers who intend to build customized models or applications on top of this model, please be aware that you are fully responsible for the customized model or your application, and ensuring your application is compliant with all applicable laws.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Content Safety Measures

We designed relevant measures to ensure the model's content safety throughout the entire model training cycle, ranging from training data preparation to model training and evaluation. Prior to launch, Seed-OSS went through numerous safety and security reviews led by a global Safety and Security team based in Singapore. This includes:

  • Training Data Filtering: Data cleaning and content filtering mechanism is designed and executed to ensure that no CSAM or highly toxic content data is included in the training dataset. PII removal is also performed through a combination of algorithmic and manual checks.
  • Safety Fine-Tuning: Safety training is executed during the Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF/PPO) training stages to minimize the likelihood of harmful outputs.
  • Evaluation: Regular safety testing and adversarial testing are conducted to identify and address safety vulnerabilities.

Training Data

Seed-OSS is trained on text data from multiple sources, which includes publicly available data on the internet, purchased data through the partnership with external vendors and data generated by our in-house teams. The model is pre-trained over 12 trillion tokens. The model has a knowledge cutoff of 07/2024.

Multiple measures have been taken to do data preprocessing, including data deduplication, data desensitization, quality filtering, CSAM filtering and toxic content filtering.