License Conflict: MIT vs CC BY-NC 4.0

#3
by qiuqiu666 - opened

Hi, I’d like to report a potential license conflict in vicgalle/xlm-roberta-large-xnli-anli. Based on the model card and documentation, this model is distributed under the MIT License. However, one of the core datasets used during trained appears to be facebook/anli, which is published under the CC BY-NC 4.0 license.

That introduces a potential license conflict, since the MIT License is a permissive open-source license that allows unrestricted commercial use, while CC BY-NC 4.0 explicitly prohibits commercial use and imposes non-trivial attribution requirements.

⚠️ Key incompatibilities between MIT and CC BY-NC 4.0:

CC BY-NC 4.0 – Section 2(a)(1) & 2(a)(5):
  • Prohibits commercial use of the dataset and its derivatives
  • Requires attribution when used or adapted

MIT License – Permissions:
  • Allows unrestricted commercial use, modification, and redistribution
  • Contains no built-in mechanism for preserving upstream data obligations

This creates legal uncertainty for downstream users, especially when:

 • The model is redistributed in commercial settings (e.g. industry R&D or ML platforms)
 • Users assume commercial usage rights due to the MIT label
 • Attribution to the ANLI dataset is missing or unclear in the model card or repository

Since AI models are typically considered derivative works of their training data, the licensing obligations of the dataset may carry over to the trained model. Using a commercial-permissive license (MIT) on top of non-commercial-restricted data (CC BY-NC) may result in a violation of the dataset license.

🔹 Suggestion:

To reduce confusion and ensure alignment with upstream licensing terms, here are a few possible steps:

1. Clarify the use of CC BY-NC 4.0 datasets in the README or model card, including an attribution line for the ANLI dataset.

2. Add a note that the model may only be used for non-commercial purposes, in accordance with the data license, despite being published under MIT.

3. Alternatively, consider adjusting the license to one that reflects the non-commercial limitation (e.g., a custom research-use-only tag or a dual-license model).

4. If commercial usage is intended, retraining the model using datasets with fully commercial-compatible licenses may be needed.

Hope this helps! Let me know if you have any questions or need more info.

Thanks for your attention!

Sign up or log in to comment