nvidia
/

Cosmos-Reason1-7B

@@ -4,9 +4,9 @@ license_name: nvidia-open-model-license
 license_link: >-
   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
 datasets:
-- nvidia/Cosmos-Reason1-SFT-Dataset-Sample
-- nvidia/Cosmos-Reason1-RL-Dataset-Sample
-- nvidia/Cosmos-Reason1-Benchmark-Sample
 library_name: cosmos
 language:
 - en
@@ -148,21 +148,24 @@ All datasets go through the data annotation process described in the technical p
 * HoloAssist: Hybrid:  Human,Automated
 * AV: Hybrid:  Human,Automated
 ## Dataset Format
 Modality: Video (mp4) and Text
 ## Dataset Quantification
 We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
-| Dataset       | SFT Data | RL Data | Benchmark Data |
-|--------------|---------:|--------:|---------------:|
-| [RoboVQA](https://robovqa.github.io/)       | 1.14m    | 252    | 110            |
-| AV              | 24.7k    | 200    | 100            |
-| [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/) | 258k     | 240    | 100            |
-| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)        | 38.9k    | 200    | 100            |
-| [HoloAssist](https://holoassist.github.io/)  | 273k     | 200    | 100            |
-| [RoboFail](https://robot-reflect.github.io/)      | N/A      | N/A    | 100            |
-| **Total Storage Size** | **300.6GB** | **2.6GB** | **1.5GB** | |
 We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
@@ -237,7 +240,4 @@ We value you, the datasets, the diversity they represent, and what we have been
 | Model Application(s):                           | Physical AI common sense understanding and embodied reasoning                                                                                                                                                                                                                                                                                                                     |
 | Describe the life critical impact (if present). | None Known                                                                                                                                                                                                                                                                                                                           |
 | Use Case Restrictions:                          | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)                                                                                                                                                                                                                   |
-| Model and dataset restrictions:                 | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |

 license_link: >-
   https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
 datasets:
+- nvidia/Cosmos-Reason1-SFT-Dataset
+- nvidia/Cosmos-Reason1-RL-Dataset
+- nvidia/Cosmos-Reason1-Benchmark
 library_name: cosmos
 language:
 - en
 * HoloAssist: Hybrid:  Human,Automated
 * AV: Hybrid:  Human,Automated
+**Metrics**:
+We report the model accuracy on the embodied reasoning benchmark introduced in [Cosmos-Reason1](https://arxiv.org/abs/2503.15558). The results differ from those presented in Table 9 due to additional training aimed at supporting a broader range of Physical AI tasks beyond the benchmark.
+|                   | [RoboVQA](https://robovqa.github.io/)        | AV       | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/)       | [RoboFail](https://robot-reflect.github.io/)                               | Average |
+|--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|
+| **Accuracy**        | 87.3                                        | 70.8    | 63.7                                                 | 48.9                                          | 62.7                                           | 57.2                                            |  65.1                                          |
 ## Dataset Format
 Modality: Video (mp4) and Text
 ## Dataset Quantification
 We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
+|                   | [RoboVQA](https://robovqa.github.io/)        | AV       | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/)       | [RoboFail](https://robot-reflect.github.io/)                               | Total Storage Size |
+|--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|--------------------|
+| **SFT Data**        | 1.14m                                        | 24.7k    | 258k                                                 | 38.9k                                          | 273k                                           | N/A                                            | **300.6GB**         |
+| **RL Data**         | 252                                          | 200      | 240                                                  | 200                                            | 200                                            | N/A                                            | **2.6GB**           |
+| **Benchmark Data**  | 110                                          | 100      | 100                                                  | 100                                            | 100                                            | 100                                            | **1.5GB**           |
 We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
 | Model Application(s):                           | Physical AI common sense understanding and embodied reasoning                                                                                                                                                                                                                                                                                                                     |
 | Describe the life critical impact (if present). | None Known                                                                                                                                                                                                                                                                                                                           |
 | Use Case Restrictions:                          | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)                                                                                                                                                                                                                   |
+| Model and dataset restrictions:                 | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |