Image-Text-to-Text
Transformers
Safetensors
Cosmos
English
qwen2_5_vl
nvidia
conversational
text-generation-inference
tsungyi commited on
Commit
c786949
·
verified ·
1 Parent(s): df59fbe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -16
README.md CHANGED
@@ -4,9 +4,9 @@ license_name: nvidia-open-model-license
4
  license_link: >-
5
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
6
  datasets:
7
- - nvidia/Cosmos-Reason1-SFT-Dataset-Sample
8
- - nvidia/Cosmos-Reason1-RL-Dataset-Sample
9
- - nvidia/Cosmos-Reason1-Benchmark-Sample
10
  library_name: cosmos
11
  language:
12
  - en
@@ -148,21 +148,24 @@ All datasets go through the data annotation process described in the technical p
148
  * HoloAssist: Hybrid: Human,Automated
149
  * AV: Hybrid: Human,Automated
150
 
 
 
 
 
 
 
151
  ## Dataset Format
152
  Modality: Video (mp4) and Text
153
 
154
  ## Dataset Quantification
155
  We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
156
 
157
- | Dataset | SFT Data | RL Data | Benchmark Data |
158
- |--------------|---------:|--------:|---------------:|
159
- | [RoboVQA](https://robovqa.github.io/) | 1.14m | 252 | 110 |
160
- | AV | 24.7k | 200 | 100 |
161
- | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/) | 258k | 240 | 100 |
162
- | [Agibot](https://github.com/OpenDriveLab/AgiBot-World) | 38.9k | 200 | 100 |
163
- | [HoloAssist](https://holoassist.github.io/) | 273k | 200 | 100 |
164
- | [RoboFail](https://robot-reflect.github.io/) | N/A | N/A | 100 |
165
- | **Total Storage Size** | **300.6GB** | **2.6GB** | **1.5GB** | |
166
 
167
 
168
  We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
@@ -237,7 +240,4 @@ We value you, the datasets, the diversity they represent, and what we have been
237
  | Model Application(s): | Physical AI common sense understanding and embodied reasoning |
238
  | Describe the life critical impact (if present). | None Known |
239
  | Use Case Restrictions: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |
240
- | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |
241
-
242
-
243
-
 
4
  license_link: >-
5
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license
6
  datasets:
7
+ - nvidia/Cosmos-Reason1-SFT-Dataset
8
+ - nvidia/Cosmos-Reason1-RL-Dataset
9
+ - nvidia/Cosmos-Reason1-Benchmark
10
  library_name: cosmos
11
  language:
12
  - en
 
148
  * HoloAssist: Hybrid: Human,Automated
149
  * AV: Hybrid: Human,Automated
150
 
151
+ **Metrics**:
152
+ We report the model accuracy on the embodied reasoning benchmark introduced in [Cosmos-Reason1](https://arxiv.org/abs/2503.15558). The results differ from those presented in Table 9 due to additional training aimed at supporting a broader range of Physical AI tasks beyond the benchmark.
153
+ | | [RoboVQA](https://robovqa.github.io/) | AV | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/) | [RoboFail](https://robot-reflect.github.io/) | Average |
154
+ |--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|
155
+ | **Accuracy** | 87.3 | 70.8 | 63.7 | 48.9 | 62.7 | 57.2 | 65.1 |
156
+
157
  ## Dataset Format
158
  Modality: Video (mp4) and Text
159
 
160
  ## Dataset Quantification
161
  We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
162
 
163
+ | | [RoboVQA](https://robovqa.github.io/) | AV | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/) | [RoboFail](https://robot-reflect.github.io/) | Total Storage Size |
164
+ |--------------------|---------------------------------------------|----------|------------------------------------------------------|------------------------------------------------|------------------------------------------------|------------------------------------------------|--------------------|
165
+ | **SFT Data** | 1.14m | 24.7k | 258k | 38.9k | 273k | N/A | **300.6GB** |
166
+ | **RL Data** | 252 | 200 | 240 | 200 | 200 | N/A | **2.6GB** |
167
+ | **Benchmark Data** | 110 | 100 | 100 | 100 | 100 | 100 | **1.5GB** |
168
+
 
 
 
169
 
170
 
171
  We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
 
240
  | Model Application(s): | Physical AI common sense understanding and embodied reasoning |
241
  | Describe the life critical impact (if present). | None Known |
242
  | Use Case Restrictions: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |
243
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |