Update README.md
Browse files
README.md
CHANGED
|
@@ -8,9 +8,9 @@ license: mit
|
|
| 8 |
<div align="center">
|
| 9 |
|
| 10 |
[](https://farlongctx.github.io/)
|
| 11 |
-
[](https://arxiv.org/abs/2503.19325)
|
| 12 |
-
[](https://huggingface.co/guyuchao/FAR_Models)
|
| 13 |
-
[
|
| 22 |
|
| 23 |
## π’ News
|
| 24 |
|
|
@@ -31,12 +31,12 @@ license: mit
|
|
| 31 |
|
| 32 |
FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
|
| 33 |
|
| 34 |
-
 learns t
|
|
| 44 |
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
|
| 45 |
|
| 46 |
<p align="center">
|
| 47 |
-
<img src="
|
| 48 |
<p>
|
| 49 |
|
| 50 |
### π₯ FAR supports 16x longer temporal extrapolation at test time
|
| 51 |
|
| 52 |
<p align="center">
|
| 53 |
-
<img src="
|
| 54 |
<p>
|
| 55 |
|
| 56 |
### π₯ FAR supports efficient training on long-video sequence with managable token lengths
|
| 57 |
|
| 58 |
<p align="center">
|
| 59 |
-
<img src="
|
| 60 |
<p>
|
| 61 |
|
| 62 |
#### π For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
|
|
|
|
| 8 |
<div align="center">
|
| 9 |
|
| 10 |
[](https://farlongctx.github.io/)
|
| 11 |
+
[](https://arxiv.org/abs/2503.19325)
|
| 12 |
+
[](https://huggingface.co/guyuchao/FAR_Models)
|
| 13 |
+
[](https://paperswithcode.com/sota/video-generation-on-ucf-101)
|
| 14 |
|
| 15 |
</div>
|
| 16 |
|
|
|
|
| 18 |
<a href="https://arxiv.org/abs/2503.19325">Long-Context Autoregressive Video Modeling with Next-Frame Prediction</a>
|
| 19 |
</p>
|
| 20 |
|
| 21 |
+

|
| 22 |
|
| 23 |
## π’ News
|
| 24 |
|
|
|
|
| 31 |
|
| 32 |
FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
|
| 33 |
|
| 34 |
+

|
| 35 |
|
| 36 |
### π₯ FAR achieves better convergence than video diffusion models with the same continuous latent space
|
| 37 |
|
| 38 |
<p align="center">
|
| 39 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/converenge.jpg?raw=true" width=55%>
|
| 40 |
<p>
|
| 41 |
|
| 42 |
### π₯ FAR leverages clean visual context without additional image-to-video fine-tuning:
|
|
|
|
| 44 |
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
|
| 45 |
|
| 46 |
<p align="center">
|
| 47 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/performance.png?raw=true" width=75%>
|
| 48 |
<p>
|
| 49 |
|
| 50 |
### π₯ FAR supports 16x longer temporal extrapolation at test time
|
| 51 |
|
| 52 |
<p align="center">
|
| 53 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/extrapolation.png?raw=true" width=100%>
|
| 54 |
<p>
|
| 55 |
|
| 56 |
### π₯ FAR supports efficient training on long-video sequence with managable token lengths
|
| 57 |
|
| 58 |
<p align="center">
|
| 59 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/long_short_term_ctx.jpg?raw=true" width=55%>
|
| 60 |
<p>
|
| 61 |
|
| 62 |
#### π For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
|