Update README.md
Browse files
README.md
CHANGED
@@ -8,9 +8,9 @@ license: mit
|
|
8 |
<div align="center">
|
9 |
|
10 |
[](https://farlongctx.github.io/)
|
11 |
-
[](https://arxiv.org/abs/2503.19325)
|
12 |
-
[](https://huggingface.co/guyuchao/FAR_Models)
|
13 |
-
[
|
22 |
|
23 |
## π’ News
|
24 |
|
@@ -31,12 +31,12 @@ license: mit
|
|
31 |
|
32 |
FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
|
33 |
|
34 |
-
 learns t
|
|
44 |
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
|
45 |
|
46 |
<p align="center">
|
47 |
-
<img src="
|
48 |
<p>
|
49 |
|
50 |
### π₯ FAR supports 16x longer temporal extrapolation at test time
|
51 |
|
52 |
<p align="center">
|
53 |
-
<img src="
|
54 |
<p>
|
55 |
|
56 |
### π₯ FAR supports efficient training on long-video sequence with managable token lengths
|
57 |
|
58 |
<p align="center">
|
59 |
-
<img src="
|
60 |
<p>
|
61 |
|
62 |
#### π For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
|
|
|
8 |
<div align="center">
|
9 |
|
10 |
[](https://farlongctx.github.io/)
|
11 |
+
[](https://arxiv.org/abs/2503.19325)
|
12 |
+
[](https://huggingface.co/guyuchao/FAR_Models)
|
13 |
+
[](https://paperswithcode.com/sota/video-generation-on-ucf-101)
|
14 |
|
15 |
</div>
|
16 |
|
|
|
18 |
<a href="https://arxiv.org/abs/2503.19325">Long-Context Autoregressive Video Modeling with Next-Frame Prediction</a>
|
19 |
</p>
|
20 |
|
21 |
+

|
22 |
|
23 |
## π’ News
|
24 |
|
|
|
31 |
|
32 |
FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
|
33 |
|
34 |
+

|
35 |
|
36 |
### π₯ FAR achieves better convergence than video diffusion models with the same continuous latent space
|
37 |
|
38 |
<p align="center">
|
39 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/converenge.jpg?raw=true" width=55%>
|
40 |
<p>
|
41 |
|
42 |
### π₯ FAR leverages clean visual context without additional image-to-video fine-tuning:
|
|
|
44 |
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
|
45 |
|
46 |
<p align="center">
|
47 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/performance.png?raw=true" width=75%>
|
48 |
<p>
|
49 |
|
50 |
### π₯ FAR supports 16x longer temporal extrapolation at test time
|
51 |
|
52 |
<p align="center">
|
53 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/extrapolation.png?raw=true" width=100%>
|
54 |
<p>
|
55 |
|
56 |
### π₯ FAR supports efficient training on long-video sequence with managable token lengths
|
57 |
|
58 |
<p align="center">
|
59 |
+
<img src="https://github.com/showlab/FAR/blob/main/assets/long_short_term_ctx.jpg?raw=true" width=55%>
|
60 |
<p>
|
61 |
|
62 |
#### π For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
|