metadata
license: mit
library_name: fasttext
pipeline_tag: data-filtering
tags:
- pretraining-data-selection
This fastText model is a filter for selecting high-quality pretraining data, as described in Improving Pretraining Data Using Perplexity Correlations. It targets the LAMBADA IT task.
The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.
For complete usage instructions and the theoretical background, please refer to the project's GitHub repository.