You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v3.2.0).
Load text data
This guide shows you how to load text datasets. To learn how to load any type of dataset, take a look at the general loading guide.
Text files are one of the most common file types for storing a dataset. By default, 🤗 Datasets samples a text file line by line to build the dataset.
>>> from datasets import load_dataset
>>> dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})
# Load from a directory
>>> dataset = load_dataset("text", data_dir="path/to/text/dataset")
To sample a text file by paragraph or even an entire document, use the sample_by
parameter:
# Sample by paragraph
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph")
# Sample by document
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document")
You can also use grep patterns to load specific files:
>>> from datasets import load_dataset
>>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
To load remote text files via HTTP, pass the URLs instead:
>>> dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")
To load XML data you can use the “xml” loader, which is equivalent to “text” with sample_by=“document”:
>>> from datasets import load_dataset
>>> dataset = load_dataset("xml", data_files={"train": ["my_xml_1.xml", "my_xml_2.xml"], "test": "my_xml_file.xml"})
# Load from a directory
>>> dataset = load_dataset("xml", data_dir="path/to/xml/dataset")