HuggingFace Datasets

Datasets

Splits and subsets

多くの場合 split として test, train, validation、あれば subset として言語別

code:usage.py

import datasets

# データセット自体は読み込まず中身を確認

builder = datasets.load_dataset_builder(...)

builder.info.features

builder.info.splits

datasets.get_dataset_config_names(...) # config の列挙 ≒ subset 分かる

# データセットの読み込み

dataset = datasets.load_dataset('path')

dataset = datasets.load_dataset('path', 'name') # 大抵 name(=config) が subset に対応している

dataset'train'0 # 0 行目

# 特定のファイル狙いうち

dataset = load_dataset("allenai/c4", data_files="multilingual/c4-ja.tfrecord-00002-of-01024.json.gz")

Loading methods

split='train'

streaming=True

revision=...

trust_remote_code=True

ストリーミング

Differences between Dataset and IterableDataset

code:usage.py

# iterable として一部をストリーミング

iterable_dataset = load_dataset(..., split='train', streaming=True)

for row in iterable_dataset.take(10):

print(row)

# ダウンロード済みのを dataset.to_iterable_dataset() してもよい

前処理

code:usage.py

def tokenization(example):

return tokenizer(example"text")

tokenized_dataset = dataset.map(tokenization, batched=True)

データ分割

dataset["train"].filter(lambda x: len(x["text"]) < 100)

dataset['train'].shard(num_shards=10, index=0)

dataset["train"].select(range(0, 1000))

dataset['train'][0:10] してしまいがち

code:split.py

split_dataset = dataset"train".train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset"train"

test_dataset = split_dataset"test"

データセットくっつける

code:join.py

from datasets import concatenate_datasets, interleave_datasets

# 単にくっつける

dataset = concatinate_datasets([ds1'train', ds2])

# サンプリングして取り出す

dataset = interleave_datasets(

[ds1'train', ds2],

probabilities=0.7, 0.3,# (Aの70%+Bの30%ではない、合計は 1.0)

stopping_strategy="first_exhausted", # all_exhausted なら全データセット取り出すまで少ない方からは重複して取り出される

)

Process

ユーティリティ

from transformers.pipelines.pt_utils import KeyDataset

KeyDataset(dataset, 'input') で input だけ取り出した Dataset を返す

データセットを既存の ML ライブラリのデータセットにする

Using Datasets with TensorFlow

https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.set_format

https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.to_tf_dataset

code:dataset.py

dataset.set_format(type="torch", columns=...)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_dataset = dataset.to_tf_dataset(

columns="input_ids", "token_type_ids", "attention_mask",

label_cols="label",

batch_size=2,

collate_fn=data_collator,

shuffle=True

)

UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

pipe にジェネレータを渡せばよいのだが、list を含むチャットテンプレートを yield しているとうまくうごかない?