from datasets import load_from_disk
= load_from_disk('data/dataset') dataset
In my previous blog post, I showed how I created a Hugging Face dataset for detecting OCR mistakes. One thing thing that annoyed me about this dataset is that it didn’t contain the names of the token labels. I searched for the solution and tried different things, but couldn’t figure out how to do it. Then finally, when I had some time and was browsing the Hugging Face dataset documentation, I found methods cast()
and cast_column()
that allow you update the dataset features and properly set the class labels. Here is how to do it.
First, load the dataset without the class labels:
A sample from this dataset has the following features:
'train'][0] dataset[
{
'key': 'FR/FR1/499.txt',
'start_token_id': 0,
'score': 0.0464135021,
'tokens': ['Johannes,', 'Dei', 'gratia,', 'Francorum', 'rex.', 'Notum', 'facimus', 'universis,', 'tam', 'presentibus', 'quam', 'futuris,', 'nobis,', 'ex', 'parte', 'Petri', 'juvenis', 'sentiferi', 'qui', 'bene', 'et', 'fideliter', 'in', 'guerris', 'nostris', 'nobis', 'servivit', 'expositum', 'fuisse,', 'qod', 'cum', 'ipse,', 'tam', 'nomine', 'suo'],
'tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
'language': 'FR'
}
When looking at the features of the dataset, we see that the tags
column is of type (Sequence
of) Value
(and not of (Sequence
of) ClassLabel
).
'train'].features dataset[
{'key': Value(dtype='string', id=None),
'language': Value(dtype='string', id=None),
'score': Value(dtype='float64', id=None),
'start_token_id': Value(dtype='int64', id=None),
'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
The next step is to call the cast_column
method with the correct properties:
from datasets import Sequence, ClassLabel
= dataset.cast_column('tags', Sequence(feature=ClassLabel(num_classes=3, names=['O', 'OCR-Mistake-B', 'OCR-Mistake-I']), length=-1)) dataset
Loading cached processed dataset at data/dataset/train/cache-7695d0b08b5f7b4d.arrow
Loading cached processed dataset at data/dataset/val/cache-b0a1c2c8a428d020.arrow
Loading cached processed dataset at data/dataset/test/cache-9e879e4bbea50e50.arrow
After this update, the label names and label to name mapping are stored in the dataset:
'train'].features["tags"].feature.names dataset[
['O', 'OCR-Mistake-B', 'OCR-Mistake-I']
'train'].features["tags"].feature._str2int dataset[
{'O': 0, 'OCR-Mistake-B': 1, 'OCR-Mistake-I': 2}