Storing custom token classification labels in a Hugging Face dataset

tips and tricks
Published

December 18, 2022

In my previous blog post, I showed how I created a Hugging Face dataset for detecting OCR mistakes. One thing thing that annoyed me about this dataset is that it didn’t contain the names of the token labels. I searched for the solution and tried different things, but couldn’t figure out how to do it. Then finally, when I had some time and was browsing the Hugging Face dataset documentation, I found methods cast() and cast_column() that allow you update the dataset features and properly set the class labels. Here is how to do it.

Twelve species of fish, Carl Cristiaan Fuchs (1802 - 1855)

Twelve species of fish, Carl Cristiaan Fuchs (1802 - 1855)

First, load the dataset without the class labels:

from datasets import load_from_disk

dataset = load_from_disk('data/dataset')

A sample from this dataset has the following features:

dataset['train'][0]
{
    'key': 'FR/FR1/499.txt',
    'start_token_id': 0,
    'score': 0.0464135021,
    'tokens': ['Johannes,', 'Dei', 'gratia,', 'Francorum', 'rex.', 'Notum', 'facimus', 'universis,', 'tam', 'presentibus', 'quam', 'futuris,', 'nobis,', 'ex', 'parte', 'Petri', 'juvenis', 'sentiferi', 'qui', 'bene', 'et', 'fideliter', 'in', 'guerris', 'nostris', 'nobis', 'servivit', 'expositum', 'fuisse,', 'qod', 'cum', 'ipse,', 'tam', 'nomine', 'suo'],
    'tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
    'language': 'FR'
}

When looking at the features of the dataset, we see that the tags column is of type (Sequence of) Value (and not of (Sequence of) ClassLabel).

dataset['train'].features
{'key': Value(dtype='string', id=None),
 'language': Value(dtype='string', id=None),
 'score': Value(dtype='float64', id=None),
 'start_token_id': Value(dtype='int64', id=None),
 'tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

The next step is to call the cast_column method with the correct properties:

from datasets import Sequence, ClassLabel

dataset = dataset.cast_column('tags', Sequence(feature=ClassLabel(num_classes=3, names=['O', 'OCR-Mistake-B', 'OCR-Mistake-I']), length=-1))
Loading cached processed dataset at data/dataset/train/cache-7695d0b08b5f7b4d.arrow
Loading cached processed dataset at data/dataset/val/cache-b0a1c2c8a428d020.arrow
Loading cached processed dataset at data/dataset/test/cache-9e879e4bbea50e50.arrow

After this update, the label names and label to name mapping are stored in the dataset:

dataset['train'].features["tags"].feature.names
['O', 'OCR-Mistake-B', 'OCR-Mistake-I']
dataset['train'].features["tags"].feature._str2int
{'O': 0, 'OCR-Mistake-B': 1, 'OCR-Mistake-I': 2}