from dataclasses import dataclass
@dataclass
class AlignedToken:
str # String in the OCR text (excluding aligmnent characters)
ocr: str # String in the gold standard (excluding aligmnent characters)
gs: str # String in the aligned OCR text (including aligmnent characters)
ocr_aligned: str # String in the aligned GS text (including aligmnent characters)
gs_aligned: int # The index of the first character in the OCR text
start: int # The lentgh of the OCR string len_ocr:
Some years ago, I did a project with the Dutch National Library on OCR post-correction. I wanted to investigate the potential of Deep Learning for correcting OCR errors in text. For various reasons, we never got very good results. Around the same time, two competitions on post-OCR text correction were organized at the ICDAR conference (2017 and 2019). I remained interested in the problem and started working on reproducing the competition results in my free time.
The competition divided the challenge of OCR post-correction into two tasks:
- Detection
- Correction
This post is about my first experiences with solving the detection task. The paper about the results contains very brief descriptions of the competitors’ solutions, which makes it hard to reproduce their models. The paper states that the winner used multilingual BERT with CNN layers for recognizing tokens with OCR mistakes. For simplicity, I decided to start with training a simpler BERT for token classification model.
The data
The competition dataset consists of (historical) newspaper data in 10 languages. Each text file contains three lines, e.g.,
[OCR_toInput] This is a cxample...
[OCR_aligned] This is a@ cxample...
[ GS_aligned] This is an example.@@
The first line contains the ocr input text. The second line contains the aligned ocr and the third line contains the aligned gold standard (GS). @
is the aligment character and #
represents tokens in the OCR that do not occur in the gold standard (noise).
Task 1 of the competition is about finding tokens with OCR mistakes. In this context, a token refers to a string between two whitespaces. The goal of this task is to predict the position and length of OCR mistakes. I created a Python library called ocrpostcorrection that contains functionality for doing OCR postcorrection, including converting the ICDAR dataset into a Hugging Face dataset with ‘sentences’ of a certain length. This notebook contains the code used to create the dataset. I will now explain the most important steps.
First, a text is divided into aligned tokens by splitting the aligned OCR and GS on matching whitespaces. The ocrpostcorrection library contains a dataclass AlignedToken
which is used to store the results:
The tokenize_aligned function is used to divide an input text into AlignedToken
s.
from ocrpostcorrection.icdar_data import tokenize_aligned
'This is a@ cxample...', 'This is an example.@@') tokenize_aligned(
[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4),
AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2),
AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1),
AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.@@', start=10, len_ocr=10)]
The OCR text of an AlignedToken
may still consist of multiple tokens. This is the case when the OCR text contains one or more spaces. To make sure the (sub)tokenization of a token is the same, no matter if it was not yet tokenized completely, another round of tokenization is added. Using the get_input_tokens function, every AlignedToken
is split on whitespace. Each subtoken is stored in the InputToken
dataclass:
from dataclasses import dataclass
@dataclass
class InputToken:
str # OCR text
ocr: str # GS text
gs: int # character offset in the original OCR text
start: int # length of the OCR text
len_ocr: int # Class label: [0, 1, 2] label:
This dataclass also adds the class labels. There are three classes:
- 0: No OCR mistake
- 1: Start token of an OCR mistake
- 2: Inside token of an OCR mistake
This example code shows how an AlignedToken
is divided into inputToken
s:
from ocrpostcorrection.icdar_data import AlignedToken, get_input_tokens
= AlignedToken('Long ow.', 'Longhow.', 'Long ow.', 'Longhow.', 24, 8)
t print(t)
for inp_tok in get_input_tokens(t):
print(inp_tok)
AlignedToken(ocr='Long ow.', gs='Longhow.', ocr_aligned='Long ow.', gs_aligned='Longhow.', start=24, len_ocr=8)
InputToken(ocr='Long', gs='Longhow.', start=24, len_ocr=4, label=1)
InputToken(ocr='ow.', gs='', start=29, len_ocr=3, label=2)
The output produced by this code is:
AlignedToken(ocr='Long ow.', gs='Longhow.', ocr_aligned='Long ow.', gs_aligned='Longhow.', start=24, len_ocr=8)
InputToken(ocr='Long', gs='Longhow.', start=24, len_ocr=4, label=1)
InputToken(ocr='ow.', gs='', start=29, len_ocr=3, label=2)
A text can be tokenized by combining the tokenize_aligned
and get_input_tokens
functions. Texts are stored in another dataclass:
from dataclasses import dataclass
@dataclass
class Text:
str # OCR input text
ocr_text: list # List of AlignedTokens
tokens: list # List of InputTokens
input_tokens: float # Normalized editdistance between OCR and GS text score:
A text file can be tokenized using the function process_text:
from pathlib import Path
from ocrpostcorrection.icdar_data import process_text
= Path('example.txt')
in_file = process_text(in_file) text
which results in the following instance of the Text
dataclass:
Text(ocr_text='This is a cxample...',
tokens=[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4),
AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2),
AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1),
AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.@@', start=10, len_ocr=10)],
input_tokens=[InputToken(ocr='This', gs='This', start=0, len_ocr=4, label=0),
InputToken(ocr='is', gs='is', start=5, len_ocr=2, label=0),
InputToken(ocr='a', gs='an', start=8, len_ocr=1, label=1),
InputToken(ocr='cxample...', gs='example.', start=10, len_ocr=10, label=1)],
score=0.2)
The next step is processing the entire dataset. This can be done with the generate_data function. The ouptut of this function consists of a dictionary containing <file name>: Text
pairs and a pandas DataFrame containing metadata. For each file, the metadata contains file name, language, score (normalized editdistance), and the numbers of aligned and input tokens:
language | file_name | score | num_tokens | num_input_tokens | |
---|---|---|---|---|---|
0 | SL | SL/SL1/29.txt | 0.463415 | 7 | 7 |
1 | SL | SL/SL1/15.txt | 0.773294 | 155 | 246 |
2 | SL | SL/SL1/114.txt | 0.019256 | 268 | 272 |
The train set consists of 11662 text files. The mean number of InputToken
s is 269.51, with a standard deviation of 200.61. The minimum number of InputToken
s is 0 and the maximum 3068. The histogram below shows the distribution of the number of InputToken
s. Most texts have less than 250 InputToken
s and there are some very long texts.
The mean normalized editdistance between OCR and GS text is 0.21, with a standard deviation of 0.13. The minimum is 0.00 and the maximum is 1.00. Smaller distances are better (less OCR mistakes). The distribution of normalized editdistance shows two peaks; one close to zero and one between 0.2 and 0.3. Most texts have a low editdistance. This means that most texts should be of high enough quality to be able to learn from.
The ICDAR dataset consists of a train and test set. For validation, I split off 10% of the texts from the train set, stratified on language.
Because BERT has a limit on input length and the length of the texts vary, the texts are split up in smaller sequences. As an approximation of sentence length, for this first experiment, I chose a sequence length of 35 tokens (with an overlap of 5 tokens). The generate_sentence function returns sequences of a certain length and overlap, given the metadata DataFrame and dictionary of Text
instances.
The sequences are returned as pandas DataFrame, which can be converted to a Hugging Face Dataset using the Dataset.from_pandas() method. The first two ‘sentences’ in the train set look like:
{
'key': 'FR/FR1/499.txt',
'start_token_id': 0,
'score': 0.0464135021,
'tokens': ['Johannes,', 'Dei', 'gratia,', 'Francorum', 'rex.', 'Notum', 'facimus', 'universis,', 'tam', 'presentibus', 'quam', 'futuris,', 'nobis,', 'ex', 'parte', 'Petri', 'juvenis', 'sentiferi', 'qui', 'bene', 'et', 'fideliter', 'in', 'guerris', 'nostris', 'nobis', 'servivit', 'expositum', 'fuisse,', 'qod', 'cum', 'ipse,', 'tam', 'nomine', 'suo'],
'tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
'language': 'FR'
},
{
'key': 'FR/FR1/499.txt',
'start_token_id': 30,
'score': 0.0204918033,
'tokens': ['cum', 'ipse,', 'tam', 'nomine', 'suo', 'quam', 'ut', 'tutor', 'et', 'ha', 'bens', 'gubernacionem', 'seu', 'ballum', 'fratrum', 'et', 'sororum', 'suorum', 'in', 'minori', 'etate', 'constitutorum,', 'possessionem', 'aliquorum', 'bonorum', 'mobi', 'lium', 'et', 'inmobilium', 'apprehenderit,', 'quorum', 'possessionem', 'Thomas', 'juvenis', 'pater'],
'tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
'language': 'FR'
}
Each sample specifies key
, start_token_id
, score
, tokens
, tags
, and language
. The key
links the sample to the text file the sequence was taken from. start_token_id
is used to merge the sequences, so we get predictions for all tokens in the text. This way, performance can be calculated for complete texts instead of sequences. score
(normalized editdistance) is used for selecting high qualitity data. For the first experiment, sequences with a normalized editdistance > 0.3 were removed from the train and validation sets (but not from the test set!). tokens
and tags
contain the data that is used to train the classifier. language
was not used for the first experiment.
The model
The code for training the model can be found in this notebook. After loading the dataset, there is one more detail that needs to be taken care of. BERT uses subword tokenization, while the dataset contains labels for complete words. Also, BERT tokenizers add special tokens [CLS]
and [SEP]
. This means that after BERT tokenization, the input labels don’t match the tokens anymore, e.g.,
Input sequence (and labels):
{
'tokens': ['This', 'is', 'a', 'cxample...']
'tags': [0, 0, 1, 1]
}
Because the ICDAR dataset is multilingual, I selected bert-base-multilingual-cased
as a base model. Tokenized with the bert-base-multilingual-cased
tokenizer the sequence becomes:
['[CLS]', 'This', 'is', 'a', 'c', '##xa', '##mp', '##le', '.', '.', '.', '[SEP]']
To be able to train the model, the labels will have to be realigned. The Hugging Face task guide on token classification contains an example tokenize_and_align
function for doing so. A slightly adapted version was added to the ocrpostcorrection package. This function is a partial, allowing the tokenizer to be instantiated separately. This makes it more convenient to apply it to a dataset using the Dataset.map
function, because there is no need to add a lambda function. To use the function, do:
from ocrpostcorrection.token_classification import tokenize_and_align_labels
= icdar_dataset.map(tokenize_and_align_labels(tokenizer), batched=True) tokenized_icdar
After preparing the dataset, and instantiating a data collator, model and trainer, training can start. For this experiment, the model was trained on Google Colab, using the following training arguments:
from transformers import TrainingArguments
= TrainingArguments(
training_args ='choose/your/own/output/directory',
output_dir='epoch',
evaluation_strategy=3,
num_train_epochs=True,
load_best_model_at_end='epoch',
save_strategy=16
per_device_train_batch_size )
The best model came from epoch 2. For this model, training and validation loss were 0.254 and 0.291, respectively.
Results
The code that was used to determine the performance of the model can be found in this notebook. Performance is calculated using the competition evaluation script. This script expects input in the form:
{
"<language>/<set>/<number>.txt":
{
"0:1": {},
"4:2": {},
...
}
...
}
The first number in the keys for a text represents the start index of the OCR mistake. The second number is the number of (input) tokens that are incorrect. The evaluation script calculates precision, recall and F-measure on the token level.
It takes quite some steps to transform the (sub)token-level predictions that the model provides as output into the format accepted by the evaluation script. First, predictions for subtokens are merged into predictions for InputToken
s. An InputToken
is considered an OCR mistake if at least one subtoken is predicted to be an OCR mistake. Next, sequences of InputToken
-level predictions are merged into predictions for an entire text. If predictions for overlapping InputToken
s differ, it is considered as an OCR mistake. Finally, the predictions for individual tokens are translated to character offset:number of tokens
-pairs. The predictions2icdar_output function is available for this conversion process. It takes as input the tokenized test set, the predicted labels, the tokenizer, and a dictionary with <file name>: Text
pairs, and returns the expected ICDAR output format:
from ocrpostcorrection.utils import predictions2icdar_output, predictions_to_labels
= predictions2icdar_output(tokenized_icdar['test'],
output
predictions_to_labels(predictions),
tokenizer, data_test)
When saved to a JSON file, the output
dictionary can be used to calculate performance using the runEvaluation function. The runEvaluation
code was taken from the original evalTool_ICDAR2017.py
(CC0 License) via Kotwic4/ocr-correction. In addition to the JSON file, the function requires the (path to the) test set as input. The function creates a csv file containing precision, recall, and F-measure for all texts in the test set. The following table contains the mean results grouped by language.
Language | Precision | Recall | F-measure | F-measure CCC (2019 competition winner) |
---|---|---|---|---|
BG | 0.88 | 0.67 | 0.74 | 0.77 |
CZ | 0.81 | 0.55 | 0.64 | 0.70 |
DE | 0.98 | 0.89 | 0.93 | 0.95 |
EN | 0.85 | 0.54 | 0.62 | 0.67 |
ES | 0.91 | 0.46 | 0.59 | 0.69 |
FI | 0.89 | 0.77 | 0.82 | 0.84 |
FR | 0.81 | 0.49 | 0.59 | 0.67 |
NL | 0.87 | 0.60 | 0.66 | 0.71 |
PL | 0.89 | 0.70 | 0.77 | 0.82 |
SL | 0.80 | 0.58 | 0.64 | 0.69 |
The last column of the table reports the mean F-measure for CCC, the 2019 competition winner. CCC outperforms the new model on every language, although, for some languages the difference is quite small. However, for a first attempt, I think the results are not bad at all!