Functionality for working with the ICDAR dataset

The ICDAR 2019 Competition on Post-OCR Text Correction dataset (zenodo record) contains text files in the following format:

[OCR_toInput] This is a cxample...
[OCR_aligned] This is a@ cxample...
[ GS_aligned] This is an example.@@

The first line contains the ocr input text. The second line contains the aligned ocr and the third line contains the aligned gold standard. @ is the aligment character and # represents characters in the OCR that do not occur in the gold standard.

For working with this data, the first 14 characters have to be removed. We also remove leading and trailing whitespace.


 remove_label_and_nl (line:str)


Task 1 of the competition is about finding tokens with OCR mistakes. In this context a token refers to a string between two whitespaces.


 AlignedToken (ocr:str, gs:str, ocr_aligned:str, gs_aligned:str,
               start:int, len_ocr:int)

Dataclass for storing aligned tokens


 tokenize_aligned (ocr_aligned:str, gs_aligned:str)

Get a list of AlignedTokens from the aligned OCR and GS strings

tokenize_aligned("This is a@ cxample...", "This is an example.##")
[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4),
 AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2),
 AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1),
 AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.##', start=10, len_ocr=10)]

The OCR text of an AlignedToken may still consist of multiple tokens. This is the case when the OCR text contains one or more spaces. To make sure the (sub)tokenization of a token is the same, no matter if it was not yet tokenized completely, another round of tokenization is added.


 InputToken (ocr:str, gs:str, start:int, len_ocr:int, label:int)

Dataclass for the tokenization within AlignedTokens


 leading_whitespace_offset (string:str)

Return the leading whitespace offset for an aligned ocr string

If an aligned ocr string contains leading whitespace, the offset needs to be added to the start index of the respective input tokens.

Args: string (str): aligned ocr string to determine the leading whitespace offset for

Returns: int: leading whitespace offset for input tokens


 get_input_tokens (aligned_token:__main__.AlignedToken)

Tokenize an AlignedToken into subtokens and assign task 1 labels

t = AlignedToken("Major", "Major", "Major", "Major", 19, 5)

for inp_tok in get_input_tokens(t):
AlignedToken(ocr='Major', gs='Major', ocr_aligned='Major', gs_aligned='Major', start=19, len_ocr=5)
InputToken(ocr='Major', gs='Major', start=19, len_ocr=5, label=0)
t = AlignedToken("INEVR", "I NEVER", "I@NEV@R", "I NEVER", 0, 5)

for inp_tok in get_input_tokens(t):
AlignedToken(ocr='INEVR', gs='I NEVER', ocr_aligned='I@NEV@R', gs_aligned='I NEVER', start=0, len_ocr=5)
InputToken(ocr='INEVR', gs='I NEVER', start=0, len_ocr=5, label=1)
t = AlignedToken("Long ow.", "Longhow.", "Long ow.", "Longhow.", 24, 8)

for inp_tok in get_input_tokens(t):
AlignedToken(ocr='Long ow.', gs='Longhow.', ocr_aligned='Long ow.', gs_aligned='Longhow.', start=24, len_ocr=8)
InputToken(ocr='Long', gs='Longhow.', start=24, len_ocr=4, label=1)
InputToken(ocr='ow.', gs='', start=29, len_ocr=3, label=2)

Process a text file

Next, we need functions for processing a text in the ICDAR data format.


 Text (ocr_text:str, tokens:list, input_tokens:list, score:float)

Dataclass for storing a text in the ICDAR data format


 clean (string:str)

Remove alignment characters from a text


 normalized_ed (ed:int, ocr:str, gs:str)

Returns the normalized editdistance


 process_text (in_file:pathlib.Path)

Process a text from the ICDAR dataset

Extract AlignedTokens, InputTokens, and calculate normalized editdistance.

Processing the example text:

in_file = Path(os.getcwd()) / "data" / "example.txt"
text = process_text(in_file)
Text(ocr_text='This is a cxample...', tokens=[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4), AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2), AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1), AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.@@', start=10, len_ocr=10)], input_tokens=[InputToken(ocr='This', gs='This', start=0, len_ocr=4, label=0), InputToken(ocr='is', gs='is', start=5, len_ocr=2, label=0), InputToken(ocr='a', gs='an', start=8, len_ocr=1, label=1), InputToken(ocr='cxample...', gs='example.', start=10, len_ocr=10, label=1)], score=0.2)

Process the entire dataset

File structure of the ICDAR dataset

├── <data_dir>
│   ├── <language>
│   │   ├── <language (set)>1
│   │   ...
│   │   └── <language (set)>n
│   ...


 generate_data (in_dir:pathlib.Path)

Process all texts in the dataset and return a dataframe with metadata


 get_intermediate_data (zip_file:str)

Get the data and metadata for the train and test sets from the zip file.


 extract_icdar_data (out_dir:str, zip_file:str)

Generate input ‘sentences’

The following functions can be used to generate sequences of a certain length with possible overlap.


 window (iterable, size=2)

Given an iterable, return all subsequences of a certain size


 generate_sentences (df:pandas.core.frame.DataFrame,
                     data:Dict[str,__main__.Text], size:int=15,

Generate sequences of a certain length and possible overlap


 process_input_ocr (text:str)

Generate Text object for OCR input text (without aligned gold standard)

text = process_input_ocr("This is a cxample...")
data = {"ocr_input": text}
md = pd.DataFrame(
        "language": ["?"],
        "file_name": ["ocr_input"],
        "score": [text.score],
        "num_tokens": [len(text.tokens)],
        "num_input_tokens": [len(text.input_tokens)],

df = generate_sentences(md, data, size=2, step=2)

assert 2 == df.shape[0]
assert [0, 2] == list(df.start_token_id)
1it [00:00, 2198.27it/s]
key start_token_id score tokens tags language
0 ocr_input 0 0.0 [This, is] [0, 0] oc
1 ocr_input 2 0.0 [a, cxample...] [0, 0] oc