ICDAR Data

Functionality for working with the ICDAR dataset

The ICDAR 2019 Competition on Post-OCR Text Correction dataset (zenodo record) contains text files in the following format:

[OCR_toInput] This is a cxample...
[OCR_aligned] This is a@ cxample...
[ GS_aligned] This is an example.@@
01234567890123

The first line contains the ocr input text. The second line contains the aligned ocr and the third line contains the aligned gold standard. @ is the aligment character and # represents characters in the OCR that do not occur in the gold standard.

For working with this data, the first 14 characters have to be removed. We also remove leading and trailing whitespace.

remove_label_and_nl

 remove_label_and_nl (line:str)

Tokenization

Task 1 of the competition is about finding tokens with OCR mistakes. In this context a token refers to a string between two whitespaces.

AlignedToken

 AlignedToken (ocr:str, gs:str, ocr_aligned:str, gs_aligned:str,
               start:int, len_ocr:int)

Dataclass for storing aligned tokens

tokenize_aligned

 tokenize_aligned (ocr_aligned:str, gs_aligned:str)

Get a list of AlignedTokens from the aligned OCR and GS strings

tokenize_aligned("This is a@ cxample...", "This is an example.##")

[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4),
 AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2),
 AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1),
 AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.##', start=10, len_ocr=10)]

The OCR text of an AlignedToken may still consist of multiple tokens. This is the case when the OCR text contains one or more spaces. To make sure the (sub)tokenization of a token is the same, no matter if it was not yet tokenized completely, another round of tokenization is added.

InputToken

 InputToken (ocr:str, gs:str, start:int, len_ocr:int, label:int)

Dataclass for the tokenization within AlignedTokens

leading_whitespace_offset

 leading_whitespace_offset (string:str)

*Return the leading whitespace offset for an aligned ocr string

If an aligned ocr string contains leading whitespace, the offset needs to be added to the start index of the respective input tokens.

Args: string (str): aligned ocr string to determine the leading whitespace offset for

Returns: int: leading whitespace offset for input tokens*

get_input_tokens

 get_input_tokens (aligned_token:__main__.AlignedToken)

Tokenize an AlignedToken into subtokens and assign task 1 labels

t = AlignedToken("Major", "Major", "Major", "Major", 19, 5)
print(t)

for inp_tok in get_input_tokens(t):
    print(inp_tok)

AlignedToken(ocr='Major', gs='Major', ocr_aligned='Major', gs_aligned='Major', start=19, len_ocr=5)
InputToken(ocr='Major', gs='Major', start=19, len_ocr=5, label=0)

t = AlignedToken("INEVR", "I NEVER", "I@NEV@R", "I NEVER", 0, 5)
print(t)

for inp_tok in get_input_tokens(t):
    print(inp_tok)

AlignedToken(ocr='INEVR', gs='I NEVER', ocr_aligned='I@NEV@R', gs_aligned='I NEVER', start=0, len_ocr=5)
InputToken(ocr='INEVR', gs='I NEVER', start=0, len_ocr=5, label=1)

t = AlignedToken("Long ow.", "Longhow.", "Long ow.", "Longhow.", 24, 8)
print(t)

for inp_tok in get_input_tokens(t):
    print(inp_tok)

AlignedToken(ocr='Long ow.', gs='Longhow.', ocr_aligned='Long ow.', gs_aligned='Longhow.', start=24, len_ocr=8)
InputToken(ocr='Long', gs='Longhow.', start=24, len_ocr=4, label=1)
InputToken(ocr='ow.', gs='', start=29, len_ocr=3, label=2)

Process a text file

Next, we need functions for processing a text in the ICDAR data format.

Text

 Text (ocr_text:str, tokens:list, input_tokens:list, score:float)

Dataclass for storing a text in the ICDAR data format

clean

 clean (string:str)

Remove alignment characters from a text

normalized_ed

 normalized_ed (ed:int, ocr:str, gs:str)

Returns the normalized editdistance

process_text

 process_text (in_file:pathlib.Path)

*Process a text from the ICDAR dataset

Extract AlignedTokens, InputTokens, and calculate normalized editdistance.*

Processing the example text:

in_file = Path(os.getcwd()) / "data" / "example.txt"
text = process_text(in_file)
text

Text(ocr_text='This is a cxample...', tokens=[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4), AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2), AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1), AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.@@', start=10, len_ocr=10)], input_tokens=[InputToken(ocr='This', gs='This', start=0, len_ocr=4, label=0), InputToken(ocr='is', gs='is', start=5, len_ocr=2, label=0), InputToken(ocr='a', gs='an', start=8, len_ocr=1, label=1), InputToken(ocr='cxample...', gs='example.', start=10, len_ocr=10, label=1)], score=0.2)

Process the entire dataset

File structure of the ICDAR dataset

.
├── <data_dir>
│   ├── <language>
│   │   ├── <language (set)>1
│   │   ...
│   │   └── <language (set)>n
│   ...
...

generate_data

 generate_data (in_dir:pathlib.Path)

Process all texts in the dataset and return a dataframe with metadata

get_intermediate_data

 get_intermediate_data (zip_file:str)

Get the data and metadata for the train and test sets from the zip file.

extract_icdar_data

 extract_icdar_data (out_dir:str, zip_file:str)

Generate input ‘sentences’

The following functions can be used to generate sequences of a certain length with possible overlap.

window

 window (iterable, size=2)

Given an iterable, return all subsequences of a certain size

generate_sentences

 generate_sentences (df:pandas.core.frame.DataFrame,
                     data:Dict[str,__main__.Text], size:int=15,
                     step:int=10)

Generate sequences of a certain length and possible overlap

process_input_ocr

 process_input_ocr (text:str)

Generate Text object for OCR input text (without aligned gold standard)

text = process_input_ocr("This is a cxample...")
data = {"ocr_input": text}
md = pd.DataFrame(
    {
        "language": ["?"],
        "file_name": ["ocr_input"],
        "score": [text.score],
        "num_tokens": [len(text.tokens)],
        "num_input_tokens": [len(text.input_tokens)],
    }
)

df = generate_sentences(md, data, size=2, step=2)

assert 2 == df.shape[0]
assert [0, 2] == list(df.start_token_id)
df

1it [00:00, 2198.27it/s]

	key	start_token_id	score	tokens	tags	language
0	ocr_input	0	0.0	[This, is]	[0, 0]	oc
1	ocr_input	2	0.0	[a, cxample...]	[0, 0]	oc