The ICDAR 2019 Competition on Post-OCR Text Correction dataset (zenodo record ) contains text files in the following format:
[OCR_toInput] This is a cxample...
[OCR_aligned] This is a@ cxample...
[ GS_aligned] This is an example.@@
01234567890123
The first line contains the ocr input text. The second line contains the aligned ocr and the third line contains the aligned gold standard. @
is the aligment character and #
represents characters in the OCR that do not occur in the gold standard.
For working with this data, the first 14 characters have to be removed. We also remove leading and trailing whitespace.
remove_label_and_nl
remove_label_and_nl (line:str)
Tokenization
Task 1 of the competition is about finding tokens with OCR mistakes. In this context a token refers to a string between two whitespaces.
AlignedToken
AlignedToken (ocr:str, gs:str, ocr_aligned:str, gs_aligned:str,
start:int, len_ocr:int)
Dataclass for storing aligned tokens
tokenize_aligned
tokenize_aligned (ocr_aligned:str, gs_aligned:str)
Get a list of AlignedTokens from the aligned OCR and GS strings
tokenize_aligned("This is a@ cxample..." , "This is an example.##" )
[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4),
AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2),
AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1),
AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.##', start=10, len_ocr=10)]
The OCR text of an AlignedToken may still consist of multiple tokens. This is the case when the OCR text contains one or more spaces. To make sure the (sub)tokenization of a token is the same, no matter if it was not yet tokenized completely, another round of tokenization is added.
leading_whitespace_offset
leading_whitespace_offset (string:str)
*Return the leading whitespace offset for an aligned ocr string
If an aligned ocr string contains leading whitespace, the offset needs to be added to the start index of the respective input tokens.
Args: string (str): aligned ocr string to determine the leading whitespace offset for
Returns: int: leading whitespace offset for input tokens*
Process a text file
Next, we need functions for processing a text in the ICDAR data format.
Text
Text (ocr_text:str, tokens:list, input_tokens:list, score:float)
Dataclass for storing a text in the ICDAR data format
clean
clean (string:str)
Remove alignment characters from a text
normalized_ed
normalized_ed (ed:int, ocr:str, gs:str)
Returns the normalized editdistance
process_text
process_text (in_file:pathlib.Path)
*Process a text from the ICDAR dataset
Extract AlignedTokens, InputTokens, and calculate normalized editdistance.*
Processing the example text:
in_file = Path(os.getcwd()) / "data" / "example.txt"
text = process_text(in_file)
text
Text(ocr_text='This is a cxample...', tokens=[AlignedToken(ocr='This', gs='This', ocr_aligned='This', gs_aligned='This', start=0, len_ocr=4), AlignedToken(ocr='is', gs='is', ocr_aligned='is', gs_aligned='is', start=5, len_ocr=2), AlignedToken(ocr='a', gs='an', ocr_aligned='a@', gs_aligned='an', start=8, len_ocr=1), AlignedToken(ocr='cxample...', gs='example.', ocr_aligned='cxample...', gs_aligned='example.@@', start=10, len_ocr=10)], input_tokens=[InputToken(ocr='This', gs='This', start=0, len_ocr=4, label=0), InputToken(ocr='is', gs='is', start=5, len_ocr=2, label=0), InputToken(ocr='a', gs='an', start=8, len_ocr=1, label=1), InputToken(ocr='cxample...', gs='example.', start=10, len_ocr=10, label=1)], score=0.2)
Process the entire dataset
File structure of the ICDAR dataset
.
├── <data_dir>
│ ├── <language>
│ │ ├── <language (set)>1
│ │ ...
│ │ └── <language (set)>n
│ ...
...
generate_data
generate_data (in_dir:pathlib.Path)
Process all texts in the dataset and return a dataframe with metadata