Utils

Util functionality

Set random seed


set_seed

 set_seed (seed:int)

Set the random seed in Python std library and pytorch

Args: seed (int): Value of the random seed

set_seed(23)

Convert predictions into ICDAR output format


predictions_to_labels

 predictions_to_labels (predictions)

separate_subtoken_predictions

 separate_subtoken_predictions (word_ids, preds)

merge_subtoken_predictions

 merge_subtoken_predictions (subtoken_predictions)

gather_token_predictions

 gather_token_predictions (preds)

Gather potentially overlapping token predictions


labels2label_str

 labels2label_str (labels, text_key)

extract_icdar_output

 extract_icdar_output (label_str, input_tokens)

predictions2icdar_output

 predictions2icdar_output (samples, predictions, tokenizer, data_test)

Convert predictions into icdar output format

bert_base_model_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(bert_base_model_name)

tokens = ["the" for i in range(1000)]

r = tokenizer(tokens, is_split_into_words=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (1002 > 512). Running this sequence through the model will result in indexing errors
len(r.input_ids)
1002

Convert predictions into entities output format


create_entity

 create_entity (entity_tokens)

In the entitiy output format, an entity looks as follows:

input_tokens = [
    InputToken(
        ocr="one",
        gs="one",
        start=0,
        len_ocr=3,
        label=0,
    ),
    InputToken(
        ocr="tow",
        gs="two",
        start=4,
        len_ocr=3,
        label=1,
    ),
]
create_entity(input_tokens)
{'entity': 'OCR mistake', 'word': 'one tow', 'start': 0, 'end': 7}

The entity output format consists of a list of such entities.


extract_entity_output

 extract_entity_output (label_str:str, input_tokens)

Convert label string to the entity output format


predictions2entity_output

 predictions2entity_output (samples, predictions, tokenizer, data_test)

Convert predictions into entity output format


create_perfect_icdar_output

 create_perfect_icdar_output (data)

Running the ICDAR evaluation script

This code was taken from the original evalTool_ICDAR2017.py (CC0 License) via Kotwic4/ocr-correction.


EvalContext

 EvalContext (filePath, verbose=False)

Initialize self. See help(type(self)) for accurate signature.


reshape_input_errors

 reshape_input_errors (tokenPosErr, evalContext, verbose=False)

runEvaluation

 runEvaluation (datasetDirPath, pathInputJsonErrorsCorrections,
                pathOutputCsv, verbose=False)

Main evaluation method

Type Default Details
datasetDirPath path to the dataset directory (ex: r”./dataset_sample”)
pathInputJsonErrorsCorrections # input path to the JSON result (ex: r”./inputErrCor_sample.json”), format given on https://sites.google.com/view/icdar2017-postcorrectionocr/evaluation)
pathOutputCsv output path to the CSV evaluation results (ex: r”./outputEval.csv”)
verbose bool False

read_results

 read_results (csv_file)

Read csv with evaluation results

Convert ICDAR output format to SimpleCorrectionDataset


icdar_output2simple_correction_dataset_df

 icdar_output2simple_correction_dataset_df
                                            (output:Dict[str,Dict[str,Dict
                                            ]], data:Dict[str,ocrpostcorre
                                            ction.icdar_data.Text],
                                            dataset:str='test')

Convert the icdar data error detection output to input for SimpleCorrectionDataset

Because gold standard for input_tokens is not available, the dataset dataframe cannot be used for evaluation anymore.

data_dir = Path(os.getcwd()) / "data" / "dataset_training_sample"
data, md = generate_data(data_dir)

detection_output = create_perfect_icdar_output(data)

df = icdar_output2simple_correction_dataset_df(detection_output, data)
print(f"DataFrame contains {df.shape[0]} samples")

dataset = SimpleCorrectionDataset(df, max_len=10)
print(f"Dataset contains {len(dataset)} samples")
2it [00:00, 1710.91it/s]
DataFrame contains 40 samples
Dataset contains 35 samples

Summarize icdar results


aggregate_ed_results

 aggregate_ed_results (csv_file)

aggregate_results

 aggregate_results (csv_file)

read_results

 read_results (csv_file)

Development


reduce_dataset

 reduce_dataset (dataset, n=5)

Return dataset with the first n samples for each split