23) set_seed(
Utils
Set random seed
set_seed
set_seed (seed:int)
*Set the random seed in Python std library and pytorch
Args: seed (int): Value of the random seed*
Convert predictions into ICDAR output format
predictions_to_labels
predictions_to_labels (predictions)
separate_subtoken_predictions
separate_subtoken_predictions (word_ids, preds)
merge_subtoken_predictions
merge_subtoken_predictions (subtoken_predictions)
gather_token_predictions
gather_token_predictions (preds)
Gather potentially overlapping token predictions
labels2label_str
labels2label_str (labels, text_key)
extract_icdar_output
extract_icdar_output (label_str, input_tokens)
predictions2icdar_output
predictions2icdar_output (samples, predictions, tokenizer, data_test)
Convert predictions into icdar output format
= "bert-base-multilingual-cased"
bert_base_model_name = AutoTokenizer.from_pretrained(bert_base_model_name)
tokenizer
= ["the" for i in range(1000)]
tokens
= tokenizer(tokens, is_split_into_words=True) r
Token indices sequence length is longer than the specified maximum sequence length for this model (1002 > 512). Running this sequence through the model will result in indexing errors
len(r.input_ids)
1002
Convert predictions into entities output format
create_entity
create_entity (entity_tokens)
In the entitiy output format, an entity looks as follows:
= [
input_tokens
InputToken(="one",
ocr="one",
gs=0,
start=3,
len_ocr=0,
label
),
InputToken(="tow",
ocr="two",
gs=4,
start=3,
len_ocr=1,
label
),
] create_entity(input_tokens)
{'entity': 'OCR mistake', 'word': 'one tow', 'start': 0, 'end': 7}
The entity output format consists of a list of such entities.
extract_entity_output
extract_entity_output (label_str:str, input_tokens)
Convert label string to the entity output format
predictions2entity_output
predictions2entity_output (samples, predictions, tokenizer, data_test)
Convert predictions into entity output format
create_perfect_icdar_output
create_perfect_icdar_output (data)
Running the ICDAR evaluation script
This code was taken from the original evalTool_ICDAR2017.py (CC0 License) via Kotwic4/ocr-correction.
EvalContext
EvalContext (filePath, verbose=False)
Initialize self. See help(type(self)) for accurate signature.
reshape_input_errors
reshape_input_errors (tokenPosErr, evalContext, verbose=False)
runEvaluation
runEvaluation (datasetDirPath, pathInputJsonErrorsCorrections, pathOutputCsv, verbose=False)
Main evaluation method
Type | Default | Details | |
---|---|---|---|
datasetDirPath | path to the dataset directory (ex: r”./dataset_sample”) | ||
pathInputJsonErrorsCorrections | # input path to the JSON result (ex: r”./inputErrCor_sample.json”), format given on https://sites.google.com/view/icdar2017-postcorrectionocr/evaluation) | ||
pathOutputCsv | output path to the CSV evaluation results (ex: r”./outputEval.csv”) | ||
verbose | bool | False |
read_results
read_results (csv_file)
Read csv with evaluation results
Convert ICDAR output format to SimpleCorrectionDataset
icdar_output2simple_correction_dataset_df
icdar_output2simple_correction_dataset_df (output:Dict[str,Dict[str,Dict ]], data:Dict[str,ocrpostcorre ction.icdar_data.Text], dataset:str='test')
*Convert the icdar data error detection output to input for SimpleCorrectionDataset
Because gold standard for input_tokens is not available, the dataset dataframe cannot be used for evaluation anymore.*
= Path(os.getcwd()) / "data" / "dataset_training_sample"
data_dir = generate_data(data_dir)
data, md
= create_perfect_icdar_output(data)
detection_output
= icdar_output2simple_correction_dataset_df(detection_output, data)
df print(f"DataFrame contains {df.shape[0]} samples")
= SimpleCorrectionDataset(df, max_len=10)
dataset print(f"Dataset contains {len(dataset)} samples")
2it [00:00, 1710.91it/s]
DataFrame contains 40 samples
Dataset contains 35 samples
Summarize icdar results
aggregate_ed_results
aggregate_ed_results (csv_file)
aggregate_results
aggregate_results (csv_file)
read_results
read_results (csv_file)
Development
reduce_dataset
reduce_dataset (dataset, n=5)
Return dataset with the first n samples for each split