Results
Results for the 2019 ICDAR competition on Post-OCR Text correction can be found in this paper. The best results are repeated in the tables below.
Task 1: Token Classification
Summarized results (F-measure)
Best results in bold.
Method | BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
---|---|---|---|---|---|---|---|---|---|---|
CCC (2019 competition winner) | 0.77 | 0.70 | 0.95 | 0.67 | 0.69 | 0.84 | 0.67 | 0.71 | 0.82 | 0.69 |
Experiment 1 | 0.74 | 0.64 | 0.93 | 0.62 | 0.59 | 0.82 | 0.59 | 0.66 | 0.77 | 0.63 |
Experiment 2 (long sequences) | 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 |
Experiment 1 reproduced with DVC | 0.75 | 0.68 | 0.96 | 0.66 | 0.64 | 0.83 | 0.69 | 0.69 | 0.81 | 0.67 |
Experiment 1 reproduced with DVC - evaluation with old eval script | 0.75 | 0.68 | 0.96 | 0.66 | 0.64 | 0.83 | 0.66 | 0.69 | 0.81 | 0.67 |
Experiment 1 reproduced with DVC - stratified on subset | 0.75 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.69 | 0.69 | 0.81 | 0.68 |
Experiment 1 (code refactor, training batch size 32) | 0.74 | 0.68 | 0.96 | 0.66 | 0.63 | 0.83 | 0.67 | 0.69 | 0.80 | 0.68 |
Experiment 1 (code refactor, training batch size 16) | 0.75 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.69 | 0.69 | 0.81 | 0.68 |
Experiment 1: token classification with Huggingface BERT 07-02-2022
- Validation set: 10% of texts (stratified on language)
- Dataset 07-02-2022
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) lenght: 35, step: 30 (overlap of 5 tokens)
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.253900
- Val: 0.290570
- Test:
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.875714 | 0.665102 | 0.73898 |
CZ | 0.81 | 0.548696 | 0.635217 |
DE | 0.975809 | 0.88539 | 0.927579 |
EN | 0.850833 | 0.535625 | 0.623125 |
ES | 0.9068 | 0.464 | 0.591 |
FI | 0.895 | 0.770125 | 0.8235 |
FR | 0.806301 | 0.48875 | 0.592939 |
NL | 0.870816 | 0.596939 | 0.662449 |
PL | 0.8894 | 0.6982 | 0.7738 |
SL | 0.805 | 0.575833 | 0.63875 |
Experiment 2: token classification with Huggingface BERT long sequences
- Validation set: 10% of texts (stratified on language)
- Dataset 07-02-2022
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) lenght: 150, step: 150
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.224500
- Val: 0.285791
- Test: 0.4178357720375061
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.85 | 0.693673 | 0.744286 |
CZ | 0.808043 | 0.623261 | 0.685652 |
DE | 0.971874 | 0.954152 | 0.962806 |
EN | 0.823333 | 0.618125 | 0.668333 |
ES | 0.8722 | 0.52 | 0.6254 |
FI | 0.896625 | 0.785625 | 0.833375 |
FR | 0.797703 | 0.571588 | 0.651368 |
NL | 0.875102 | 0.634286 | 0.690204 |
PL | 0.8872 | 0.7466 | 0.8026 |
SL | 0.806667 | 0.653333 | 0.692917 |
Remarks
For some texts the sequences of length 150 have to be truncated to fit into the 512 input tokens for BERT. Consequently, we are missing predictions for these truncated tokens. Maybe it is a good idea to decrease the ‘step’ size, so we’ll have predictions for every token. However, this would also mean that we’ll have more repetition in the training set. This might impact the results in a negative way.
Experiment 1 reproduced with DVC
- ocrpostcorrection-notebooks commit: 430d228
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.2398
- Val: 0.2871749699115753
- Test: 0.4474944472312927
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.85 | 0.71 | 0.75 |
CZ | 0.82 | 0.6 | 0.68 |
DE | 0.97 | 0.96 | 0.96 |
EN | 0.81 | 0.6 | 0.66 |
ES | 0.87 | 0.54 | 0.64 |
FI | 0.91 | 0.78 | 0.83 |
FR | 0.8 | 0.63 | 0.69 |
NL | 0.86 | 0.63 | 0.69 |
PL | 0.89 | 0.75 | 0.81 |
SL | 0.81 | 0.62 | 0.67 |
Experiment 1 reproduced with DVC, stratified on subset
- ocrpostcorrection-notebooks commit: 6231bca
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.2439
- Val: 0.2839458584785461
- Test: 0.4422231018543243
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.86 | 0.7 | 0.75 |
CZ | 0.85 | 0.6 | 0.69 |
DE | 0.97 | 0.95 | 0.96 |
EN | 0.82 | 0.61 | 0.67 |
ES | 0.89 | 0.53 | 0.63 |
FI | 0.89 | 0.79 | 0.83 |
FR | 0.81 | 0.62 | 0.69 |
NL | 0.87 | 0.64 | 0.69 |
PL | 0.89 | 0.75 | 0.81 |
SL | 0.81 | 0.64 | 0.68 |
Experiment 1 with training batch size 32 (2023-07-28)
- ocrpostcorrection-notebooks commit: 8f7329e
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.2625
- Val: 0.2949527204036712
- Test: 0.4553228318691253
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.86 | 0.69 | 0.74 |
CZ | 0.85 | 0.59 | 0.68 |
DE | 0.97 | 0.95 | 0.96 |
EN | 0.83 | 0.59 | 0.66 |
ES | 0.88 | 0.53 | 0.63 |
FI | 0.89 | 0.79 | 0.83 |
FR | 0.8 | 0.61 | 0.67 |
NL | 0.86 | 0.64 | 0.69 |
PL | 0.89 | 0.75 | 0.8 |
SL | 0.82 | 0.63 | 0.68 |
Remarks
Training batch size was set to 32 (instead of 16). Trained on Google Colab T4 High RAM.
Experiment 1 with training batch size 16 (2023-07-28)
- ocrpostcorrection-notebooks commit: 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
- Pretrained model: bert-base-multilingual-cased
- Loss
- Train: 0.2439
- Val: 0.2839458584785461
- Test: 0.4422231018543243
language | T1_Precision | T1_Recall | T1_Fmesure |
---|---|---|---|
BG | 0.86 | 0.7 | 0.75 |
CZ | 0.85 | 0.6 | 0.69 |
DE | 0.97 | 0.95 | 0.96 |
EN | 0.82 | 0.61 | 0.67 |
ES | 0.89 | 0.53 | 0.63 |
FI | 0.89 | 0.79 | 0.83 |
FR | 0.81 | 0.62 | 0.69 |
NL | 0.87 | 0.64 | 0.69 |
PL | 0.89 | 0.75 | 0.81 |
SL | 0.81 | 0.64 | 0.68 |
Remarks
Training batch size was set to 16 again. Trained on Google Colab T4 High RAM.
Results are back to what they were for Experiment 1 reproduced with DVC - stratified on subset (as expected). It seems that batch size has a small impact on performance.
Task 2: Error Correction
Task 1 Perfect
Summarized results (average % of improvement in edit distance between original and corrected). The input is the ‘perfect’ results for error detection.
Method | BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
---|---|---|---|---|---|---|---|---|---|---|
CCC (2019 competition winner) | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
Experiment 1 | -38 | -64 | 47 | -13 | -1 | 14 | -8 | -4 | -14 | -34 |
Experiment 1 reproduced with DVC | 14 | -64 | 21 | -2 | 17 | 15 | nan | 8 | -5 | -25 |
Baseline 2 (hidden size 768) | 17 | -67 | 25 | -4 | 17 | 21 | -3 | 10 | -7 | -32 |
byt5-small experiment 1 | 16 | -18 | 56 | 2 | 7 | 38 | 11 | 6 | 9 | -14 |
byt5-small experiment 2 | 21 | -14 | 65 | 2 | 10 | 42 | 13 | 14 | 17 | -8 |
byt5-small experiment 3 | 25 | 8 | 66 | 17 | 19 | 48 | 10 | 24 | 27 | 18 |
byt5-small experiment 4 | 23 | 10 | 72 | 18 | 22 | 50 | 25 | 28 | 28 | 15 |
Task 1 Results
Summarized results (average % of improvement in edit distance between original and corrected). The input is the errors detected by a model. The experiment notes specify which error detection model was used.
Method | BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
---|---|---|---|---|---|---|---|---|---|---|
CCC (2019 competition winner) | 9 | 6 | 24 | 11 | 11 | 8 | 5 | 12 | 17 | 14 |
Experiment 1 reproduced with DVC | -5 | -45 | 25 | -16 | -6 | 12 | -13 | -9 | -15 | -37 |
Baseline 2 (hidden size 768) | -4 | -48 | 28 | -20 | -7 | 18 | -13 | -7 | -15 | -47 |
byt5-small experiment 1 | 10 | -21 | 52 | -9 | 1 | 34 | 2 | -0 | 5 | -24 |
byt5-small experiment 2 | 13 | -15 | 59 | -7 | 3 | 37 | 1 | 6 | 11 | -19 |
byt5-small experiment 3 | 16 | -4 | 60 | -7 | 10 | 42 | -2 | 15 | 19 | 11 |
byt5-small experiment 4 | 14 | -2 | 65 | -4 | 11 | 43 | 7 | 17 | 21 | 1 |
Error Correction Experiment Notes
Correction experiment 1 reproduced with DVC (2023-07-29)
- ocrpostcorrection-notebooks commit: dc2af99
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model:
SimpleCorrectionSeq2seq
- Decoder:
GreedySearchDecoder
- Loss
- Train: 8.595188051536567
- Val: 9.212168355464936
- Test: 9.366749288250466
Trained on Google Colab T4 High RAM.
Correction experiment baseline 2 (2023-08-05)
- ocrpostcorrection-notebooks commit: 45fa416
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model:
SimpleCorrectionSeq2seq
- Decoder:
GreedySearchDecoder
- Loss (Updated run from commit 765a7df)
- Train: 7.310251626014709
- Val: 7.631718857658534
- Test: 8.613492756178495 (Updated run from commit 3955d38)
Set hidden size to 768 to create baseline for an experiment with BERT hidden vectors as additional input.
Trained on Google Colab T4 High RAM.
Results in table have been recalculated after the problem with nan and -inf for two French texts had been fixed. The results are tracked in DVC in commit 765a7df.
Experiment byt5-small 1 (2024-03-01)
- ocrpostcorrection-notebooks commit: b677b6b
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model: byt5-small
- Number of epochs: 1
- Optimizer: AdamW (default)
- Loss
- Train: 0.6192
- Val: 0.4882390201091766
- Test: 0.5294567942619324
Trained on Google Colab T4.
Experiment byt5-small 2: AdaFactor optimizer (2024-03-08)
- ocrpostcorrection-notebooks commit: 7665d86
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model: byt5-small
- Number of epochs: 1
- Optimizer: AdaFactor
- Loss
- Train: 0.4592
- Val: 0.3836239278316498
- Test: 0.4266203045845032
Trained on Google Colab T4.
Experiment byt5-small 3: language as task prefix (2024-03-17)
- ocrpostcorrection-notebooks commit: 855b2cf
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model: google/byt5-small
- Number of epochs: 1
- Optimizer: AdaFactor
- Loss
- Train: 0.4285
- Val: 0.356132298707962
- Test: 0.3938777148723602
Trained on Google Colab T4.
Experiment byt5-small 4: marked errors in context without task prefix (2024-07-27)
- ocrpostcorrection-notebooks commit: 2fb58d4
- Detection model from experiment 9099e78
- Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
- Model: google/byt5-small
- Number of epochs: 1
- Optimizer: Adafactor
- Loss
- Train: 0.2885
- Val: 0.2521135210990906
- Test: 0.2787809669971466
Trained on Google Colab A100.
Ocrpostcorrection version used: e2176bb. This is the commit before the filter_len_ocr_mistake_in_context
was added, so filtering on input length did not happen. (There was a bug in this function.) It probably worked, because the A100 GPU has much more memory than the V100 that I used for the previous attempt of this experiment.