Results

Experimental results for the different tasks

Results for the 2019 ICDAR competition on Post-OCR Text correction can be found in this paper. The best results are repeated in the tables below.

Task 1: Token Classification

Summarized results (F-measure)

Best results in bold.

Method BG CZ DE EN ES FI FR NL PL SL
CCC (2019 competition winner) 0.77 0.70 0.95 0.67 0.69 0.84 0.67 0.71 0.82 0.69
Experiment 1 0.74 0.64 0.93 0.62 0.59 0.82 0.59 0.66 0.77 0.63
Experiment 2 (long sequences) 0.74 0.69 0.96 0.67 0.63 0.83 0.65 0.69 0.8 0.69
Experiment 1 reproduced with DVC 0.75 0.68 0.96 0.66 0.64 0.83 0.69 0.69 0.81 0.67
Experiment 1 reproduced with DVC - evaluation with old eval script 0.75 0.68 0.96 0.66 0.64 0.83 0.66 0.69 0.81 0.67
Experiment 1 reproduced with DVC - stratified on subset 0.75 0.69 0.96 0.67 0.63 0.83 0.69 0.69 0.81 0.68
Experiment 1 (code refactor, training batch size 32) 0.74 0.68 0.96 0.66 0.63 0.83 0.67 0.69 0.80 0.68
Experiment 1 (code refactor, training batch size 16) 0.75 0.69 0.96 0.67 0.63 0.83 0.69 0.69 0.81 0.68

Experiment 1: token classification with Huggingface BERT 07-02-2022

  • Validation set: 10% of texts (stratified on language)
  • Dataset 07-02-2022
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) lenght: 35, step: 30 (overlap of 5 tokens)
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.253900
    • Val: 0.290570
    • Test:
language T1_Precision T1_Recall T1_Fmesure
BG 0.875714 0.665102 0.73898
CZ 0.81 0.548696 0.635217
DE 0.975809 0.88539 0.927579
EN 0.850833 0.535625 0.623125
ES 0.9068 0.464 0.591
FI 0.895 0.770125 0.8235
FR 0.806301 0.48875 0.592939
NL 0.870816 0.596939 0.662449
PL 0.8894 0.6982 0.7738
SL 0.805 0.575833 0.63875

Experiment 2: token classification with Huggingface BERT long sequences

  • Validation set: 10% of texts (stratified on language)
  • Dataset 07-02-2022
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) lenght: 150, step: 150
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.224500
    • Val: 0.285791
    • Test: 0.4178357720375061
language T1_Precision T1_Recall T1_Fmesure
BG 0.85 0.693673 0.744286
CZ 0.808043 0.623261 0.685652
DE 0.971874 0.954152 0.962806
EN 0.823333 0.618125 0.668333
ES 0.8722 0.52 0.6254
FI 0.896625 0.785625 0.833375
FR 0.797703 0.571588 0.651368
NL 0.875102 0.634286 0.690204
PL 0.8872 0.7466 0.8026
SL 0.806667 0.653333 0.692917

Remarks

For some texts the sequences of length 150 have to be truncated to fit into the 512 input tokens for BERT. Consequently, we are missing predictions for these truncated tokens. Maybe it is a good idea to decrease the ‘step’ size, so we’ll have predictions for every token. However, this would also mean that we’ll have more repetition in the training set. This might impact the results in a negative way.

Experiment 1 reproduced with DVC

  • ocrpostcorrection-notebooks commit: 430d228
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) length: size: 35, step: 30
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.2398
    • Val: 0.2871749699115753
    • Test: 0.4474944472312927
language T1_Precision T1_Recall T1_Fmesure
BG 0.85 0.71 0.75
CZ 0.82 0.6 0.68
DE 0.97 0.96 0.96
EN 0.81 0.6 0.66
ES 0.87 0.54 0.64
FI 0.91 0.78 0.83
FR 0.8 0.63 0.69
NL 0.86 0.63 0.69
PL 0.89 0.75 0.81
SL 0.81 0.62 0.67

Experiment 1 reproduced with DVC, stratified on subset

  • ocrpostcorrection-notebooks commit: 6231bca
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) length: size: 35, step: 30
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.2439
    • Val: 0.2839458584785461
    • Test: 0.4422231018543243
language T1_Precision T1_Recall T1_Fmesure
BG 0.86 0.7 0.75
CZ 0.85 0.6 0.69
DE 0.97 0.95 0.96
EN 0.82 0.61 0.67
ES 0.89 0.53 0.63
FI 0.89 0.79 0.83
FR 0.81 0.62 0.69
NL 0.87 0.64 0.69
PL 0.89 0.75 0.81
SL 0.81 0.64 0.68

Experiment 1 with training batch size 32 (2023-07-28)

  • ocrpostcorrection-notebooks commit: 8f7329e
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) length: size: 35, step: 30
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.2625
    • Val: 0.2949527204036712
    • Test: 0.4553228318691253
language T1_Precision T1_Recall T1_Fmesure
BG 0.86 0.69 0.74
CZ 0.85 0.59 0.68
DE 0.97 0.95 0.96
EN 0.83 0.59 0.66
ES 0.88 0.53 0.63
FI 0.89 0.79 0.83
FR 0.8 0.61 0.67
NL 0.86 0.64 0.69
PL 0.89 0.75 0.8
SL 0.82 0.63 0.68

Remarks

Training batch size was set to 32 (instead of 16). Trained on Google Colab T4 High RAM.

Experiment 1 with training batch size 16 (2023-07-28)

  • ocrpostcorrection-notebooks commit: 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
    • Sequence (sentence) length: size: 35, step: 30
  • Pretrained model: bert-base-multilingual-cased
  • Loss
    • Train: 0.2439
    • Val: 0.2839458584785461
    • Test: 0.4422231018543243
language T1_Precision T1_Recall T1_Fmesure
BG 0.86 0.7 0.75
CZ 0.85 0.6 0.69
DE 0.97 0.95 0.96
EN 0.82 0.61 0.67
ES 0.89 0.53 0.63
FI 0.89 0.79 0.83
FR 0.81 0.62 0.69
NL 0.87 0.64 0.69
PL 0.89 0.75 0.81
SL 0.81 0.64 0.68

Remarks

Training batch size was set to 16 again. Trained on Google Colab T4 High RAM.

Results are back to what they were for Experiment 1 reproduced with DVC - stratified on subset (as expected). It seems that batch size has a small impact on performance.

Task 2: Error Correction

Task 1 Perfect

Summarized results (average % of improvement in edit distance between original and corrected). The input is the ‘perfect’ results for error detection.

Method BG CZ DE EN ES FI FR NL PL SL
CCC (2019 competition winner) n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
Experiment 1 -38 -64 47 -13 -1 14 -8 -4 -14 -34
Experiment 1 reproduced with DVC 14 -64 21 -2 17 15 nan 8 -5 -25
Baseline 2 (hidden size 768) 17 -67 25 -4 17 21 -3 10 -7 -32
byt5-small experiment 1 16 -18 56 2 7 38 11 6 9 -14
byt5-small experiment 2 21 -14 65 2 10 42 13 14 17 -8
byt5-small experiment 3 25 8 66 17 19 48 10 24 27 18
byt5-small experiment 4 23 10 72 18 22 50 25 28 28 15

Task 1 Results

Summarized results (average % of improvement in edit distance between original and corrected). The input is the errors detected by a model. The experiment notes specify which error detection model was used.

Method BG CZ DE EN ES FI FR NL PL SL
CCC (2019 competition winner) 9 6 24 11 11 8 5 12 17 14
Experiment 1 reproduced with DVC -5 -45 25 -16 -6 12 -13 -9 -15 -37
Baseline 2 (hidden size 768) -4 -48 28 -20 -7 18 -13 -7 -15 -47
byt5-small experiment 1 10 -21 52 -9 1 34 2 -0 5 -24
byt5-small experiment 2 13 -15 59 -7 3 37 1 6 11 -19
byt5-small experiment 3 16 -4 60 -7 10 42 -2 15 19 11
byt5-small experiment 4 14 -2 65 -4 11 43 7 17 21 1

Error Correction Experiment Notes

Correction experiment 1 reproduced with DVC (2023-07-29)

  • ocrpostcorrection-notebooks commit: dc2af99
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: SimpleCorrectionSeq2seq
  • Decoder: GreedySearchDecoder
  • Loss
    • Train: 8.595188051536567
    • Val: 9.212168355464936
    • Test: 9.366749288250466

Trained on Google Colab T4 High RAM.

Correction experiment baseline 2 (2023-08-05)

  • ocrpostcorrection-notebooks commit: 45fa416
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: SimpleCorrectionSeq2seq
  • Decoder: GreedySearchDecoder
  • Loss (Updated run from commit 765a7df)
    • Train: 7.310251626014709
    • Val: 7.631718857658534
    • Test: 8.613492756178495 (Updated run from commit 3955d38)

Set hidden size to 768 to create baseline for an experiment with BERT hidden vectors as additional input.

Trained on Google Colab T4 High RAM.

Results in table have been recalculated after the problem with nan and -inf for two French texts had been fixed. The results are tracked in DVC in commit 765a7df.

Experiment byt5-small 1 (2024-03-01)

  • ocrpostcorrection-notebooks commit: b677b6b
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: byt5-small
    • Number of epochs: 1
    • Optimizer: AdamW (default)
  • Loss
    • Train: 0.6192
    • Val: 0.4882390201091766
    • Test: 0.5294567942619324

Trained on Google Colab T4.

Experiment byt5-small 2: AdaFactor optimizer (2024-03-08)

  • ocrpostcorrection-notebooks commit: 7665d86
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: byt5-small
    • Number of epochs: 1
    • Optimizer: AdaFactor
  • Loss
    • Train: 0.4592
    • Val: 0.3836239278316498
    • Test: 0.4266203045845032

Trained on Google Colab T4.

Experiment byt5-small 3: language as task prefix (2024-03-17)

  • ocrpostcorrection-notebooks commit: 855b2cf
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: google/byt5-small
    • Number of epochs: 1
    • Optimizer: AdaFactor
  • Loss
    • Train: 0.4285
    • Val: 0.356132298707962
    • Test: 0.3938777148723602

Trained on Google Colab T4.

Experiment byt5-small 4: marked errors in context without task prefix (2024-07-27)

  • ocrpostcorrection-notebooks commit: 2fb58d4
  • Detection model from experiment 9099e78
  • Dataset
    • Split seed: 8232
    • Validation set: 10.0%
    • Max token length: 22
  • Model: google/byt5-small
    • Number of epochs: 1
    • Optimizer: Adafactor
  • Loss
    • Train: 0.2885
    • Val: 0.2521135210990906
    • Test: 0.2787809669971466

Trained on Google Colab A100.

Ocrpostcorrection version used: e2176bb. This is the commit before the filter_len_ocr_mistake_in_context was added, so filtering on input length did not happen. (There was a bug in this function.) It probably worked, because the A100 GPU has much more memory than the V100 that I used for the previous attempt of this experiment.