Results

Experimental results for the different tasks

Results for the 2019 ICDAR competition on Post-OCR Text correction can be found in this paper. The best results are repeated in the tables below.

Task 1: Token Classification

Summarized results (F-measure)

Best results in bold.

Method	BG	CZ	DE	EN	ES	FI	FR	NL	PL	SL
CCC (2019 competition winner)	0.77	0.70	0.95	0.67	0.69	0.84	0.67	0.71	0.82	0.69
Experiment 1	0.74	0.64	0.93	0.62	0.59	0.82	0.59	0.66	0.77	0.63
Experiment 2 (long sequences)	0.74	0.69	0.96	0.67	0.63	0.83	0.65	0.69	0.8	0.69
Experiment 1 reproduced with DVC	0.75	0.68	0.96	0.66	0.64	0.83	0.69	0.69	0.81	0.67
Experiment 1 reproduced with DVC - evaluation with old eval script	0.75	0.68	0.96	0.66	0.64	0.83	0.66	0.69	0.81	0.67
Experiment 1 reproduced with DVC - stratified on subset	0.75	0.69	0.96	0.67	0.63	0.83	0.69	0.69	0.81	0.68
Experiment 1 (code refactor, training batch size 32)	0.74	0.68	0.96	0.66	0.63	0.83	0.67	0.69	0.80	0.68
Experiment 1 (code refactor, training batch size 16)	0.75	0.69	0.96	0.67	0.63	0.83	0.69	0.69	0.81	0.68

Experiment 1: token classification with Huggingface BERT 07-02-2022

Validation set: 10% of texts (stratified on language)
Dataset 07-02-2022
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) lenght: 35, step: 30 (overlap of 5 tokens)
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.253900
- Val: 0.290570
- Test:

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.875714	0.665102	0.73898
CZ	0.81	0.548696	0.635217
DE	0.975809	0.88539	0.927579
EN	0.850833	0.535625	0.623125
ES	0.9068	0.464	0.591
FI	0.895	0.770125	0.8235
FR	0.806301	0.48875	0.592939
NL	0.870816	0.596939	0.662449
PL	0.8894	0.6982	0.7738
SL	0.805	0.575833	0.63875

Experiment 2: token classification with Huggingface BERT long sequences

Validation set: 10% of texts (stratified on language)
Dataset 07-02-2022
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) lenght: 150, step: 150
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.224500
- Val: 0.285791
- Test: 0.4178357720375061

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.85	0.693673	0.744286
CZ	0.808043	0.623261	0.685652
DE	0.971874	0.954152	0.962806
EN	0.823333	0.618125	0.668333
ES	0.8722	0.52	0.6254
FI	0.896625	0.785625	0.833375
FR	0.797703	0.571588	0.651368
NL	0.875102	0.634286	0.690204
PL	0.8872	0.7466	0.8026
SL	0.806667	0.653333	0.692917

Remarks

For some texts the sequences of length 150 have to be truncated to fit into the 512 input tokens for BERT. Consequently, we are missing predictions for these truncated tokens. Maybe it is a good idea to decrease the ‘step’ size, so we’ll have predictions for every token. However, this would also mean that we’ll have more repetition in the training set. This might impact the results in a negative way.

Experiment 1 reproduced with DVC

ocrpostcorrection-notebooks commit: 430d228
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.2398
- Val: 0.2871749699115753
- Test: 0.4474944472312927

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.85	0.71	0.75
CZ	0.82	0.6	0.68
DE	0.97	0.96	0.96
EN	0.81	0.6	0.66
ES	0.87	0.54	0.64
FI	0.91	0.78	0.83
FR	0.8	0.63	0.69
NL	0.86	0.63	0.69
PL	0.89	0.75	0.81
SL	0.81	0.62	0.67

Experiment 1 reproduced with DVC, stratified on subset

ocrpostcorrection-notebooks commit: 6231bca
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.2439
- Val: 0.2839458584785461
- Test: 0.4422231018543243

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.86	0.7	0.75
CZ	0.85	0.6	0.69
DE	0.97	0.95	0.96
EN	0.82	0.61	0.67
ES	0.89	0.53	0.63
FI	0.89	0.79	0.83
FR	0.81	0.62	0.69
NL	0.87	0.64	0.69
PL	0.89	0.75	0.81
SL	0.81	0.64	0.68

Experiment 1 with training batch size 32 (2023-07-28)

ocrpostcorrection-notebooks commit: 8f7329e
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.2625
- Val: 0.2949527204036712
- Test: 0.4553228318691253

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.86	0.69	0.74
CZ	0.85	0.59	0.68
DE	0.97	0.95	0.96
EN	0.83	0.59	0.66
ES	0.88	0.53	0.63
FI	0.89	0.79	0.83
FR	0.8	0.61	0.67
NL	0.86	0.64	0.69
PL	0.89	0.75	0.8
SL	0.82	0.63	0.68

Remarks

Training batch size was set to 32 (instead of 16). Trained on Google Colab T4 High RAM.

Experiment 1 with training batch size 16 (2023-07-28)

ocrpostcorrection-notebooks commit: 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Normalized editdistance threshold for ‘sentences’: 0.3 (only for train and val)
- Sequence (sentence) length: size: 35, step: 30
Pretrained model: bert-base-multilingual-cased
Loss
- Train: 0.2439
- Val: 0.2839458584785461
- Test: 0.4422231018543243

language	T1_Precision	T1_Recall	T1_Fmesure
BG	0.86	0.7	0.75
CZ	0.85	0.6	0.69
DE	0.97	0.95	0.96
EN	0.82	0.61	0.67
ES	0.89	0.53	0.63
FI	0.89	0.79	0.83
FR	0.81	0.62	0.69
NL	0.87	0.64	0.69
PL	0.89	0.75	0.81
SL	0.81	0.64	0.68

Remarks

Training batch size was set to 16 again. Trained on Google Colab T4 High RAM.

Results are back to what they were for Experiment 1 reproduced with DVC - stratified on subset (as expected). It seems that batch size has a small impact on performance.

Task 2: Error Correction

Task 1 Perfect

Summarized results (average % of improvement in edit distance between original and corrected). The input is the ‘perfect’ results for error detection.

Method	BG	CZ	DE	EN	ES	FI	FR	NL	PL	SL
CCC (2019 competition winner)	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
Experiment 1	-38	-64	47	-13	-1	14	-8	-4	-14	-34
Experiment 1 reproduced with DVC	14	-64	21	-2	17	15	nan	8	-5	-25
Baseline 2 (hidden size 768)	17	-67	25	-4	17	21	-3	10	-7	-32
byt5-small experiment 1	16	-18	56	2	7	38	11	6	9	-14
byt5-small experiment 2	21	-14	65	2	10	42	13	14	17	-8
byt5-small experiment 3	25	8	66	17	19	48	10	24	27	18
byt5-small experiment 4	23	10	72	18	22	50	25	28	28	15

Task 1 Results

Summarized results (average % of improvement in edit distance between original and corrected). The input is the errors detected by a model. The experiment notes specify which error detection model was used.

Method	BG	CZ	DE	EN	ES	FI	FR	NL	PL	SL
CCC (2019 competition winner)	9	6	24	11	11	8	5	12	17	14
Experiment 1 reproduced with DVC	-5	-45	25	-16	-6	12	-13	-9	-15	-37
Baseline 2 (hidden size 768)	-4	-48	28	-20	-7	18	-13	-7	-15	-47
byt5-small experiment 1	10	-21	52	-9	1	34	2	-0	5	-24
byt5-small experiment 2	13	-15	59	-7	3	37	1	6	11	-19
byt5-small experiment 3	16	-4	60	-7	10	42	-2	15	19	11
byt5-small experiment 4	14	-2	65	-4	11	43	7	17	21	1

Error Correction Experiment Notes

Correction experiment 1 reproduced with DVC (2023-07-29)

ocrpostcorrection-notebooks commit: dc2af99
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: SimpleCorrectionSeq2seq
Decoder: GreedySearchDecoder
Loss
- Train: 8.595188051536567
- Val: 9.212168355464936
- Test: 9.366749288250466

Trained on Google Colab T4 High RAM.

Correction experiment baseline 2 (2023-08-05)

ocrpostcorrection-notebooks commit: 45fa416
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: SimpleCorrectionSeq2seq
Decoder: GreedySearchDecoder
Loss (Updated run from commit 765a7df)
- Train: 7.310251626014709
- Val: 7.631718857658534
- Test: 8.613492756178495 (Updated run from commit 3955d38)

Set hidden size to 768 to create baseline for an experiment with BERT hidden vectors as additional input.

Trained on Google Colab T4 High RAM.

Results in table have been recalculated after the problem with nan and -inf for two French texts had been fixed. The results are tracked in DVC in commit 765a7df.

Experiment byt5-small 1 (2024-03-01)

ocrpostcorrection-notebooks commit: b677b6b
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: byt5-small
- Number of epochs: 1
- Optimizer: AdamW (default)
Loss
- Train: 0.6192
- Val: 0.4882390201091766
- Test: 0.5294567942619324

Trained on Google Colab T4.

Experiment byt5-small 2: AdaFactor optimizer (2024-03-08)

ocrpostcorrection-notebooks commit: 7665d86
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: byt5-small
- Number of epochs: 1
- Optimizer: AdaFactor
Loss
- Train: 0.4592
- Val: 0.3836239278316498
- Test: 0.4266203045845032

Trained on Google Colab T4.

Experiment byt5-small 3: language as task prefix (2024-03-17)

ocrpostcorrection-notebooks commit: 855b2cf
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: google/byt5-small
- Number of epochs: 1
- Optimizer: AdaFactor
Loss
- Train: 0.4285
- Val: 0.356132298707962
- Test: 0.3938777148723602

Trained on Google Colab T4.

Experiment byt5-small 4: marked errors in context without task prefix (2024-07-27)

ocrpostcorrection-notebooks commit: 2fb58d4
Detection model from experiment 9099e78
Dataset
- Split seed: 8232
- Validation set: 10.0%
- Max token length: 22
Model: google/byt5-small
- Number of epochs: 1
- Optimizer: Adafactor
Loss
- Train: 0.2885
- Val: 0.2521135210990906
- Test: 0.2787809669971466

Trained on Google Colab A100.

Ocrpostcorrection version used: e2176bb. This is the commit before the filter_len_ocr_mistake_in_context was added, so filtering on input length did not happen. (There was a bug in this function.) It probably worked, because the A100 GPU has much more memory than the V100 that I used for the previous attempt of this experiment.