Error Correction

Functionality for error correction (task 2)

Dataset Creation

A dataset for token correction consists of the OCR text and gold standard of AlignedTokens. These can be extracted from the Text objects using the get_tokens_with_OCR_mistakes function. This function also adds data properties that can be used for calculating statistics about the data.

get_tokens_with_OCR_mistakes

 get_tokens_with_OCR_mistakes
                               (data:Dict[str,ocrpostcorrection.icdar_data
                               .Text], data_test:Dict[str,ocrpostcorrectio
                               n.icdar_data.Text], val_files:List[str])

Return pandas dataframe with all OCR mistakes from train, val, and test

The following code example shows how use this function. For simplicity, in the example below, the data dictionary (which contain <file name>: Text pairs) is used both as train/val and test set.

data_dir = Path(os.getcwd()) / "data" / "dataset_training_sample"
data, md = generate_data(data_dir)
val_files = ["en/eng_sample/2.txt"]

token_data = get_tokens_with_OCR_mistakes(data, data, val_files)
print(token_data.shape)
token_data.head()

2it [00:00, 1508.20it/s]

(80, 12)

	ocr	gs	ocr_aligned	gs_aligned	start	len_ocr	key	language	subset	dataset	len_gs	diff
0	In		In	##	0	2	en/eng_sample/1.txt	en	eng_sample	test	0	2
1	troe	tree	troe	tree	13	4	en/eng_sample/1.txt	en	eng_sample	test	4	0
2	peremial	perennial	perem@ial	perennial	23	8	en/eng_sample/1.txt	en	eng_sample	test	9	-1
3	eLngated	elongated	eL@ngated	elongated	46	8	en/eng_sample/1.txt	en	eng_sample	test	9	-1
4	stein,	stem,	stein,	stem@,	55	6	en/eng_sample/1.txt	en	eng_sample	test	5	1

Get the context of an ocr mistake.

get_OCR_mistakes_in_context

 get_OCR_mistakes_in_context
                              (data:Dict[str,ocrpostcorrection.icdar_data.
                              Text], data_test:Dict[str,ocrpostcorrection.
                              icdar_data.Text],
                              ocr_mistakes:pandas.core.frame.DataFrame,
                              offset:int)

get_context_for_dataset

 get_context_for_dataset
                          (data:Dict[str,ocrpostcorrection.icdar_data.Text
                          ], ocr_mistakes:pandas.core.frame.DataFrame,
                          offset:int)

get_closest_value

 get_closest_value (lst:List, value:int)

token_data2 = get_OCR_mistakes_in_context(data, data, token_data, offset=20)
print(token_data2.shape)
token_data2.head()

100%|██████████| 4/4 [00:00<00:00, 873.54it/s]
100%|██████████| 4/4 [00:00<00:00, 913.24it/s]

(80, 15)

	ocr	gs	ocr_aligned	gs_aligned	start	len_ocr	key	language	subset	dataset	len_gs	diff	context_before	context_after	len_mistake_in_context
0	In		In	##	0	2	en/eng_sample/1.txt	en	eng_sample	train	0	2		botany, a troe is a	22
1	troe	tree	troe	tree	13	4	en/eng_sample/1.txt	en	eng_sample	train	4	0	In botany, a	is a peremial plant	37
2	peremial	perennial	perem@ial	perennial	23	8	en/eng_sample/1.txt	en	eng_sample	train	9	-1	botany, a troe is a	plant with an eLngated	51
3	eLngated	elongated	eL@ngated	elongated	46	8	en/eng_sample/1.txt	en	eng_sample	train	9	-1	peremial plant with an	stein, or trunk,	48
4	stein,	stem,	stein,	stem@,	55	6	en/eng_sample/1.txt	en	eng_sample	train	5	1	plant with an eLngated	or trunk, suppor ing	50

token_data2.tail()

	ocr	gs	ocr_aligned	gs_aligned	start	len_ocr	key	language	subset	dataset	len_gs	diff	context_before	context_after	len_mistake_in_context
35	test-FFF	test- FFF	test-@FFF	test- FFF	48	8	fr/fr_sample/2.txt	fr	fr_sample	test	9	-1	test -DDD test- EEE	test-GGG test - HHH	48
36	test-GGG	test -GGG	test@-GGG	test -GGG	57	8	fr/fr_sample/2.txt	fr	fr_sample	test	9	-1	test- EEE test-FFF	test - HHH test-III	47
37	test - HHH	test-HHH	test - HHH	test@-@HHH	66	10	fr/fr_sample/2.txt	fr	fr_sample	test	8	2	EEE test-FFF test-GGG	test-III test - JJJ	52
38	test-III	test - III	test@-@III	test - III	77	8	fr/fr_sample/2.txt	fr	fr_sample	test	10	-2	test-GGG test - HHH	test - JJJ blablabla	49
39	?	!	?	!	107	1	fr/fr_sample/2.txt	fr	fr_sample	test	1	0	test - JJJ blablabla		22

Create vocabularies

https://pytorch.org/tutorials/beginner/translation_transformer.html

Define special symbols and indices and make sure the tokens are in order of their indices to properly insert them in vocababulary.

yield_tokens

 yield_tokens (data, col)

Helper function to create vocabulary containing characters

generate_vocabs

 generate_vocabs (train)

Generate ocr and gs vocabularies from the train set

Use the trainset to create the ocr and gs vocabularies:

vocab_transform = generate_vocabs(token_data.query('dataset == "train"'))

len(vocab_transform["ocr"]), len(vocab_transform["gs"])

(46, 44)

Collation

The character sequences need to be transformed into vectors.

Source: https://pytorch.org/tutorials/beginner/translation_transformer.html

get_text_transform

 get_text_transform (vocab_transform)

Returns text transforms to convert raw strings into tensors indices

tensor_transform

 tensor_transform (token_ids:List[int])

Function to add BOS/EOS and create tensor for input sequence indices

sequential_transforms

 sequential_transforms (*transforms)

Helper function to club together sequential operations

text_transform = get_text_transform(vocab_transform)

text_transform["ocr"](["t", "e", "s", "t", "-", " ", "A", "A", "A"])

tensor([ 4,  5,  6,  4,  7, 10, 13, 13, 13,  3])

text_transform = get_text_transform(vocab_transform)

print(text_transform["ocr"](["e", "x", "a", "m", "p", "l", "e"]))
print(text_transform["gs"](["e", "x", "a", "m", "p", "l", "e"]))

tensor([ 5,  0, 21, 34, 22, 33,  5,  3])
tensor([ 5,  0, 21, 27, 23, 26,  5,  3])

Neural network

Network: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

EncoderRNN

 EncoderRNN (input_size, hidden_size)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

AttnDecoderRNN

 AttnDecoderRNN (hidden_size, output_size, dropout_p=0.1, max_length=11)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

SimpleCorrectionSeq2seq

 SimpleCorrectionSeq2seq (input_size, hidden_size, output_size, dropout,
                          max_length, teacher_forcing_ratio, device='cpu')

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

batch_size = 2
hidden_size = 5
dropout = 0.1
max_token_len = 10

model = SimpleCorrectionSeq2seq(
    len(vocab_transform["ocr"]),
    hidden_size,
    len(vocab_transform["gs"]),
    dropout,
    max_token_len,
    teacher_forcing_ratio=0.5,
    device=device,
)

input = torch.tensor([[6, 4], [22, 30], [0, 6], [18, 4], [11, 3], [3, 1]])
encoder_hidden = model.encoder.initHidden(batch_size=batch_size, device=device)

target = torch.tensor([[6, 4], [23, 5], [16, 6], [16, 4], [11, 4], [3, 1]])

losses, _ = model(input, encoder_hidden, target)
losses

tensor([-23.0017, -19.0353], grad_fn=<SumBackward1>)

Evaluation

model_save_path = Path(os.getcwd()) / "data" / "model.rar"

checkpoint = torch.load(model_save_path)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.eval()

SimpleCorrectionSeq2seq(
  (encoder): EncoderRNN(
    (embedding): Embedding(46, 5)
    (gru): GRU(5, 5, batch_first=True)
  )
  (decoder): AttnDecoderRNN(
    (embedding): Embedding(44, 5)
    (attn): Linear(in_features=10, out_features=11, bias=True)
    (attn_combine): Linear(in_features=10, out_features=5, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (gru): GRU(5, 5)
    (out): Linear(in_features=5, out_features=44, bias=True)
  )
)

indices2string

 indices2string (indices, itos)

indices = torch.tensor(
    [
        [20, 34, 22, 6, 1, 1, 1, 1, 1, 1],
        [22, 6, 1, 1, 1, 1, 1, 1, 1, 1],
        [21, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [4, 5, 6, 4, 1, 1, 1, 1, 1, 1],
        [29, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    ]
)
indices2string(indices, vocab_transform["gs"].get_itos())

['This', 'is', 'a', 'test', '!']

decoder = GreedySearchDecoder(model)

max_len = 10

test = SimpleCorrectionDataset(token_data.query('dataset == "test"'), max_len=max_len)
test_dataloader = DataLoader(test, batch_size=5, collate_fn=collate_fn(text_transform))


output_strings = predict_and_convert_to_str(
    model, test_dataloader, vocab_transform["gs"], device
)

100%|██████████| 7/7 [00:00<00:00, 163.36it/s]

torch.Size([1, 5, 5])
torch.Size([1, 5, 5])
torch.Size([1, 5, 5])
torch.Size([1, 5, 5])
torch.Size([1, 5, 5])
torch.Size([1, 5, 5])
torch.Size([1, 5, 5])

max_len = 10
test_data = (
    token_data.query('dataset == "test"')
    .query(f"len_ocr <= {max_len}")
    .query(f"len_gs <= {max_len}")
    .copy()
)

test_data["pred"] = output_strings

Performance measure: mean normalized edit distance

Mean (normalized) edit distance.
- Option: ignore -
- option: ignore case

test_data["ed"] = test_data.apply(
    lambda row: edlib.align(row.ocr, row.gs)["editDistance"], axis=1
)
test_data.ed.describe()

count    35.000000
mean      1.971429
std       1.773758
min       1.000000
25%       1.000000
50%       1.000000
75%       2.000000
max       8.000000
Name: ed, dtype: float64

test_data["ed_norm"] = test_data.apply(
    lambda row: normalized_ed(row.ed, row.ocr, row.gs), axis=1
)
test_data.ed_norm.describe()

count    35.000000
mean      0.390952
std       0.326909
min       0.100000
25%       0.125000
50%       0.250000
75%       0.583333
max       1.000000
Name: ed_norm, dtype: float64

test_data["ed_pred"] = test_data.apply(
    lambda row: edlib.align(row.pred, row.gs)["editDistance"], axis=1
)
test_data.ed_pred.describe()

count    35.000000
mean      8.057143
std       3.253440
min       1.000000
25%       7.000000
50%      10.000000
75%      10.000000
max      11.000000
Name: ed_pred, dtype: float64

test_data["ed_norm_pred"] = test_data.apply(
    lambda row: normalized_ed(row.ed_pred, row.pred, row.gs), axis=1
)
test_data.ed_norm_pred.describe()

count    35.000000
mean      0.989351
std       0.030110
min       0.900000
25%       1.000000
50%       1.000000
75%       1.000000
max       1.000000
Name: ed_norm_pred, dtype: float64