= Path(os.getcwd()) / "data" / "dataset_training_sample"
data_dir = generate_data(data_dir)
data, md = ["en/eng_sample/2.txt"]
val_files
= get_tokens_with_OCR_mistakes(data, data, val_files)
token_data print(token_data.shape)
token_data.head()
2it [00:00, 1508.20it/s]
(80, 12)
ocr | gs | ocr_aligned | gs_aligned | start | len_ocr | key | language | subset | dataset | len_gs | diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | In | In | ## | 0 | 2 | en/eng_sample/1.txt | en | eng_sample | test | 0 | 2 | |
1 | troe | tree | troe | tree | 13 | 4 | en/eng_sample/1.txt | en | eng_sample | test | 4 | 0 |
2 | peremial | perennial | perem@ial | perennial | 23 | 8 | en/eng_sample/1.txt | en | eng_sample | test | 9 | -1 |
3 | eLngated | elongated | eL@ngated | elongated | 46 | 8 | en/eng_sample/1.txt | en | eng_sample | test | 9 | -1 |
4 | stein, | stem, | stein, | stem@, | 55 | 6 | en/eng_sample/1.txt | en | eng_sample | test | 5 | 1 |