data_dir = Path(os.getcwd()) / "data" / "dataset_training_sample"
data, md = generate_data(data_dir)
val_files = ["en/eng_sample/2.txt"]
token_data = get_tokens_with_OCR_mistakes(data, data, val_files)
print(token_data.shape)
token_data.head()2it [00:00, 1508.20it/s]
(80, 12)
| ocr | gs | ocr_aligned | gs_aligned | start | len_ocr | key | language | subset | dataset | len_gs | diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | In | In | ## | 0 | 2 | en/eng_sample/1.txt | en | eng_sample | test | 0 | 2 | |
| 1 | troe | tree | troe | tree | 13 | 4 | en/eng_sample/1.txt | en | eng_sample | test | 4 | 0 | 
| 2 | peremial | perennial | perem@ial | perennial | 23 | 8 | en/eng_sample/1.txt | en | eng_sample | test | 9 | -1 | 
| 3 | eLngated | elongated | eL@ngated | elongated | 46 | 8 | en/eng_sample/1.txt | en | eng_sample | test | 9 | -1 | 
| 4 | stein, | stem, | stein, | stem@, | 55 | 6 | en/eng_sample/1.txt | en | eng_sample | test | 5 | 1 |