Persian OCR dateset
Persian OCR dataset
The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.
The content of this dataset includes public and news texts.
This dataset uses Far_ketab font. [website]
For each page in this dataset, a subfolder with the same name as the page has been created.
Each subfolder contains 4 files, for example in subfolder 00001 we have:
1.page_00001.png [ Page image ]
2.label_00001.xlsx [ The exact location of each word on the page ]
3.fulltext_00001.txt [ Full text in page ]
4.line_00001.xlsx [ The exact location of each line on the page ]
Introducing label_xxxx.xlsx columns:
import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
#read word
word = label['word'][j]
#read index_line word
index_line = label['line'][j]
#read points
point1 = eval(label['point1'][j])
point2 = eval(label['point2'][j])
point3 = eval(label['point3'][j])
point4 = eval(label['point4'][j])
data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})
I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link :
https://www.coffeete.ir/persiandataset
https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.