Arshasb Save

Persian OCR dateset

Project README

Arshasb

Persian OCR dataset

In this repository, Arshasb (ancient Iranian name[ اَرشاسب ]) Persian OCR dataset is located.
This dataset contains 33,000 pages of Persian text, of which 7,000 pages have been published for free.
The words that are placed next to each other are interdependent and represent one subject.
More precisely, the placement of the words is meaningful, and this helps to use NLP models in the OCR process.
In this dataset, the position of each word is precisely labeled. Look at this sample:

Download

There are 100 samples of this dataset in Arshasb_samples.tar.gz
You can download Arshasb dataset with 7k pages in this link (~730M)
Also, if you want a 33,000-page dataset, contact me by hubare.ra[at]gmail.com .[Not free]

Detail

The number of unique words with the removal of numbers and punctuation is 97498. In the 7k version, this number is reduced to 40911 unique words.
The content of this dataset includes public and news texts.
This dataset uses Far_ketab font. [website]
For each page in this dataset, a subfolder with the same name as the page has been created.
Each subfolder contains 4 files, for example in subfolder 00001 we have:
- 1.page_00001.png [ Page image ]
- 2.label_00001.xlsx [ The exact location of each word on the page ]
- 3.fulltext_00001.txt [ Full text in page ]
- 4.line_00001.xlsx [ The exact location of each line on the page ]
- Introducing label_xxxx.xlsx columns:
  - 1.word
  - 2.line [show index-line word]
  - 3.point(1-2-3-4) [show location of each word]

Sample code for reading label_xxxx.xlsx

import pandas as pd
label = pd.read_excel('Arshasb_7k/00001/label_00001.xlsx')
data = []
for j in range(len(label)):
    #read word
    word = label['word'][j]
    #read index_line word
    index_line = label['line'][j]
    #read points
    point1 = eval(label['point1'][j])
    point2 = eval(label['point2'][j])
    point3 = eval(label['point3'][j])
    point4 = eval(label['point4'][j])
    data.append({'number':j , 'word':word, 'line':index_line ,'point1':point1,'point2':point2,'point3':point3,'point4':point4})

Donation

I try to publish free Persian datasets in github. Your financial support will encourage me.
Donation link : https://www.coffeete.ir/persiandataset

https://www.patreon.com/persiandataset
If you are in Iran, contact me by hubare.ra[at]gmail.com for donation.

Open Source Agenda is not affiliated with "Arshasb" Project. README Source: persiandataset/Arshasb

Stars

Open Issues

Last Commit

1 year ago

Repository

persiandataset/Arshasb

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/arshasb"><img src="https://www.opensourceagenda.com/projects/arshasb/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022