Persian Phonemizer Save

A tool for translating Persian text to IPA (International Phonetic Alphabet).

Project README

persian_phonemizer

A tool for translating Persian text to IPA (International Phonetic Alphabet).

In Persian, one written word can have different pronunciations and different meanings according to the pronunciation. This library helps with disambiguation of such words.

A few examples of use cases of this library are:

Input for TTS systems
Helping people in learning Persian
Adding pronunciation for Persian words in texts of other languages

Installation

pip install persian_phonemizer

Usage

Fast start:

>>> from persian_phonemizer import Phonemizer
>>> phonemizer = Phonemizer()
>>> phonemizer.phonemize("آن مرد مرد.")
'ʔɒːn mæɾd moɾd .'
>>> phonemizer.phonemize("دوچرخه جدید علی گم شد.")
'dovtʃʰæɾxeje dʒædiːde ʔæliː ɡom ʃod .'

you can set the package to output Persian text with eraab instead of IPA:

>>> phonemizer = Phonemizer(output_format='eraab')
>>> phonemizer.phonemize("آن مرد مرد.")
'آن مَرد مُرد .'

What's inside?

A database containing words, part-of-speech, pronunciation and meaning according to Moen dictionary
- script for parsing Dehkhoda dictionary is available in the dataset directory. Still, the results are not used in the package because some pronunciations are outdated and will do more harm than good.
A Part-of-Speech tagger and a Dependency Parser trained on Universal Dependencies dataset using spaCy
A Grapheme to Phoneme model using a seq-to-seq neural network implemented in Pytorch. More info is provided in g2p_fa repo. These assets were created to be used in this repo but each one has the ability to be used separately.

How does it work?

This package uses several approaches for finding the proper pronunciation.

Input text gets normalized and tokenized
Root word for each word in input is calculated using a lemmatizer to cover complex verbs and nouns
Each word is looked up for pronunciations in the database.
- If there is no pronunciation available, pronounce is predicted using g2p_fa.
- If there is one pronunciation, that one is used.
- If there is more than one pronunciation, the correct one is chosen based on the Part-of-Speech tag for that word.
Suffix and prefix pronunciations are added for each word
Add e or je between words when needed using the dependency parser

Open Source Agenda is not affiliated with "Persian Phonemizer" Project. README Source: de-mh/persian_phonemizer

Stars

Open Issues

Last Commit

1 year ago

Repository

de-mh/persian_phonemizer

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/persian-phonemizer"><img src="https://www.opensourceagenda.com/projects/persian-phonemizer/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022