g2p
Table of Contents
Grapheme to phoneme module. You can use this to generate phones for words in various languages. These phones are compatible with our current ASR models. In any case, you should check with the ASR owner. While training, the ASR is made to use a certain phoneset, this could be IPA, CommonLabel, or anything else. An ASR using a certain phoneset can only be extended with phones generated in that phoneset.
1. Language Support
| Language | Phoneset | Model | Model Type |
|---|---|---|---|
| Indian English | IPA | ./models/models36/en_with_hindi_phones/en-hi-ipa-model |
Sequitur |
| Hindi | CommonLabel | Use unified parser | |
| Tamil | CommonLabel | Use unified parser | |
| Telugu | CommonLabel | Use unified parser | |
| Kannada | CommonLabel | Use unified parser | |
| Malayalam | CommonLabel | Use unified parser | |
| Punjabi | IPA | TBD | TBD |
| Bengali | CommonLabel | Use unified parser | |
| Marathi | CommonLabel | Use unified parser | |
| Gujarati | CommonLabel | Use unified parser |
2. Installation
Recommended way to install is by cloning the repository and doing poetry install
--no-dev. The system is reported to work on Python 3.6.9.
3. Usage
Depending on the model type, you can use various g2p components to convert words
in phones. For example, if you have a unified parser based model, you should use
g2p.components.UnifiedParser.
All components have similar calling pattern that looks like this:
from g2p.components import Fallback cmp = Fallback() # A component returns Optional[List[List[str]]] which looks something like # this [['ax', 'b', 'hh', 'ao'],['ax', 'b', 'hh', 'ao', 'r']] cmp("word")
In case you are using a model based component, it's recommended to use that as a
fallback for a dictionary. We keep source lexicons in ./data/sources and put
overall generated ones in the database ./data/pronunciations.db. In the
database, we keep pronunciations along with their source names. Here is how once
could go about the whole process for Indian English:
from g2p.components import SequiturModel, ARPAg2p, Chain # The default Indian English dictionary is ./data/en-hi-names-ipa-lexicon.txt arpa = ARPAg2p("./data/en-hi-names-ipa-lexicon.txt", name="dict") model_path = "./models/models36/en_with_hindi_phones/en-hi-ipa-model" sm = SequiturModel(model_path, name="sequitur-en") chain_g2p = Chain([arpa, sm]) chain_g2p("cow")
For languages using Unified Parser, you need to build unified-parser and pass the path of its binary to the g2p component as shown below:
from g2p.components import UnifiedParser up = UnifiedParser("./path-to-up-executable") up("\340\244\225\340\245\207\340\244\262\340\244\276")