g2p

Table of Contents



This page is generated via scout. For now it just shows the project README.

Grapheme to phoneme module. You can use this to generate phones for words in various languages. These phones are compatible with our current ASR models. In any case, you should check with the ASR owner. While training, the ASR is made to use a certain phoneset, this could be IPA, CommonLabel, or anything else. An ASR using a certain phoneset can only be extended with phones generated in that phoneset.

1. Language Support

Language Phoneset Model Model Type
Indian English IPA ./models/models36/en_with_hindi_phones/en-hi-ipa-model Sequitur
Hindi CommonLabel Use unified parser  
Tamil CommonLabel Use unified parser  
Telugu CommonLabel Use unified parser  
Kannada CommonLabel Use unified parser  
Malayalam CommonLabel Use unified parser  
Punjabi IPA TBD TBD
Bengali CommonLabel Use unified parser  
Marathi CommonLabel Use unified parser  
Gujarati CommonLabel Use unified parser  

2. Installation

Recommended way to install is by cloning the repository and doing poetry install --no-dev. The system is reported to work on Python 3.6.9.

3. Usage

Depending on the model type, you can use various g2p components to convert words in phones. For example, if you have a unified parser based model, you should use g2p.components.UnifiedParser.

All components have similar calling pattern that looks like this:

from g2p.components import Fallback

cmp = Fallback()
# A component returns Optional[List[List[str]]] which looks something like
# this [['ax', 'b', 'hh', 'ao'],['ax', 'b', 'hh', 'ao', 'r']]

cmp("word")

In case you are using a model based component, it's recommended to use that as a fallback for a dictionary. We keep source lexicons in ./data/sources and put overall generated ones in the database ./data/pronunciations.db. In the database, we keep pronunciations along with their source names. Here is how once could go about the whole process for Indian English:

from g2p.components import SequiturModel, ARPAg2p, Chain

# The default Indian English dictionary is ./data/en-hi-names-ipa-lexicon.txt
arpa = ARPAg2p("./data/en-hi-names-ipa-lexicon.txt", name="dict")

model_path = "./models/models36/en_with_hindi_phones/en-hi-ipa-model"
sm = SequiturModel(model_path, name="sequitur-en")
chain_g2p = Chain([arpa, sm])

chain_g2p("cow")

For languages using Unified Parser, you need to build unified-parser and pass the path of its binary to the g2p component as shown below:

from g2p.components import UnifiedParser

up = UnifiedParser("./path-to-up-executable")
up("\340\244\225\340\245\207\340\244\262\340\244\276")