g2p

1. Language Support
2. Installation
3. Usage

This page is generated via scout. For now it just shows the project README.

Grapheme to phoneme module. You can use this to generate phones for words in various languages. These phones are compatible with our current ASR models. In any case, you should check with the ASR owner. While training, the ASR is made to use a certain phoneset, this could be IPA, CommonLabel, or anything else. An ASR using a certain phoneset can only be extended with phones generated in that phoneset.

1. Language Support

Language	Phoneset	Model	Model Type
Indian English	IPA	`./models/models36/en_with_hindi_phones/en-hi-ipa-model`	Sequitur
Hindi	CommonLabel	Use unified parser
Tamil	CommonLabel	Use unified parser
Telugu	CommonLabel	Use unified parser
Kannada	CommonLabel	Use unified parser
Malayalam	CommonLabel	Use unified parser
Punjabi	IPA	TBD	TBD
Bengali	CommonLabel	Use unified parser
Marathi	CommonLabel	Use unified parser
Gujarati	CommonLabel	Use unified parser

2. Installation

Recommended way to install is by cloning the repository and doing poetry install --no-dev. The system is reported to work on Python 3.6.9.

3. Usage

Depending on the model type, you can use various g2p components to convert words in phones. For example, if you have a unified parser based model, you should use g2p.components.UnifiedParser.

All components have similar calling pattern that looks like this:

from g2p.components import Fallback

cmp = Fallback()
# A component returns Optional[List[List[str]]] which looks something like
# this [['ax', 'b', 'hh', 'ao'],['ax', 'b', 'hh', 'ao', 'r']]

cmp("word")

In case you are using a model based component, it's recommended to use that as a fallback for a dictionary. We keep source lexicons in ./data/sources and put overall generated ones in the database ./data/pronunciations.db. In the database, we keep pronunciations along with their source names. Here is how once could go about the whole process for Indian English:

from g2p.components import SequiturModel, ARPAg2p, Chain

# The default Indian English dictionary is ./data/en-hi-names-ipa-lexicon.txt
arpa = ARPAg2p("./data/en-hi-names-ipa-lexicon.txt", name="dict")

model_path = "./models/models36/en_with_hindi_phones/en-hi-ipa-model"
sm = SequiturModel(model_path, name="sequitur-en")
chain_g2p = Chain([arpa, sm])

chain_g2p("cow")

For languages using Unified Parser, you need to build unified-parser and pass the path of its binary to the g2p component as shown below:

from g2p.components import UnifiedParser

up = UnifiedParser("./path-to-up-executable")
up("\340\244\225\340\245\207\340\244\262\340\244\276")

g2p

Table of Contents

1. Language Support

2. Installation

3. Usage