Skip to main content

Design of a Phonetically Balanced Code-Mixed Hindi-English Read Speech Corpus for Automatic Speech Recognition





by -Ayushi Pandey, B M L Srivastava , Rohit Kumar, B T Nellore , K S Teja, S V Gangashetty







"Hungry kya?"
"What your bahana is?"

few advertisement slogans
                   
        Pepsi: "Yeh Dil Maange More"
        Coke: "Life ho to aisi"
Have you come across the above conversations and native half baked pure language :) .

New pattern emerged known as Hinglish. The mix of Hindi and English is the language of the street and the college campus, and its sound sets many parents' teeth on edge. It's a bridge between two cultures that has become an island of its own, a distinct hybrid culture for people who aspire to make it rich abroad without sacrificing the sassiness of the mother tongue. And it may soon claim more native speakers worldwide than English. full article on Hinglish

Bilingual and multilingual speech communities recognize code-switching and code-mixing as predominant phenomena in conversational speech. While code-switching is regarded as an inter-sentential alternation between two languages, code-mixing is a word-level embedding of one language in the matrix of another. The phenomenon holds particular relevance in speech communities where the mother tongue and the medium of instruction are different languages.
According to the census of 2001, 12.1% of the speakers in India are speakers of English as their second or third language. With widespread usage and growth of this phenomenon of code-mixing mandates a shift in paradigm from monolingual automatic speech recognition (ASR).

Types of code-switching that occur in data from various bilingual communities. 


  •  Insertion, where words or elements from one language are inserted into the frame of another. For example: “ मïजाते वक़्त उन्हì drop कर दँगी । ू ” meaning :  "I will drop them when I go."
  • Alternation is described by the act of alternating larger chunks of the sentence, for example a clausal level switch. For example: “ मुझे अच्छा लगेगा if you could come” meaning : "I would like it if you could come."
  • Congruent lexicalisation is described by how a common language structure emerges by overlapping the words/morphemes of the two languages in question. For example, in the word कम्प्यूटरƑ = कम्प्यूटर + ओं, meaning computers = computer + s where a Hindi inflection is being accepted on an English word, computer.


Paper, presented a Phonetically Balanced Code Mixed (PBCM) speech corpus, sampled from a standardized code-mixed text corpus, the Large Code Mixed (LCM) corpus. An optimal text selection procedure has been used to extract 6,126 utterances from the LCM. The PBCM corpus is currently in the process of being recorded and post-processed for speech recognition purposes at IIITHyderabad.
 The primary objectives of the work include:

• To introduce selected sections of Hindi newspapers as a reliable site for code mixed HindiEnglish.

• To develop an optimal text selection procedure towards a Phonetically Balanced read speech corpus in Code Mixed (PBCM) Hindi-English.

• To record the utterances collected in the PBCM, through the contribution of Hindi-English bilingual speakers.

• To construct a baseline speech recognition system for code-mixed speech, extrapolating on monolingual Hindi and English training resources.

Design of data corpus

As a first step, a large body of data was scraped from three sections, namely Gadgets and
Technology, Lifestyle and Sports from the newspapers DainikBhaskar  and Sanjeevani . 
The following example represents the word level English insertion in the matrix of a Hindi sentence.

Example:
अनहल्थी फ ै ूड्स को अ￸धकतर अवॉइड करना चािहए ।
Gloss:
[unhealthy-ENG] [foods-ENG] [case marker-HIN] [avoid-ENG] [mostly-HIN] [do-HIN] [should-HIN]
Translation:
One should mostly avoid unhealthy foods. 

Here, the English insertion has been transcribed in a matrix sentence of Devanagari. The newspaper corpus contains both English words transcribed in Devanagari, as in the example above, but also a sizeable amount of English words in their Roman transcriptions.


Equation (1) describes the Pearson’s correlation r, where n is the number of pairs to be scored, x is the value contained in the first variable (in our case, the phonetic distribution of the LCM corpus), and y is the value contained in the second variable (phonetic distribution of the PBCM corpus).






The paper presents a phonetically balanced read speech corpus for code-mixed Hindi-English automatic speech recognition. The PBCM corpus has been sampled from a Large newspaper Corpus (LCM), which contains rich lexical insertions from English in a matrix of Hindi sentences. The inclusion of rare triphones in the sampled corpus has resulted in a high phonetic coverage (correlation: 0.996), even with a small number of sentences. 

To the best of knowledge, the PBCM can be safely proposed as one of the first phonetically balanced corpus of code-mixed speech in an Indian language pair. Recordings through the contribution of 100 Hindi-English bilinguals is aimed for the corpus, of which 78 speakers have been recorded. Once post-processed, the PBCM corpus will be made available for research and related purposes.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods