HMM based speech synthesis (HTS) is a state-of-the art approach to text-to-speech synthesis. Segmentation of the training data is essential for building any text-to-speech system. Most conventional text-to-speech systems use phones as the basic unit of synthesis and use a speech recogniser to automatically segment the data at the phone level. As Indian languages are low resource languages, accurate transcriptions are difficult to obtain owing to paucity of data. Manual labeling at the phone level is not only laborious but also inaccurate. HMM based flat start segmentation doesn't work well at the sentence level. In this paper we propose an event driven approach to obtain better phone boundaries. Syllable-like events are detected in the speech signal and matched with syllabified transcription of the text. The syllables are converted to phoneme sequences and Baum-Welch embedded re-estimation is restricted to the syllable-level. Subjective evaluations indicate that the proposed system has a lower word error rate compared to that of a conventional system that uses flat start for obtaining phone boundaries. © 2014 IEEE.

Hema Murthy

Department of Computer Science and Engineering

S. Aswin Shanmugam

Acoustic equipment

Speech synthesis

Telephone sets

transcription

Conventional systems

EVENT-DRIVEN APPROACH

HMM-BASED SPEECH SYNTHESIS

Indian languages

Low resource languages

State of the art

Subjective evaluations

TEXT-TO-SPEECH SYSTEM

Telephone systems

IIT Madras is a public technical and research university located in Chennai, Tamil Nadu. Founded in 1959, it is recognised as an Institute of National Importance.

IIT Madras has been ranked as the top engineering institute in India for four years in a row by the National Institutional Ranking Framework of the MHRD

It currently offers undergraduate, postgraduate and research degrees across 16 disciplines in Engineering, Sciences, Humanities and Management. About 596 faculty belonging to science and engineering departments and centres of the Institute are engaged in teaching, research and industrial consultancy.

IIT Madras

Group delay based phone segmentation for HTS

2014 20th National Conference on Communications, NCC 2014

In this work, we explore the task of musical onset detection in Carnatic music by choosing five major percussion instruments: the mridangam, ghatam, kanjira, morsing and thavil. We explore the musical characteristics of the strokes for each of the above instruments, motivating the challenge in designing an onset detection algorithm. We propose a non-model based algorithm using the minimum-phase group delay for this task. The music signal is treated as an Amplitude-Frequency modulated (AM-FM) waveform, and its envelope is extracted using the Hilbert transform. Minimum phase group delay processing is then applied to accurately determine the onset locations. The algorithm is tested on a large dataset with both controlled and concert recordings (tani avarthanams). The performance is observed to be the comparable with that of the state-of-the-art technique employing machine learning algorithms. © 2015 IEEE.

2015 21st National Conference on Communications, NCC 2015

Musical onset detection on carnatic percussion instruments

This paper describes the design and development of Indian language Text-To-Speech (TTS) synthesis systems, using polysyllabic units. Firstly, a phone based TTS is built. Later, a monosyllable cluster unit TTS is built. It is observed that the quality of the synthesized sentences can improve if polysyllable units are used (when the appropriate units are available), since the effects of co-articulation will be preserved in such a case. Hence, we built Hindi and Tamil TTS with polysyllabic units, that contains cluster units of more than one type (monosyllable, bisyllable and trisyllable). The system selects the best set of units during the unit selection process, so as to minimize the join and concatenation costs. Preliminary listening tests indicated that the polysyllable TTS has better quality. ©2010 IEEE.

Proceedings of 16th National Conference on Communications, NCC 2010

Using polysyllabic units for text to speech synthesis in Indian languages

International Journal of Signal Processing Systems

An Evaluation of Techniques Based on HMM Speech Synthesis for Using in HTS-ARAB-TALK

Traditionally, the information in speech signals is represented in terms of features derived from short-time Fourier analysis. In this analysis the features extracted from the magnitude of the Fourier transform (FT) are considered, ignoring the phase component. Although the significance of the FT phase was highlighted in several studies over the recent three decades, the features of the FT phase were not exploited fully due to difficulty in computing the phase and also in processing the phase function. The information in the short-time FT phase function can be extracted by processing the derivative of the FT phase, i. e., the group delay function. In this paper, the properties of the group delay functions are reviewed, highlighting the importance of the FT phase for representing information in the speech signal. Methods to process the group delay function are discussed to capture the characteristics of the vocal-tract system in the form of formants or through a modified group delay function. Applications of group delay functions for speech processing are discussed in some detail. They include segmentation of speech into syllable boundaries, exploiting the additive and high resolution properties of the group delay functions. The effectiveness of segmentation of speech, and the features derived from the modified group delay are demonstrated in applications such as language identification, speech recognition and speaker recognition. The paper thus demonstrates the need to exploit the potential of the group delay functions for development of speech systems. © 2011 Indian Academy of Sciences.

Fulltext

Sadhana - Academy Proceedings in Engineering Sciences

Group delay functions and its applications in speech technology

Speech Communication

Statistical parametric speech synthesis

In the development of a syllable-centric automatic speech recognition (ASR) system, segmentation of the acoustic signal into syllabic units is an important stage. Although the short-term energy (STE) function contains useful information about syllable segment boundaries, it has to be processed before segment boundaries can be extracted. This paper presents a subband-based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of the Fourier transform phase and the deconvolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection. By treating the STE function as a magnitude spectrum of an arbitrary signal, a minimum-phase group delay function is derived. This group delay function is found to be a better representative of the STE function for syllable boundary detection. Although the group delay function derived from the STE function of the speech signal contains segment boundaries, the boundaries are difficult to determine in the context of long silences, semivowels, and fricatives. 'In this paper, these issues are specifically addressed and algorithms are developed to improve the segmentation performance. The speech signal is first passed through a bank of three filters, corresponding to three different spectral bands. The STE functions of these signals are computed. Using these three STE functions, three minimum-phase group delay functions are derived. By combining the evidence derived from these group delay functions, the syllable boundaries are detected. Further, a multiresolutionbased technique is presented to overcome the problem of shift in segment boundaries during smoothing. Experiments carried out on the Switchboard and OGI-MLTS corpora show that the error in segmentation is at most 25 milliseconds for 67% and 76.6% of the syllable segments, respectively. © 2004 Hindawi Publishing Corporation.

Eurasip Journal on Applied Signal Processing

Subband-based group delay segmentation of spontaneous speech Into syllable-like units

Journal	Data powered by Typeset2014 20th National Conference on Communications, NCC 2014
Publisher	Data powered by TypesetIEEE Computer Society
Open Access	No