Automatic machine translated speech into text in real

Automatic  Speech  Recognition  (ASR)  can  be  defined  as machine  translated  speech  into  text  in  real  time.  Thus,  it  is also  often  referred  to  as  Speech-to-Text  conversion.  Famous examples  of  systems  which  use  ASR  include  Siri,  Google Now,  Alexa,  etc.  The  Speech  Recognition  research  has  been around for more than 50 years now. The Speech Recognition research  is  not  yet  at  a  point  where  machines  understand everything  a  person  says  in  any  acoustic  environment.  The main goal of research in ASR is to allow a computer to accurately recognize all the words that have been intelligibly spoken by a speaker, irrespective of his/her characteristics, background noise, vocabulary size. The goal of an ASR system is to accurately convert  a  speech  signal, which is possibly a sequence of words with pauses/fillers, into  a  text  message  transcription  of the spoken words. A speech recognizer requires statistical models, an Acoustic Model  and  a  Language  Model.  The  statistical  methods  for continuous speech recognition were established more than 30 years ago, the most popular being the Hidden Markov Models (HMMs).  In  speech  recognition,  decoding  is  equivalent  to recognizing  the  sequence  of  words  given  the  acoustic observations. Acoustic  Modelling  is  the  heart  of  speech  recognition. It  estimates  the  probability  of  generating  acoustic  features for  given  words  and  thus,  directly  affects  speech  recognition quality.    Acoustic    modelling    though,    has    only    partial information available for training Acoustic Model parameters because   the   corresponding   textual   transcription   is   time-unaligned. The hidden information of the words alignment in a utterance makes acoustic modelling more challenging.A  Language  Model  effectively  reduces  and  more  importantly prioritizes the Acoustic Modeling hypothesis. A probability  of  acoustic  features  given  word  transcription  estimated by Acoustic Modeling is combined with the probability of the words transcription estimated by Language Model in order to compute the posterior probability of transcription. Most   current   speech   recognition   systems   use   Hidden Markov  Models  (HMMs)  for  temporal  variability  of  speech. Gaussian Mixture Models (GMMs) are used to determine how well  each  state  of  each  HMM  fits  a  frame  of  the  speech input.  Alternatively, this can be evaluated using the feed-forward neural networks. Feed-forward  neural  networks with  more  number  of  hidden  layers  are  said  to  outperform GMMs on a variety of benchmarks. One of the drawbacks of GMMs  is  that  they  are  statistically  inefficient  to  handle  data which lies on or near a nonlinear manifold. Speech is produced by modulating a relatively small number of parameters of a dynamical system, and this implies that its true underlyingstructure is much more lower dimensional than is apparent in a window containing hundreds of coefficients. Artificial neural networks trained by backpropogating the error derivatives have the potential to learn much better models of data that lie on or near a non-linear manifold. Two decades ago, researchers found some success by using a single hidden layer artificial neural network to predict the HMM states from windows containing acoustic coefficients. Advances in machine learning algorithms and hardware have led to more efficient methods for training DNNs with many layers and a very large output layer. In training, an acoustic model is built for each phoneme. This acoustic modeling using HMM captures the distinctive properties of the speech which takes into account speaker variations, pronunciation variations, context dependent phonetic coarticulation variations. For this reason, acoustic training corpus has to be quite large to obtain robust acoustic model.Firstly initial set of single Gaussian monophone HMMs are created. The monophone expansion consists of three phonemes with each phoneme represented by tri-state HMM. The large output layer in the DNNs is required to accomodate the large number of HMM states that arise when each phone is modelled by a number of “triphone” HMMs. Kaldi is an open-source toolkit which contains several recipes to implement Automatic Speech Recognition. It supports all the latest features such as linear transforms, MMI, boosted MMI, MCE, discriminative analysis, deep neural networks, etc. Kaldi consists of library, command line programs and scripts for acoustic modelling. The basic unit used in many neural networks computes the weighted sum of it’s inputs and then passes this sum through a nonlinear function. In TDNN, this basic unit is modified by introducing delays. The inputs of a unit row willbe multiplied with several weights, one for each delay. In this way, a TDNN unit has the ability to relate and compare current input to the past history of events. Each TDNN unit oulined in this section has the ability to encode temporal relationships within the range of N delays. Higher lauyers can attend to larger time spans, so local short duration features at the lower layer and more complex longer duration features at the higher layer. The learning procedure enssures that each of the units in each layer has itsweights adjusted in a way that improves the network’s overall performanceTotal 630 speakers• Each speaker has 10 utterances, i.e, total of 6300 utterances• Speakers have been taken from 8 major dialect regionsin the US, labelled dr1-dr8• Total Male-Female ratio is 70-30• Breakdown of 6300 utterances:1) 2 dialect sentences designed at SRI (SA)2) 450 phonetically compact sentences designed atMIT (SX)3) 1890 diverse sentences designed at TI (SI)Experimentation was done on the TIMIT dataset using various Deep Learning models such as GMMs and DNNs for the Acoustic Modelling alongwith HMMs for the Language Modelling. The Word Error Rate(WER) for such combinations were recorded.