Generative Adversarial Networks using Unit Selection Synthesis Based Virtual Assistant for the…

Source: Deep Learning on Medium

Generative Adversarial Networks using Unit Selection Synthesis Based Virtual Assistant for the Hindi Language

The virtual assistant is a software-based program which can perform tasks or particular service for an individual. In the coming era, technology is expected to be simple to use and execute by the end-user. The field of artificial intelligence and mobile technology also helps the visually blind people to overcome their disability and to live a healthy balanced life. Now–a day’s speech recognition made it roots for performing and triggering the various activities in human life. And the English and many other foreign languages have made dominant and remarkable work for the same. The study reveals that companies like Google, Microsoft, Amazon shows its roots for English and other countries languages while for Indian languages is still a matter of research. Hindi is the national language of India which is spoken by 45 % of Indians and our aim is to develop virtual assistant for the same. This research will initiate and fulfil the dream of the Digital India initiative to perpetrate speech techniques of the government of India rural places because actual India consists of 70 % rural population. And this will help every common man to come up with technology and also the development of any nation relies on the growth and development of its entire people involved. We focused on the deployment of this design using unit selection based speech synthesis. According to research, only 10 % of the total population speaks English in India while 41 % of people speak the Hindi language. So the probability of people suffering from visual disability and speaking the English language is very less. Then this virtual assistant application will help such people to communicate in their own national language. This will help them to perform the activities and overcome their disability which will give them mental support and also adds a place in society. In this context, the Government of India has taken a great initiative to sponsor a consortium project to develop TTS in an Indian language in two phases, which got completed in Sept 2017. Main focus during the development was to put ease for disabled users or users that find it hard to use some of the functions of their mobile phones, as well as children and older users. The main advantage of using GAN is that the objective optimized by GANs — to generate artificial data that is indistinguishable from real data by another neural net — is highly aligned with the goal of producing realistic data. GANs don’t require a lot of the prior and posterior probability calculations often necessary for another competing approach, maximum likelihood

Project summary

The field of artificial intelligence and speech recognition technology created a boom in the field of information and communication technology. A virtual assistant is in a great demand to perform the various activates of regular life. It also helps visually disabled people to lead the normal life and remove the gaps in society and bring them to the mainstream. An online poll in May 2017 found the most popularly used online personal assistant includes in the US were Apple’s Siri (34 %), Google Assistant (19 %), Amazon Alexa (6 %), and Microsoft Cortana (4%). Along with many other available applications, Google Home and Amazon Alexa are launched in market modifying the concept of a personal assistant app with a hardware device to control. The working mechanism of this system involves the strategy of automatic speech recognition (ASR) and deep learning algorithms. The system involves the training speech data and feature extraction methods for the speech decoder. This implies the backup of lexical model and statistical language model designed for the system.

Automatic Speech Recognition

The speech recognition lays the foundation to identify the input speech and phonetic approach based in it. The deep learning methods with multiple abstraction layers works for the quality-based output. The figure describes the working of deep learning methods for conversion of text to speech algorithm.

Deep learning method for text to speech algorithms

Virtual assistants use Natural Language Processing (NLP) to match user text or speech input to executable commands. Many continually learn using artificial intelligence techniques including machine learning.

Virtual assistants work differently in different areas:

·Text (online chat), especially in an instant messaging app or another app

· speech, for example, with Amazon Alexa on the Amazon Echo device, or Siri on an iPhone

· By taking and/or uploading images, as in the case of Samsung Bixby on the Samsung Galaxy S8.

Some virtual assistants are accessible via multiple methods, such as Google Assistant via chat on the Google Allo app and via speech on Google Smartphone speakers. To activate a virtual assistant using the speech, a wake word might be used. This is a word or groups of words, such as “Alexa” or “OK Google”. There are also various intelligent assistant’s technologies which are used in the U.S. Defence Advanced Research Projects Agency (DARPA) programs. The DARPA focused on enriching the multimodal speech-enabled systems with advanced conversational capabilities for engaging human users in mixed-initiative interactions. In the year 2003- 2008, efforts were laid to fund the Cognitive Agent that Learns and Organizes (CALO) project. The main key feature of CALO is to build an integrated system capable of true artificial intelligence’s (AI’s) key features such as the skill to learn and adapt in adverse situations, and comfortably interact with humans. The CALO research had a number of spin-offs, most notably that International’s Siri intelligent software assistant. Speech technologies are behind building many successful products in different markets with related to human interaction and expression. Still, naturalness is a matter of research with related to speech recognition and so provided that many call centres displayed visual menu rather than typical human interaction patterns, leading to low user satisfaction in many cases. The scenario of speech recognition changed completely with the launch of Siri in 2011. Apple’s launch of the IPA on the iPhone marked a turning point in the mass acceptance of speech technologies. With the massive computational power available through the cloud, more applications and AI technologies started to be integrated into dialogue systems. Figure 2 provides a complete view of various virtual personal assistants developed by various companies and brought into the market. Consequently, IPAs were developed that allow users to operate devices, access information, and manage personal tasks in a much richer way. With respect to speech recognition technology, each company is launching its own AI system. The main idea of all such available systems is focused on dialogue capability and question-answer format and simple actions. As a contrast, text-based chatbots from Facebook, Google, and others make use of dialogue technologies in automating services via Messenger, bypassing the dependency on speech technology. The rapid increase of high-quality cloud-based IPAs has been partly attributed to recent advances in deep learning tech- neologies especially deep neural networks (DNNs). With the exception of speaker recognition in the late 1990s, deep learning methods have only recently surpassed hidden methods. The figure reveals the progress and development of applications for the available virtual assistants.

The development of speech Assistant-enabled IPAs

The following paragraph describes the various such available personal assistants. More or less all such applications are accessed in English language, so there is significant demand for design of personal assistants in Indian languages in the market.

  1. Google Now/ Google Assistant: — Google now recognizes 119 languages for speech-to-text dictation. The work is actively done on languages and widely used in countries. It attempts the process of Natural Language Processing (NLP) and speech synthesis. The figure below explains the flow of mechanism of Google virtual assistant. It captures the word from the Hindi corpus and retrieves the required response.
Google Assistant

2. Apple SIRI: — This is a widely used application for a personal assistant for iOS and macOS system. It speaks 21 languages and localised for 36 countries. It is one of the best personal assistants with naturalness and expression involved in synthesized speech. It works on the mechanism of the Unit selection-based speech synthesis process. The system is designed on the basis of deep learning techniques for concatenation. The actual working mechanism of SIRI is described in the figure below. In which the speech is recorded and depending on the text the response is uttered.

Siri Workflow

3) Microsoft Cortana: — Has eight languages tailored to 13 countries. Microsoft, an editorial team of 29 people works to customise Cortana for local markets. Initially, the app that was built only for Windows Phone, however, it is now available onto Android devices, bringing the same assistance found on the desktop to phone or tablet. The block diagram of Microsoft Cortana is described in the following figure.

Block Diagram Microsoft Cortana

4) Amazon ECHO: — Amazon Echo connects to Alexa-a cloud-based speech service-to play music, make calls, set alarms and timers, ask questions, check your calendar, weather, traffic and sports scores, manage to-do and shopping lists, control compatible smart home devices, and more. The Alexa is the word used to initiate the speech service of the product Amazon Echo. The figure below describes the working of the same.

Alexa Speech Service and AWS loT

Usage of Virtual Assistants

The use of virtual assistants has increased from the last few years. This led to the development of many such products and applications in different operating systems. The research and study reveal that Siri had a larger share of mobile search than the Bing or Yahoo. The study was based on a survey of 800 US adults split roughly evenly between iOS and Android users. In Android user’s application, Google’s search engine occupied space up to 84 % as compared to the other search engines. Among iPhone owners, Google had a 78 % share. After Google, however, Siri was named by more respondents as their “primary search engine” than the Bing or Yahoo which is represented in the figure below.

Survey of usage for primary search engine on smartphone

Siri was the primary search engine of 13 % of iPhone owners. The significant use of Siri is because of the long-term, potentially disruptive impact of speech and virtual assistants on traditional “query in a box” results. It’s important to point out, however, that these responses reflect self-reported data and many not line up one-to-one with behaviour. A very large percentage of respondents (72 %) said they were using virtual assistants to “supplement” more traditional mobile search. The figure shows the percentage of the usage of personal assistants in their mobile phones.

Respondents using a Virtual Personal Assistant to Supplement Their Primary Smartphone Search Engine

Only 16 % of iPhone owners did not use a virtual assistant, while just fewer than 40 % of Android users did not. Among iPhone owners who used assistants other than Siri, 10 % used Google Now, and 4 % cited Cortana. Among Android users, 24 % were using virtual assistants other than the Google’s own, with 10 % using Cortana and the remainder distributed across several others, including Vivo. As the above figure and graph reveal that there is much scope for Indian languages to build their own virtual assistant in the Indian language which will be beneficial for all common people living in the country to get associated with technology and led to the development of the nation.

Objectives of the Project:

The objective of the proposed project is to develop the Hindi language, based-virtual assistant. This proposed virtual application will be designed to work on various gadgets and operating system. The goal is to develop a standardized database for the Hindi language. The main objective behind designing this proposal is to make it capable of answering questions and making recommendations by means of using an interface which is associated with natural language processing. The system will be connected to web services to delegate the request. The concept is to develop a speech synthesizer for our national language Hindi. This will benefit all the users living in this country to communicate in their own national language. India is a developing country and this will definitely help people to move forward for the usage of more communication and internet services. This will be able to buy tickets, reserve a table and summon a taxi, all without a user having to open another app, register for a separate service or place a call and thereby improved quality of life. The main attractive feature is it will accept the commands and answer the questions in the Hindi language. The goal is to develop unrestricted Text-to-Speech (TTS) systems for Hindi Indian languages for the visually challenged and mobile application and PC virtual assistant in a Hindi, computer-aided learning for rural areas etc. India is a vast country with 28 states with 1.3 billion population speaking 22 official languages and hundreds of dialects. In this cases developing speech synthesizer for each language will be almost difficult. Hence, to overcome this hurdle, the focus is also to develop a Hindi corpus with support for multiple Indian accents and to build appropriate language-specific linguistic analysis modules for Text-To-Speech (TTS) synthesis.

Expected output and outcome of the proposal

This project will build up a design for Hindi-based virtual assistant. This will help to communicate with people in the national language. The narrating of questions and answers will be done in the same. This will help to find nearest restaurants by just driving and giving commands to your Phone through a speech by technology. It will also read text messages, view maps and provide directions, will make appointments and will place calls on hold along with which perform activates based on commands given to it. The figure represents the block diagram for the Hindi based virtual assistant. This involves the mechanism of speech synthesis and speech recognition. The GAN technique is used for the built-up the required system which will help to increase the prosody and naturalness of the speech.

Block diagram Hindi-based virtual assistant