dc.contributor.advisor | Alam, Md. Golam Rabiul | |
dc.contributor.author | Mamun, Md. Adyelullahil | |
dc.contributor.author | Abdullah, Hasnat Md. | |
dc.date.accessioned | 2021-10-18T05:05:53Z | |
dc.date.available | 2021-10-18T05:05:53Z | |
dc.date.copyright | 2021 | |
dc.date.issued | 2021-01 | |
dc.identifier.other | ID 20241044 | |
dc.identifier.other | ID 20241047 | |
dc.identifier.uri | http://hdl.handle.net/10361/15324 | |
dc.description | This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2021. | en_US |
dc.description | Cataloged from PDF version of thesis. | |
dc.description | Includes bibliographical references (pages 48-56). | |
dc.description.abstract | At present, intelligent virtual assistants (IVA) are not only about delivering the functionalities
and increasing their performances; they also need a socially interactive personality. As human
conversational styles are measured by our sense of humor, personalities, tone of voice, these
qualities have become essential for conversational intelligent virtual assistants. Our proposed
system is an anthropomorphic intelligent system that can hold a proper human-like
conversation with emotion and personality. It can also be able to imitate any person's voice
given; voice audio data is available. Initially, the temporal audio wave data will be converted
to frequency domain data (Mel-Spectrogram), which contains distinct patterns for audio
features like the notes, pitch, rhythm, and melody. A parallel CNN, Transformer-Encoder, is
used to predict the emotion from 7 different audio data classes. This audio is also fed to the
deep-speech, an RNN model that consists of 5 hidden layers. From the spectrogram, it
generates the text transcription. Then the transcript text is transferred to the multi-domain
conversation agent, using blended skill talk and transformer-based retrieve-and-generate
generation strategy and beam-search decoding an appropriate textual response is generated,
which in turn gets synthesized to audio using WaveGlow that is based on WaveNet and Glow.
It learns an invertible mapping of data to a latent space that can be manipulated and generates
a Mel-spectrogram frame based on previous Mel-spectrogram frames. Finally, from the
generated spectrogram, the waveform is generated using WaveGlow. A fine-tuned system can
be used in the following but not limited to applications like dubbing, voice assistant, re-creating
new movies with old actors. | en_US |
dc.description.statementofresponsibility | Md. Adyelullahil Mamun | |
dc.description.statementofresponsibility | Hasnat Md. Abdullah | |
dc.format.extent | 56 pages | |
dc.language.iso | en | en_US |
dc.publisher | Brac University | en_US |
dc.rights | Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. | |
dc.subject | IVA | en_US |
dc.subject | NLP | en_US |
dc.subject | SER | en_US |
dc.subject | Emotion | en_US |
dc.subject | Audio-Emotion | en_US |
dc.subject | Personal-Assistant | en_US |
dc.subject.lcsh | Intelligent control systems | |
dc.title | Affective social anthropomorphic intelligent system | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Department of Computer Science and Engineering, Brac University | |
dc.description.degree | B. Computer Science | |