Automatic Audiobook Creation Using Neural Text-To-Speech

Over 5000 audiobooks were generated using Neural Text-To-Speech and Project Gutenberg e-books.

Sep 14, 2023

TL;DR: Researchers from Microsoft, MIT, Project Gutenberg and Google have collaborated to revolutionize audiobook creation. Using neural text-to-speech technology and the SynapseML framework, they've automated the process, turning e-books into high-quality, customizable audiobooks. This system overcomes previous robotic voice challenges, offers voice customization, and introduces emotive reading, making literature more accessible and engaging.

Here’s an AI-generated audiobook:

In the digital age, audiobooks have emerged as a pivotal medium, enhancing the accessibility and engagement of literature for audiences worldwide. However, the traditional methods of creating these audiobooks are fraught with challenges, from the extensive time and effort required to the inconsistencies in quality. Enter the groundbreaking collaboration between researchers from tech and academic giants: Microsoft, MIT, Project Gutenberg, and Google. Together, they're reshaping the audiobook landscape, introducing automation and innovation where it's needed most.

Website Link

Paper Link

The Need for Automated Audiobook Creation

Historically, producing an audiobook has been a labour-intensive process. Whether it's the meticulous narration by professionals or the passionate efforts of volunteers, the journey from text to audio is long and winding. Platforms like LibriVox, driven by human volunteers, have made commendable strides in making audiobooks accessible. However, the variability in recording quality and environments can lead to inconsistent outputs. On the other end of the spectrum, platforms like Audible offer high-quality audiobooks but at a price, both monetarily and in terms of open access.

Unveiling the Automated System

The collaborative project introduces a system that harnesses the power of neural text-to-speech technology, promising to revolutionize the way we perceive audiobooks. Imagine generating thousands of human-quality audiobooks from the vast collection of Project Gutenberg, all at the click of a button. Beyond this, the system offers unparalleled customization. Listeners can adjust speaking speeds, choose different styles, and even modify emotional intonations. And for those who've always dreamt of hearing a book in their voice, this system can make that dream a reality with just a snippet of sample audio.

Overcoming Traditional Challenges

Past attempts at automated audiobook creation often stumbled at two major hurdles: the unmistakably robotic tone of text-to-speech systems and the challenge of discerning which text segments should be vocalized (you don’t want a huge index table readout in an audiobook!). This new system not only produces high-quality, human-like audio but also intelligently navigates the content of diverse e-books, ensuring a seamless listening experience.

The Technical Backbone: SynapseML

At the heart of this revolution is SynapseML, a robust scalable machine learning framework. It orchestrates the entire audiobook creation process, ensuring efficiency and quality at every step. From parsing thousands of e-books from Project Gutenberg to generating emotive, lifelike speech, SynapseML is the unsung hero behind the scenes.

From E-books to Audiobooks: The Process

The journey begins with e-books, specifically those in the HTML format from Project Gutenberg. Given the non-standardized nature of these files, which can contain everything from footnotes to transcriber notes, the system employs clustering on the HTML Document Object Model (DOM) tree. This approach allows for the efficient parsing of vast collections, ensuring only the relevant text makes its way to the listener's ears.

Achieving High-Quality Speech

The system's prowess doesn't stop at parsing. It recognizes the nuances required in different genres. While a non-fiction work might demand a clear, neutral voice, fiction, with its dialogues and drama, comes alive with emotive reading. Using the zero-shot text-to-speech method, it can even clone a user's voice from minimal recordings, adding a personal touch to the listening experience.

Breathing Life into Text: Emotive Reading

What truly sets this system apart is its ability to infuse emotion into the narrative. By segmenting text into narration and dialogue, identifying speakers, and predicting emotions, it crafts a dynamic listening experience. Passages with multiple characters and emotional dialogues are no longer monotonous; they're vibrant and engaging, much like a theatrical performance.

Conclusion

The collaboration between researchers of Microsoft, MIT, Project Gutenberg, and Google marks a significant leap in the world of audiobooks. By automating the creation process, they're not only making literature more accessible but also ensuring consistent quality. As we stand on the cusp of this revolution, one thing is clear: the future of audiobooks is bright, inclusive, and incredibly exciting.