Back in July, Apple launched an online journal specifically dedicated to its machine learning (ML) efforts, as well as a means to allow researchers to share their work with others.
Today, Apple has officially added three new entries to the journal, each of which details the development of technology behind Apple’s digital personal assistant, Siri. The first is entitled, “Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization,” and it is described, in part, as such:
“Users expect Siri speech recognition to work well regardless of language, device, acoustic environment, or communication channel bandwidth. Like many other supervised machine learning tasks, achieving such high accuracy usually requires large amounts of labeled data. Whenever we launch Siri in a new language, or extend support to different audio channel bandwidths, we face the challenge of having enough data to train our acoustic models.”
This journal entry is meant to discuss a variety of different transfer learning techniques, which leverage acoustic models and data that are already being utilized.
Meanwhile, the second new entry is entitled, “Inverse Text Normalization as a Labeling Problem,” and it is described as:
“Siri displays entities like dates, times, addresses and currency amounts in a nicely formatted way. This is the result of the application of a process called inverse text normalization (ITN) to the output of a core speech recognition component. To understand the important role ITN plays, consider that, without it, Siri would display “October twenty third twenty sixteen” instead of “October 23, 2016”. In this work, we show that ITN can be formulated as a labelling problem, allowing for the application of a statistical model that is relatively simple, compact, fast to train, and fast to apply. We demonstrate that this approach represents a practical path to a data-driven ITN system.”
Finally, the third entry is entitled, “Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis,” and the description reads as such:
“Siri is a personal assistant that communicates using speech synthesis. Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through. This article presents more details about the deep learning based technology behind Siri’s voice.”
This one is perhaps one of the more interesting for end users, as it details what went into the reasoning behind the change in voice for Siri itself. There is plenty of detailed information as usual, and these are certainly deep dives into their respective topics. If you’re interested in the technology that makes Siri what it is, they’re certainly worth digging into.Like this post? Share it!