Soviet mathematician who lost his hearing got a job at Google to help people with hearing and speech impairments
Dmitry Kanevsky develops products that help people communicate with loved ones, colleagues, mobile devices and the whole world. Writes about it VC.RU.
Having lost his hearing, he learned to read lips, graduated from Moscow State University, became a candidate of science, moved to the USA and now works as a researcher at Google.
For the last 40 years, Dmitry has been developing devices and technologies that help people with hearing impairments. Among the inventions is a device that helps to “hear” with the help of the skin, and an application that translates into the text the speech of people with a strong accent, stuttering and other features of speech.
The inventor told how he created his lip reader, got a job at Google and helped develop an algorithm for automatically creating captions on YouTube.
Next - from the first person.
As a child, I lost my hearing. But I was taught to read lips, and I went to a regular school.
I had a lot of friends. Then I did not experience great difficulties in communication. It became difficult when in the eighth grade I moved to the second mathematical school in Moscow. There were other guys and complex technological subjects - they had to study mainly from textbooks.
Nevertheless, after school I entered Moscow State University - in 1969, and then studied mathematics for another eight years and became a candidate of science, writing a dissertation on algebraic geometry.
I think mathematics made me more independent. In it you are one on one with the problem. You can focus on her, fight her. It matches my character.
Finishing the thesis, I met my future wife. She moved with her parents to Israel, and I decided to go after her.
I understood that in a new country I would not read lips as well as in the USSR, and I could not communicate freely with people. Then I developed an apparatus that helped read from my lips.
The device was mounted on the body and allowed you to "hear" the skin - catch sounds and translate them into vibration. The problem was that some sounds, for example “c”, “w”, “and”, “a”, are at high frequencies, so it is difficult to feel them with your skin. Then I came up with translating high frequencies to low.
I managed to make such a small device that they did not notice him under the clothes.
I got permission to take the device to Israel, and it helped me speak Hebrew, in which a large number of words with “high-frequency” sounds like “Shabbat”, “Shalom” and so on.
In Israel, I showed the device to one doctor. He said that this is a great thing and you need to open a company to sell the device.
We called it SensorAid. In parallel with this startup, I also worked as a mathematician at the Weizmann Institute. In a month I earned 2000 shekels, it was in the 1981 year.
The device was then used in many countries - it was universal for the whole world. In one hospital, he was compared with the development of Cohler, which implanted a transmitter in a person’s ear so that he could read sounds.
My device showed the same result as Cohler, but their development cost $ 25 thousands and required a serious operation, and my version was several times cheaper and did not require the intervention of surgeons.
In 1984, the American company Spectro bought the copyright for the device. First, I went to work in academic institutions in Germany and the USA, and after that I moved to IBM.
Work at IBM
First I developed an algorithm for speech recognition.
To translate speech into text, the system needed to read the acoustic signal and match it with the word that it represents.
To do this, the sound is represented as a sequence of numbers, which is compared with each word in the dictionary, using some criterion. A spoken word is a word that is best aligned with this sequence of numbers. Criteria are polynomials that consisted of 50 million variables or parameters.
In 1990, dynamic programming methods allowed the calculation of polynomials with 50 million parameters in linear time.
More advanced criteria were not based on polynomials, but on rational functions — polynomial relations. For a long time they could not find a way to calculate the values for 50 million parameters in linear time. And I found this method. And when it began to be applied, the accuracy of speech recognition improved significantly.
Along with this, I constantly worked on technologies that would help people with hearing impairments. At that time, the Internet appeared, and with its help I created the world's first services that helped to understand speech.
For example, a service that allowed you to translate oral speech into written. To do this, the client called people who can quickly type, turned on the speakerphone, and they typed the text they heard during the call.
The text in real time was displayed on the client’s computer screen, and he understood what they were talking next to him. Such a service cost up to $ 120 – 150 per hour.
I was also involved in inventions not related to speech recognition.
One such technology is Artificial Passenger. She helped drivers not to fall asleep at the wheel. The system watched a person, talked with him, so the driver, answering questions, did not fall asleep.
Another development concerned bank security. To confirm the identity of the client, consultants usually asked for his mother or wife.
I developed a system that allowed the bank to collect more information about the client so that employees could ask a new question each time. For example: “What is your dog’s name?” Or “When did you return from vacation?”
At the same time, the technology identified the caller’s voice and checked whether it really belongs to the bank’s client. If everything was in order and the person gave the correct answer to the question, the bank employee understood that it was not a fraudster who was calling.
Speech Recognition for YouTube
In 2014, I switched to Google, where I continued to work on speech recognition.
I took up the Closed Caption system for YouTube, which automatically recognizes speech on the video and translates it into subtitles. At that time, the technology worked poorly, and the team and I had to improve its algorithm.
To create acoustic models of words, we needed data - texts and their voiced versions in order to train the machine. And so that the words are pronounced in different voices.
Previously, people were hired for this, who listened and transcribed audio into text. So they typed several thousand hours of speech examples, which is not enough for a good recognition system.
YouTube is interesting in that there are a huge number of videos where sound and text are already available. Many users upload videos to the site into which they themselves have embedded subtitles with decryption. In part, this was done because the search engines showed subtitled videos higher.
I got an idea to use hundreds of thousands of hours of finished data from users to learn algorithms. The only problem was that people often make not only mistakes in the text, but simply put a random set of letters in the subtitles to get a high rank when searching. We had to put in filters that distinguished quality data from bad ones.
As a result, we finished development in 2016, and Closed Caption became much better at recognizing speech. What users see now, clicking on the automatic creation of subtitles, is the result of this work.
Projects for people with disabilities
In 2017, I moved from the New York office to the California branch of Google.
Already here for half a year, together with the team, I created the Live Transcribe application, which uses the same technology for translating speech into text as YouTube, but as a separate application. With it, people with hearing problems can find out what they are being told.
The system also recognizes additional sounds, which the user also writes about: the barking of a dog, the crying of a child, the sound of a guitar, a knock on a door, laughter, and so on. This part of the audio information is processed on the phone itself, and direct speech decryption works through the Internet.
One of the main creators of this application is Chet Gnegi. Often, Google employees develop projects to solve the problems of their colleagues. Gnegi saw me using services where people type the speech they hear for me and decided to help.
He created the first prototype application. It helped us work together and eventually grew into a separate Google project called Live Transcribe.
Another project I participate in is Euphonia. This application is for people with non-standard speech - those who have ALS, for deaf, stuttering, stroke survivors.
For this project, we again need many examples of non-standard speech. Only this time they can not be found even on YouTube. Such a speech is very individual, and a different approach is needed here for data collection.
I myself dictated the first 25 hours of recording. He preliminarily wrote the reports that he planned to speak with, and then recorded them on audio. So I trained the system. I could speak, and the audience saw a text transcript of my reports.
With each new performance, the system understood and even better recognized new phrases. Now I don’t need to write reports in advance - the algorithm translates absolutely everything that I say into the text.
So it became clear that this approach works, and we began to invite people with special speeches to read and write the text too.
In the case of people with ALS, we started by giving them typical phrases that they say to interact with, for example, Google Home. They need to repeat 100 phrases in order to train the system for themselves. It’s hard for such people to talk and they get tired quickly, so we can’t expect a lot of notes from them.
Nevertheless, we gradually began to combine examples of speech of different people with this disease in order to create a universal system in the future. This is a slow process - there is too little data, and Euphonia is still a research project, not a finished product.
Euphonia does not require an internet connection, as is the case with Live Transcribe. Smartphones have small computing power, which is difficult to decrypt audio. However, the team managed to cope with this.
Many people are afraid that their data is processed via the Internet. If the user comes to the doctor, then both he and the doctor are worried that their dialogue will go to the remote servers. This is not here because Euphonia does not need a network connection.
Now we give a link where people with speech problems can register and leave examples of their speech. In some cases, Google tries to make an individual speech recognizer for them for free.
I am also working on a sign language recognition project. Here we work with visual information. This task is even more difficult than speech recognition. Now the development is at the initial stage.
In sign language, one gesture may not mean a single letter, but a whole phrase. And again we need to find a huge number of examples. On this project, we are collaborating with the University of Gallaudet. In the United States, this is the only institution of higher learning for the hearing-impaired and the deaf.
In addition, I returned to the idea of my device, which converted high frequencies to low. My colleagues are working on its new version, which is more modern, with the help of which it will be possible to transmit more information.
Read also on ForumDaily:
We ask you for support: make your contribution to the development of the ForumDaily project
Thank you for staying with us and trusting! Over the past four years, we have received a lot of grateful feedback from readers who helped our materials arrange life after moving to the USA, get a job or education, find a place to live or arrange a child in a kindergarten.
Now we want to ask YOU for support.
High-quality journalism requires serious financial investments and our revenues do not always cover the costs of maintaining the editorial staff, which jeopardizes the continuous operation of the site. We do not introduce a paid subscription, so that Russian-speaking immigrants in the United States can freely receive verified information in their native language. But we will be grateful to you for any amount that you are willing to share so that we can continuously provide useful information to thousands of immigrants.
Security of contributions is guaranteed by the use of the highly secure Stripe system.
Always yours, ForumDaily!
Do you want more important and interesting news about life in the USA and immigration to America? Subscribe to our page in Facebook. Choose the option "Priority in the show" - and read us first. And don't forget to subscribe to ForumDaily Woman - There you will find a lot of positive information.