Why do Greek, Czech, Hungarian, and Swedish, with their 8 to 13 million speakers, have Google Translate support and robust Wikipedia presences, while languages the same size or larger, like Bhojpuri (51 million), Fula (24 million), Sylheti (11 million), Quechua (9 million), and Kirundi (9 million) languish in technological obscurity?
Swedish, Greek, Hungarian, and Czech have a wealth of language resources, created one human at a time over centuries. They're the languages of entire nation-states, with national TV and radio recordings that can be used as the foundation for text-to-speech models. Their speakers have the kind of disposable income that makes media companies translate popular novels and subtitle foreign movies and TV shows. They're found in countries that tech companies imagine their customers might be living in or might at least visit on holiday, meaning it's worth localizing interfaces and adding them as translation options. They have regularized spelling systems and dictionaries that can be rolled into spellcheckers and predictive text models. They have highly literate speakers with internet access who can contribute to projects like Wikipedia. (Speakers who can even, in the case of Swedish, create a bot to automatically make basic Wikipedia articles for rivers, mountains, and other natural features.)
Language resources don't just appear. People have to decide to create them, and those people need to be fed and watered and educated and housed and supported, whether that's by governments or by companies or by the kind of personal wealth that lets individuals take on time-consuming intellectual hobbies. Creating parallel corpora and other language resources takes years, if it happens at all, and cost tens of millions of dollars per language.
Gretchen McCulloch, The widely-spoken languages we still can’t translate online. (My latest article as Wired’s Resident Linguist.)
Lingthusiasm Episode 24: Making books and tools speak Chatino - Interview with Hilaria Cruz
As English speakers, we take for granted that we have lots of resources available in our language, from children’s books to dictionaries to automated tools like Siri and Google Translate. But for the majority of the world’s languages, this is not the case.
In this episode, your host Gretchen McCulloch interviews Dr Hilaria Cruz, a linguist and native speaker of Chatino, an Indigenous language of Mexico which is spoken by over 40,000 people. Hilaria combines her work as an Assistant Professor of linguistics at the University of Louisville with creating resources for her fellow speakers of Chatino, everything from paperback or cloth children’s books to high-tech speech recognition tools which will make it easier to create more resources like this in the future. And she’s also making these resources available for other underrepresented languages!
Click here for a link to this episode in your podcast player of choice or read the transcript here
Announcements:
There were two big announcements at the top of the episode:
The first is that we have a date for our liveshow in Melbourne! We will be at the State Library of Victoria on Friday the 16th of November. Tickets on sale soon through our EventBrite.
We are also thrilled to announce we’ll be doing a liveshow in Sydney! We’ll be at GiantDwarf on Monday the 12th of November. Tickets available through their website.
We also have new merch!
Thanks to Lucy Maddox for bringing Space Babies to life! Check out the art in this post. A portion of the proceeds from the Space Baby merch will be donated to the Resource Network for Linguistic Diversity.
We also have new scarf colours, and t-shirts that say “I want to be the English schwa. It's never stressed.” Check out our Merch page for more details.
This month’s bonus episode was about hyperforeignisms! We take an international tour through how our minds deal with the interesting edge cases of words that are kinda-English and kinda-other-languages. Support the show on Patreon to get access to this and all 19 bonus episodes.
Here are the links mentioned in this episode:
Chatino language (Wikipedia)
Lengua Chatino resources website (mostly in Spanish)
A video story told aloud in Chatino by Hilaria Cruz
Hilaria Cruz’s page at the University of Kentucky
Hilaria Cruz’s website
Joel Sherzer
Tony Woodbury
Hilaria’s PhD thesis (Linguistic poetic and rhetoric of Eastern Chatino of San Juan Quiahije)
Automatic Speech Recognition (Wikipedia)
Alexis Michaud
Oliver Adams
Tlingit, Ojibwe, Hupa languages (Wikipedia)
Here’s a photo of the children’s books that Hilaria Cruz and her students made! Books 1-6 (from left) are in Chatino. The rightmost book is in Hupa and the second from right book is in Ojibwe. All eight books are available for purchase on Amazon. (More about the book creation process.)
From the description on the ASREL Retreat website (Automatic Speech Recognition for Endangered Languages):
This retreat will foster a dialogue between computer scientists working on Automatic Speech Recognition (ASR) specifically neural networks, native speakers of endangered languages, and linguists doing research on endangered languages to address the issue of the “bottleneck” of language transcription and discuss the use of technology in the transcription of language data.
Tools and technologies to automate and expedite the transcription and translation of oral texts from endangered languages are urgently needed. Most researchers working with endangered languages process their materials manually. Some researchers estimate that it takes roughly from 1 to 50 hours to prepare one hour of spoken text manually.
ASR technologies can significantly reduce the workload of transcribing large collections of speech recordings in these lesser-studied languages. Automating the process will enable the transcriber to become more of an editor, accelerating the overall transcription process. Implementation of ASR technologies could free up time for linguists, language activists, and speakers to create materials for teaching and learning the language, rather than spending countless hours transcribing.
You can listen to this episode via Lingthusiasm.com, Soundcloud, RSS, Apple Podcasts/iTunes, Spotify, YouTube, or wherever you get your podcasts. You can also download an mp3 via the Soundcloud page for offline listening.
To receive an email whenever a new episode drops, sign up for the Lingthusiasm mailing list.
You can help keep Lingthusiasm ad-free, get access to bonus content, and more perks by supporting us on Patreon.
Lingthusiasm is on Twitter, Instagram, Facebook, and Tumblr. Email us at contact [at] lingthusiasm [dot] com
Gretchen is on Twitter as @GretchenAMcC and blogs at All Things Linguistic.
Lauren is on Twitter as @superlinguo and blogs at Superlinguo.
Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our editorial producers are Emily Gref and A.E. Prévost, our production assistants are Celine Yoon & Fabianne Anderberg, and our music is ‘Ancient City’ by The Triangles.
This episode of Lingthusiasm is made available under a Creative Commons Attribution Non-Commercial Share Alike license (CC 4.0 BY-NC-SA).
There’s now more than 500 language varieties on Gboard for Android, bringing a smart, AI-driven typing experience to even more people around the world.
The latest version of the Google keyboard supports 500 languages with keyboard layouts and autocorrect/predictive text. Here’s an excerpt from the official blog post about how they went about doing this:
Building technology that works across languages is important: without a keyboard tailored to your language, simple things like messaging friends or family can be a challenge. Often, keyboard apps don’t support the characters and scripts used for languages with a smaller speaking population. As an example, the Nigerian language "Ásụ̀sụ̀ Ị̀gbò" is impossible to type on an English keyboard. Plus, wouldn't it be frustrating to see nearly every word you type incorrectly autocorrected into another language?
Many of Gboard’s newly added languages are traditionally not widely written, such as in newspapers or books, so they’re rarely found online. But as we spend more time on our phones on messaging apps and social media, people are now typing in these languages more than ever. The ability to easily type in these languages lets people communicate with others in the language they would normally speak face-to-face as well.
How we add new languages to Gboard
In addition to designing a new keyboard layout, every time a new language is added to Gboard we create a new machine learning language model. This model trains Gboard to know when and how to autocorrect your typing, or to predict your next word. For languages like English, which has only about 30 characters and large amounts of written materials widely available, this is easy. For many of the world's languages, though, this process is much harder.
In order to train our machine learning language models, we need a text corpus (which is a database of lots of available texts written in a particular language). Often, finding text data in these languages can be challenging. When we can’t find data online, we’ll share a list of writing prompts with native speakers, so we can create new text corpora from scratch. (You can read more about our crawling efforts for these languages in one of our recent research papers.)
Read the whole post.
I’m especially impressed by the numbers here. For context, Wikipedia is available in around 300 languages. Every other multilingual tech platform I can think of has support for between 30 and ~100 languages. There are still 7000-some languages in the world, so this task is by no means complete, but support for 500 languages (and especially creating their own corpora specifically for the task) legitimately sets the bar higher.
Transcript Lingthusiasm Episode 24: Making books and tools speak Chatino - Interview with Hilaria Cruz
This is a transcript for Lingthusiasm Episode 24: Making books and tools speak Chatino - Interview with Hilaria Cruz. It’s been lightly edited for readability. Listen to the episode here or wherever you get your podcasts. Links to studies mentioned and further reading can be found on the Episode 24 show notes page.
[Music]
Lauren: Hi Lingthusiasts, Lauren here. Before we get to Gretchen's great interview with Hilaria Cruz today, I have two exciting pieces of news to share with you. The first is that we have a date for our Melbourne live show. We'll be at the State Library of Victoria on Friday the 16th of November. Also, very excited to share with you that we are doing a live show in Sydney as well. We’ll be at GiantDwarf on Monday the 12th of November. For more details and links to tickets, go lingthusiasm.com/show. Our patrons will get a couple of free tickets. We're looking forward to meeting them and all of you as well. We're also super excited to be able to share with you some new Lingthusiasm merchandise that we've been working on, which was another Patreon goal of ours. We are very excited to bring you the space babies and space pigeon from Episode 1 of the show in full and glorious animated colour on a range of merchandise, available through our site. You can see the images, find out more about the illustrations, and our wonderful illustrator, Lucy Maddox, by visiting lingthusiasm.com/merch. And now, over to Gretchen.
[Music]
Gretchen: Welcome to Lingthusiasm, a podcast that's enthusiastic about linguistics. I'm Gretchen McCulloch, and I'm here with Dr. Hilaria Cruz, who is a Neukom Fellow at Dartmouth College and just starting as an assistant professor in linguistics at the University of Louisville, and is a native speaker of Chatino who works with Chatino as well. Welcome, Hilaria.
Hilaria: Well, thank you. Hello, everyone!
Gretchen: Thank you so much for being here!
Hilaria: You are welcome.
Gretchen: I'm here because you invited me down for a workshop at Dartmouth, and so I'm going to talk about that as well. But first, let's start with: How did you get into linguistics?
Hilaria: As a native speaker of Chatino, I grew up in a community where we all spoke Chatino, and then it came time for us to go to school, and then my father says, “Well, I would like you to get an education.” So my father then says, “We're going to go to this other town named Juquila so you guys can go to school.” We came to Juquila and, at a time in the 1970s, the Mexican government wanted indigenous children to study, so they developed these, like, boarding schools – well, it was like a boarding house where indigenous children that came from the outskirts of the Spanish-speaking towns had room and board while they went to public school. So my family came to this, what is called “the houses” there, and I was sent to elementary school not knowing a word of Spanish. It was complete immersion.
Gretchen: Wow.
Hilaria: At the time, there was just one school in that little town, just one elementary school for – I would say, I'm just guessing, 5,000 people. There were many children. There were some children that went to school in the morning. There were some children that were going to school in the evening. Since I did not know that much Spanish, my father took me there and introduced me to this class. The teacher was nice, and then I – just as a warm-up, he let me go there for a few mornings. I would just go, just for a few hours.
Gretchen: How old were you?
Hilaria: I think that I was about seven.
Gretchen: Mm-hmm.
Hilaria: I would just hang out for a few hours, and I would just take off and go back to the boarding house, where my parents were also staying with us. And then the teacher says, “Oh, this is fine. But I think that you are ready to begin your regular classes now. So you are going to come to school from 2:00 to 6:00.” Then I began to get really sad, because I did not want to go to school because I used to get bored, just to sit down there and just not understand what the teacher says. Then I began to go to these evening classes, and I was not happy. So then I decided that I want to go back to the morning class, because it was the same teacher teaching first grade in the morning and then in the evening. I will go back, and he will welcome me. “Aha, yes, come in.” I will go for two, three hours in the morning, as much as I wanted. Then, I will go back again, back home, and, to me – that was the happy medium for me. At some point, then, he stopped me, and he says, “No, no, no, no. You cannot set up your own time. You must come back here, to school.” So I –
Gretchen: You go to school, you play the rules.
Hilaria: Yes. To me school was just horrible. But I guess I persisted, and I got really bored, and I guess I passed, and then I – when we got to sixth grade there was – I guess in that school it was only a middle school, but, actually, my family and I were not happy in that town because that was the first place where I encountered racism against Native people. Because in my community, I was just a member of society, right? But when we got to school, kids began to pinch me, and they will call me “india” and things like that. So I will come back to my father and say, “Why is it that these kids are saying this to me? Why is it that they are pinching me and pounding me?” Because I just did not understand.
Gretchen: Yeah.
Hilaria: Then my father would say, “But you know, every time they tell you that, just be proud of yourself.” But how can a kid to be proud of – how can you be proud if somebody is stopping you, right? That was my experience in that town. It was like a frontier town. There was a lot of racism towards the Chatino people, who live in the outskirts of that town. So then I told my father and my siblings too, “You know what? We're not happy in this town.” Then he told us, “Well, I understand that you're not happy. Let's go to the city.” We went to the city. And there was a more cosmopolitan – we lived in a small area of the city where there were a lot of migrants from indigenous communities, so it was better. I continued my education. My father and I talked, and he encouraged me to continue college because he told me that in college, it'll be a lot of fun. That in college, I will be able to talk to other people, and meet a lot of people, so I was excited about going to college. I continued my education because I wanted to meet interesting people in college. That was the whole goal.
Gretchen: It’s a good goal. I like that goal.
Hilaria: I wanted to have interesting conversations, meet interesting people in college.
Gretchen: Yeah. That's great. I like that.
Hilaria: I think that my father was really smart for doing that.
Gretchen: He knew you very well.
Hilaria: Yeah, I think so. So my goal was to get to college, and have wonderful conversations, and meet interesting people.
Gretchen: Mm-hmm.
Hilaria: I continued going to college. Then, in 1991, I came to the United States. I began to hear conversations about linguists working with Native American languages, reviving these moribund languages, and then I began to think, “You know what? Maybe linguists will be able to help me create an alphabet for the Chatino language.” Because I was very curious about how to represent the Chatino languages, but the only thing that I was familiar with was the Spanish alphabet.
Gretchen: Right.
Hilaria: But since these languages come from such different linguistic families, Spanish does not have all of the symbols to be able to represent a tonal language, let's say like Chatino. We would try to write it down, but when it came time to read it, we could not read it.
Gretchen: It’s kind of unsatisfying.
Hilaria: So there was something missing there. I began to think, “You know what? This sounds very interesting. I think that linguists could help me maybe find a way to write the Chatino language.” I began to write to different linguists. I would write them letters and say, “Yes. Could you please help me develop an alphabet for my language?”
Gretchen: And this is 1991, so you're writing letters.
Hilaria: Ah, well it was –
Gretchen: Or emails maybe?
Hilaria: – letter. It was letter. I was writing emails around 2000, or something like that. It wasn’t in 1991. So I began to write these letters in 2000. My sister, Emiliana, also was on the same path. It was interesting because my sister Emiliana – I would talk about all these things, and I said that I was the first one, but, quietly, she had the same idea. She was more proactive. Well, we were both working on our own ends.
Gretchen: Oh, interesting.
Hilaria: Yeah. So Emiliana was in Oaxaca City then. She had a little coffee shop down there. And there walks in this American guy, whose name is Joel Sherzer. The professor Joel Sherzer, he used to teach at the University of Texas in the anthropology department. Joel Sherzer is a wonderful, very friendly guy. Joel Sherzer began to strike up a conversation with Emiliana, and then Joel asked Emiliana, “Tell me about you. What are you interested in?”
So then Emiliana says, “Well, you know what? I would love to be able to study my language.” And Joel says, “Well, that sounds very interesting. Tell me more about it because we at the University of Texas are very interested in working with native speakers of Mexico. Actually, we're creating a program. Why don't you come and visit us –
Gretchen: Oh my god.
Hilaria: – at the University of Texas?” So Emiliana went to Texas. She joined the anthropology department at the University of Texas. Emiliana began her program at the University of Texas, and we were just all very excited because then we met Anthony Woodbury, who was very interested in working with us with Chatino. And then Emiliana says, “Well, you know, in our studies of Chatino we need linguists. I think that you should join the linguistics department.”
Gretchen: So she recruited you to do the 'stics part?
Hilaria: Yeah! So then I say, “Sure! Yeah, I would love to do that.”
Gretchen: Okay. Is she your older sister?
Hilaria: She’s younger.
Gretchen: Oh, wow!
Hilaria: Well, she always tells me what to do. So that is how I joined the linguistics department. I was doing fieldwork with them. I was not a linguistics student or anything like that. I was just like – I accompanied them because I was just so excited they were studying Chatino, and this is something that I always wanted to do. So I began to do fieldwork. I pay my own way, and I just wait over there.
Gretchen: Oh my god. So you were like the consultant? They were asking you questions about Chatino?
Hilaria: No, no, no, no.
Gretchen: You were just doing it with them for fun?
Hilaria: I was just doing it for fun. No, but they also did – and this was in the summer of 2003 – they did fieldwork. I mean, Emiliana was in school. I was not. I was just like a labourer, someone who was so excited about this, you know? Because this was always what I wanted to do, right? I was just so excited about it. So Emiliana told me, “Hey, we're going to go down there, and we're going to do fieldwork.” And I said, “I’ll come.” I pay my own way. I went there.
Since Emiliana had placed this idea of me that I needed to study linguistics, then I asked Tony, “Hey, do you think that I could join the Linguistics Department?” And then he says to me, “Well, you're going to have to apply, but if you're ready to work hard, we might accept you.”
Gretchen: Did you speak English at this point?
Hilaria: Yes, I did.
Gretchen: Oh, okay.
Hilaria: So that's how I began to study linguistics.
Gretchen: Oh, that's cool. So then you became a grad student at University of Texas.
Hilaria: This is how I began a graduate [degree] in linguistics at the University of Texas in Austin.
Gretchen: Oh, cool. That's really neat. And then you wrote a dissertation about Chatino and learned a lot of stuff, including how to write it?
Hilaria: Yeah. So one of the things that I wanted to do was to describe the poetics of Chatino where, at the time, I would call it poetics. One of the things that I grew up with, and what Joel Sherzer called verbal art, is what he calls it.
Gretchen: Speech...
Hilaria: He wrote a book on speech play and verbal art. This is the title of a book that Sherzer wrote, but basically he used to call it verbal art.
Gretchen: Verbal art. Oh, yeah.
Hilaria: So what he meant by verbal art is just to take into account the different types of speech styles that exist in one community. And one of the things that I saw in growing up in San Juan Quiahije is that there are so many different types of discourse. We have ceremonial discourse. We have political discourse. We have dialogues, you know, exchanges. I wanted to record some of those discourses because some of those – so what gets transmitted in many of those discourses is the need to preserve tradition. For example, there's always a pair of lines that the orator says. This is our tradition. This is what the elders left for us since the foundation of the community, since the foundation of the mountains, and to leave this tradition will be seen as bad. So as a Chatino speaker, every time I hear these ceremonial speeches, they resonate with me a lot. So I wanted to record us. The first assignment that I had in the first moment when I was in graduate school was I proposed to record political speech. I went back to my community. I recorded political speech, and the change in the authority. I did my master's thesis on that. And then for my dissertation, I did an ethnography of speech. I described the different patterns and structures –
Gretchen: Oh, like all the different genres?
Hilaria: The different genres. And it was describing the ecosystem of the different styles –
Gretchen: Oh, that’s interesting!
Hilaria: – of speech in the community. And I worked with very gifted and talented speakers. This is something that I really wanted to do, and it was a lot of work, but it was, I think, very important work. So I have the basis now to be able to continue that kind of work for other people to do the same.
Gretchen: Yeah. So we can we can link to your dissertation. But that's also how you got into, “Oh my gosh, it's really hard to work with audio data.”
Hilaria: That is right! Because it was hard for two reasons to be able to transcribe speech. I was, of course, a native speaker of the language so I knew what they were saying, but the problem was, when it was time to commit this language onto paper, that I was just a beginning writer. I mean, we were just in stages of developing the alphabet for the language, and then also learning linguistics. And then Anthony would worry. He is very meticulous at what he does, so he will say, “Well, what is the alphabet that you want to use?” There were like two or three choices of alphabet, so if you're going to choose one, you're going to have to be consistent. I was just beginning.
Gretchen: That is so hard for a beginner too.
Hilaria: It was just – it was tough. But another thing that I noticed was that it was just very time-consuming to be able to be transcribing these texts. This is something that I began to realise when I began to transcribe this text. In my dissertation, I offer transcriptions of five to six ceremonial texts. All of these are semi-long texts.
Gretchen: You mean long speeches?
Hilaria: Yeah, these are long speeches, and different genres.
Gretchen: Yeah. Because I know when we're doing – for the podcast, we make transcriptions for the podcast. We put the audio onto YouTube, actually, and we use YouTube's automatic speech recognition to create the first draft of the transcript. And then we have a person who goes in and corrects it because there’re all these corrections you need to make. For one thing, YouTube never recognises the name of the podcast, Lingthusiasm, because it's not a real word.
Hilaria: Yeah.
Gretchen: And so it gives us these crazy things about, like, “link Suzy azzam.” Like, who is Suzy? Why is she here? But we're lucky because we have automatic –
Hilaria: You’re lucky!
Gretchen: – transcripts.
Hilaria: At least! At least you have – this is news to me. This is the first time I heard the process by which you do transcription.
Gretchen: But it still takes hours, and we're still paying a human to do hours of detailed work making the transcripts, even though we cut out half of it by having an automatic thing create the first draft that that person can then fix.
Hilaria: That is so interesting. I wish I had a tool like that for Chatino, you know? At least something that could help me – just to give me a little help so I won't get carpal tunnel.
Gretchen: My gosh! Yeah, I bet. And it's probably hard for you to hire Chatino-speaking research assistants here in the US because I don't imagine there are a lot of them.
Hilaria: Well, it's not only that, but since in Mexico, as part of the creation of the nation state, their policy has been to integrate indigenous people into this national language, which is Spanish. So then when students go to school, the language of instruction is Spanish.
Gretchen: Right.
Hilaria: They don't know how to read and write in Chatino.
Gretchen: So even if they're speaking Chatino, you have to teach them how to read and write first?
Hilaria: Yes, that's right. If they can be of, you know, help.
Gretchen: Yeah, absolutely. And that's part of what you're doing this weekend?
Hilaria: That is right, yes. So then what happened, we continued to do research at the University of Texas, and we developed a very strong program of Chatino studies there. We used to call ourselves the Chatino Gang. There have been like eight dissertations on different Chatino languages that came out of the University of Texas from one or two very sporadic works in Chatino. There were like eight very in-depth studies. And one of those works was by Lynn Hou and Kate Mesh. They were studying sign language and gesture. Lynn Hou is a signer herself, and she will use transcribers in any spoken language, whether it’s English, Spanish, or Chatino.
Gretchen: Right.
Hilaria: So she was doing her dissertation on language acquisition in socialisation of deaf children in San Quiahije, in my community. She asked me to transcribe the audio interviews that she was doing with the families. And these were really lengthy interviews. But then I took that very seriously because, like I said, I'm Lynn's ears, and I have to do this transcription really faithfully so she can get access to this language. So in taking that work really seriously to allow her access, I began to do the transcription. But then, at that point, it became to me much more important to be able to have some tool that could help me because it was just a lot of work. So then I made a comment on Facebook, “Hey, you know what? I see that automatic speech recognition, it's just very developed in English and all of these languages, how can we get a tool to be able to transcribe this text in Chatino?” I really don’t care. I would love to just have a tool that says things in Chatino because they were repeating these things, “cha, cha,” all the time, and it was just like, “Oh my god. I just want to have something that could at least recognize a few words so that I don't have to type all of these words.”
Gretchen: I mean, because the estimates that I've seen for how long it takes to do a transcript are like one hour of transcriber work for one minute of audio.
Hilaria: Yes.
Gretchen: And that's the kind of work – so if someone has an hour of audio data, that's 60 hours of work to try to transcribe that one hour.
Hilaria: That's right. No.
Gretchen: Which is ridiculous!
Hilaria: Yeah, it’s very labourious. So then I began to ask people. In talking with some linguists, they will say, “Well, it's very difficult to do speech recognition in small languages,” because the models such as forced alignment, which is a model that they had been using at the time, needed hundreds if not thousands of hours of text, and we did not have that.
Gretchen: That’s the whole point of it being a small language, you don't have those kinds of resources.
Hilaria: Yes. So then I began to think, “Well, how can we make it – how can we speakers of minority languages make it, or facilitate, or invite these people who are doing this automatic speech recognition research to be able to do collaborations and to help us create tools?” So then I went to several meetings, and I met the people who ran Linguist List, Damir Cavar and Gosia Cavar. It seems like they have some interest in doing ASR, and it seems like when I talked to them, and I told them about the problem, that they said, “Oh yes, I think that this could be possible.” It seems like it wasn't a challenge for them. They invited me to IU, Indiana University, and one of the interesting things that we did with Damir and Gosia there, which I did not encounter before, was that Damir thought that we could entice people who were doing computational linguistics if we offer some data in open access.
Gretchen: Okay.
Hilaria: So then what I did there was that we had a little recording, and then I re-spoke many of the texts that I had transcribed for my dissertation first. I re-read them. Then we put them again into ELAN, and then we put all their – we annotated them with parts of speech, and cut and paste.
Gretchen: So you re-spoke them like in an audio booth so you'd have higher sound quality? Or was it just slower?
Hilaria: Well, we didn't have an audio booth. It was just a nice recorder.
Gretchen: Like a nice quiet room compared to being outside where they weren’t even recorded the first time?
Hilaria: Yes, it was a nice recording. And we had a good tape recorder basically.
Gretchen: Oh, okay, okay.
Hilaria: So I re- spoke them in a –
Gretchen: Like high quality, slow...
Hilaria: Yeah, something like that. I tried to re-read the text. And so we compiled a corpus of 3.5 hours, which we put in this program called GORILLA, where people can just download it, and they can use it to do any type of research that they want to. I thought that that was very clever and – because Damir says, “Well, we need to allow people to have a nice corpus so that they can use it if they wish to add a different language into their models.”
Gretchen: And so do people start using it?
Hilaria: This is how I came into contact with the people that I'm working with right now. At some point – also, Alexis Michaud, who works on a group of languages called Yongning Na, he was also asking the same question. He's a linguist, he's a phonetician, and he was working with these languages, and he also wanted to do some automatic speech recognition for the languages that he was working with.
Gretchen: Where are they spoken?
Hilaria: In China.
Gretchen: In China. Okay.
Hilaria: Yes. The Na languages are spoken in China, so he also put some high-quality data out –
Gretchen: Out there on the internet, yeah.
Hilaria: Yeah, out there in the internet. And that is how he got connected with Oliver Adams, who is one of the co-organizers for this conference that I'm doing right now. So Oliver Adams got in touch with...
Gretchen: Alexis.
Hilaria: With Alexis. So they began to do this collaboration, but then it came time when they wanted to fit the model with another language that was also a tonal language. We had this corpus that we had developed with Linguist List, which was –
Gretchen: Chatino.
Hilaria: – Chatino with open access with one speaker, me –
Gretchen: Which is also a tonal language.
Hilaria: Which is also a tonal language.
Gretchen: It’s completely unrelated to this language in China.
Hilaria: Yeah, and actually it’s spoken by a comparable size of population, like 40,000 people, kind of like that, 40-50,000 people. So that is how we began this collaboration.
Gretchen: And so is the idea to make tools that could work regardless of what the language is? Or you have to kind of – so it'll work on Yongning Na, it'll work on Chatino, it'll work on some other language, it doesn't matter? Or is it to figure out how much data you need to train a very small amount of data, and then it works specifically on the language?
Hilaria: Yes. Well, actually the methods that Oliver Adams is using is neural networks.
Gretchen: Oh, okay.
Hilaria: Yes. So he developed this software called Persephone. With Persephone, then, you can input data on – I guess in this case he was interested in tonal languages, so maybe he developed some tools so that the model could recognise tonal languages. That's why he fed two tonal languages into the model, to see what kind of outputs they had. It seems like with the corpus that Alexis was working with, the output was just excellent, because he used more data. But the output in Chatino was also very good. It's very promising.
Gretchen: So it's useful for you to take a first draft of a transcript or something?
Hilaria: I think that it’ll be very useful. I have not used it to transcribe new data, and this is the reason why at the retreat we're going to find out how can we, who are not technologically savvy people, start using and training these models with new data.
Gretchen: So at the retreat the goal is to bring together the automatic speech recognition people and the minority language documentation people and say, “Okay. How can we help each other? How can we make these tools that’ll work for everything?”
Hilaria: How can we collaborate? How can we make tools for language documentation? Yes. Because on the one hand, we linguists are not – we don't know how to operate these models, and the engineers, they know how to work these systems. So the two of us are going to come together, and we're going to have an honest conversation. We linguists will say, “How would you like us to prepare our data so you can use it for your models?” They will tell us and vice versa, “This is what we need.”
Gretchen: And you have people working on multiple different languages, and multiple different technology-type things, all together?
Hilaria: Yeah, that's right. In my conversations with Oliver Adams, right now our tools for major languages are very advanced. A lot of the problems have been solved. Actually, there are many sub-specialties within that field. For example, one of the interests that Oliver Adams has is multilingualism in ASR. So for him, this is so interesting because we're going to have different speakers. We're going to have speakers of Chatino languages, speakers of Mayan languages. Basically, what Oliver Adam says is that many of the differences sometimes could even be anatomical. He should explain what he means more, but...
Gretchen: So for multilingual automatic speech recognition, is that an automatic speech recognition tool that works for multiple languages at the same time?
Hilaria: Yeah, I think so.
Gretchen: So if you're speaking in Chatino one minute and Spanish another minute, and let's say you also happen to speak a Mayan language, you could speak to it in any of those languages and it would be able to pick up, correctly, whatever you were doing?
Hilaria: You know, I’m really new in this field, so I really cannot speak –
Gretchen: This is the hope, maybe.
Hilaria: Yeah, yeah. I think we need to ask the ASR people these particular questions.
Gretchen: Yeah. But it would be great if it would work for multiple languages. That would be really cool.
Hilaria: Yes, yes. Actually, this is the new frontier.
Gretchen: Yeah, that’s right. Because there's six, seven thousand languages in the world, and there's what, maybe ten that you have really good automatic speech recognition tools for right now?
Hilaria: That’s right, yes. Yes. But the thing is that, still, for minority languages, there are certain requirements that need to be there in place first. Like with Chatino, it was easy to do this because I have prepared the corpus. There is an alphabet we have for Chatino. So we have research in Chatino now, but many of those languages do not have this research available. So even if you have a sound file that is not transcribed in one language, it will not be useful for someone to –
Gretchen: Because you do need some training data.
Hilaria: Yeah, you do need training data, and also, you need a person to evaluate the output of the model.
Gretchen: Right. Because then you can't fix it if it’s...
Hilaria: That's right. For example, in the way we work with Oliver, it was that he put this data in, and then I as a person, as a speaker, I went out and just evaluated the output of the system. It has to be reciprocal.
Gretchen: And so what's – in 20 years when this is amazing and everything works great, what's the vision for how this works? Is it so people who speak Chatino can say, “Okay, Google,” to their phones in Chatino and it will reply back?
Hilaria: Well, I mean, people can give it many uses, right? I don't know if I can say what kind of uses people can give it if we were able to get to that point. But on a personal level, I would love to be able to have a tool that could help me transcribe text that I have. Because, actually, we have hundreds of hours of recordings of Chatino text, and it'll be wonderful to be able to have these transcriptions. And the results of these transcriptions could be fed into ongoing dictionaries to study the syntax of the language, to study morphology, or all the different aspects of the grammar.
Gretchen: Or to make books or these kinds of things in the language.
Hilaria: Or to make books in the language. For example, we have recorded many stories that – they're sitting there. We haven't been able to transcribe them. It would be nice if we had a nice transcription with the story, so then we can work with our artists and make children's books, and develop all of these materials to promote the language.
Gretchen: Yeah. Because you've made some books already, right?
Hilaria: That is right! We just had some books published with the help of many people. And I'm just so proud of this because this is one of the first times that I have seen children's books in Chatino. They are so beautiful, colorful.
Gretchen: They're really beautiful. You were showing them to me earlier and they’re really lovely.
Hilaria: This is a project that I did with my students in our language revitalization class that I taught in the winter 2018 at Dartmouth College. So one of the first things that we did was to do the drawings on cloth books. Each student developed their own theme, and they put it on cloth books. Then we had an exhibition, and then the exhibition was a success. It was really beautiful. People loved it.
Gretchen: The cloth books look so cool! They're soft and you can – you know, a baby couldn't destroy them.
Hilaria: That's right. And then I got some funding from the Neukom Foundation to do the publication in a different format of these books. One of the students had to draw pictures for many of the books because the originals were just images that students pulled out –
Gretchen: From the internet somewhere.
Hilaria: – from the internet somewhere, because we were not thinking forward about publications and things like that. But when we realised that we needed to publish them, and that Neukom was offering some funding to publish them, we realised that we did not want to get –
Gretchen: You didn’t want to get sued.
Hilaria: Yeah, sued! So now we have these new books with completely new images, and they are –
Gretchen: And they're lovely. And they're Creative Commons, and they're Open License.
Hilaria: That's right.
Gretchen: So you had a few of those books be translated into other languages that don't have enough children's books?
Hilaria: That's right. Because I had native speakers of North American languages in my class. I had my student who spoke Tlingit. There was another student who spoke Hupa, and Ojibwe. And when they saw this, they realised that they wanted to do the same version in their own languages.
Gretchen: That’s so cool.
Hilaria: It was just really amazing. So all of these books just came out.
Gretchen: So now there's this little link between the Tlingit speakers, and the Hupa speakers, and the Ojibwe speakers, and the Chatino speakers. They'll all have the same pictures in their books with the words in their own language.
Hilaria: It is just so amazing, you know? I went back to Mexico, and I took the cloth books down to Oaxaca, and there was this friend from my community who came to visit. I was visiting my mom in Oaxaca. He came to visit. And then I sat down with him, and I read him one of the children's books. And then at the end he says to me, “It is so sad,” he says, “that our language is getting lost.” That is – so really, the books really bring these conversations about the importance of language.
Gretchen: And if the kids are – because, probably, a lot of the kids still kind of speak the language at home, but then when they go to school and the only language they see written down is Spanish, whereas if they could see also written versions of Chatino so they could be bilingual, and know that there's people who care about the language, and give it more prestige, and these kinds of things.
Hilaria: That is right. I grew up in Mexico hearing that indigenous languages were not languages because they did not have a writing system. That is why I wanted to develop an alphabet to show that this is a legitimate language. By having these cute little books, it's –
Gretchen: And they look very professional, too. Like, they’re shiny. And they look very professional.
Hilaria: That’s right. Yeah. We wanted to make Chatino look good. So in this conversation that I had with this person from my community, I said to him, “One of the worries that I have,” I said, “if I distribute this book in the community, is that many of the books that I see, like textbooks that the schools give for free, they all end up in the toilet." So then I said, “One of the worries that I have is that my book will end up in the toilet.”
Gretchen: Yeah.
Hilaria: He said to me very seriously, “You know what? I'm going to tell you one thing.” He says, “I read the Bible. I do not take the Bible in the toilet. The Bible in my house has a special place. This book will be next to the Bible.”
Gretchen: What a compliment!
Hilaria: Yeah.
Gretchen: That is so meaningful.
Hilaria: It was just really beautiful, yes.
Gretchen: Yeah.
Hilaria: So I want to use these books to promote the language. One of the things that I would like to do, since this is a personal endeavour, and I don't have the backing of the state, I don’t have unlimited resources.
Gretchen: Yeah.
Hilaria: I would like to enlist families in the community to read the books, and then take videos of them interacting with the books and reading them with their children, take videos, and then, with their permission, upload them on social media, and in this way promote reading.
Gretchen: And they can see it. Because I think this is the thing is the technology space seems like it's so dominated by just a few languages, and to say, okay, this can be a language of technology, and this can be a language of writing and of the future that you can keep passing on to your kids.
Hilaria: That is right. Yeah.
Gretchen: Yeah.
Hilaria: I sometimes put little videos saying little phrases in Chatino. There are a lot of Chatinos who have migrated to the States. And they have children, and some of them are teaching Chatino to their children. Apparently I have some toddlers that follow my little videos.
Gretchen: Oh my gosh!
Hilaria: They just watch it over and over, and they repeat the words.
Gretchen: Oh my god, you’re like their teacher, or their grandma.
Hilaria: Yes. But I wish I could do more. It's just very sporadic.
Gretchen: Yeah. But that's still so cool. So if you can get other people making videos as well, maybe that helps.
Hilaria: Yes. Yeah, just make these books in different forms, like make little –
Gretchen: Or digital versions of them or something.
Hilaria: – animation, or things like that, yeah.
Gretchen: Yeah, that's very cool. I've taken a photo of the books already. So we will share a photo of the books, and we'll also link to whatever website or something you have set up for those. People can go see them, and you can see what they look like.
Hilaria: And you know what? One of the most important things about this is that this – as you say, these books have a Creative Commons License. So if someone out there would like to create children's books, they can use the same images, and just put their own text, and use the same things to publish their own books for their own language.
Gretchen: Yeah, that's really great. Hopefully you'll get photos being sent in from around the world of people doing that.
Hilaria: That would be amazing.
Gretchen: That would be amazing. Send Hilaria your photos if you end up using them.
[Music]
Gretchen: For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on iTunes, Apple Podcasts, Google Play Music, SoundCloud, or wherever else you get your podcasts, and you can follow @Lingthusiasm on Twitter, Facebook, Instagram, and Tumblr. You can get IPA scarves and other Lingthusiasm merch at lingthusiasm.com/merch.
I can be found as @GretchenAMcC on Twitter, and my blog is AllThingsLingthustic.com. Lauren tweets and blogs as Superlinguo.
To listen to bonus episodes, ask us your linguistics questions, and help keep the show ad-free, go to patreon.com/lingthusiasm, or follow the links from our website. Can't afford to pledge? That's okay too. We also really appreciate it if you could rate us on iTunes, or recommend Lingthusiasm to anyone who needs a little more linguistics in their life.
Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our audio producer is Claire Gawne, our editorial producers are Emily Gref and A.E. Prévost, and our production assistants are Celine Yoon and Fabianne Anderberg, and our music is by The Triangles.
Hilaria: Stay Lingthusiastic!
[Music]
This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.