Bona: [...]working on language is much more complicated than any other parallel that we could imagine. Like Jade was saying from wa Thiong'o, language carries a lot of things with itself. It carries the principles. It carries the values. It carries the culture. It carries heritage. It carries a lot of things that altogether give an identity to a human being – a group of human beings. I would say language is very important. Language matters.
Excerpt from Episode 57 of Lingthusiasm: Making machines learn Fon and other African languages - Interview with Masakhane
Listen to the episode, read the full transcript, or check out more links about syntax and language and technology
A series of intimate conversations could teach an AI to understand both language and culture.
An interesting article about the social process of creating machine translation datasets in Khoekhoegowab. Excerpt:
On the surface, Wilhelmina Ndapewa Onyothi Nekoto and Elfriede Gowases seem like a mismatched pair. Nekoto is a 26-year-old data scientist. Gowases is a retired English teacher in her late 60s. Nekoto, who used to play rugby in Namibia’s national league, stands about a head taller than Gowases, who is short and slight. Like nearly half of Namibians, Nekoto speaks Oshiwambo, while Gowases is one of the country’s roughly 200,000 native speakers of Khoekhoegowab.
But the women grew close over a series of working visits starting last October. At Gowases’s home, they translated sentences from Khoekhoegowab to English. Each sentence pair became another entry in a budding database of translations, which Nekoto hopes will one day power AI tools that can automatically translate between Namibia’s languages, bolstering communication and commerce within the country.
“If we can design applications that are able to translate what we’re saying in real time,” Nekoto says, “then that’s one step closer toward economic [development].” That’s one of the goals of the Masakhane project, which organizes natural language processing researchers like Nekoto to work on low-resource African languages.
Compiling a dataset to train an AI model is often a dry, technical task. But Nekoto’s self-driven project, rooted in hours of close conversation with Gowases, is anything but. Each datapoint contains fragments of cultural knowledge preserved in the stories, songs, and recipes that Gowases has translated. This information is as crucial for the success of a machine translation algorithm as the grammar and syntax embedded in the training data.
Lingthusiasm Episode 57: Making machines learn Fon and other African languages - Interview with Masakhane
When you see something on social media in a language you don’t read, it’s really handy to have a quick and good-enough “click to translate” option. But despite the fact that 2000 of the world’s languages are African, machine translation and other language tech tools don’t yet exist for most of them.
In this episode, your host Gretchen McCulloch interviews Jade Abbott and Bonaventure Dossou of Masakhane, a grassroots organisation whose mission is to strengthen and spur Natural Language Processing research in African languages, for Africans, by Africans. We talk about how they started working on language tech, Bona’s machine translator in Fon, and alternative models of participatory research and collective co-authorship.
Click here for a link to this episode in your podcast player of choice or read the transcript here.
Announcements:
This month’s bonus episode is about the linguistics of Pokémon names! Which sounds cuter, a Pikachu or a Charmander? Which sounds like it would be more likely to win in a fight, a Squirtle or a Blastoise? Even if you're not familiar with these pocket monsters, or if you're encountering new Pokémon you haven't heard of before, you might still have a vague sense of which names sound big or small, cuddly or powerful. This has lead to the creation of the delightful and entirely real linguistic subfield of Pokémonastics.
Join us on Patreon to learn more, and get access to 51 other bonus episodes! You’ll also get access to our Discord server, where you can chat about your favourite Pokémon names with other language nerds!
Here are the links mentioned in this episode:
Masakhane website
Masakhane on Twitter
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages - by Masakhane
Masakhane keynote presentation - “Low-resoursedness” Beyond Data
Masakhane wins the 2021 Wikimedia Foundation Research Award of the Year
Fon-French machine translator
Speech recognition programme for Fon
BBC article about Bona’s Fon translator
Making books and tools speak Chatino - Interview with Hilaria Cruz (Lingthusiasm Episode 24)
You can listen to this episode via Lingthusiasm.com, Soundcloud, RSS, Apple Podcasts/iTunes, Spotify, YouTube, or wherever you get your podcasts. You can also download an mp3 via the Soundcloud page for offline listening.
To receive an email whenever a new episode drops, sign up for the Lingthusiasm mailing list.
You can help keep Lingthusiasm ad-free, get access to bonus content, and more perks by supporting us on Patreon.
Lingthusiasm is on Twitter, Instagram, Facebook, and Tumblr. Email us at contact [at] lingthusiasm [dot] com
Gretchen is on Twitter as @GretchenAMcC and blogs at All Things Linguistic.
Lauren is on Twitter as @superlinguo and blogs at Superlinguo.
Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our production editor is Sarah Dopierala, our production manager is Liz McCullough, and our music is ‘Ancient City’ by The Triangles.
This episode of Lingthusiasm is made available under a Creative Commons Attribution Non-Commercial Share Alike license (CC 4.0 BY-NC-SA).
Transcript Lingthusiasm Episode 57: Making machines learn Fon and other African languages - Interview with Masakhane
This is a transcript for Lingthusiasm Episode 57: Making machines learn Fon and other African languages - Interview with Masakhane. It’s been lightly edited for readability. Listen to the episode here or wherever you get your podcasts. Links to studies mentioned and further reading can be found on the Episode 57 show notes page.
[Music]
Gretchen: Welcome to Lingthusiasm, a podcast that’s enthusiastic about linguistics. I’m Gretchen McCulloch. I’m here with Jade Abbot and Bonaventure Dossou from the Masakhane Initiative. Today we’re getting enthusiastic about natural language processing research in African languages – for Africans, by Africans. Hello! Welcome to the show.
Jade: Thank you for having us, Gretchen, hi.
Bona: Hi. Thank you for having us.
Gretchen: Thank you both for coming. I know that Masakhane is a big group of people. So, you are two parts of it, but there are other people who are also involved. Let’s start with backing up a step and talking about how you both got interested in language in the first place. Jade, do you wanna start?
Jade: Of course. I got interested in language from a very young age when I – I originate from South Africa, which has a very traumatic past with a lot of division. One of the things, when I was growing up, I noticed that I was never taught to speak African languages despite the fact I was living in South Africa. When I did try and learn the languages, I ended up with a lot of anxiety trying to get through this. Being in computer science, I turned to technology as a means to do this. The reason that it mattered to me so much was – and I don’t wanna be the cheesy person to quote Nelson Mandela, but I’m gonna do it because he says when you speak to someone in a language they understand, it goes to their head. If you speak to them in their own language, it goes to their heart. This I fundamentally believe on the importance of communication. For me it was a matter of bridging not only just a communication divide but also a cultural divide and bringing people together. This is why I cared a lot about language. It was from that point of view, that feeling, that the Masakhane Initiative was envisioned from my side. I started getting into it from a technological perspective and being like, “How can we use technology to support either translation of these languages or using them in a technological space?” We realised that there was just nothing – no research. Nothing was available. That’s what spawned off Masakhane.
Gretchen: Cool. And Bona, how did you start getting involved with this?
Bona: I grew up in Benin – a country where French is spoken. And also, like Jade, I did not get much exposed to my native language, which is Fon and also Igbo, simply because, in Benin, everything is done in French. Also at school, I mean, you were forced to speak French. Otherwise, if you were caught speaking the language, I mean, your native language and everything, you could have been punished or expelled from the school, either temporarily or even definitively. I always had passion for language but not my native language. I had passion for English because I wanted to become a computer scientist. I wanted to travel abroad, do great things, and everything. I knew that English will be definitely important for me. As a matter of fact, I was a very, very big fan of Michael Jackson. I used to buy his CDs. I used to go on the internet and google the lyrics and look at the sounds, trying to understand every meaning of the lyrics because I believed that understanding the song could help me to dance better somehow. Yeah, that’s how I got interested in language. But only recently when I travelled abroad in Russia, I had a lot of communication issues, mainly with my mom who cannot really speak well French. That’s basically from where the whole RAS idea of working on African languages started. Like they say, great and beautiful minds always meet together, so finally we made Masakhane. So, here we are today.
Gretchen: The Masakhane Initiative started in 2019, is that right?
Jade: Yes.
Gretchen: And Bona, you got involved in 2020?
Bona: Yeah, I remember it was January 2020.
Gretchen: What an interesting time to get involved in something. You two are part of this organisation. Do you wanna talk a bit more about how – maybe, Jade, this is a question for you – people started coming together and forming groups and collaborations and so on within this project?
Jade: How it started was we were sitting – me and it was my co-author, Laura, who was the person I used to originally write these publications with around African languages – we were at a conference – the ACL conference.
Gretchen: That’s the Association of Computational Linguistics?
Jade: Yes.
Gretchen: Not the American one. There are ones in lots of areas.
Jade: There are a whole bunch. There isn’t one in African yet.associate [Laughs]
Gretchen: Yet.
Jade: Yet.
Gretchen: You could change that.
Jade: Yes. We were publishing at what they call the “WiNLP,” the Widening Natural Language Processing workshop. When we were there, we realised we were one of two groups of African attendees. The one group was from Ethiopia and the other group was us – to the point where researchers who cared about low-resource languages were actively seeking us out to be like, “Oh, great, we have people from the continent of Africa here! We should meet them.” From chatting to everyone and from our own research, we realised that the problem is so dire. There’s so little representation of African languages in language technology that we needed to build a pan-African wide community. Thankfully, there’s a great initiative called the Deep Learning Indaba. The goal of the Deep Learning Indaba is to strengthen machine learning in Africa. They deal with all aspects of machine learning. What they essentially did is brought together researchers and industry professionals from around the continent and put them through these workshops. What was great was that I knew that I could go present this idea of, hey, let’s build a community for African languages. Let’s make these NLP tools. Let’s work together. And then let’s write this huge paper with 50 authors to show people how we’re gonna change the world. What was great is that we already had a big group of people who were involved. Initially, we started up already with 50 people from all over the continent who were active in the Slack groups and helping write code. Right now, since that moment, we’ve been working virtually. When the pandemic happened, and suddenly everyone was confined, we had little to no change. In fact, it might’ve been the opposite. I think people had a lot more spare time to work on – [laughs] – to work on it.
Gretchen: Because you were already doing things across the entire continent, and so you were collaborating virtually anyway, so didn’t change that much. You were piggybacking on an existing group of people who were interested in technology stuff and saying, “Hey, maybe you should care about technology and language together”?
Jade: Yes, pretty much. There was already a group caring about it, but we said, “Let’s make a focus group. Let’s write a paper together. Let’s try to solve this representation issue as a group.”
Gretchen: How did you get involved, Bona?
Bona: I remember it was – I mean, the story was funny because I was – sorry. If I can go a little bit back to give more context. I was working at a company, and I built let me call it a “translation tool” that I was using – I mean, that had the languages of Google Translate, but I was not really using Google Translate. I did in Python, and I tried to implement the API and everything, integrate into the application that I was also building. It was an African social network application. I integrated it to the feed. For instance, I publish something in English or French, someone else can see it in Yoruba or whatever. I mean, those languages, for example, are on Google Translate. I also integrated it to the chat, so that if I write something in French, you decide in which language you want me to see. We run into a conflict of – I won’t call it “conflict of interest” – but I built the whole thing because of my interest in NLP, which is also dependent on another let me say “environment.” It happened to my brother/colleague, Chris, and I when we were in Russia. That would be too long to talk about. Finally, the one IT thing, the product and everything, talking about intellectual property and everything – iPhone, iPhone, iPhone. Finally, I gave up. After Chris joined us – Chris is the one that brought that idea of working on African languages. We started working on Lingala. I didn’t go quite well because we’re trying to let them know how it should be. But they were like, “Who are you? How dare you? How can you speak like that? Who do you think you are?” And then things went down, down, down, broke down, and broke down, and finally, we left. Finally, we left. Then we were actively looking for people working on those ideas – those African languages. We left the company somewhere in November. Then in December I think it was Chris who found on Twitter exactly Deep Learning Indaba, which was retweeting about AI for development. They talked about that platform, Zindi. Then there was a contest about African languages data set. We’re like, okay, great. That’s a good idea. Then we just jumped in, and we started working on gathering data for Fon and everything. And I mean, you know, when you retweet, retweet, retweet, you finally can reach the source, right? In one of those retweets, I found the Twitter post of Masakhane, and then I went to them, and I said, I was like, “I should” – I was like, wow, this is great if we can join and everything. Chris was like, “Well, let’s wait. Maybe there are only professionals there. Let us finish because we’re still undergraduates.” Everything we’ve been doing now we honestly just planned it for NOW-now – I mean, to start now or to start after we graduate from the masters. But I was like, “Okay, I don’t care.” Most of the time it’s like that. Chris is a little bit of a pessimist, and I’m just like, okay, I just do it. The worst that can happen is it doesn’t work, but then at least I have tried. I just went to the website, and there was a form to fill out, so I sent both of our CVs and everything. Yeah, we just got accepted. We just got in. That’s how it started.
Jade: For context, Gretchen, Bona comes as a duo. It’s usually Bona and Chris. I met them together.
Gretchen: It’s technical limitations that prevent us from interviewing all three of you. There’s a really interesting presentation if people wanna see some of the faces of more of the group of Masakhane. The video presentation that you did a few months ago, which we can link to, that has snippets from about, what, 10 or so people and different parts of the project that they’re involved with.
Jade: It was a very fun talk. We took full advantage of the digital format, and edited – and edited, and edited – with a whole bunch of people and made things move around.
Gretchen: It’s such a good talk because a lot of conference talks are like, okay, here’s one person, and there’s a bad-quality video over some slides. Yours is actually a multimedia thing. There’s music, and there’s little animations of the people moving in and out. I think it’s just a really beautiful video, especially considering that it was originally a conference talk, which are often pretty boring.
Jade: That was actually all edited by Chris, I think. The amount of energy that went into editing that and putting it together and coordinating so many people.
Gretchen: All the talks fit together really well. Of course, none of you would’ve been in the same location at all.
Jade: If we’d done this conference talk, we would not have all been there, right, just simply due to visa issues and costs. We could never have done this if it weren’t digital.
Gretchen: You’re not gonna fly a dozen people to a conference to give a single talk where each of them talks for five minutes or something like that.
Jade: I mean, we would do that. It’s just that we probably wouldn’t get everyone in.
Gretchen: It’d just be a lot of logistical issues preventing it.
Jade: So many. The logistical issues that face Africans outweigh the logistical issues for many parts of the world. For instance, the majority of Africa struggles to get into the US. They’ve had multiple instances of conference goers – people who are running and organising a conference – cannot get a visa for Canada, where the conference is being held. It’s actually really interesting to see how a lot of people in this conference – conferences changed from being something really dull and boring, but it actually enabled so few people from around the world, especially from Africa, to actually attend and participate.
Gretchen: So, there’s some benefits to the digital conferences because now you don’t need a visa.
Jade: No. [Laughs]
Gretchen: This maybe gets us into talking about how you’ve also creatively subverted the traditional authorship models when it comes to writing papers as a collective for Masakhane.
Jade: Yeah, it was really interesting because we have this idea, and it’s particularly rare in the technology sphere, where typically you have the researchers, and then you’ve got everyone else in the world. Everyone else is things being studied, data being collected, participants. They’re not actually part of the research process. We said everyone who contributed, no matter how big or small – so, was it an edit you did, was it a model you contributed, was it a small evaluation you did – you were added as an author. We were sitting discussing, and we had 50 authors on this paper, and we’re trying to figure out what order they should be in. We randomised it. We set it from reverse alphabetical because that’s, you know, fine. Then someone was like, “Hey, why don’t we just put a symbol to represent the community?” because that’s actually what we want the first author to be.
Gretchen: Especially with a paper with 50 authors, it often gets cited as “So-and-So et al.”
Jade: Exactly.
Gretchen: And yet, if that turns into “Abbott et al.” that looks like it’s your project, which it’s not. You’re trying to decentralise that.
Jade: Yeah. We added in the for all symbol as our first author.
Gretchen: That’s this upside-down capital A.
Jade: The mathematical symbol, yeah. It’s the whole set, right. It’s actually encapsulated in our logo. If you look at the Masakhane logo, there’s actually a for all symbol located in there amongst it. We did that. The conferences weren’t happy. They kept trying to ignore the symbol. We kept trying to argue otherwise. It turns out they don’t represent the character sets on the proceedings websites. That’s a bit strange to me.
Gretchen: Especially for something that’s a computational linguistics conference, you would think that they would have Unicode support and the ability to support special characters and things. Because lots of people actually have those in their names as well.
Jade: Yes. By the end of it, I wasn’t sure if it was a technical incapability, or if they just didn’t want to set a precedent that you can just do this, but I can’t see why you shouldn’t be able to just do this. This seems fine. We should be able to have symbols as authors. If we can have cats at 10th authors and dogs as 5th authors, then why can’t we have a symbol?
Gretchen: There are, indeed, famously papers that have cats sometimes as a co-author. To talk about some of the specific projects that people have been working on under the whole Masakhane umbrella – Bona, I know you’ve been making a translator for Fon.
Bona: Yeah. Masakhane is a huge family. Of course, the focus is on NLP. We first started as a machine translation community, mainly, and then broadened to more topics of NLP, like speech and everything. It started with a translation engine for Fon, which is my native language widely spoken in Benin – the most spoken language if we discount the colonial language and administrative language, which is French. So, we basically started by Fon. Thankfully, we managed, and the platform’s actually already online – useful. We got thousands, already, of translation solutions – feedbacks and everything. We’ve been working on many things. The other second major thing is a speech recognition model for Fon and for Igbo. For Fon, it’s actually – I mean, it’s already working. We have a model working even though for Fon it was merged better. For Fon, we made it public, and it’s also hosted on the website of HuggingFace, so people can actually just go there, register their voices, and then have the transcription. That’s basically what I’m working on. But there are other people who have focused on the same thing, like MT, machine translation, for the languages like Yoruba and everything. There’s been recognition as well – Named Entity Recognition and Part of Speech plus tagging and everything, so some things are broad actually. Somehow, we might need to get at least a bit of each of the members. I think that’s how the strength is built.
Gretchen: It was neat. I was going on your website earlier last week and clicking around and saying, okay, I can try to use this Fon translator. I can type something in in French because, you know, I speak French, and then it presents me with this result. I have, obviously, no way of telling whether it’s a good Fon translation or not because I don’t speak it. But it looked like it was doing something, which was all that I could evaluate. That’s the whole problem because not everyone necessarily speaks every language involved there. But the thing that I did notice was that it had an option for “Is this translation incorrect?”, “Do you wanna change something?”, and presumably you’re using that to help improve the translator.
Bona: Yes. You’re definitely right. We included that to have much more translations to include more people. Because I work on the project with Chris and with two other people who are not computer scientists, who are not in the field of data science, NLP. They’re just like the linguistics people – like journalists, media communicators, and everything. So, they don’t actually know anything about those tools, but they had us checking the quality of the translation and everything. On our site, we wanted to have more data, more translations, to improve the service, but we knew we could not do it alone. We knew that it would make a huge difference involving people that speak the language. This is what made us put that form, so that people could suggest translations and score the translations. The translation score which happens to fit the – because if someone is giving a translation with a confidence score of 1 over 5, we are actually not really sure that we’ll go for it. Maybe we select it, but then we won’t give it much more attention than someone who is sure – who has given a 2.5 or 4 over 5 to the translation he suggested.
Gretchen: Because some users could be trolls or something and not giving very good results, and then you won’t be able to filter for which of those translations are good. One of the things with machine translation and with natural language processing and stuff in general is it generally needs a whole lot of data, right, to do it. Especially for translation, you need translation pairs where you have something that’s “Here’s a bit in one language. Here’s a bit in another language that corresponds to it,” and then eventually if you have enough of these, you can try to teach the computer to produce other translations. What are the challenges with getting some of this data in the first place and keeping on doing things with it afterwards?
Jade: I mean, what’s fun about the majority of the languages we see that translation systems exist for is you can just go on the internet and scrape these parallel corpuses and do something that they call “alignment,” which tries to say, okay, these are the sentences that are parallel. Typically, people will use Wikipedia as a good example to try and source these parallel pairs. They’ll use news websites where they’ve got translations for them. That’s all good and well for languages which are well-represented on the internet. But for many of the African languages, that is simply not the case. In fact, sometimes even the keyboards, in order to type in that language, don’t even exist. Bona actually worked on one for Fon so that people could download it to use it.
Gretchen: So, it does now, but it didn’t.
Jade: Yeah, and it didn’t. Or spellcheckers don’t exist in those languages. So, there’s also been a recent project by one of our collaborators, Sabelo Mhlambi, who’s also put together a number of spell checkers for Southern African languages. So, it doesn’t exist. I think what’s really interesting when you speak to Western or people who are not used to working with low-resource languages, they contact us, and they want to find out what websites can we scrape. We have to come back and say, “No, no, you have to hire people to translate,” or “You need to go back into archives and look up stuff.”
Gretchen: You need to take three steps back and actually start at this earlier thing.
Jade: We’ve had really fun stories of, for instance, the South African parliament got a great law that says we have to translate all of our parliamentary proceedings into as many of our 11 official languages, which apparently does happen but doesn’t get published. The only way we got a hold of that data was one of the translators for parliament happens to be on our project. She was like, “I can get the data. I can even get permission. You just have to help me with this really annoying tool that I can’t figure out to get it out of there.”
Gretchen: Because they’re all in PDFs or something annoying?
Jade: They’re in PDFs. Sometimes they’re in DOCXs. You have to download them from some weird FTP. You can only do it in tiny batches because otherwise the system falls over. It’s like trying to get it, and they’re like, “You can have it,” but it’s like, “But how? How do we get to it?” We’ve had a couple of participants get really involved and saying, you know, they wanna really dictate what gets translated. So, not just what’s already been translated; they wanna translate or use translations – they wanna capture any data about their language in question, and they wanna decide what goes in. They don’t want it to be what’s typically available, which is – religious texts are very widely translated. That’s the majority of what we have. They go into their communities. They sit with their grandmother on a Sunday, drink coffee, and work together. She’ll translate from English into your language and write it up, and then we’d collect those pieces of paper. Someone would scan them in. Someone would then type it up using their own keyboard if they can get a hold of it. There’s quite a lengthy process, which is part preservation and part totally can be used to build tools that can facilitate learning these languages. If we don’t do that, then the models do very strange things.
Gretchen: Well, I imagine religious texts, most people don’t talk like a religious text in their ordinary life.
Jade: No, they don’t.
Gretchen: The language model only understands what it’s already seen. You could try to say something pretty normal, and it would make it sound like a weird prophesy or something.
Jade: We do that over the – in one of our first papers, we did an analysis. Someone said, “Oh, we need to translate COVID-19 surveys. You should use machine translation.” We said, “That’s gonna be a fun evaluation exercise.” There’s no way you can use anything that comes out of it. But it was a great demonstration to show the failures of this process.
Gretchen: Because all of this medical vocabulary is not gonna be in the religious text that you’re using.
Jade: Yeah. For instance, Canada got translated into “Canaan.” Things became biblified.
Gretchen: Like, Canaan, oh no. [Laughter]
Jade: That would keep happening, so we got a lot of that. We used that as an evaluation exercise, so we’d have a native speaker come and correct whatever the machine did. It was quite a lot for us but also a little bit horrifying if you’re thinking that people are actually training these models without knowing how good or bad they are because they don’t have a team who understands it.
Gretchen: I imagine giving someone this advice, you know, “Save yourself from the plague of locusts” or something “by going to Canaan,” and you’re like, “Wait. No. There’s got to be something going on here, but I don’t know what it is.”
Jade: “What are you trying to tell me, computer? Why are you telling me this?”
Gretchen: Now you must have a bunch of translated data from COVID announcements and the pandemic and things like that because a lot of those did get produced in lots of languages.
Jade: I think some of them we do have. I don’t think we’ve done a recent gathering of them. We probably should and just kind of bolster our training sets. I think we’re still overwhelmingly dominated by religious text, but we’re sorting that out. There’s a number of funds that have come through, and they want to – for example, Bona was talking earlier about these Zindi competitions. Zindi is a data science platform where someone will release a data set or a challenge, and programmers will go on – or data scientists will go on – and participate in this competition. Zindi is an African data science platform. One of our close collaborators was running a competition that was I think funded by UNESCO. The goal there was to fund creation of data sets, which was really helpful. Now we have data sets that are being created, which means people go get things annotated. They work with journalists to get things written. We knew people who were working on annotating and translating their masters theses or a collection of poems they’d written to do this. There’re some funds that are available to do so. So, instead of scraping, we go and create more content and try and spur that into the world so that we have more.
Gretchen: That’s really neat. Of course, as Bona was just saying, once you can put some sort of machine translation or something up then you can also get people to contribute directly to rating it and adding new things.
Jade: Exactly. Which is what we’ve done. We actually recently deployed translate.masakhane.io where we have a number of English to about 67 African languages. We actually have 38 models that currently exist, we just haven’t deployed them all. Right now, it only goes English into the African languages, but we’ve swapped that around, so it does from African languages to English, too, and then also from African language to African language. It’s also this idea of extending participation from the Masakhane group, which is quite large, but extending that to as many people across the continent who’d like to participate and add translations.
Gretchen: That’s great. I guess there’s sort of, at the individual levels, a bunch of people are still gonna keep making projects underneath that. Bona, do you wanna talk about where you see things going? What’s next for you and what else you’re working on?
Bona: Okay. What’s next? I think it depends. I think already the bigger and the main goal of the family or the community is already, I mean, being step-by-step, slowly and steadily achieved. People know more now about African languages. People are more conscious about it, about the challenges. We’re seeing more tools are created. African languages have much more visibility now. I think it was hard to even find a paper that was talking about Fon because it simply did not. But actually, thanks to the Masakhane community, Chris, and I, and everything, there are not a lot but at least already more than two, three, and everything. On the personal side, Jade mentioned African to African, which is good to me and also to Chris. We started with an MT, machine translation. Now, we’re done, and we’re still working on speech. That’s basically, like, an MT, speech, and maybe we’ll do that cycle and actually more to something else. We also talk about organisation to improve the translation quality. So, me, I’m very interested in a future where, for instance, we could have an assistant with which we could just talk in Fon. Instead of having Siri or Alexa to which we are speaking French or English, we could have those assistants to which we are speaking our languages. That would fun.
Gretchen: Yeah, that would be great.
Bona: That would actually give a whole more sense to everything we are working on – having those language models that we are training integrated to mobile phones. On the keyboard, we are building all the spellcheckers that Jade mentioned. Those being integrated to every daily technology – phones, laptops, and everything – I think that would be a milestone. I don’t know what would come after that, but I think that would be already a very great step.
Gretchen: By the time you get there, you could notice other things that you want as well.
Bona: Yes, definitely. One step at a time, like they used to say.
Gretchen: Absolutely. Masakhane won an award from Wikimedia for a lot of this work in this area.
Bona: Yes, Masakhane did. We all did. [Laughs]
Jade: It was really cool because Bona and Chris accepted the award directly from Jimmy Wales himself and gave beautiful acceptance speeches. I think they’re somewhere on the internet if anyone wants to go listen. They’re great.
Gretchen: Oh, we find the links for those.
Jade: Yeah, they’re lovely.
Gretchen: If somebody says, “Wow, this is really cool! I hadn’t thought about this before. I wanna learn more about this project,” or “I wanna get more involved with this project,” what would you suggest they go to?
Bona: First of all, I think, of course, I would invite them to come to the community first. I think people will benefit a lot from it because we have weekly mentoring sessions studying NLPs, weekly meetings. People are working on very cool and interesting projects. It doesn’t matter if they are beginners, if they are computer scientists, or whatever. Everybody is taken like a baby, and then we grow step-by-step together. Then already when they join, they can now have their own interest. Depending on their interest, they now can move to anything they want. We can check the papers written by Jade and Laura or maybe the paper I wrote with Chris, the papers the community writes, like the paper for which we got the Wikimedia award – the Paper of the Year award. I mean, there are a lot of things, actually, you can do. Thankfully with our work, not to be too much showing off with our work, there’re actually much more, already, accounts and much more publications, much more resources, that can be found online for anybody that wants to start working on African low-resource languages.
Gretchen: Awesome. We’ll link to some of those in the show notes, so if people wanna follow a link to the Masakhane website or to other places. Jade, any resources that you’d leave people with?
Jade: I would recommend anyone who’s interested to get a hold of a book by a Kenyan author Ngũgĩ wa Thiong'o called Decolonizing the Mind: The Politics of Language in African Literature. This book is probably one of the most impactful books for African language even though he wasn’t a linguist. He was just an author. He was an author who, after writing this, chose to only write in his native language of Gikuyu. It situates one of the big Masakhane beliefs which is – I’ll actually, if I may, read a quote, where he says, “Language carries culture. Culture carries, particularly through art or literature, the entire body of values by which we come to perceive ourselves and our place in the world. How people perceive themselves affects how they look at their culture, at their politics, and at their social production of wealth, and their entire relationship to nature and to other beings. Language is, thus, inseparable from ourselves as a community of human beings with a specific form and character, a specific history, a specific relationship to the world.” This is what Masakhane calls for is this idea of I think what happens a lot in language technologies here is language is seen as data, and data is just seen as something you put in and out of a machine, whereas only if we start acknowledging language for what it is, a representation of human beings and human culture, then we can actually start building tools that actually aid humanity rather than just exploit. It’s also how we propose solving this low-resourced issue that’s facing African languages. Yeah, I would recommend this book to anyone. It’s basically a series of essays around his thoughts around African literature and playwrights and stories.
Gretchen: That sounds great. I’m gonna have to check it out.
Jade: It’s great.
Gretchen: That almost brings us to a close. But we always like to give our guests one more chance to say, “If you could leave people knowing one thing about language or about the problems that you work on, what would that be?”
Bona: I would just say that – I forgot the name of the author. It’s a lady who said that – I don’t really remember word-for-word – but what she was saying was basically that working on language is much more complicated than any other parallel that we could imagine. Like Jade was saying from wa Thiong'o, language carries a lot of things with itself. It carries the principles. It carries the values. It carries the culture. It carries heritage. It carries a lot of things that altogether give an identity to a human being – a group of human beings. I would say language is very important. Language matters. If our bigger – I mean, our bigger vision of putting Africa on the NLP map but more generally putting, let me say, AI in the heart of Africa’s development, then it’s a necessary and compulsory start by giving value to our languages by promoting our cultures and everything. Because if we don’t do it, nobody else will do it for us. I think we’ve been waiting and waiting for too long. It’s time for us to get things going by ourselves.
[Music]
Gretchen: For more Lingthusiasm and links to all the things mentioned in this episode, go to lingthusiasm.com. You can listen to us on Apple Podcasts, Google Podcasts, Spotify, SoundCloud, YouTube, or wherever else you get your podcasts. You can follow @Lingthusiasm on Twitter, Facebook, Instagram, and Tumblr. You can get IPA scarves, “Not Judging Your Grammar, Just Analysing It” t-shirts, and other Lingthusiasm merch at lingthusiasm.com/merch. I can be found as @GretchenAMcC on Twitter, my blog is AllThingsLinguistic.com, and my book about internet language is called Because Internet. Lauren tweets and blogs as Superlinguo. The Masakhane Initiative can be found at masakhane.io. That’s M-A-S-A-K-H-A-N-E dot I-O and as @MasakhaneNLP on Twitter. Have you listened to all the Lingthusiasm episodes, and you wish there were more? You can get access to over 50 bonus episodes to listen to right now at patreon.com/lingthusiasm or follow the links from our website. Patrons also get access to our Discord chatroom to talk with other linguistics fans and other rewards as well as helping keep the show ad-free. Recent bonus topics include talking to babies, Pokémon names, and our liveshow episode on backchanneling. Can’t afford to pledge? That’s okay, too. We also really appreciate it if you’d recommend Lingthusiasm to anyone who needs a little more linguistics in their life. Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, and our music is “Ancient City” by The Triangles.
Bona: Stay lingthusiastic!
Jade: Stay lingthusiastic!
[Music]
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A Focus on Machine Translation for African Languages
Masakhane is an interesting project to make machine translation more available in languages of Africa. From the description:
We need African researchers from ACROSS the continent to join our effort in building translation models for African languages. Masakhane means "We Build Together" in isiZulu and was inspired by the Deep Learning Indaba theme for 2018.
Phase 1
We want to develop baseline machine translation models from English-to-Target African languages, with publicly available code and data. This problem is part data gathering, part developing the translation model, and part error analysis to understand what issues the models have
Phase 2
Once we, as a collaborative African NLP team, have trained baseline translation models for many of our languages, we combine our datasets and do transfer learning with fine-tuning across the languages.
Phase 3
Write & submit a paper, with all of our work, to a top-tier NLP conference and in doing so, once and for all put Africa on the NLP map