Discover Top Posts Tagged with #corpora

Popular Recent

Provide feedback now on a “Policy on Conduct” to the Corporate Policies of SCA, Inc. suggested by the Society President to aid transparency

Use your words, folks. Let them know what you think.

#mysca #society for creative anachronism #sca #corpora #transparency

Personally I believe the world would benefit greatly from the existence of corpora from fanfiction.

Imagine a corpus that draws its data from every single ao3 fanfiction.

The number of occurrences for the most painfully specific words would be off the charts.

The collocations would be insane.

I am haunted by the possibility of a corpus in which the word "oh" occurs a million times and half of them are followed by another "oh".

Delightful

#corpus #corpus linguistics #linguistics #linguistic corpus #corpora #ao3 #fanfiction #archive of our own #text corpus

Betty Tompkins Ousted from Instagram for Posting Classic Painting of Penetration

#betty tompkins #corpora

Lingthusiasm Episode 61: Corpus linguistics and consent - Interview with Kat Gupta

If you want to know what a particular person, era, or society thinks about a given topic, you might want to read what that person or people have written about it. Which would be fine if your topic and people are very specific, but what if you’ve got, say, “everything published in English between 1800 and 2000″ and you’re trying to figure out how the use of a particular word (say, “the”) has been changing? In that case, you might want to turn to some of the text analysis tools of corpus linguistics -- the area of linguistics that makes and analyzes corpora, aka collections of texts.

In this episode, your host Gretchen McCulloch gets enthusiastic about corpus linguistics with Dr Kat Gupta, a lecturer in English Language and Linguistics at the University of Roehampton in London, UK. We talk about how Kat’s interests changed along their path in linguistics, what to think about when pulling together a bunch of texts to analyze, and two of Kat’s cool research projects -- one using a corpus of newspaper articles to analyze how people perceived the various groups within the suffrage movement, and one about what we can learn about consent from their 1.4 billion-word corpus of online erotica.

Click here for a link to this episode in your podcast player of choice or read the transcript here

Announcements:

There's just under two weeks left to sign up for the Lingthusiastic Sticker Pack! Become a Ling-phabet patron or higher by November 3, 2021 (anywhere on earth) and we'll send you a pack of four fun Lingthusiasm-related stickers! Plus, if we hit our stretch goal, that'll also include the two bouba and kiki stickers below for all sticker packs. Tea and scarf, sadly, not included, but the usual tier rewards of IPA wall of fame tile and Lingthusiast sticker are. (That could be seven stickers!)

In this month’s patron bonus episode, Lauren and Gretchen get enthusiastic about improving linguistics content on Wikipedia! We talk about gaps and biases that still exist for linguistics-related articles, getting started with Wikipedia edit-a-thons for linguists (#lingwiki) in 2015, how Wikipedia can fit into academia (from wiki journals to classroom editing assignments), and the part that Wikipedia played in the Lingthusiasm origin story. To access this and 55 other bonus episodes, join the Lingthusiasm patreon.

Here are links mentioned in this episode:

Kat Gupta’s website

Kat Gupta on Twitter

Wikipedia entry for WordSmith Software

Lexically

Aimee Bailey’s work on homonormativity in queer women’s media

Response and responsibility: Mainstream media and Lucy Meadows in a post-Leveson context

Representation of the British Suffrage Movement

British National Corpus

Corpus of Contemporary American English

Lingthusiasm sticker pack special offer

You can listen to this episode via Lingthusiasm.com, Soundcloud, RSS, Apple Podcasts/iTunes, Spotify, YouTube, or wherever you get your podcasts. You can also download an mp3 via the Soundcloud page for offline listening.

To receive an email whenever a new episode drops, sign up for the Lingthusiasm mailing list.

You can help keep Lingthusiasm ad-free, get access to bonus content, and more perks by supporting us on Patreon.

Lingthusiasm is on Twitter, Instagram, Facebook, and Tumblr. Email us at contact [at] lingthusiasm [dot] com

Gretchen is on Twitter as @GretchenAMcC and blogs at All Things Linguistic.

Lauren is on Twitter as @superlinguo and blogs at Superlinguo.

Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our production editor is Sarah Dopierala, our production manager is Liz McCullough, and our music is ‘Ancient City’ by The Triangles.

This episode of Lingthusiasm is made available under a Creative Commons Attribution Non-Commercial Share Alike license (CC 4.0 BY-NC-SA).

#language #linguistics #lingthusiasm #episode 61 #podcast #episodes #interviews #Kat Gupta #corpus linguistics #consent #corpus #corpora #suffragette #suffragists #English #history #interview

Linguists worry judges aren’t using it appropriately.

The federal judge who made the ruling to overturn the mask mandate on planes informed his decision in part with a search for the word sanitation in the Corpus of Historical American English.

Here's a neat article discussing this episode, and the use of corpus linguistics in law more generally.

#linguistics #language #law #textualism #corpus linguistics #corpora

A series of intimate conversations could teach an AI to understand both language and culture.

An interesting article about the social process of creating machine translation datasets in Khoekhoegowab. Excerpt:

On the surface, Wilhelmina Ndapewa Onyothi Nekoto and Elfriede Gowases seem like a mismatched pair. Nekoto is a 26-year-old data scientist. Gowases is a retired English teacher in her late 60s. Nekoto, who used to play rugby in Namibia’s national league, stands about a head taller than Gowases, who is short and slight. Like nearly half of Namibians, Nekoto speaks Oshiwambo, while Gowases is one of the country’s roughly 200,000 native speakers of Khoekhoegowab.

But the women grew close over a series of working visits starting last October. At Gowases’s home, they translated sentences from Khoekhoegowab to English. Each sentence pair became another entry in a budding database of translations, which Nekoto hopes will one day power AI tools that can automatically translate between Namibia’s languages, bolstering communication and commerce within the country.

“If we can design applications that are able to translate what we’re saying in real time,” Nekoto says, “then that’s one step closer toward economic [development].” That’s one of the goals of the Masakhane project, which organizes natural language processing researchers like Nekoto to work on low-resource African languages.

Compiling a dataset to train an AI model is often a dry, technical task. But Nekoto’s self-driven project, rooted in hours of close conversation with Gowases, is anything but. Each datapoint contains fragments of cultural knowledge preserved in the stories, songs, and recipes that Gowases has translated. This information is as crucial for the success of a machine translation algorithm as the grammar and syntax embedded in the training data.

Read the whole thing.

#linguistics #khoekhoegowab #namibia #khoekhoe #masakhane #natural language processing #nlp #nlproc #machine translation #corpora #parallel corpora #ai #data #low-resource languages #low resource languages #digitally disadvantaged languages #data science

Corpus

A lot of people who are interested in procgen get intimidated by the vocabulary that gets thrown around. I think that’s a pity, particularly since some of the concepts are pretty simple once you get past the initial difficulty.

One term that gets thrown around a lot is “corpus”. In procgen, this just means “a collection of stuff that we use as data for the generator”. Most often, this is used in the context of text: a corpus of words for a Tracery grammar, or a training corpus for a chatbot.

A corpus is useful for more than just text: a building generator might have a corpus of 3d models of architectural elements, a music generator might have a corpus of motifs.

Corpora show up in a lot of places. The Corpora project is a repository of a number of small corpora of texts: colors, books, rivers, etc. Last year at Roguelike Celebration, everest pipkin gave a talk about curating your own personal corpora of text.

#corpus #corpora #tracery #procgen #procedural generation #text generation

corpora aka god’s gift to socially awkward linguists

#linguistics #corpora

#corpora

Trending Tags

Recently Viewed Tags

#corpora