Our human understanding of coherence derives from our ability to recognize interlocutors’ beliefs and intentions within context. That is, human language use takes place between individuals who share common ground and are mutually aware of that sharing (and its extent), who have communicative intents which they use language to convey, and who model each others’ mental states as they communicate. As such, human communication relies on the interpretation of implicit meaning conveyed between individuals. The fact that human-human communication is a jointly constructed activity is most clearly true in co-situated spoken or signed communication, but we use the same facilities for producing language that is intended for audiences not co-present with us (readers, listeners, watchers at a distance in time or space) and in interpreting such language when we encounter it. It must follow that even when we don’t know the person who generated the language we are interpreting, we build a partial model of who they are and what common ground we think they share with us, and use this in interpreting their words.
Text generated by an LM [language model] is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that. This can seem counter-intuitive given the increasingly fluent qualities of automatically generated text, but we have to account for the fact that our perception of natural language text, regardless of how it was generated, is mediated by our own linguistic competence and our predisposition to interpret communicative acts as conveying coherent meaning and intent, whether or not they do. The problem is, if one side of the communication does not have meaning, then the comprehension of the implicit meaning is an illusion arising from our singular human understanding of language (independent of the model). Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Emily M. Bender, Timnit Gebru, Angela McMillan-Major, Shmargaret Shmitchell [cheeky alias], & 3 others suppressed by Google.
There are two different perspectives from which one can look at the progress of a field. Under a bottom-up perspective, the efforts of a scientific community are driven by identifying specific research challenges. A scientific result counts as a success if it solves such a specific challenge, at least partially. As long as such successes are frequent and satisfying, there is a general atmosphere of sustained progress. By contrast, under a top-down perspective, the focus is on the remote end goal of offering a complete, unified theory for the entire field. This view invites anxiety about the fact that we have not yet fully explained all phenomena and raises the question of whether all of our bottom-up progress leads us in the right direction.
There is no doubt that NLP [natural language processing] is currently in the process of rapid hill-climbing. Every year, states of the art across many NLP tasks are being improved significantly—often through the use of better pretrained LMs [language models]—and tasks that seemed impossible not long ago are already old news. Thus, everything is going great when we take the bottom-up view. But from a top-down perspective, the question is whether the hill we are climbing so rapidly is the right hill. How do we know that incremental progress on today’s tasks will take us to our end goal, whether that is “General Linguistic Intelligence” (Yogatama et al., 2019) or a system that passes the Turing test or a system that captures the meaning of English, Arapaho, Thai, or Hausa to a linguist’s satisfaction?
It is instructive to look at the past to appreciate this question. Computational linguistics has gone through many fashion cycles over the course of its history. Grammar- and knowledge-based methods gave way to statistical methods, and today most research incorporates neural methods. Researchers of each generation felt like they were solving relevant problems and making constant progress, from a bottom-up perspective. However, eventually serious shortcomings of each paradigm emerged, which could not be tackled satisfactorily with the methods of the day, and these methods were seen as obsolete. This negative judgment— we were climbing a hill, but not the right hill—can only be made from a top-down perspective. We have discussed the question of what is required to learn meaning in an attempt to bring the top-down perspective into clearer focus.
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data, Emily M. Bender & Alexander Koller. DOI: 10.18653/v1/2020.acl-main.463 [italics in the original]
Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin. Findings of the Association for Computational Linguistics:
In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages.
Our approach consists of a lossless compressor, a compressor-based distance metric, and a k-Nearest-Neighbor classifier. Lossless compressors aim to represent information using as few bits as possible by assigning shorter codes to symbols with higher probability. The intuition of using compressors for classification is that (1) compressors are good at capturing regularity; (2) objects from the same category share more regularity than those from different categories.
Being parameter-free, our method doesn’t rely on GPU force but CPU resources only. Thus, it does not bring negative environmental impacts revolving around GPU. In terms of overgeneralization, we conduct our experiments on both in-distribution and out-of-distribution datasets, covering six languages. As compressors are data-type agnostic, they are more inclusive to datasets, which allows us to classify low-resource languages like Kinyarwanda, Kirundi, and Swahili and to mitigate the underexposure problem.
Publications talking about the application of large LMs to meaning-sensitive tasks tend to describe the models with terminology that, if interpreted at face value, is misleading. Here is a selection from academically-oriented pieces (emphasis added):
(1) In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task. (Devlin et al., 2019)
(2) Using BERT, a pretraining language model, has been successful for single-turn machine comprehension . . . (Ohsugi et al., 2019)
(3) The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. (Petroni et al., 2019)
If the highlighted terms are meant to describe human-analogous understanding, comprehension, or recall of factual knowledge, then these are gross overclaims. If, instead, they are intended as technical terms, they should be explicitly defined.
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data, Emily M. Bender & Alexander Koller. DOI: 10.18653/v1/2020.acl-main.463
Interview with Jon Dehdari, computational linguist, part 1/2
Today we talk to Jon Dehdari, a well-known computational linguist from Saarland University. Jon, let me ask you to introduce yourself: what did you study before, how did you reach the point you are at now?
Hi, sure, I’m Jon Dehdari. I’m doing a post-doc here at University of Saarland and DFKI, Germany. And before that I was working on a PhD at Ohio State University in the US. I was working on different kinds of NLP related topics. I started out working on parsing and formal analysis of syntax as well. And then I drifted into statistical NLP, machine translation, and then to neuroscience-informed NLP, I guess.
I worked with professor William Schuler on an EEG-inspired language model. EEG is a way of studying how the brain is working at a large, macroscopic level.
The timing is very fast, so you are able to see in real time how the brain is processing language and other things that are happening.
Well, it’s wide range of topics – parsing, syntax, brain processing from an EEG point of view. But do you have a main topic of interest now?
Machine translation and language modeling are main two areas I’m currently interested in. Language modeling (LM) is a general field within NLP and related areas. It’s basically just modeling language, and typically people think of LM as incremental language modeling. So, within the context of a sentence, given previous few words, it tries to predict what word will come next. That is not necessarily the only usage of language modelling, but very common scenario.
So, for both machine translation and speech recognition you have input coming in incrementally and in real time, and you want to know what word somebody spoke in to microphone or how a given word/phrase should be translated. So typical decoders for machine translation or speech recognition work incrementally - that’s the most efficient way of doing with those inputs.
What languages do you speak? It’s very important for computational linguist to have different background of some languages.
That’s true. So, English is my first language, I am also fluent in Spanish, and I have studied extensively Persian and Arabic, also German. I’ve studied to lesser extend five or ten other languages. Enough to get into trouble, but not enough to get out of trouble, I would say :)
I do have a linguistics background and some computer science background as well. I have an appreciation and fascination with human languages. I am always trying to learn new languages. Obviously, to acquire complete fluency in any language you need years and years of practice.
Let’s talk about your PhD work. What was the main problem, the proposed solution, main challenges, in your opinion?
I was initially interested in an unsupervised parser. Parsing is the task of annotating sentence with a syntactic structure, often in the form of a constituency tree. I was trying to find phrases and phrases of those phrases recursively, or trying to figure out what dependency relationships exists between words in a sentence. Usually that is done by learning from a treebank, an existing dataset of annotated syntactic trees. But I was interested in whether it is possible to learn those structures and relationships without the use of labeled data. There is plenty of labeled data for English and a few other languages, but for vast majority of the world’s languages there are simply no labeled data or very little of it. I am quite interested in part from the theoretical perspective with NLP. But also, as a linguist, I am interested in developing algorithms for all of the world’s languages or for most of them. So, unsupervised learning is an obvious choice of machine learning paradigms working with all the world languages.
As I worked more and more with unsupervised parsing and unsupervised grammar induction, it came apparent that it wasn’t that useful for machine translation decoder to have that information or it was only marginally useful to have that. There are a lot of different ways the machine translation goes wrong, and any given approach will have a small impact and I became less interested in unsupervised parsing and more interested in language modeling, which is a related area but a little bit different emphasis. N-gram language modeling is and has been a common technique for modeling language, where you just simply take the previous few words and base the probability of a next word that is going to follow on the previous few words. But it requires the previous few words to exist as it is in your training set (the data you used to train your language model). That does not work very well for free word order, which is quite common for morphologically-rich languages. An n-gram based language model works well for English and for other languages with fixed word order. That’s why I was interested in working with more morphologically-rich languages.
At that time neural language models were very new and were extremely slow. As more and more modelling data became available it was apparent for me to develop a language model first for these languages that can accommodate large amounts of data. I was interested in working within the constraints of a language model that was both fast and didn’t require any manual annotations and could make use of a longer history than just the previous three or four words. I developed a language model that unified a lot of existing techniques and also extended it so that it was both fast and accommodated freer word order as well. It was also inspired by the way that the brain processes semantic information at an incremental level. There is an event-related potential (ERP), which is an activity in the brain, that is called N400.
What we see is that at the particular part of the brain, the left side usually, in the middle, whenever we hear a part of the sentence and then we hear another word that is semantically incongruent with the words preceding it, then we're surprised a little bit to hear that word. What happens is, there is a change in voltage that becomes more negative than is typical around 0.4s after we hear that word coming in.
The classic example is - "A sparrow is not a ... boat" or "a sparrow is not a ... car". When I say “a sparrow is not a car”, that is logically true, but typically when we hear the word “sparrow”, we are going to think about some other animal-type word to follow or something that would be semantically associated with it. And when we hear the word “car”, it’s surprising to a certain extent and there is N400 would spike in that context. People would look at the N400 from a variety different angles and try to pen down when it happens and how it happens, in what context does it occur and it what context it doesn’t occur. What people found was that the more words that we hear in a given sentence, the less surprised we are going to be. Especially if all the preceding history is congruent, semantically. But at the beginning of the sentence we don’t know what the first word of that sentence will be. So the history can help to eliminate some of the surprisal that we can hear in the input, I developed a language model that reflects those patterns in the N400.
As more and more history comes in then the less surprisal it’s going to be, especially as words are congruent with the preceding words
The second part will be published soon and will tell you more things about N400 and neural networks applied to machine translation