Notes about Week 2 videos
Sets of +/- 5 words is the most useful span for collocates
Minimum frequency threshold: 10 co-occurrences
Take into accounts sentence boundaries?
Mutual information: how often 2 words co-occurate, relative to how often they occur without one another.
Mutual information may not be very accurate with data low in terms of frequency
Other measures are available that can give different results: for example the Dice coefficient
Colligation: a word collocates with a particular grammatical class
Semantic preference: collocation with words in a specific semantic family
Discourse prosody characterize speaker's attitude. Example of the verb 'cause' which is very often associated with negative events.
Keyword list < comparing 2 frequency lists
Selecting the most 'key' keywords: only consider the top 10 or 50 or 100 keywords, apply a minimum frequency (for ex. a word must occur at least 20 times to be a keyword) or considering only words well distributed (e.g. in at least 20 texts of the corpus)