DataCamp Intermediate R Ch. 4 - Regular Expressions
So, regular expressions are mysterious and powerful creatures. I know what they can do: go through character strings and see if a string has the expression or pattern you’re looking for.
I’ve only dealt with them superficially because they’re like the cool kids and I’m still a dorky noob when it comes to programming. I’ve caught glimpses here and there (not to brag) on how they work, so I’m
In R, you use the grepl(), grep(), sub(), and gsub() to search character strings.
You use a “^a” to look for a character (in these cases, an “a”) at the beginning of a string and a “a$” to look for an “a” at the end of a string.
You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively.
.*, which matches any character (.) zero or more times (*). Both the dot and the asterisk are metacharacters.
\\. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.
So, if I understand this correctly, the period in a regular expression represents any character. The asterisk represents zero or more times. Okay. See? This is tricky. I’m used to the asterisk as a wild card, you know? But in regexes, it’s the period. But now, we cool.
While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement.sub() only replaces the first match, whereas gsub() replaces all matches.
Regular expressions are a typical concept that you'll learn by doing and by seeing other examples. Before you rack your brains over the regular expression in this exercise, have a look at the new things that will be used:
.*: A usual suspect! It can be read as "any character that is matched zero or more times".
\\s: Match a space. The "s" is normally a character, escaping it (\\) makes it a metacharacter.
[0-9]+: Match the numbers 0 to 9, at least once (+).
([0-9]+): The parentheses are used to make parts of the matching string available to define the replacement. The \\1 in the replacement argument of sub() gets set to the string that is captured by the regular expression [0-9]+.
awards <- c("Won 1 Oscar.", "Won 1 Oscar. Another 9 wins & 24 nominations.", "1 win and 2 nominations.", "2 wins & 3 nominations.", "Nominated for 2 Golden Globes. 1 more win & 2 nominations.", "4 wins & 1 nomination.") sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
So, what will the result be after calling the sub() function?
Well, the regular expression is looking for a pattern, specifically: any character, then a space, then numbers 0 through 9 (at least once) then a space, then the word “nomination” then any character (this allows for either nomination or nominations) at the end of the string. If that sequence is found, the “\\1″ is supposed to replace the string with whatever numbers (0 through 9) the regular expression matched. It’s like saying, “This \\1 means that if you find the pattern, replace all those characters with just the numeric characters in the string.”
So, for the first string, “Won 1 Oscar”, the regex doesn’t find a match, so no replacement takes place.
In the second string, there’s a match: the “& 24 nominations.” So, sub() will replace that whole string with “24″.
In the third string, there’s a match, so that result is “2″.
In the fourth string, there’s a match, so that result is “3″.
In the fifth string, there’s a match, so that result is “2″.
In the last string, there’s a match, so that result is “1″.
The complete result is a character vector containing six elements: “Won 1 Oscar”, “24″, “2″, “3″, “2″, “1″.
There’s still a bit I don’t understand about how .* works, like is it for just one character or more than one? Fiddling around with the exercise, I found that you get the same result for the regular expression “.*\\s([0-9]+)\\snom.*$”. This tells me the .* is for anything. Zero or more characters. And I don’t know about the “\\1″ thing. How does that work? There must be a list of escape characters that do different things. I’ll leave that for later, though. I don’t want my brain to melt just yet.
The ([0-9]+) selects the entire number that comes before the word “nomination” in the string, and the entire match gets replaced by this number because of the \\1 that references to the content inside the parentheses.










