Data @brentmayberry - Tumblr Blog

DataCamp Intermediate R Ch. 4 - Regular Expressions

So, regular expressions are mysterious and powerful creatures. I know what they can do: go through character strings and see if a string has the expression or pattern you’re looking for.

I’ve only dealt with them superficially because they’re like the cool kids and I’m still a dorky noob when it comes to programming. I’ve caught glimpses here and there (not to brag) on how they work, so I’m

In R, you use the grepl(), grep(), sub(), and gsub() to search character strings.

You use a “^a” to look for a character (in these cases, an “a”) at the beginning of a string and a “a$” to look for an “a” at the end of a string.

You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively.

.*, which matches any character (.) zero or more times (*). Both the dot and the asterisk are metacharacters.

\\. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.

So, if I understand this correctly, the period in a regular expression represents any character. The asterisk represents zero or more times. Okay. See? This is tricky. I’m used to the asterisk as a wild card, you know? But in regexes, it’s the period. But now, we cool.

While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement.sub() only replaces the first match, whereas gsub() replaces all matches.

Regular expressions are a typical concept that you'll learn by doing and by seeing other examples. Before you rack your brains over the regular expression in this exercise, have a look at the new things that will be used:

.*: A usual suspect! It can be read as "any character that is matched zero or more times".

\\s: Match a space. The "s" is normally a character, escaping it (\\) makes it a metacharacter.

[0-9]+: Match the numbers 0 to 9, at least once (+).

([0-9]+): The parentheses are used to make parts of the matching string available to define the replacement. The \\1 in the replacement argument of sub() gets set to the string that is captured by the regular expression [0-9]+.

awards <- c("Won 1 Oscar.", "Won 1 Oscar. Another 9 wins & 24 nominations.", "1 win and 2 nominations.", "2 wins & 3 nominations.", "Nominated for 2 Golden Globes. 1 more win & 2 nominations.", "4 wins & 1 nomination.") sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)

So, what will the result be after calling the sub() function?

Well, the regular expression is looking for a pattern, specifically: any character, then a space, then numbers 0 through 9 (at least once) then a space, then the word “nomination” then any character (this allows for either nomination or nominations) at the end of the string. If that sequence is found, the “\\1″ is supposed to replace the string with whatever numbers (0 through 9) the regular expression matched. It’s like saying, “This \\1 means that if you find the pattern, replace all those characters with just the numeric characters in the string.”

So, for the first string, “Won 1 Oscar”, the regex doesn’t find a match, so no replacement takes place.

In the second string, there’s a match: the “& 24 nominations.” So, sub() will replace that whole string with “24″.

In the third string, there’s a match, so that result is “2″.

In the fourth string, there’s a match, so that result is “3″.

In the fifth string, there’s a match, so that result is “2″.

In the last string, there’s a match, so that result is “1″.

The complete result is a character vector containing six elements: “Won 1 Oscar”, “24″, “2″, “3″, “2″, “1″.

There’s still a bit I don’t understand about how .* works, like is it for just one character or more than one? Fiddling around with the exercise, I found that you get the same result for the regular expression “.*\\s([0-9]+)\\snom.*$”. This tells me the .* is for anything. Zero or more characters. And I don’t know about the “\\1″ thing. How does that work? There must be a list of escape characters that do different things. I’ll leave that for later, though. I don’t want my brain to melt just yet.

The ([0-9]+) selects the entire number that comes before the word “nomination” in the string, and the entire match gets replaced by this number because of the \\1 that references to the content inside the parentheses.

Udacity Intro to Data Science Lesson 1 - Intro (pt. 1)

What is a Data Scientist?

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Lulz! (Probably true. DataCamp calls these people “as rare as unicorns”.)

What do data scientists do? They have to locate data (I don’t think data scientists are much involved with creating data, but I could be wrong. I bet some do.), extract it from some kind of warehouse, they have to “clean it up”, meaning they have to make sure the data is in the right format, that NULL and missing values are accounted for, and that any “errors” or mistakes in the data are recognized. Once that’s done, they work on analyzing the data and discovering insights hidden within it. Then they draw conclusions from their analysis and present them (along with visualizations, summaries, explanations) to relevant stakeholders. Does that sound about right? What am I missing?

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Here’s a Venn diagram of what a data scientist might do.

What are the basic tasks a data scientist needs?

Knows what questions to ask

Interprets the data well

Understands the structure of the data

Able to work in teams (strengths and weaknesses among team members)

Simpson’s Paradox

I actually looked at this data from UC Berkeley and performed chi-squared tests to see if there were gaping discrepancies in acceptance rates. I started looking at each group. In Group A, there was a statistically significant discrepancy, with a p-value = 0.0000328. In this case, women in Group A were accepted at a much different rate than men. In Groups B through F, there was NOT a statistically signficant difference. However, taken all together, all the groups, there was a signficant difference. Why? This is Simpson’s Paradox, beautifully explained here. (I gotta learn me some D3.js!)

What Problems do Data Scientists Solve?

This was a cool look at how the Obama campaign used data-driven solutions:

http://kylerush.net/blog/optimization-at-the-obama-campaign-ab-testing/

Data science tries to solve problems in government, sports, science, health...not just at tech start-ups.

DataCamp Cleaning Data in R Ch.3 - Preparing Data for Analysis

Type Conversion

Working w/ different types of variables: characters, numerics, logicals, booleans, etc. You have to make sure the types are right so everything works together.

The class() function is a good way to examine the type of data you’re working with. Sometimes you have to convert data to a different type.

the class() function tells you what type of object you're working with. (There are subtle differences between the class, type, and mode of an object,

the object within each call of the class() function to make it evaluate to the following (in order):

character

numeric

integer

factor

logical

Add or remove quotes, add an L to numerics to make them integers and use the factor() function when appropriate to accomplish this.

It is often necessary to change, or coerce, the way that variables in a dataset are stored. This could be because of the way they were read into R (with read.csv(), for example) or perhaps the function you are using to analyze the data requires variables to be coded a certain way.

Only certain coercions are allowed, but the rules for what works are generally pretty intuitive. For example, trying to convert a character string to a number gives an error: as.numeric("some text").

There are a few less intuitive results. For example, under the hood, the logical values TRUE and FALSE are coded as 1 and 0, respectively. Therefore, as.logical(1) returns TRUE and as.numeric(TRUE)returns 1.

Dates

Dates can be a challenge to work with in any programming language, but thanks to the lubridate package, working with dates in R isn't so bad. Since this course is about cleaning data, we only cover the most basic functions from lubridate to help us standardize the format of dates and times in our data.

As you saw in the video, these functions combine the letters y, m, d, h, m, s, which stand for year, month, day, hour, minute, and second, respectively. The order of the letters in the function should match the order of the date/time you are attempting to read in, although not all combinations are valid. Notice that the functions are "smart" in that they are capable of parsing multiple formats.

stringr package

One common issue that comes up when cleaning data is the need to remove leading and/or trailing white space. The str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.

> str_trim(" this is a test ") [1] "this is a test"

A similar issue is when you need to pad strings to make them a certain number of characters wide. One example is if you had a bunch of employee ID numbers, some of which begin with one or more zeros. When reading these data in, you find that the leading zeros have been dropped somewhere along the way (probably because the variable was thought to be numeric and in that case, leading zeros would be unnecessary.)

> str_pad("24493", width = 7, side = "left", pad = "0") [1] "0024493"

In addition to trimming and padding strings, you may need to adjust their case from time to time. Making strings uppercase or lowercase is very straightforward in (base) R thanks to toupper() and tolower(). Each function takes exactly one argument: the character string (or vector/column of strings) to be converted to the desired case.

The stringr package provides two functions that are very useful for finding and/or replacing strings: str_detect() and str_replace().

Like all functions in stringr, the first argument of each is the string of interest. The second argument of each is the pattern of interest. In the case of str_detect(), this is the pattern we are searching for. In the case of str_replace(), this is the pattern we want to replace. Finally, str_replace() has a third argument, which is the string to replace with.

Missing Values

You’ll come across many kinds of “missing values”. In R, they’re represented by NA, but they come in different formats if you import data from other sources. Others are Inf and NaN.

Check for NAs by is.na(df). Are there any NAs? any(is.na(df)). How many NAs? sum(is.na(df)). Summary tells you the number of NAs for each variable in a data set. summary(df).

complete.cases(df) returns a vector for each row in the data set. It returns TRUE if no missing values in a row and FALSE otherwise. You can subset your data set to keep only those rows with “complete” observations.

df[complete.cases(df), ]

na.omit(df) removes any rows (observations) with missing values. It’s the same as df[complete.cases(df), ].

As you've seen, missing values in R should be represented by NA, but unfortunately you will not always be so lucky. Before you can deal with missing values, you have to find them in the data.

If missing values are properly coded as NA, the is.na() function will help you find them. Otherwise, if your dataset is too big to just look at the whole thing, you may need to try searching for some of the usual suspects like "", "#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.

Outliers and Obvious Errors

Extreme values for the context, values that don’t make sense.

Data entry errors, sensor errors, values that signify a missing value (-1, for example.)

These kinds of errors should be removed or replaced.

You can use summary() to check your data for extremes. Use hist() (the breaks argument lets you specify the number of buckets in the histogram, and the right = FALSE argument buckets 0 values to the right of the 0 label.). Use boxplot() to see outliers.

Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you're curious about the exact formula for determining what is "far", check out ?hist.)

#datacamp #R #data science

Udacity CS101 Ch. 2.5 - How to Solve Problems

Goal: Draw general lessons about how to solve problems.

How do you get started? What’s the first step?

Make sure you understand the problem. What does it mean to understand computational problems? All computational problems have inputs and a desired output in common.

All computational problems have these things in common:

A possible set of inputs

A desired output

The solution is to find a procedure that takes any input from the possible set and correctly produces the desired output.

So, the first step in understanding the problem is to know what the possible inputs are.

The “zeroth” rule is: DON’T PANIC. (Too late! Coding kicks my butt!)

The first rule is: WHAT ARE THE INPUTS? What is the set of valid inputs? Also, to be a “defensive programmer”, your procedure should check to make sure the inputs are valid. How are the inputs represented?

The second rule is: WHAT ARE THE OUTPUTS?

The third rule is: SOLVE THE PROBLEM! (Easy, right?) Understand the rules. Work out some examples by hand. Write some pseudocode. How would a human solve this?

The fourth rule is: FIND A SIMPLE MECHANICAL SOLUTION. Try to find a simple mechanical solution, but don’t worry about optimizing right away. How would a machine solve this? (You can use “brute force” methods, initially, before optimizing.)

Hint: Break a complex problem into many smaller problems and write helper procedures to “eat the elephant” one bite at a time.

The fifth rule is: DEVELOP INCREMENTALLY AND SOLVE AS YOU GO. Use code stubs that you know are wrong, test, then change a bit and re-test.

This exercise killed me. The goal was to calculate the number of days between two dates, factoring in leap years and all that. It was really complicated, and we had to write a bunch of “helper procedures/functions” that figured things out before putting it all together. I got stuck on trying to solve this from a human’s perspective instead of a machine’s. It was rough, but overall a good learning experience.

#udacity #cs101 #data #data science

DataCamp Cleaning Data in R Ch2. - Tidy Data

Tidy data. What is it? Here’s Hadley Wickham’s paper about it. Each observation (row) has named variables (column/attribute) with completed data. Variables have values (entries). An observational unit contains information cogent to one particular subject.

One example of dirty data is when variable values are used as variables themselves.

tidyr

By Hadley Wickham. Helps you apply the principles of tidy data.

The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.

The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw in the video, running the following command on wide_df will make it long:

gather(wide_df, my_key, my_val, -col)

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.

The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw in the video, running the following command will make long_df wide:

spread(long_df, my_key, my_val)

The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.

We've loaded the small dataset from the video called treatments into your workspace. This dataset obeys the principles of tidy data, but we'd like to split the treatment dates into two separate columns: year and month. This can be accomplished with the following:

separate(treatments, year_mo, c("year", "month"))

The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument.

We've loaded the treatments data into your workspace again, but this time the year_mo column has been separated into year and month. The original column can be recreated by putting year and month back together:

unite(treatments, year_mo, year, month)

DataCamp Writing Functions Ch. 5 - Robust Functions

Errors

Learned about the stopifnot() and stop() functions. These help you analyze the arguments passed to the function you’re writing and will present errors as you define them. The stop() function lets you customize the error message, while the stopifnot() function only tells the user that something isn’t right. I need to do a better job explaining this.

Type-Unstable Functions

Make your functions type-consistent. This means that the output should be the same type (type of object) no matter the input. Avoid type-unstable functions--these are functions that return different types of objects depending on your inputs. One example is a function that either returns a data frame or a vector depending on the input. Apparently the sapply() function is type-unstable.

(Note: You can use the map functions to subset. Instead of using a function as the second argument, you can use a number to indicate which element of the list the map function returns you want to subset.)

DataCamp Cleaning Data in R - Intro

Most data is “dirty”, data that has little or no consistent structure.

Four parts to dealing with data:

Collecting

Cleaning (tidying, wrangling, munging)

Analyzing

Reporting

First Step in Cleaning Data: Exploring

Understanding structure

Looking at data

Visualizing data

To understand data, take a look at it. Use class() to make sure it’s tabular (like a data frame or matrix or data table.) Use dim() to look at the data’s dimensions; the result is the number of rows, then number of columns. Use names() to get the column names. Use str() to get an idea of the data’s structure. Rows are “observations” and columns are “variables”.

The dplyr package uses glimpse(), which is similar to str(). (On the surface, I can’t tell the difference besides minor cosmetic changes.

The summary() function gives you the mean, median, min, max, and IQR for each variable (column).

The head() function shows the first 6 rows of a data frame. If you add head(x, n) where n = the number of rows you want to display. The same thing goes with the tail() function (displays the last 6 rows.)

hist() plots a histogram and is useful when looking at the distribution of a variable. plot() creates a scatterplot that allows you to look at the relationship between two variables.

DataCamp Importing Data Into R Ch. 5 - Importing data from the web

So much data is on the web. (Duh...all of it, in my case.) So, HTTP is the way machines communicate over the WWW.

You’d usually download files from a website to your computer, then load it into R. BUT WAIT! You can do it from R directly.

What about HTTPS? Newer versions of R can handle it automatically.

You can use download.file() to get files. You have to define the url and the destination path as arguments. You can specify the file name if you don’t want to keep the original.

Sometimes you have to specify authentication parameters (depending on the web page). For that, the httr package can come in handy.

The utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web. (This means you don’t have to use download.file().) You can also use the readr package and use read_csv() and read_tsv() to download the appropriate files from the web.

gdata can handle .xls files that are located on a remote web server. readxl can't, at least not yet.

With download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. You have to first download the RData file securely, and then import the local data file.

(Whoops! Hold on there. According to DataCamp: Another way to load remote RData files is to use the url() function inside load(). However, this will not save the RData file to a local file. You can use a URL as an argument to load data. It just won’t save it for you. You can't directly use a URL string inside load() to load remote RData files. You should use url() or download the file first using download.file(). ) I’m pretty confused. I’ll have to experiment to see what’s going on. Okay, so you have to use load(url(”<url here>”)) to make it work. Got it.

httr

Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don't tell content() how to retrieve the content through the as argument, it'll try its best to figure out which type is most appropriate based on the content-type.

Here’s an example of using the GET() function:

# Load the httr package library(httr) # Get the url, save response to resp url <- "http://www.example.com/" resp <- GET(url) # Print resp resp # Get the raw content of resp raw_content = content(resp, as = "raw") # Print the head of content head(raw_content)

And there you have it. The content() function by default, if you don't specify the as argument, figures out what type of data you're dealing with and parses it for you. Otherwise you need a character vector for the as argument.

DataCamp Importing Data Into R Ch. 5 - Importing Data from the Web (pt. 2)

Talkin’ about JSON today! (I really like JSON’s format...it makes sense, and you can pack in a lot of info in a relatively small space.)

JSON Objects

A JSON object is an unordered collection of “name”:value pairs. The names are strings, and the values can be strings, number, boolean, null, another JSON object, or a JSON array.

Here’s an example:

{”id”:1,“name”:”Frank”,”age”:23,”married”:false}

To import into R code:

> x <- '{”id”:1,“name”:”Frank”,”age”:23,”married”:false}' #use single quotes so you don’t have to escape the double quotes every time. R will escape the double quotes for you. (Which is nice.) > r <- fromJSON(x) > str(r) List of 4 $ id : int 1 $ name : chr "Frank" $ age : int 23 $ married: logi FALSE

JSON Arrays

JSON Arrays are ordered sequences of 0 or more values. JSON arrays heterogeneous, so they can contain more than one datatype, but when you convert a one-row JSON array into R, R coerces all the elements of the array into a vector containing the same datatype.

JSON Nesting

You can nest other JSON objects and arrays.

Prettify and Minify

JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:

# Mini {"a":1,"b":2,"c":{"x":5,"y":6}} # Pretty { "a": 1, "b": 2, "c": { "x": 5, "y": 6 } }

Unless you're a computer, you surely prefer the second version. However, the standard form that toJSON()returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty argument inside toJSON() to TRUE. If you already have a JSON string, you can use prettify() or minify() to make the JSON pretty or as concise as possible.

And that concludes the Importing Data into R course. Lots of good stuff. Some of it I learned from Coursera’s Data Science specialization, but I like that this is interactive and a little more rigorous. After I take DataCamp’s Cleaning Data course, maybe I’ll head back to Coursera to see if I can’t finish that specialization.

Udacity Intro to Inferential Statistics Lesson 16 - Chi Square Tests

I’ll have to find the ASCII or UTF-8 encoding for the Greek letter chi. Maybe it's Χ. Ha! Thanks, Internet. you’re the best. Oh, wait, this looks just like a regular capital X. Oops. Well, I learned about X2 tests today.

Sometimes your data doesn’t deal with means and standard deviations. Sometimes your data is categorical: male, female; yes, no, that kind of thing. X2 tests allow you to see if your observed data is significantly different from expected data. You subtract your expected values from your observed values, square that result, then divide by the expected value. You sum up all those results from all the different “categories”, and that’s your X2 value.

X2 values can never be negative, and the more categories you have, the more degrees of freedom you get. (# categories - 1).

One cool thing I learned is that if you have n number of observations, but you don’t know what the breakdown for each category is (for your expected values), then your null hypothesis is that there’s no “preference” and each category gets the same number of observations. So, it’s 1/# categories times n.

This is the last lesson for the Intro to Inferential Statistics class. I’m really enjoying it, and I think Udacity’s platform is top rate. They make learning easy.

DataCamp Importing Data Into R Ch. 4 - Importing data from relational databases

(Note: The images posted here are taken from the awesome DataCamp Course Importing Data Into R. I am posting them here guided by what I consider to be fair use. I do not claim ownership or copyrights to these images. DataCamp is a great place to learn R, so if you like what you see here, I encourage you to sign up and take their classes.)

This is about getting data from SQL, Oracle, other databases.

Step 1: Connect

The first step to import data from a SQL database is creating a connection to it. You need different packages depending on the database you want to connect to. All these packages do this in a uniform way, as specified in the DBI package.

dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().

If the MySQL database is a remote database hosted on a server, you'll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password.

Step 2: List tables

After connecting to a remote MySQL database, the next step is to see what tables the database contains You can do this with the dbListTables() function. It takes the argument con and returns a character vector with the list of table names.

Step 3: Import

To import the table data into your R session, use the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import <”table_name”>. The resulting object is a standard R data frame.

SQL Queries from Inside R

This is super cool. You can make SQL queries from R. Awesome. Use the dbGetQuery() function. Here’s an example:

This is sweet!

Here’s a comparison between dbGetQuery() and another way to get records from a MySQL database:

You can use the dbFetch() command to specify the number of records you want imported into R. People are smart!

Don’t forget to disconnect from your remote database by using the dbDisconnect() function.

This chapter was awesome. I love working with SQL, and being able to do it in R is BRILLIANT!

DataCamp - Importing Data Into R Ch. 3 - haven and foreign

haven

SAS

haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

SAS: read_sas()

STATA: read_dta() (or read_stata(), which are identical)

SPSS: read_sav() or read_por(), depending on the file type.

All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.

STATA

Next up are STATA data files; you can use read_dta() for these.

When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it's best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you'll need haven's as_factor() function.

Here’s an example:

variable$column_name <- as_factor(variable$column_name)

Be careful because R’s conversion functions have a dot, whereas haven’s do not. as.Date(as_factor(<variable$column_name>))

Plotting

A plot can be very useful to explore the relationship between two variables. If you pass the plot() function two arguments, the first one will be plotted on the x-axis, the second one will be plotted on the y-axis.

SPSS

The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you're working with, you'll need either read_sav() - for .sav files - or read_por() - for .por files.

(Note: Was reminded that when you subset, using subset(), you can specify conditions based on column names all by themselves. You don’t need the variable name and $. You also can combine conditions using just & and |. Subsetting is something I really need to practice.)

foreign

The foreign package can handle more file types.

SAS

The sas7bdat package is an alternative to reading SAS files because foreign can’t read individual SAS files. SAS “libraries” can be read.

STATA

read.dta() (notice the . vs the _ in haven’s read_dta()) converts factors and dates by default.

Data can be very diverse, going from character vectors to categorical variables, dates and more. It's in these cases that the additional arguments of read.dta() will come in handy.

The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore.

SPSS

read.spss()

All great things come in pairs. Where foreign provided read.dta() to read SAS data, there's also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().

Another argument, use.value.labels specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUEby default which means that so called labelled variables inside SPSS are converted to factors inside R.

(I’m gonna have to “consult the documentation” to figure out all the function parameters for these beauties.)

Kinda random

If you're familiar with statistics, you'll have heard about Pearson's Correlation. It is a measurement to evaluate the linear dependency between two variables, say XX and YY. It can range from -1 to 1; if it's close to 1 it means that there is a strong positive association between the variables. If XX is high, also YY tends to be high. If it's close to -1, there is a strong negative association: If XX is high, YY tends to be low. When the Pearson correlation between two variables is 0, these variables are possibly independent: there is no association between XX and YY.

You can calculate the correlation between two vectors with the cor() function. Take this code for example, that computes the correlation between the columns height and width of a fictional data frame size:

cor(size$height, size$width)

Overall, this was a pretty good chapter, even though it was like drinking out of a fire hose. I got a lot out of it, and I’m sure I’m going to come across some SAS, STATA, or SPSS files out there. It’s good to know how to handle them in R.

Udacity Intro to Inferential Statistics - Regression

I did the first three-fourths of lesson 15 in Udacity’s Intro to Inferential Statistics course. This is a good refresher for me when it comes to regression. Given a two-dimensional data set in a scatter plot. what’s the equation of a line that minimizes the distance from the points to the line? That’s regression. You find the line by minimizing the sum of squares of the observed value minus the expected value (called the residual).

You can find the slope of the regression line, or line of best fit, by calculating the r-coefficient and multiplying that by the sample standard deviation of the dependent variable (y) divided by the sample standard deviation of the independent variable (x). You find the intercept of the line by plugging in a known value of x and y that falls on the line. How are you supposed to know that? Well, one point (x, y) that you know will have a zero residual is the means of x and y. Cool, huh?

This link explains why that is. Although, if you just think about for a second, you know that it’s true because a residual is the observed value minus the expected value (the mean). So, the mean minus itself is zero. This puts the mean on the regression line.

Anyway, fun lesson. The more I study statistics, the more I enjoy it.

DataCamp Importing Data Into R Ch. 2 - Importing Data From Excel (pt. 2)

gdata package

Another package you can use to import data from Excel spreadsheets into R is the gdata package. It doesn’t handle .xlsx files without an additional driver. gdata is an established package, while readxl is still under development. (I don’t know...it seems clunky.) And you can’t import a single sheet.

Remember how read.xls() actually works? It basically comes down to two steps: converting the Excel file to a .csv file using a Perl script, and then reading that .csv file with the read.csv() function that is loaded by default in R, through the utils package.

This means that all the options that you can specify in read.csv(), can also be specified in read.xls().

So, I think the reason you’d use the gdata package is to get really specific with your arguments since read.xls() can take the same arguments as read.csv() (which are a lot, apparently). If you want to really massage that data when you import it, this seems like a good way to go.

XLConnect package

It’s a bridge between Excel and R. Whatever you can do in Excel, you can do in R. Cool! The package can work with both .xls and .xlsx package. The package depends on Java. The appeal to XLConnect is that you can make reproducible processes.

Typically, the first step will be to load a workbook in your R session with loadWorkbook(); this function will build a "bridge" between your Excel file and your R session.

To list the sheets in an Excel file, use

getSheets()

. To actually import data from a sheet, you can use

readWorksheet()

. Both functions require an XLConnect workbook object as the first argument. readWorksheet() returns the results as a data.frame automatically, so you don’t have to do any further conversion.

Where readxl and gdata were only able to import Excel data, XLConnect's approach of providing an actual interface to an Excel file makes it able to edit your Excel files from inside R.

(Note: I learned that when using the data.frame() function, you can specify column names within the function call. I didn’t know you could do that. Or, I forgot you could do that. So, data.frame(temp = <vector>, month = <vector>) would have two columns, one called “temp” and the other “month”.)

Here’s an example of using some common functions in XLConnect:

# Build connection to latitude.xlsx library(XLConnect) my_book <- loadWorkbook("latitude.xlsx") # Create data frame: summ dims1 <- dim(readWorksheet(my_book, 1)) dims2 <- dim(readWorksheet(my_book, 2)) summ <- data.frame(sheets = getSheets(my_book), nrows = c(dims1[1], dims2[1]), ncols = c(dims1[2], dims2[2])) # Add a worksheet to my_book, named "data_summary" createSheet(my_book, "data_summary") # Populate "data_summary" with summ writeWorksheet(my_book, summ, "data_summary") # Save workbook as latitude_with_summ.xlsx saveWorkbook(my_book, "latitude_with_summ.xlsx")

DataCamp Importing Data Into R Ch. 2 - Importing from Excel (pt. 1)

readxl package by Hadley Wickham.

Excel files are common in data analysis. (Is this true? I guess people use Excel for your basic stuff, and that's probably all you need.

The readxl package has two main functions: excel_sheets() and read_excel(). excel_sheets("file_name.xlsx") returns a character vector with the names of the different Excel sheets. read_excel() can handle both .xls and .xlsx files. By specifying the sheet argument, you can tell read_excel() which sheets to import (you can specify the sheet index or sheet name). Other arguments include col_names, col_types, and skip. The first two are self explanatory. skip is to skip a certain number of rows before starting the import. You can't specify the number or rows to import...yet.

For col_types, you can have numeric, date and blank, which skips the column and doesn't import it.

Importing using read_excel()

You can import the spreadsheet with the read_excel() function. Have a look at this recipe:

data <- read_excel("data.xlsx", sheet = "my_sheet")

This call simply imports the sheet with the name "my_sheet" from the "data.xlsx" file. You can also pass a number to the sheet argument; this will cause read_excel() to import the sheet with the given sheet number. sheet = 1 will import the first sheet, sheet = 2 will import the second sheet, and so on.

Importing Multiple Sheets using lapply()

Loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply().

Have a look at the example code below:

my_workbook <- lapply(excel_sheets("data.xlsx"), read_excel, path = "data.xlsx")

The read_excel() function is called multiple times on the "data.xlsx" file and each sheet is loaded in one after the other. The result is a list of data frames, each data frame representing one of the sheets in data.xlsx.

col_names argument

Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.

By default it is TRUE, denoting that the first row in the Excel sheets denote the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.

skip argument

Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you're trying to pull data from. Have a look at this example:

read_excel("data.xlsx", skip = 15)

In this case, the first 15 rows in the first sheet of "data.xlsx" are ignored.

If the first row of this sheet contained the column names, this information will also be ignored by readxl. Make sure to set col_names to FALSE or manually specify column names in this case!

DataCamp Importing Data into R Ch. 1 - Flat Files

What’s a flat file? Is it one that has up to 2 dimensions? Not like arrays that can have more than 2? It’s a record that doesn’t contain “structured relationships” (and I don’t know what that means). So, I guess it can have more than 2 dimensions.

read.table() is the basic function to read flat files. The header argument is FALSE and the default separator is “”.

read.csv() wraps around read.table()

read.csv2() uses commas as the decimal point and semicolons as the separator value (you Europeans be trippin!).

read.delim() is for tab-delimited files. The separation value is sep = “\t”.

read.delim2() is for the Europeans. The decimal point is a comma.

The file name is enclosed in quotes.

read.csv(”file_name.csv”).

read.table

If you're dealing with more exotic flat file formats, you'll want to use read.table(). It's the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is "" by default.

(Note: Use the head() function to return the first 5 rows of a data frame/table.)

stringsAsFactors

You already learned by now how to use the header and sep arguments. You can also specify stringsAsFactors: it tells R whether it should convert strings in the flat file to factors.

For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to string in your text file will be character.

(Note: When you’re subsetting, you can use the which.min() and which.max() functions to find the INDEX of the min and max value, respectively, for a certain field. The example given was hotdogs[which.min(hotdogs$calories), ] to return the row in the hotdogs data frame that had the smallest value in the calories field. Pretty cool!)

Column Classes

If your data file doesn’t come with headers, you can specify them when you import the file into R using the col.names argument. You just assign it a character vector with the names you want.

You can also specify their class/type. Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:

read.delim("my_file.txt", colClasses = c("character", "numeric", "logical"))

This approach can be useful if you have some columns that should be factors and others that should be characters. You don't have to bother with stringsAsFactors anymore; just state for each column what the class should be.

If a column is set to "NULL" in the colClasses vector, this column will be skipped and will not be loaded into the data frame.

hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))

# adding column classes to the data frame. Note that the second column will not be imported because its class was set to NULL

# Edit the colClasses argument to import the data correctly: hotdogs2 hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"), colClasses = c("factor", "NULL", "numeric")

Other Packages for Importing Data

readr() - by Hadley Wickham, faster, more consistent naming

Use read_delim(), and you specify the path to your file, and the delim argument. Specify col_names with character vector to name your columns. If left blank, col_names uses the first row as column names by default. If you set col_names to FALSE, then the columns have variable names X1, X2, X3, and so on. col_types lets you set the column classes. If it’s blank, read_delim() tries to figure it out from the first 30 rows of the input file. You can specify column types with short string representations.

Just as read.table() was the main utils function, read_delim() is the main readr function.

read_delim() takes two mandatory arguments:

file: the file that contains the data

delim: the character that separates the values in the data file

You can read in a certain number of records with the skip and max arguments. skip passes over the number of records you specify. max reads up to a specified number of records.

Apart from controlling how columns are named, you can also specify which types the columns should be in your imported data frame. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.

Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.

For a complete list of collector functions, you can take a look at the collector documentation. For this exercise you will need two collector functions:

col_integer(): the column should be interpreted as an integer.

col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels.

Through skip and n_max you can also control which part of your flat file you're actually importing into R. Watch out: Once you skip some lines, you also skip the first line that can contain column names.

read_csv() and read_tsv() (tab-delimited) are function wrappers for read_delim().

Source: DataCamp.com, https://campus.datacamp.com/courses/importing-data-into-r/chapter-1-importing-data-from-flat-files?ex=9

data.table - speed, fread().

fread() can infer column types and separators. You can specify a bunch of parameters.

Now that you know the basics about fread(), you should know about two arguments of the function: dropand select, to drop or select variables of interest.

Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named "a" and "e". The following options will all do the trick:

fread("path/to/file.txt", drop = 2:4) fread("path/to/file.txt", select = c(1, 5)) fread("path/to/file.txt", drop = c("b", "c", "d") fread("path/to/file.txt", select = c("a", "e"))

The fread() function produces data frames that look slightly different when you print them out. That's because another class named data.table is assigned to the resulting data frames. The printout of such data.table objects is different. Does something similar happen with the data frames generated by readr?

The class of the result of fread() is both data.table and data.frame. read_tsv() creates an object with three classes:tbl_df, tbl and data.frame.

What's the benefit of these additional classes? Well, it allows for a different treatment of printouts, for example.

#datacamp #r #data analysis

Udacity Intro to Inferential Statistics Ch. 12 - One-Way ANOVA

ANOVA = ANalysis Of VAriance

This chapter was all about testing three or more independent samples, and it was awesome. You basically measure the variance between samples (how much their means differ from one another) and divide by the variance within samples (the variability of each sample itself).

It was a little light on the actual explanation, especially on the degrees of freedom part, but that’s probably because they covered it already, and I just need to remember how degrees of freedom work. If you have X number of values that have to add up to a certain sum, you can choose X-1 values, but the final value is fixed in order to get your sum. So, that’s why you see n-1 degrees of freedom for things.

In ANOVA, there are two degrees of freedom, one for the between samples variance (the numerator) and one for the within samples variance (the denominator). The numerator’s df is the total number of samples (K) minus 1. The denominator’s df is the total number of observations from all samples (N), minus K.

This expression gives you the F-statistic, and it’s not normally distributed. I forget why that is. The F-statistic can never be negative. There are different F-statistics depending on the alpha level.

I can’t recommend Udacity’s courses enough. They have the best UX of all the MOOCs I’ve taken, and they do a terrific job of walking you step by step through the instruction material. It’s a lot of fun. I just wish I were rich enough to afford the $200/month nanodegree program. Maybe some day.

Trending Blogs

Recently Viewed Blogs

Data