Discover Top Posts Tagged with #tidyverse

I figured out how to use ggplot to make crochet in the round charts to plan colorwork 👀

For any fellow R and fiber arts nerds, here's my code: https://pastebin.com/U1A7mB3B

hey!! i saw a post of yours reblogged and you said you were using the program R! im registering for my first year of college rn and ill be taking a class with that program. any tips or info you can share?

Hey you, @flightymacaw!

I’d suggest installing R Studio (linked) which delivers a great interface to code your scripts, show output, libraries and help/package documentation in one window.

Learn what data types there are.

Learn about data structurs and data frames.

Get familiar with the general structure of the R language. If you can, maybe already figure out what topics will be covered in the class and read up on the stats? (Although, this is reaaally ambitious, by my standards, for a first year’s student, but it can’t harm)

On the R Studio website there’re also cheatsheets to be found and links to blogs.

r-bloggers is also a great resource but this might be more helpful once you are more advanced and have specific issues to solve. I find it difficult to pinpoint where to start because I just learnt in on my own and that probably wasn’t the best way.

Most valuable packages:

anything in the tidyverse but mostly dplyr and ggplot2. Here’s a good book to learn how to work them (online and free).

I hope this isn’t too overwhelming.

Generally, I’d suggest you just play around in RStudio. Generate or download a data set and write a few lines, calculate some ratios, plot something, just to get familiar with it.

If your code won’t compile, check if you closed all brackets, installed and loaded needed packages/libraries. Those are common issues. You can’t really break anything, so just play around and enjoy the steep learning curve :’D

#ask me ask me ask me #i wont say no how could i #r #r programming #r studio #tidyverse #r-bloggers #data science #flightymacaw

Updates on negative filtering of data-frames in R!

In a recent post, I wrote about how negative filtering ("remove all entries which do not have X") works with missing values - in a hope of sparing at least some other people of bad surprises.

I'm back with yet another potentially painful lesson (see previous post on subsetting tibbles). This may be expected for you, in which case

There is an update! There is a new version out of the package dplyr, and it has a function called "filter_out" which could help you (depending on what you want).

Building tools for R users, these days mostly in Rust 🦀 https://blog.davisvaughan.com https://github.com/DavisVaughan

dplyr::filter_out will remove rows from a data-frame that matches a certain condition. Unlike != with plain dplyr::filter however, it will retain the missing values. I have created a small illustration of the differences between different ways of negative filtering data-frames below.

dplyr::filter_out() could help you, depending on what you are doing and what your underlying assumptions are (what will you a year from now expect of your code?). It isn't necessary to change your whole way of doing things just because this option now exists, that could have bad consequences if you haven't fully thought it through. As always, be safe by making explicit your assumptions and testing them.

All the best Hedders

#rstats #tidyverse #r

The tidyverse is a war crime.

#r #tidyverse

Unexpected issue with negative filters that can cause big troubles

I'm back with yet another potentially painful lesson (see previous post on subsetting tibbles). This may be expected for you, in which case - great! I love that for you. However, I have made this mistake, found the weirdness and corrected it, more than once which is why I'm doing a post here. This should sear this fact into my memory so I don't have to do the same dance again. Maybe I'll also be useful to someone else.

Because of the way missing values work when you do logical tests, you need to think a little bit extra hard when you use dplyr::filter() with a negative condition.

In short, this code:

df %>% dplyr::filter(column2 != "The Big Short") %>% nrow()

will only return rows for which the cells in column2 aren't exactly "The Big Short" and also AREN'T MISSING!

You can test this out yourself with the code below:

df <- data.frame(column1 = c("apples", "pears", "mango", NA, NA), column2 = c("The Big Short", "The Big Short", NA, NA, NA)) df %>% dplyr::filter(column2 != "The Big Short") %>% nrow()

This code will return 0, because there are no rows where for column2 isn't "The Big Short" and isn't missing.

This all boils down to the fact that if you compare NA to any other value, the answer is... drumroll: NA! dplyr::filter only returns rows where the test evaluates to TRUE, so outcomes that result in NA are excluded.

We are actually reminded of this in the description of dplyr::filter()

To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped (dplyr package v1.1.4 description for filter)

For a full reminder of how different types of missing values work, I recommend running through these tests:

NA != NA NA != "mango" NULL == NULL NULL != "mango" NaN != NaN NaN != "mango" "" != "" "" != "mango"

Over and out!

#rstats #tidyverse #r

Unexpected behaviour from tidyverse tibbles that can cause big problems

Tidyverse is a set of R-packages that work on the same principles and work well with each other, and they have revolutionised the R-universe. Besides creating a whole lot of handy functions (my favourite function is dplyr::select) and bringing pipes (%>%) into the R-universe (since 2021, we also have pipes in base-R |>) - tidyverse also introduced a new data object: tibbles!

Rumour has it that they are called tibbles because that's sort of what it sounds like when someone from New Zealand says "tables", and R in general, and tidyverse in particular, has a NZ-bent.

Tibbles are much like base-R data.frames, but with some extra functionality. Most of the time, I don't make use of the added features besides grouping. However, I make use of tidyverse functions and sometimes that can turn data-frames into tibbles without me being fully aware - and this can cause problems for subsetting!

Recently, I learned about a behaviour that sets tibbles apart from data.frames that caused me some serious confusion and troubles: Unlike regular base data-frames, tbl[,2] will not produce a vector - but a tibble. I'll explain.

Naturally, some readers may not be surprised by this. If that is you, you can stop reading here and congratulations on your assumptions aligning with the world. Let's start with data-frames, the more common and familiar type of object. Columns in data-frames are vectors. In fact, data-frames in base-R are a "list of equal length vectors" in terms of R-ontology. You can call on specific columns using the $-operator or []-indexing.

So, if we have a data-frame and would like to subset its columns and check their structure we can do this:

> df <- data.frame(name = c("Hedvig", "Stephen", "Filip"), age = c(36, 37, NA)) > df$age %>% str() num [1:3] 36 37 NA > df[,2] %>% str() num [1:3] 36 37 NA

Those str() calls tell us that both instances of subsetting to the second column are numeric vectors of length 3 (num [1:3]). If we pass these through a function like as .character(), they will still be vectors of length 3 - just characters instead (chr [1:3]).

> df$age %>% as.character() %>% str() chr [1:3] "36" "37" NA > df[,2] %>% as.character() %>% str() chr [1:3] "36" "37" NA

If we then turn that data-frame into a tibble and do the same thing, something different happens.

> df_tibble <- dplyr::as_tibble(df) > df_tibble$age %>% str() num [1:3] 36 37 NA > df_tibble[,2] %>% str() tibble [3 × 1] (S3: tbl_df/tbl/data.frame) $ age: num [1:3] 36 37 NA

The subsetting with the $-operator will indeed be reported to be a numeric vector of length 3. However, using the []-indexing will result in a tibble rather than a vector!

This has the consequence that if we then proceed to use as.character on are subsetting object, the second instance will look quite different indeed:

> df_tibble$age %>% as.character() %>% str() chr [1:3] "36" "37" NA > df_tibble[,2] %>% as.character() %>% str() chr "c(36, 37, NA)"

In the second instances, instead of outputting a character vector of length 3, we get a string of length 1 which is all the content collapsed. This is because when we select a column of a tibble using the []-notation, we get a tibble and when a tibble is passed to as.character, the whole thing is collapsed to a character string. Technically, this happens if you pass a data.frame to as.character as well. If the tibble or data-frame has many columns, they are each collapsed separately.

> df %>% as.character() %>% str() chr [1:2] "c(\"Hedvig\", \"Stephen\", \"Filip\")" "c(36, 37, NA)"

Why is this an issue?

The difference in behaviour between df[,2] and df_tibble[,2] can become a problem if you don't anticipate that you're dealing with a tibble and expect a data-frame. For example, if you pass a data-frame through tidyverse functions such as tidyr::pivot_longer() it will become a tibble. It is possible that later on in your code, you use []-notation to subset this object as if it were a data-frame and then things may break. Here's an example of a data-frame transforming to a tibble because of tidyr::pivot_longer:

> df %>% tidyr::pivot_longer(cols = "name") %>% str() tibble [3 × 3] (S3: tbl_df/tbl/data.frame) $ age : num [1:3] 36 37 NA $ name : chr [1:3] "name" "name" "name" $ value: chr [1:3] "Hedvig" "Stephen" "Filip"

Why is it even like this?

Tibbles are designed to be more consistent and safer for data science workflows because they never change the type of the returned object. This prevents bugs that could occur when a data frame with one column suddenly becomes a vector instead.

This is preferred generally in the data science community, it is for example how data-frames work in python's pandas. This behaviour is mainly an "issue" if you are used to working with data-frames in R and therefore expect to get a vector when using []-subsetting - it's a habits issue rather than an "actual" issue.

Some say that it "should" be tibbles all the way down.

What can you do to ameliorate the situation?

Fear not, all is not lost. There are several ways to sort this out. Before changing your code: update your assumptions, do not expect that [] will return a vector from a 2-dimensional object.

Then, consider one or several of these options:

1. perhaps a cop-out, but works!

use as.data.frame() after certain/most tidyverse functions. If you don't need tibble-specific functionality, then this can work well if you've already got a project up-and-running and want to just continue on "data-frame only"-habits.

2. replace [] with $

Use $-instead, see examples above.

3. replace [] with dplyr::pull()

> df_tibble %>% dplyr::pull(2) %>% str() num [1:3] 36 37 NA

Bonus: dplyr::pull() works with numeric for position of column, column name in quotation marks and even column name without quotation marks. The second option below, without quotation marks, is discouraged as it could refer to a dynamically defined variable elsewhere in the environment.

> df_tibble %>% dplyr::pull("age") %>% str() num [1:3] 36 37 NA > df_tibble %>% dplyr::pull(age) %>% str() num [1:3] 36 37 NA

3. more technically elegant but requires that you are very sure of your data flow

When you use the []-subsetting, add [[1]] like so:

> df_tibble[,2][[1]] %>% as.character() %>% str() chr [1:3] "36" "37" NA

That will access the right level of structure, the column you're after "inside" the tibble you subsetted to. However, note that if you do input a data-frame and use this, you'll get the first item in the vector instead:

> df[,2][[1]] %>% as.character() %>% str() chr "36"

So this only works if you know you're getting tibbles and NOT data-frames.

4) even better than (3)!

After I made this post originally, Ben Bolker on Bluesky pointed out that a more reliable way of indexing with [] is to use the actual strings of column names. So:

> df[,"name"] [1] "Hedvig" "Stephen" "Filip" > df_tibble[,"name"] # A tibble: 3 x 1 name <chr> 1 Hedvig 2 Stephen 3 Filip

Using a string with [,] makes the code clearer and avoids position errors, but if the object is a tibble you'll still get a tibble — not a vector. However.... Ben pointed out that we can update out approach from (3) to

df[["name"]] [1] "Hedvig" "Stephen" "Filip"

df_tibble[["name"]] [1] "Hedvig" "Stephen" "Filip"

and then we do get a vector even from the tibble! Very nice, thanks Ben! I don't know why I didn't think of that.

Hope that helps! Thanks a million to my smart and handsome husband Stephen for helping me debug code today which was malfunctioning due to these issues.

Extra:

data-frames can also behave like tibbles in terms of []-subsetting - i.e. remain a data-frame instead of becoming a vector. By default, data-frames have a parameter called "drop" set to TRUE, but if you change it to FALSE it'll not drop the dimensions of the object - return a data-frame and you'll get the same behaviour as with tibbles.

> df[,2, drop = TRUE] %>% str() num [1:3] 36 37 NA > df[,2, drop = FALSE] %>% str() 'data.frame': 3 obs. of 1 variable: $ age: num 36 37 NA

#r #rstats #tidyverse

Real-World Statistics with R: Solving Problems with Data

In today's data-driven world, the ability to analyze data effectively is crucial. R, a powerful statistical programming language, offers a plethora of tools for data analysis. Whether you're interested in understanding economic trends or delving into public health data, R can help you uncover insights hidden within complex datasets. This blog will guide you through the essentials of statistical analysis in R, including linear regression, ANOVA, and chi-square tests. Additionally, we'll explore how to clean and prepare real-world datasets using popular R packages like tidyverse, dplyr, and readr. Finally, we'll provide a real-world hook by examining how R can be used to analyze COVID-19 trends or population health surveys.

Introduction to Statistical Analysis in R

R is renowned for its robust statistical capabilities. Here are some foundational techniques you'll encounter when working with R:

Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. In R, you can perform linear regression using the lm() function. This technique is invaluable for predicting outcomes and understanding correlations within your data.

Linear Regression

ANOVA (Analysis of Variance)

ANOVA is used to determine whether there are significant differences between the means of three or more groups. This is particularly useful in experimental designs where you want to compare different treatment effects. In R, the aov() function is your go-to for conducting ANOVA tests.

ANOVA

Chi-Square Test

The chi-square test is a non-parametric test used to examine the association between categorical variables. It helps determine if the distribution of sample categorical data matches an expected distribution. You can perform chi-square tests in R using the chisq.test() function.

Chi-Square Test

Cleaning and Preparing Real-World Datasets

Before you can analyze data effectively, you need to ensure it's clean and well-prepared. Real-world datasets often come with missing values, inconsistencies, and irrelevant information. Here's how you can tackle these challenges:

Using tidyverse

The tidyverse is a collection of R packages designed for data science. It includes tools for data manipulation, visualization, and more. tidyverse makes data cleaning intuitive and efficient.

Data Manipulation with dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs to help solve data problems. With functions like filter(), select(), mutate(), and summarize(), you can transform your data into a tidy format that is easy to work with.

Reading Data with readr

readr is a part of the tidyverse and specializes in reading rectangular data like CSV files into R. The read_csv() function is particularly useful for importing large datasets efficiently.

Real-World Hook: Analyzing COVID-19 Trends with R

To see R in action, let's consider the analysis of COVID-19 trends. During the pandemic, vast amounts of data were collected on infection rates, vaccination progress, and more. R can help you analyze this data to uncover patterns and trends.

For example, you can use linear regression to predict future case numbers based on current trends or conduct ANOVA to compare infection rates across different regions. With the help of visualization tools like ggplot2, another package in the tidyverse, you can create informative charts that highlight key insights.

R Libraries

FAQs

1. What is R and why is it popular for statistical analysis? R is a programming language and environment used for statistical computing and graphics. It's popular due to its extensive library of packages and tools for data manipulation, analysis, and visualization.

2. How can I get started with R for statistical analysis? Begin by installing R and RStudio, an integrated development environment for R. Familiarize yourself with basic R syntax and explore the tidyverse packages for data manipulation and analysis.

3. What are some common challenges in real-world data analysis? Common challenges include dealing with missing data, inconsistent data formats, and large datasets. R provides numerous functions and packages to handle these issues effectively.

4. How can I visualize my data in R? R offers several packages for data visualization, with ggplot2 being one of the most popular. It allows you to create complex plots from data in a straightforward manner.

5. Can I use R for machine learning tasks? Yes, R supports machine learning through packages like caret and randomForest. These tools enable you to implement various machine learning algorithms for predictive analytics.

By leveraging R's capabilities, you can turn raw data into actionable insights, making informed decisions based on solid statistical analysis. Whether you're a beginner or a seasoned analyst, R's tools are invaluable for solving real-world problems with data.

Home

View this post on Instagram

A post shared by Sunshine Digital Services (@sunshinedigital.services)

#StatisticalAnalysis #DataWithR #RForDataScience #Tidyverse #RealWorldData #SolveWithData #RStats #DataCleaning #DataDrivenDecisions #SunshineDigitalServices #Instagram #Youtube

View this post on Instagram

A post shared by Assignment On Click (@assignmentonclick)

#DataWrangling #dplyr #tidyr #Tidyverse #RForDataScience #RProgramming #DataCleaningR #DataManipulation #LearnR #TechForStudents #AssignmentHelp #AssignmentOnClick #assignment #assignment help #assignment service #assignmentexperts #assignmentwriting #Instagram