Hedvig's Rstats outlet @hedvigsr - Tumblr Blog

Updates on negative filtering of data-frames in R!

In a recent post, I wrote about how negative filtering ("remove all entries which do not have X") works with missing values - in a hope of sparing at least some other people of bad surprises.

I'm back with yet another potentially painful lesson (see previous post on subsetting tibbles). This may be expected for you, in which case

There is an update! There is a new version out of the package dplyr, and it has a function called "filter_out" which could help you (depending on what you want).

Building tools for R users, these days mostly in Rust 🦀 https://blog.davisvaughan.com https://github.com/DavisVaughan

dplyr::filter_out will remove rows from a data-frame that matches a certain condition. Unlike != with plain dplyr::filter however, it will retain the missing values. I have created a small illustration of the differences between different ways of negative filtering data-frames below.

dplyr::filter_out() could help you, depending on what you are doing and what your underlying assumptions are (what will you a year from now expect of your code?). It isn't necessary to change your whole way of doing things just because this option now exists, that could have bad consequences if you haven't fully thought it through. As always, be safe by making explicit your assumptions and testing them.

All the best Hedders

#rstats #tidyverse #r

Unexpected issue with negative filters that can cause big troubles

I'm back with yet another potentially painful lesson (see previous post on subsetting tibbles). This may be expected for you, in which case - great! I love that for you. However, I have made this mistake, found the weirdness and corrected it, more than once which is why I'm doing a post here. This should sear this fact into my memory so I don't have to do the same dance again. Maybe I'll also be useful to someone else.

Because of the way missing values work when you do logical tests, you need to think a little bit extra hard when you use dplyr::filter() with a negative condition.

In short, this code:

df %>% dplyr::filter(column2 != "The Big Short") %>% nrow()

will only return rows for which the cells in column2 aren't exactly "The Big Short" and also AREN'T MISSING!

You can test this out yourself with the code below:

df <- data.frame(column1 = c("apples", "pears", "mango", NA, NA), column2 = c("The Big Short", "The Big Short", NA, NA, NA)) df %>% dplyr::filter(column2 != "The Big Short") %>% nrow()

This code will return 0, because there are no rows where for column2 isn't "The Big Short" and isn't missing.

This all boils down to the fact that if you compare NA to any other value, the answer is... drumroll: NA! dplyr::filter only returns rows where the test evaluates to TRUE, so outcomes that result in NA are excluded.

We are actually reminded of this in the description of dplyr::filter()

To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped (dplyr package v1.1.4 description for filter)

For a full reminder of how different types of missing values work, I recommend running through these tests:

NA != NA NA != "mango" NULL == NULL NULL != "mango" NaN != NaN NaN != "mango" "" != "" "" != "mango"

Over and out!

#rstats #tidyverse #r

Unexpected behaviour from tidyverse tibbles that can cause big problems

Tidyverse is a set of R-packages that work on the same principles and work well with each other, and they have revolutionised the R-universe. Besides creating a whole lot of handy functions (my favourite function is dplyr::select) and bringing pipes (%>%) into the R-universe (since 2021, we also have pipes in base-R |>) - tidyverse also introduced a new data object: tibbles!

Rumour has it that they are called tibbles because that's sort of what it sounds like when someone from New Zealand says "tables", and R in general, and tidyverse in particular, has a NZ-bent.

Tibbles are much like base-R data.frames, but with some extra functionality. Most of the time, I don't make use of the added features besides grouping. However, I make use of tidyverse functions and sometimes that can turn data-frames into tibbles without me being fully aware - and this can cause problems for subsetting!

Recently, I learned about a behaviour that sets tibbles apart from data.frames that caused me some serious confusion and troubles: Unlike regular base data-frames, tbl[,2] will not produce a vector - but a tibble. I'll explain.

Naturally, some readers may not be surprised by this. If that is you, you can stop reading here and congratulations on your assumptions aligning with the world. Let's start with data-frames, the more common and familiar type of object. Columns in data-frames are vectors. In fact, data-frames in base-R are a "list of equal length vectors" in terms of R-ontology. You can call on specific columns using the $-operator or []-indexing.

So, if we have a data-frame and would like to subset its columns and check their structure we can do this:

> df <- data.frame(name = c("Hedvig", "Stephen", "Filip"), age = c(36, 37, NA)) > df$age %>% str() num [1:3] 36 37 NA > df[,2] %>% str() num [1:3] 36 37 NA

Those str() calls tell us that both instances of subsetting to the second column are numeric vectors of length 3 (num [1:3]). If we pass these through a function like as .character(), they will still be vectors of length 3 - just characters instead (chr [1:3]).

> df$age %>% as.character() %>% str() chr [1:3] "36" "37" NA > df[,2] %>% as.character() %>% str() chr [1:3] "36" "37" NA

If we then turn that data-frame into a tibble and do the same thing, something different happens.

> df_tibble <- dplyr::as_tibble(df) > df_tibble$age %>% str() num [1:3] 36 37 NA > df_tibble[,2] %>% str() tibble [3 × 1] (S3: tbl_df/tbl/data.frame) $ age: num [1:3] 36 37 NA

The subsetting with the $-operator will indeed be reported to be a numeric vector of length 3. However, using the []-indexing will result in a tibble rather than a vector!

This has the consequence that if we then proceed to use as.character on are subsetting object, the second instance will look quite different indeed:

> df_tibble$age %>% as.character() %>% str() chr [1:3] "36" "37" NA > df_tibble[,2] %>% as.character() %>% str() chr "c(36, 37, NA)"

In the second instances, instead of outputting a character vector of length 3, we get a string of length 1 which is all the content collapsed. This is because when we select a column of a tibble using the []-notation, we get a tibble and when a tibble is passed to as.character, the whole thing is collapsed to a character string. Technically, this happens if you pass a data.frame to as.character as well. If the tibble or data-frame has many columns, they are each collapsed separately.

> df %>% as.character() %>% str() chr [1:2] "c(\"Hedvig\", \"Stephen\", \"Filip\")" "c(36, 37, NA)"

Why is this an issue?

The difference in behaviour between df[,2] and df_tibble[,2] can become a problem if you don't anticipate that you're dealing with a tibble and expect a data-frame. For example, if you pass a data-frame through tidyverse functions such as tidyr::pivot_longer() it will become a tibble. It is possible that later on in your code, you use []-notation to subset this object as if it were a data-frame and then things may break. Here's an example of a data-frame transforming to a tibble because of tidyr::pivot_longer:

> df %>% tidyr::pivot_longer(cols = "name") %>% str() tibble [3 × 3] (S3: tbl_df/tbl/data.frame) $ age : num [1:3] 36 37 NA $ name : chr [1:3] "name" "name" "name" $ value: chr [1:3] "Hedvig" "Stephen" "Filip"

Why is it even like this?

Tibbles are designed to be more consistent and safer for data science workflows because they never change the type of the returned object. This prevents bugs that could occur when a data frame with one column suddenly becomes a vector instead.

This is preferred generally in the data science community, it is for example how data-frames work in python's pandas. This behaviour is mainly an "issue" if you are used to working with data-frames in R and therefore expect to get a vector when using []-subsetting - it's a habits issue rather than an "actual" issue.

Some say that it "should" be tibbles all the way down.

What can you do to ameliorate the situation?

Fear not, all is not lost. There are several ways to sort this out. Before changing your code: update your assumptions, do not expect that [] will return a vector from a 2-dimensional object.

Then, consider one or several of these options:

1. perhaps a cop-out, but works!

use as.data.frame() after certain/most tidyverse functions. If you don't need tibble-specific functionality, then this can work well if you've already got a project up-and-running and want to just continue on "data-frame only"-habits.

2. replace [] with $

Use $-instead, see examples above.

3. replace [] with dplyr::pull()

> df_tibble %>% dplyr::pull(2) %>% str() num [1:3] 36 37 NA

Bonus: dplyr::pull() works with numeric for position of column, column name in quotation marks and even column name without quotation marks. The second option below, without quotation marks, is discouraged as it could refer to a dynamically defined variable elsewhere in the environment.

> df_tibble %>% dplyr::pull("age") %>% str() num [1:3] 36 37 NA > df_tibble %>% dplyr::pull(age) %>% str() num [1:3] 36 37 NA

3. more technically elegant but requires that you are very sure of your data flow

When you use the []-subsetting, add [[1]] like so:

> df_tibble[,2][[1]] %>% as.character() %>% str() chr [1:3] "36" "37" NA

That will access the right level of structure, the column you're after "inside" the tibble you subsetted to. However, note that if you do input a data-frame and use this, you'll get the first item in the vector instead:

> df[,2][[1]] %>% as.character() %>% str() chr "36"

So this only works if you know you're getting tibbles and NOT data-frames.

4) even better than (3)!

After I made this post originally, Ben Bolker on Bluesky pointed out that a more reliable way of indexing with [] is to use the actual strings of column names. So:

> df[,"name"] [1] "Hedvig" "Stephen" "Filip" > df_tibble[,"name"] # A tibble: 3 x 1 name <chr> 1 Hedvig 2 Stephen 3 Filip

Using a string with [,] makes the code clearer and avoids position errors, but if the object is a tibble you'll still get a tibble — not a vector. However.... Ben pointed out that we can update out approach from (3) to

df[["name"]] [1] "Hedvig" "Stephen" "Filip"

df_tibble[["name"]] [1] "Hedvig" "Stephen" "Filip"

and then we do get a vector even from the tibble! Very nice, thanks Ben! I don't know why I didn't think of that.

Hope that helps! Thanks a million to my smart and handsome husband Stephen for helping me debug code today which was malfunctioning due to these issues.

Extra:

data-frames can also behave like tibbles in terms of []-subsetting - i.e. remain a data-frame instead of becoming a vector. By default, data-frames have a parameter called "drop" set to TRUE, but if you change it to FALSE it'll not drop the dimensions of the object - return a data-frame and you'll get the same behaviour as with tibbles.

> df[,2, drop = TRUE] %>% str() num [1:3] 36 37 NA > df[,2, drop = FALSE] %>% str() 'data.frame': 3 obs. of 1 variable: $ age: num 36 37 NA

#r #rstats #tidyverse

problems with citing R-packages automagically

It's nice to cite people whose work has aided you, and this goes for R-packages as well. You can use the command citation("name of package") to get information on how the package creators want to be cited for their work. See example below:

______________________________________

> citation("ape") To cite ape in a publication please use:

Paradis E, Schliep K (2019). “ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R.” Bioinformatics, 35, 526-528. doi:10.1093/bioinformatics/bty633 https://doi.org/10.1093/bioinformatics/bty633.

A BibTeX entry for LaTeX users is

@Article{, title = {ape 5.0: an environment for modern phylogenetics and evolutionary analyses in {R}}, author = {Emmanuel Paradis and Klaus Schliep}, journal = {Bioinformatics}, year = {2019}, volume = {35}, pages = {526-528}, doi = {10.1093/bioinformatics/bty633}, }

ape is evolving quickly, so you may want to cite its version number (found with 'library(help = ape)' or 'packageVersion("ape")').

______________________________________

Recently, I wrote a function (SH.misc::credit_packages()) which looks at all the R-scripts in your project and identifies which packages you're using. Then it generates references for them in a bibTeX file, a string that can be imported to TeX so that all are named and tables with packages and versions. The function is actually a wrapper for two functions NC.misc::list.functions.in.file() and xfun::pkg_bib().

This is so nice right, you get to reference everyone!

There are lots of brilliant R-packages out there, on CRAN and elsewhere. Truly, if you've got a task or a problem - chances are someone else

However, since then I've found three problems.

Problem 1) citation information can be wrong

The citation information for the package, which is pulled from the package DESCRIPTION or inst/CITATION files, can have errors in it.

So far I've seen two kinds

1a) wrong data for the reference, such as wrong year (yes, people don't necessarily write the right citation info for their own work)

1b) the bibTeX formatting can be wrong (e.g. unescaped &)

Neither is super common, but the more packages you cite the more likely these things are to crop up. While the function SH.misc::credit_packages can't automatically detect 1a type problems, it does sort out 1b) by asking xfun::pkg_bib to set tweak to T (ampersands escaped etc).

Problem 2) missing date accessed/retrieved

R-package citations are not necessarily the typical journal article or book format. Often they'll contain web links as well, like this one:

Kassambara, A. (2023) ggpubr: ggplot2 based publication ready plots. https://rpkgs.datanovia.com/ggpubr/. R package version 0.6.0.

Some publishers require references with web links to have a date accessed/retrieved specified. This is in spite of it not being necessary according to the popular bibliography style of the American Psychological Association (APA). APA states that you only need to include a retrieval date "if the work is unarchived and designed to change over time". Since citations like the one above for ggpubr includes a version, it is considered archived and does not need additional date accessed/retrieved. However, specific publishers may still require this information anyway.

What to do? Well, you'll need to insert this information. Pick a date when you ran the code in full or some other relevant date like when you installed the packages and specify that manually as date accessed/retrieved. You can set it in the field urldate and/or in note in the bibTeX entry, depending on the bibliography style. If you're a linguist using unified.bst, set it in note as {Date accessed [YY-MM-DD]}.

Problem 3) sometimes some packages are missed

Unfortunately, the way that SH.misc::credit_packages() loops through your R-scripts to identify which packages you've used is not perfect. Sometimes it fails to recognise packages. Therefore it's important to check which ones were included, and if necessary supplement with package names in the argument pkgs_vec_manual.

Partial solution

All of the problems above can be ameliorated by not citing every package you've used, but just the most crucial ones. You can use the argument pkgs_vec_manual for the function SH.misc::credit_packages() to manually specify specific packages instead of all packages in your scripts. It's not ideal, you'll need to cherry pick but maybe it's worth it 🍒?

If you are asked to supply date accessed/retrieved, check first if they're asking this just out of habit or if their style-guide actually does require this information even when version data is present (as is the case for most R-packages).

Anyway, give thanks and thanks will come to you.

#rstats #r

Credit R-packages and indicate versions

There are lots of brilliant R-packages out there, on CRAN and elsewhere. Truly, if you've got a task or a problem - chances are someone else has already made a neat and robust function for just that! And all of this work is being done for free by brilliant package creators who spend their valuable time and energy making all of our lives better!

In this post, I'll show you a function I've made to create references for all the package you use, a table of package versions and if you'd like, a TeX-file that can be plugged into your article directly to list all of the citations.

Giving credit

When you use someone else's package, it's nice to credit their work by referencing them. The easiest way to do this is

utils::citation("name of package")

then you'll get the citation that the package creator(s) want you to use. You can even get an entry in bibTeX for import to Zotero, bib-file etc. How nice is that?

Reproducibility

Another advantage of referencing packages is that the citations typically contain the version number of the package you used. This can be important because R-packages get updated regularly. Bugs are found and functions are improved upon generally. If you want users to be able to recreate your exact code, you're gonna have to tell them the versions of all the packages you use.

Uri Simonsohn in his blog post point out a change in behaviour of dplyr::distinct(). Before 2016-06-24, all columns were kept but after only the one indicated was kept in the resulting data-frame. That kind of dramatic change is, in my experience, not common but it can happen. By keeping track of package versions, you can mitigate the problem.

Image from https://datacolada.org/95

There are several approaches to loading specific versions of packages, one popular being the package groundhogr::groundhog.library (read more about that here: https://groundhogr.com ). I've had some trouble with groundhogr, so I'm currently workshopping a much less sophisticated function that is takes a tsv-table of packge and their versions and loops through unsinstalling, installing etc.

There are also different approaches to switching between versions of R itself, I'm currently working with Rswitch and it's going well.

Setting that aside for now, regardless of how you install and load packages it's good if in your publication you credit and indicate versions of packges. To that end, I've written

SH.misc::credit_packages()

this function takes a vector of file-paths (the R-scripts you're using), finds all the packages you're using, creates a table of versions (can be printed as LaTeX table with xtable::xtable()) and/or as plain tsv-file), a bibTeX file of all citations (can be imported to reference manages like Zotero) and optionally a tex-file that can be imported directly into a LaTeX document to list all of the references.

The function also reports which packages are used the most, which script uses the most different functions and lets you know if there are packages which are loaded but not used.

Caveats: in cases where a function name exists in more than one package and you don't use :: the function can get confused. Also, it only tracks direct calls. If you load a package because it's necessary for another package, but you don't call functions in the first package directly, it won't pick up on that and therefore ignore it for the output. To this end, I've added the argument "extra_pkgs" where you can list any packages that you don't want it to miss. Otherwise, it'll only include in the output packages which have functions that are called directly in your scripts.

This function is accomplished using a combination of the following packages: NC.misc, xtable, utils, bib2df, dplyr, tibble, knitr, reader, magrittr and tidyr. In case of NC.misc, I had to adapt the function list.functions.in.file a little to output each instance of a package being called. base-R functions like readLines are also used. See citations for all packages used at the end of this post.

Example output

Table of packages and versions in tv-file

Table of packages and their versions in TeX (using R-package xtable)

bibTeX file with entries for all packages used

TeX-file listing all of the packages citekeys (can be included into LaTeX with \input)

References

Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.

Cooper, N. (2022). NCmisc: Miscellaneous functions for creating adaptive functions and scripts. Version 1.2.0 https://cran.r-project.org/web/packages/NCmisc/NCmisc.pdf

Cooper N (2017). reader: Suite of Functions to Flexibly Read Data from Files. R package version 1.0.6, https://CRAN.R-project.org/package=reader.

Dahl D, Scott D, Roosen C, Magnusson A, Swinton J (2019). xtable: Export Tables to LaTeX or HTML. R package version 1.8-4, https://CRAN.R-project.org/package=xtable.

Müller K, Wickham H (2023). tibble: Simple Data Frames. R package version 3.2.1, https://CRAN.R-project.org/package=tibble.

Ottolinger P (2019). bib2df: Parse a BibTeX File to a Data Frame. R package version 1.1.1, https://CRAN.R-project.org/package=bib2df.

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.

Wickham H, Vaughan D, Girlich M (2023). tidyr: Tidy Messy Data. R package version 1.3.0, https://CRAN.R-project.org/package=tidyr.

Xie Y (2023). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.43, https://yihui.org/knitr/.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

#rstats #r #tidyverse

dimensionality reduction

This post is not technically about R, it's a more general note to remind myself of the very basics of different approaches to reducing multidimensional data to various latent dimensions.

Principal Components Analysis (PCA)

PCA takes coordinates/variables, which can be a row of observations with any set of continuous numeric variables. PCA needs a complete matrix, no missing data. If you've got some missing data, you can consider pruning observations and variables to remove it or impute it using an appropriate method.

There is a variant of PCA, nipalsPCA: https://rdrr.io/bioc/pcaMethods/man/nipalsPca.html which can handle small amounts of missing data with iterative approach for estimating the principal components extracting them one at a time

PCA proceeds new dimensions, called components. PCA also tells you how much of the variance in the data each component explains, with the first component explaining the most and so on. You can use a nongraphical Cattel’s Scree test to find the optimal number of components to explain your data.

PCA preserves covariance of data.

Besides the reduced dimensions, PCA will also tell you which variables in your original data loads onto each dimension which helps you interpret what the new latent dimensions mean. MCA (discussed later) does the same.

The best intro to PCA I know is by the brilliant Julia Silge: https://juliasilge.com/blog/stack-overflow-pca/

There exists phylogenetic PCA, which factors in relatedness of data-points as it finds latent dimensions.

Image 1: 3-dimensional scatterplot of first three principal components in some structural linguistic data (Sahul-project, Reesink, G., & Dunn, M. (2012). Systematic typological comparison as a tool for investigating language history.). Code: https://github.com/HedvigS/personal-cookbook/blob/main/R/example_data/PCA_RGB_plot.R

Image 2: plots of the top-40 variables in the original data that load onto the first and second components in Grambank data (v1). Source Skirgård, H., Haynie, H. J., Blasi, D. E., Hammarström, H., Collins, J., Latarche, J. J., ... & Gray, R. D. (2023). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16), eadg6175. Code: https://github.com/grambank/grambank-analysed/tree/main/R_grambank/PCA

PCA preserves the covariance structure of data and should be used on numeric continuous variables.

Input: A complete table of numeric continuous variables, with no missing data – imputation may be necessary. Alternatively, you can use nipalsPCA which can handle missing data.

Binary data is probably better handled with MCA than with PCA

Ordinal data could possibly be handled with PCA, but requires thinking if they are monotonic or not (is the step between 1 and 2 the same distance as the step between 2 and 3)

Output:

New dimensions (principal components), ordered by ability to explain variance in the data

Variance explained per dimension (eigenvalues)

Loadings of input variables on each component

Scores for each observation on the components

Multidimensional Scaling (MDS)

MDS takes distances between points as input (you can calculate those distances however you like). The distance matrix of all observations need to be complete, but the underlying data needn't be. For example, Gower-distances (a.k.a relative Hamming) can be calculated even when there is missing data. However, that missing data should be considered and addressed carefully at that step (again, pruning or imputation can be used).

MDS is also known as Principal Components Analaysis (PCoA) or Torgerson–Gower.

Most often, MDS refers to "Classical" MDS, however there is also

Metric multidimensional scaling (mMDS)

Non-metric multidimensional scaling (NMDS)

Generalized multidimensional scaling

PCoA is only the same as classical MDS.

Multidimensional Scaling (MDS) preserves distances between observations and “knows” nothing about the input variables used to calculate those distances.

Input: A distance or dissimilarity matrix between observations (e.g., Euclidean, Gower). The underlying data can have missing values if the distance method allows for it (e.g., Gower).

Output:

New dimensions, ordered by ability to explain variance in the distance matrix

Variance explained per dimension (eigenvalues)

Scores for each observation in the new dimensions

Other

Other techniques that I know less about but will list for reference. May be updated in future. Last update 2025-10-01.

Multiple Correspondence Analysis (MCA)

MCA is similar to PCA, but instead of continuous data it takes categorical data without order (colors, brands etc) or binary data.

Input: A complete table of categorical or binary variables. If not binary, variables are turned into binary via one-hot encoding. No missing values allowed, so imputation may be necessary.

Output:

New dimensions (factors/components), ordered by ability to explain variance in the data

Variance explained per dimension (eigenvalues)

Loadings of variables on each dimension (technically of the one-hot encoded variables)

Scores for each observation on the new dimensions

Factor Analysis of Mixed Data (FAMD)

FAMD is a combination of PCA and MCA, it can take categorical and continuous data both. As with PCA, you can use a nongraphical Cattel’s Scree test to find the optimal number of components to explain your data. Used for example in: Kalyan, S., & Donohue, M. (2023). The Dimensions of Morphosyntactic Variation: Whorf, Greenberg and Nichols were right. Linguistic Typology at the Crossroads, 3(2), 132-190.

Input: A complete table of mixed variables (both numeric continuous and categorical). No missing values – imputation may be necessary.

Output:

New dimensions (components), ordered by ability to explain variance in the data

Variance explained per dimension (eigenvalues)

Loadings for numeric variables and categorical variables (technically the one-hot encoded versions)

Scores for observations on each dimension

t-distributed stochastic neighbor embedding (t-SNE)

t-SNE shows clusters of data. It takes as it's input the observations and variables, not their distances (i.e. unlike MDS). t-SNE relies on the researcher defining a "perplexity" value which has to do with how many other observations each point is compared to. There is no a priori way of choosing a perplexity value based on the data, the researcher has to choose it themselves. It is possible that t-SNE outputs clusters that are not "real" due to an ill-chosen perplexity value. The size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. If you use t-SNE, you gotta figure out a principled way of defining perplexity.

Image 3: Same data, different perplexity measurements in t-SNE. Source: https://www.scdiscoveries.com/blog/knowledge/what-is-t-sne-plot/

Uniform Manifold Approximation and Projection (UMAP)

UMAP is similar to t-SNE. It also features a parameter that needs hand-tuning, the number of neighbours.

I am sceptical of both t-SNE and UMAP since they require humans a priori to set a variable for finding dimensions and clusters, which opens it up for "hacking" (setting a value to find what you want to find).

There has been some recent progress made, if you absolutely want to use tSNE and UMAP, check out:: Xia, Lucy, Christy Lee, and Jingyi Jessica Li (2024) "Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters." Nature Communications 15.1

Summary of PCA, MCA, FAMD and MDS input and output

#r #rstats

updated R-function for spatial covariance

When dealing with fixed spatial covariance in R, don't give longitudes and latitudes to the function geoR::varcov.spatial. Instead, use the function varcov.spatial.3D from the package rgrambank or give the geoR::varcov.spatial distances.

This is a problem specifically when you have a table of observations with longitudes and latitudes and you ask geoR::varcov.spatial to make a spatial covariance matrix with some specified decay. This problem doesn't exist if you instead choose to give the function distances that you've already calculated, then all works as it should.

The problem with giving longitude and latitude to geoR::varcov.spatial is that it doesn't calculate distances correctly before it computes the covariance. The function assumes the world is a flat rectangle with UK in the center instead of a globe with no surface center. This means that Fiji is extremely far from Samoa even though we know they're next to each other, you'd have to go all around the earth because you're on the edge of the rectangular map.

Here is more on the problem itself:

For geographic coordinates, the most common way to specify them scientifically is with Longitude and Latitude (first proposed in Greece befo

Like I said earlier, if you give geoR::varcov.spatial distances instead, you're golden. Here's more on how to do that:

Sometimes giving data to a function is a bit like trying to feed Ryan Gosling cereal, he wants it but you're not yet sure how to give it to

However, if you don't want to go through that rigamarole you can give longitude and latitudes to my function rgrambank::varcov.spatial.3D instead. The package is on Github, the repos also includes example scripts that showcase how each function works:

Contribute to HedvigS/rgrambank development by creating an account on GitHub.

Over and out, good luck!

#rstats #r #phylogenetics #linguistictypology

D-estimate test of phylo signal - sanity checks

Just a brief message about a problem with Fritz and Purvis' test of phylogenetic signal in binary traits - D-estimates.

When the distribution of tip values is highly skewed (e.g. 5% one way, 95% the other) - the assumptions of the test break down.

I've written about this in a recent paper: https://www.jbe-platform.com/content/journals/10.1075/dia.22022.ski More details are in Supplementary Material I. The Supplementary Material isn't up yet but will be soon. **EDIT**: Supplementary material here: https://doi.org/10.1075/dia.22022.ski.additional

I've also written a wrapper function in R to caper::phylo.d() that runs some sanity checks and if you clear those you proceed to the regular function. It's in a package on GitHub that is a work in progress, please lodge issues via GitHub if you find any.

library(remotes) library(caper) remotes::install_github("HedvigS/[email protected]") SH.misc::phylo.d_wrapper()

Old blogpost that roughly explains the problem https://hedvigsr.tumblr.com/post/702460017373216768/d-estimates-going-nuts-when-values-are-skewed

If you're using this test for something then you are more than welcome to use the wrapper function to check if this problem exists for your project.

#rstats #r #phylogenetics

All tibbles are also data-frames

There's no need to worry too much about tibbles, they're basically just data-frames with some added bells and whistles. All tibbles are also data.frames and can be turned into purely data-frames with as.data.frame().

One of the snazzy things that I find useful is that once you've run grouped_by() it'll remember the row groups even if you stack further functions.

But, overall if you've got a workflow with data.frames you don't need to change much if some end up also being tibbles. If you get annoyed by tibbles being a little bit more finicky and demanding more from you, you can always run as.data.frame().

#rstats #tidyverse

giving geoR::varcov.spatial() distances instead of coordinates - how to feed the baby

Sometimes giving data to a function is a bit like trying to feed Ryan Gosling cereal, he wants it but you're not yet sure how to give it to him. Some trial and erroring, looking at source code and a good old mind-pause can often do the trick.

I recently had this with geoR::varcov.spatial() and its argument "dists.lowertri" and here is the solution.

As I recently learned, stats::dists() doesn't calculate geographic distances well. It messes up distances over the antimeridian, and it doesn't take into account the curvature of the earth.

Now, if you're using BRMS, INLA or some other regression model where you want to include a spatial variance-covariance matrix as a predictor you may turn to the package geoR and their varcov.spatial() function. However, if you give this function the coordinates of your data points it will run stats::dists() on them. This is not desirable, because stats::dists() is not good at geographic distances.

What you wanna do instead is calculate distances in a good way and then feed geoR::varcov() the argument dists.lowertri instead of coords. The function takes either, if you give it coordinates, it'll calculate a distance matrix with stats::dists() and proceed from there. If you give it distances, it'll take those instead and not calculate its own.

We can see that this is the case in the first couple of lines of the source code for varcov.spatial(), as there are several sanity checks for one of the arguments coords or dists.lowertri being specified (but not both).

Now, the documentation says that the argument dists.lowertri should be "a vector with the lower triangle of the matrix of distances between pairs of data points". This was at first a bit odd to me, typically distances are in a pairwise matrix (observations as rows and columns, cell values as distances). I was a bit stumped at how it expected this in a flat vector form. Again I turned to the source code, and found that if coords are specified, this is what happens:

dists.lowertri <- as.vector(dist(coords))

I had a look at this, and it's a numeric vector of the lower triangle of a distance matrix (diagonal excluded). If I have some coords, I can use fields::rdist.earth() in this way to feed it to varcov.spatial:

rdist.earth_dists <- fields::rdist.earth(x1 = coords, x2 = coords, miles = FALSE) rdist.earth_dists[upper.tri(rdist.earth_dists, diag = TRUE)] <- NA dists_vector <- as.vector(rdist.earth_dists) %>% na.omit() spatial_covar_mat <- varcov.spatial(dists.lowertri = dists_vector, cov.pars = c(1, 1.15), kappa = 2)$varcov

Naturally, you don't have to use fields::rdist.earth() specifically. You can use some other way of doing distances (maybe cost-distances?). However you make your distances, set the upper triangle and diagonal to missing values and do as.vector() and na.omit(). (For language cost-distances, check this resource out.) There you are, we've fed the baby and we can move on with our lives. P.S. If you struggle installing and running the package geoR on macOS this could be due to problems with XQuartz. If all you want is geoR::varcov.spatial(), you actually don't need XQuartz. One solution is to just copy-paste over the specific code for the function you need, that way you bypass the geoR package installation problems. This is what we did in a recent paper, see code here. Make sure to credit the developers still though. Also, note that since you copy it over at a certain point in time, you won't be caught up with bug-fixes and other improvements to the function.

P.P.S. For some more sample code on vcv:s, look here.

#rstats #r

don't use stats::dist for geographic distances! It can't handle the antimeridian correctly!

For geographic coordinates, the most common way to specify them scientifically is with Longitude and Latitude (first proposed in Greece before the birth of Christ and later standardized at a conference in the USA in 1800's).

Longitudes run from -180 to +180, Latitudes run from -90 to +90. Longitudes center on Greenwich (England) and Latitudes on the equator. Based on the prime meridian (the Longitude running through England), the antimeridian (the opposite one running through the Pacific) and the equator, you can divide up the earth into 4 quadrants.

The antimeridian is often not depicted. If the map is centered on Greenwich, then the antimeridian is the west and east edges. Below is a Pacific-centered map showing the antimeridian.

The antimeridian in the Pacific results in everything to the west of it having positive values for Longitude (for example, Fiji is roughly at Longitude 177) and to the east, it's negative values (for example Samoa is around Longitude -171).

Now, if you want to calculate distances between points, that straddle the antimeridian, you don't wanna be silly and go aaaalll the way around you want to take the shortest path. Take for example Samoa and Fiji. I've drawn a map below which shows the antimeridian (gray line) and lines between Fiji and Samoa, and also Fiji and Santo (Vanuatu). The line between Fiji and Samoa crosses the antimeridian, whereas the line between Fiji and Santo does not. We'd expect the distances to be quite similar and for any smart calculation of these distances to not go aaaaall the way around the earth to go from Fiji to Samoa.

However, this is not what happens if we calculate distances using the base R function stats::dist(). In addition to not accounting for the curvature of the earth, stats::dist() also goes AAALL the way around the earth to get from Fiji to Samoa. This is probably bad for whatever you want to do with geographic distances.

The illustration below shows the distance on a 2D surface that does not contain information on the edges of the map being connected (i.e. a 3D globe), which is what stats::dist() does.

Instead, use rdist.earth() from the R-package fields. It'll do things correctly and also account for the curvature of the earth(!). Below is some code illustrating the differences using coordinates of the languages Bislama (spoken on Santo), Fijian and Samoan.

library(tidyverse) library(fields) library(reshape2) coords <- data.frame( Glottocode = c("fiji1242", "bisl1239", "samo1305"), Longitude = c(177.772, 166.890, -171.830 ), Latitude = c(-17.8148, -15.4000, -13.9200)) %>% column_to_rownames("Glottocode") %>% as.matrix() rdist.earth_dists <- fields::rdist.earth(x1 = coords, x2 = coords, miles = FALSE) %>% reshape2::melt() %>% filter(Var1 == "fiji1242" & Var2 == "bisl1239"| Var1 == "fiji1242" & Var2 == "samo1305") %>% arrange(Var2) percent <- paste0(round(rdist.earth_dists[1,3] / rdist.earth_dists[2,3] * 100, digits = 2), "%") cat(paste0("fields::rdist.earth thinks that the distance Fijian <-> Bislama is ", percent, " of the distance Fijian <-> Samoan" )) dist_dists <- dist(coords) %>% as.matrix() %>% reshape2::melt() %>% filter(Var1 == "fiji1242" & Var2 == "bisl1239"| Var1 == "fiji1242" & Var2 == "samo1305") %>% arrange(Var2) percent <- paste0(round(dist_dists[1,3] / dist_dists[2,3] * 100, digits = 2), "%") cat(paste0("stats:dist thinks that the distance Fijian <-> Bislama is ", percent, " of the distance Fijian <-> Samoan" ))

The output is:

fields::rdist.earth thinks that the distance Fijian <-> Bislama is 99.74% of the distance Fiji <-> Samoan stats:dist thinks that the distance Fijian <-> Bislama is 3.19% of the distance Fiji <-> Samoan

Thanks to Ezequiel and Angela for noticing and discussions.

There are of course even more distance measurements you can use besides fields:.rdist.earth(). rdist.earth() is an upgrade from stats::dist(), but you may want even more sophisticated measurements like cost-surfaces etc.

If you are using varcov.spatial() from geoR note that if you give it coordinates, it will calculate distances using stats::dist(). Instead, you can give it distances right away, with the argument dists.lowertri and calculate those yourself in a sensible way (for example with fields::rdist.earth()).

#rstats

bib2df bug - hacky solution

bib2df is an R-package that reads in bibtex files and makes a data frame of it, where every field is a column, every entry a row.

There's a bug in this package: if there are no spaces before and after the equal sign for the field assignment, it doesn't work. It is known.

One hacky solution is to just read in the bibtex file as lines, insert spaces in all such places and then read it in with bib2df. It's not pretty, but it works and is an easy fix if you find yourself in this trouble.

Here's a script that does this:

Contribute to HedvigS/personal-cookbook development by creating an account on GitHub.

Please note that bib2df assumes that fields are like this:

author = {Hedvig}

if your fields are like this:

author = "Hedvig"

it will struggle.

#rstats #tidyverse #bibtex #latex

D-estimates going nuts when values are skewed

D-estimates are a tool for measuring phylogenetic signal in a set of binary data. The method was proposed by Fritz and Purvis (2010) and is implemented in the R-package caper by Fritz and Orome. It's fast, it's neat and very useful. (Note: it measures phylogenetic signal not conservativness/stability.) I've noticed something that can be an issue, and that's good to know about if you are using this method.

Caveat: I could be wrong about what is going on here, this is what I think based on my understanding.

The input is a set of binary data linked to a tree, and the output is a value - the D-estimate. If this value is 1 or higher, your data is similar to what would happen if the traits were randomly generated and if it's 0 they're more similar to Brownian evolution. You should also look at the p-values (pval0 and pval1) that the function outputs, as these take into account sample size and tells you how similar your set is to 0 or 1 given that. Values can also be lower than 0 and higher than 1. In my experience, values lower than -7 and higher than 7 are very rare and probably require closer scrutiny.

I've noticed that sometimes the D-estimate can get much, much larger than 1 and much, much lower than 0. This occurs when the data you have is extremely skewed, for example when all but one tip is of the same value (all tips are 1 except one that is 0). I've created an example in an R-script to illustrate and I have suggestions for solutions.

The tree below has 155 tips. 154 of those have a 0 for our binary trait, and one tip has the value 1. In cases like this, the D-estimate algorithm will severely struggle. It's suggested D-estimates of 1520 once when it was run, and -21 another time. These are VERY different outcomes. What is going on?

Each time you measure D-estimate, a set of random value distributions are created and one Brownian simulation is carried out. This is why you get different D-estimates every-time you run it (unless you set random seed).

The default number of random permutations for caper::phylo.d() is 1000. In cases where there is only one tip of one state, it just so happens that the random distributions sometimes hit very close to that. The random distributions are less likely to be similar to a more complicated pattern with more tips of either state.

To illustrate this further, I used the tree above and a couple of different distributions of feature states.

only one tip of state 0, either almost directly daughter of root, smack in the middle or random position

sister pairs of same state, all other different

two random tips of same state

triplets of same states

three random tips of same state

quadruplets of same state

four random tips of same state

one larger clade of 31 tips of same state

31 random tips of same state

R-script here

You can see the distribution of these simulated variables below:

I then run the D-estimate algorithm on this tree and these traits 8 times each and with different number of random permutations (1000, 20000 and 30000). This generates 408 D-estimates.

The plot below shows on the x-axis the number of features that are of the minority state (i.e. 1 tip with value 0, 2 tips with value 0 etc) and the y-axis is the D-estimate value. The three panels represent the different permutation values.

While it is true that the D-estimate values for 1 tip only in one state are stilly varying a lot more than those with 2 etc - this seems to be improved with an increased number of permutations. The amount of variation in the output goes down as the number of permutation goes up, it gets a better sense of what was just a chance random similarity and what is more likely.

Even when you set the number of permutations to 3000, the variation for cases where there is only one singleton tip is still pretty high (579 in my example), and the variance is also quite high for the cases of 2 and 4 cases compared to the case where a larger cluster exists.

If you want to dig deeper, there are also systematic differences if the singleton value is in a more or less direct daughter to the root ("outlier" in my code), smack in the middle or at a random position. It has to do with the way it does ancestral state reconstruction, which is through Felsenstein's contrasting algorithm (sort of like max parsimony but smarter because it cares about branch lengths).

The R-code I've uploaded doesn't require you to download any data or anything, the tree and data are all there directly in the script. It doesn't take that long time to run, and when it's done it'll do a little "pling!". You can easily try it out and poke around yourself in the resulting data-frame.

Solution

If you are getting D-estimates that are varying wildly, first take a close look at your feature value distributions. Consider increasing the number of permutations to reign it in a bit.

However, there's something fundamental you need to consider if you've got very skewed data: in order for the D-measure to work, it needs groups of tips to latch onto, clumps. One tip is not a group. Two tips is better, but still a bit sus. You may want to disregard the D-estimates in such cases entirely, and separate those instances out in your data and compare and summarised them in a different way. Set them aside in one bucket and present them in some other way, and compare D-estimates of features that are more suited for this kind of analysis. Have a think about it, have a poke around - run the D-estimate many times and have a look at the variance you get.

Good luck!

P.S. Big thanks to my wonderfully smart husband Stephen (left below) for helping me work out the mathematics of all of this. He is a kind and intelligent person.

#Rstats #R #statistics #phylogenetics #trees #caper #phylo.d

caper::comparative.data() complaining about rownames: solution

I was using caper::comparative.data() just now to prep some data for PGLS analysis. It was giving me an odd error message:

Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length In addition: Warning message: Setting row names on a tibble is deprecated.

I couldn't understand what was going on, the function doesn't even ask for rownames at all. And what's this warning from tidyverse about rownames on tibbles?

After digging into the source code for this function and having a chat with my good friend and colleague Dr Hannah Haynie, we figured it out: the data frame I was feeding it was also a tibble, and the tibble warning, though only a warning, was throwing a spanner in the works for the way the caper-function was functioning such that the warning turns into an error.

The solution is to just turn the data frame into a data fram only by using as.data.frame().

#rstats #caper #phylo.d

Handling command line arguments to R scripts

Do you need to specify arguments to R scripts, but sometimes need to run through it in Rstudio? Here's a solution.

You can execute R scripts from the command line by going

Rscript script.R

This is equivalent to sourcing a script inside Rstudio or inside another script, it'll run the entire script from start to finish.

There's a neat feature here, you can pass arguments to the script directly from the command line. For example, if you have a script that looks like this:

script.R args = commandArgs() argument <- args[1] print(argument)

And you call on it in the command line like so:

Rscript script.R "tomato"

It'll pass the string "tomato" to the script, through the function commandArgs(). This function makes a character vector of every text string it finds from the command line. In this case, there's only one item and it ends up as the argument of print - so it'll be spit back to the prompt.

You can also give it more complex objects, and many at a time. You can give it the file path to a file that you want read in, or specify hyper-parameters in a statistical model. For example:

script.R args = commandArgs() print(paste(args[1], args[2]))

Command line:

Rscript script.R "tomato" "potato"

This will spit back "tomato potato", which have been pasted together in print().

However, when you're developing code you might want to try different things out without calling it from the command line. Maybe you're debugging and you're running through the code chunk wise in Rstudio to find the issue. If so, you will run into trouble because the vector args will be empty. Here's a solution:

args = commandArgs()

if(length(args) != 0){ argument <- args[1] cat(paste0("I'm being called from the command line and you gave me one argument. It was:\n", argument, "\n")) } else { #if you're running this script chunkwise in Rstudio or similar instead of via command line, you'll read in the parameters this way: argument <- "potato" cat(paste0("I'm being called from inside Rstudio or from the command line without an argument. The harcoded arugment there is:\n", argument, "\n"))

}

There's an if-statement in this code, it tests if there is indeed something in the vector args. If nothing was specified in the command line, the length of args will be 0. In that case, there's a hardcoded option - "potato". If the code is run on the command line with no other specified arguments or if it's run chunk wise in Rstudio, the default value "potato" will be used. This can be handy.

Over and out.

#rstats

don't use duplicated() for getting rid of duplicates if you want to pick randomly

You might have duplicates in your data, and you might want to get rid of them. Unless you know things about these duplicates, the best choice is probably to pick randomly.

If you use the function duplicated() to find duplicates and remove them, you will always be keeping the first record you have and removing all subsequent duplicates. This isn't random. You'll always retain the first one in the dataset, and discard all the others. Is that what you want?

If you feel comfortable with tidyverse packages, there's a great solution here using group_by() and either sample_n() or slice_sample(). I've written examples below showcasing the output using duplicated() and the tidyverse method. Each method is shown 10 times just so you can see what's going on.

library(tidyverse)

df <- tibble(movies= c("Clueless", "Get it on", "Bianca", "Bianca"), score = c(5, 4, 3, 4))

for(i in 1:10){df_no_dupes_1 <- df[!duplicated(df$movies), cat(paste0("With the duplicated method, I kept the score ", df_no_dupes_1[3,2], " for the movie Bianca.\n"))}

for(i in 1:10){ df_no_dupes_2 <- df %>% group_by(movies) %>% sample_n(size = 1) #you can use slice_sample(n = 1) as well

kept <- df_no_dupes_2 %>% dplyr::filter(movies == "Bianca") %>% .[,2]

cat(paste0("With the tidyverse method, I kept score ", kept, " for the movie Bianca.\n"))

}

Finally, you may know something about your data and want to keep a particular entry, say the one with the highest value etc. You can use the weights argument in sampl_n(), or you can use arrange() and then distinct().

for(i in 1:10){ df_no_dupes_3 <- df %>% arrange(desc(score)) %>% distinct(movies, .keep_all = T)

kept <- df_no_dupes_3 %>% dplyr::filter(movies == "Bianca") %>% .[,2]

cat(paste0("With the distinct method of keeping the highest score, I kept score ", kept, " for the movie Bianca.\n"))

}

#rstats

if a file doesn't exist, run this script to create it

Sometimes when you're running code you're relying on files created by other scripts. For example, in my previous post I used a script that needed certain data tables to exist as tsv-files.

There's a couple of different ways to go here: you can rely on the script always being run in a particular order and therefore the necessary things existing (using for example a makefile or an r or shell script which calls r scripts in a given order), you can just source the relevant scripts at the beginning regardless or...

... or you can use a little if statement and check if the file doesn't exist, and if so source the relevant script. That's what I do here:

autotyp_area_fn <- "output_tables/glottolog_AUTOTYP_areas.tsv"

if (!file.exists(autotyp_area_fn)) {

source("assigning_AUTOTYP_areas.R") }

autotyp_area <- read_tsv(autotyp_area_fn, col_types = cols()) %>% dplyr::select(Language_ID, AUTOTYP_area)

This code snippet first makes a variable for the expected file location, then goes to an if statement that is activated if it is true that the file does NOT exist. If it doesn't it sources the appropriate script. If not, it does nothing.

Then it reads in the file.

I actually use this in combination with makefiles, because I want there to be a set order but I also want to be able and go in and spot check specific scripts without getting into a hassle about what needed to exist.

This is not a super-user sophisticated thing, but it improves my life somewhat. Maybe it'll improve yours too?

#rstats

Trending Blogs

Recently Viewed Blogs

Hedvig's Rstats outlet