This post is not technically about R, it's a more general note to remind myself of the very basics of different approaches to reducing multidimensional data to various latent dimensions.
Principal Components Analysis (PCA)
PCA takes coordinates/variables, which can be a row of observations with any set of continuous numeric variables. PCA needs a complete matrix, no missing data. If you've got some missing data, you can consider pruning observations and variables to remove it or impute it using an appropriate method.
There is a variant of PCA, nipalsPCA: https://rdrr.io/bioc/pcaMethods/man/nipalsPca.html which can handle small amounts of missing data with iterative approach for estimating the principal components extracting them one at a time
PCA proceeds new dimensions, called components. PCA also tells you how much of the variance in the data each component explains, with the first component explaining the most and so on. You can use a nongraphical Cattel’s Scree test to find the optimal number of components to explain your data.
PCA preserves covariance of data.
Besides the reduced dimensions, PCA will also tell you which variables in your original data loads onto each dimension which helps you interpret what the new latent dimensions mean. MCA (discussed later) does the same.
The best intro to PCA I know is by the brilliant Julia Silge: https://juliasilge.com/blog/stack-overflow-pca/
There exists phylogenetic PCA, which factors in relatedness of data-points as it finds latent dimensions.
Image 1: 3-dimensional scatterplot of first three principal components in some structural linguistic data (Sahul-project, Reesink, G., & Dunn, M. (2012). Systematic typological comparison as a tool for investigating language history.). Code: https://github.com/HedvigS/personal-cookbook/blob/main/R/example_data/PCA_RGB_plot.R
Image 2: plots of the top-40 variables in the original data that load onto the first and second components in Grambank data (v1). Source Skirgård, H., Haynie, H. J., Blasi, D. E., Hammarström, H., Collins, J., Latarche, J. J., ... & Gray, R. D. (2023). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9(16), eadg6175. Code: https://github.com/grambank/grambank-analysed/tree/main/R_grambank/PCA
PCA preserves the covariance structure of data and should be used on numeric continuous variables.
Input:
A complete table of numeric continuous variables, with no missing data – imputation may be necessary. Alternatively, you can use nipalsPCA which can handle missing data.
Binary data is probably better handled with MCA than with PCA
Ordinal data could possibly be handled with PCA, but requires thinking if they are monotonic or not (is the step between 1 and 2 the same distance as the step between 2 and 3)
New dimensions (principal components), ordered by ability to explain variance in the data
Variance explained per dimension (eigenvalues)
Loadings of input variables on each component
Scores for each observation on the components
Multidimensional Scaling (MDS)
MDS takes distances between points as input (you can calculate those distances however you like). The distance matrix of all observations need to be complete, but the underlying data needn't be. For example, Gower-distances (a.k.a relative Hamming) can be calculated even when there is missing data. However, that missing data should be considered and addressed carefully at that step (again, pruning or imputation can be used).
MDS is also known as Principal Components Analaysis (PCoA) or Torgerson–Gower.
Most often, MDS refers to "Classical" MDS, however there is also
Metric multidimensional scaling (mMDS)
Non-metric multidimensional scaling (NMDS)
Generalized multidimensional scaling
PCoA is only the same as classical MDS.
Multidimensional Scaling (MDS) preserves distances between observations and “knows” nothing about the input variables used to calculate those distances.
Input:
A distance or dissimilarity matrix between observations (e.g., Euclidean, Gower). The underlying data can have missing values if the distance method allows for it (e.g., Gower).
New dimensions, ordered by ability to explain variance in the distance matrix
Variance explained per dimension (eigenvalues)
Scores for each observation in the new dimensions
Other techniques that I know less about but will list for reference. May be updated in future. Last update 2025-10-01.
Multiple Correspondence Analysis (MCA)
MCA is similar to PCA, but instead of continuous data it takes categorical data without order (colors, brands etc) or binary data.
Input:
A complete table of categorical or binary variables. If not binary, variables are turned into binary via one-hot encoding. No missing values allowed, so imputation may be necessary.
New dimensions (factors/components), ordered by ability to explain variance in the data
Variance explained per dimension (eigenvalues)
Loadings of variables on each dimension (technically of the one-hot encoded variables)
Scores for each observation on the new dimensions
Factor Analysis of Mixed Data (FAMD)
FAMD is a combination of PCA and MCA, it can take categorical and continuous data both. As with PCA, you can use a nongraphical Cattel’s Scree test to find the optimal number of components to explain your data. Used for example in: Kalyan, S., & Donohue, M. (2023). The Dimensions of Morphosyntactic Variation: Whorf, Greenberg and Nichols were right. Linguistic Typology at the Crossroads, 3(2), 132-190.
Input:
A complete table of mixed variables (both numeric continuous and categorical). No missing values – imputation may be necessary.
New dimensions (components), ordered by ability to explain variance in the data
Variance explained per dimension (eigenvalues)
Loadings for numeric variables and categorical variables (technically the one-hot encoded versions)
Scores for observations on each dimension
t-distributed stochastic neighbor embedding (t-SNE)
t-SNE shows clusters of data. It takes as it's input the observations and variables, not their distances (i.e. unlike MDS). t-SNE relies on the researcher defining a "perplexity" value which has to do with how many other observations each point is compared to. There is no a priori way of choosing a perplexity value based on the data, the researcher has to choose it themselves. It is possible that t-SNE outputs clusters that are not "real" due to an ill-chosen perplexity value. The size of clusters produced by t-SNE is not informative, and neither is the distance between clusters.
If you use t-SNE, you gotta figure out a principled way of defining perplexity.
Image 3: Same data, different perplexity measurements in t-SNE. Source: https://www.scdiscoveries.com/blog/knowledge/what-is-t-sne-plot/
Uniform Manifold Approximation and Projection (UMAP)
UMAP is similar to t-SNE. It also features a parameter that needs hand-tuning, the number of neighbours.
I am sceptical of both t-SNE and UMAP since they require humans a priori to set a variable for finding dimensions and clusters, which opens it up for "hacking" (setting a value to find what you want to find).
There has been some recent progress made, if you absolutely want to use tSNE and UMAP, check out:: Xia, Lucy, Christy Lee, and Jingyi Jessica Li (2024) "Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters." Nature Communications 15.1
Summary of PCA, MCA, FAMD and MDS input and output