22  Multidimensional scaling

Multidimensional scaling (MDS) is another approach to ordination. The purpose is similar to the goals of PCA, but the methods are a bit different. With MDS, the starting point is a similarity, or distance matrix, providing measurements between all pairs of data. Many different functions can be used to compute this similarity matrix. Commonly distance functions are used. If quantitative data have very skewed distributions, or there is no good way to interpret the distance between points, the data may be converted to ranks (1, 2, 3, …) before the similarity measure is computed; then the method is called non-metric multidimensional scaling (NMDS) in recognition of the fact that the information being presented is not related to a distance.

22.0.1 Cities on a map

An easy to understand example of MDS starts with a matrix giving the distance between each pair of cities in a set. The MDS visualization then scatters these points across the plane, reconstructing the geographic separation of the points.

The dataset below is built from a database of world cities, selected to have only cities of more than 5 million plus some cities in Canada. I included a maximum of 4 cities per country so that no one region of the globe was too strongly concentrated in points. You can get the data from simplemaps and make your own subset. I’ll plot the data twice: once just as points in the plane, and once with a map of the major landforms of earth as a background.

cities <- read_csv("static/selected_cities.csv")
Rows: 61 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): city, city_ascii, country, iso2, iso3, admin_name, capital
dbl (4): lat, lng, population, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cities, aes(x=lng, y=lat)) + 
  geom_point() +
  geom_text_repel(aes(label=city), size=2) 

Now we calculate the distance between all pairs of cities. I’ve shown two ways to do this – using great circle distance along the surface of the Earth, and using a distance that projects the surface of the Earth onto a plane first. If the goal is to reconstruct the map shown above, use this projected distance. If you would like instead to show great-circle distances between cities in the plane, use the geodesic distance.

city_distance <- dist(cities |> dplyr::select(lng, lat) |> as.matrix())
  # distance in equirectangular projection
  # https://en.wikipedia.org/wiki/Equirectangular_projection

Perform MDS (principal coordinates analysis). The direction of the two main axes could be reversed relative to the original map; I’ve reversed the x and y axes to match our customary view of the world.

mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) |>
  ggplot(aes(x = V1, y = V2)) + 
  geom_point() + 
  geom_text_repel(aes(label = city), size=2) +
  scale_x_reverse() + scale_y_reverse() +
  labs(title = "Map reconstructed from equirectangular distance matrix",
       x = "", y = "")

city_distance <- rdist.earth(cities |> dplyr::select(lng, lat) |> as.matrix(), 
                             miles = FALSE) # geodesic distance
mds1 <- cmdscale(city_distance)
colnames(mds1) <- c("V1", "V2")
bind_cols(cities, as_tibble(mds1)) |>
  ggplot(aes(x = V1, y = V2)) + 
  geom_point() + 
  geom_text_repel(aes(label = city), size=2) +
  scale_x_reverse() + scale_y_reverse() +
  labs(title = "Map reconstructed from great-circle distance matrix",
       x = "", y = "")

22.1 NMDS example

Morse code is a way of sending text messages using just two symbols: dot and dash, which was designed to be transmitted by a person clicking a key to make sounds and the receiver listening to the sounds and translating the message as it arrives.

The dataset below was created as part of an experiment to measure the rate at which patterns of sounds for one symbol were confused with sounds for a different symbol. The matrix is symmetric dissimilarity measure. Rows and columns vary across the 36 symbols tested (26 letters and 10 numeric digits). All diagonals are 0s. The off diagonal values are large if the sounds are not likely to be confuesed. Smaller dissimilarities correspond to symbols that are more likely to be confused. A small excerpt of the table is shown.

morse.dist <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-dist.txt',
                          row.names = 1, head = T)
names(morse.dist) <- rownames(morse.dist)
morse.dist[1:5,1:5] |> kable() |> row_spec(0, monospace=TRUE) |> 
  column_spec(1, monospace=TRUE) |>
  kable_styling(full_width = FALSE)
.- -... -.-. -.. .
.- 0 167 169 159 180
-... 167 0 96 79 163
-.-. 169 96 0 141 166
-.. 159 79 141 0 172
. 180 163 166 172 0

Use the function metaMDS from the package vegan to perform the NMDS ordination. The ordiplot function shows the objects from the dissimilarity matrix on a two-dimensional “ordination plot”. Points that are closer together are more likely to be confused (they are less dissimilar).

FIX ME.

NMDS <- metaMDS(morse.dist, trace=0)
NMDS

Call:
metaMDS(comm = morse.dist, trace = 0) 

global Multidimensional Scaling using monoMDS

Data:     morse.dist 
Distance: user supplied 

Dimensions: 2 
Stress:     0.1800255 
Stress type 1, weak ties
Best solution was repeated 5 times in 20 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation 
Species: scores missing
plot(NMDS, type = "text", cex = 1.5)

# ordiplot(NMDS, cex = 1.5, type = 't')
# stressplot(NMDS)

This next plot compares the distances in the original data to the distances as represented by the ordination. If the ordination represents the original data well, this will be close to a straight line. There is a point for every pairwise comparison. In this case there are 36 * 35 / 2 = 630 distances to compare. I have labeled some of the points with the largest distortion imposed by the ordination.

symbols <- rownames(NMDS$points)
NMDS[c("dist", "dhat", "iidx", "jidx")] |> as_tibble() |> 
   mutate(comparison = paste0(symbols[iidx], "/", symbols[jidx]),
          comparison2 = case_when( abs(dist-dhat) < 40 ~ "", TRUE ~ comparison)) |>
   ggplot(aes(dist, dhat)) + 
   geom_point(size=0.5, color="blue") + 
   geom_text_repel(aes(label = comparison2)) +
   theme_bw()

You can also access the points from the ordination and make the plot using ggplot. The relative position of points on the plat is the only thing that matters – any rotation or translation of the plot contains the same information.

NMDS$points |> as_tibble(rownames = "Symbol") |>
   ggplot(aes(x = MDS1, y = MDS2 )) +
   geom_text(aes(label=Symbol)) + 
   theme_bw() + 
   geom_point(size = 0.5) + 
   geom_text_repel(aes(label = Symbol))

You might want to understand the NMDS analysis in terms of some properties of the Morse Code signals. This table has the length (1-5) and the ratio of short to long (dots and dashes) signals in each symbol. The envfit function then finds the direction each of these variables increases most rapidly across the ordination plane. The summary reports the direction and the correlation between these variables and the position on the ordination plot. The arrow for “length” follows the pattern in the ordination very well, while the arrow for the ratio of short to long only accounts for about half of the variation in the ordination plot.

morse.attr <- read.delim('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/morsecodes-attr.txt',
                        row.names = 1, head = T)
ef <- envfit(NMDS, morse.attr)
ef

***VECTORS

                    NMDS1    NMDS2     r2 Pr(>r)    
length            0.88381 -0.46784 0.9287  0.001 ***
ratio.short.long -0.39383 -0.91918 0.5067  0.001 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Permutation: free
Number of permutations: 999

The arrows show the direction of most rapid average increase of each variable.

plot(NMDS, type = "text", cex = 1.5)
species scores not available
plot(ef)

Here is a method to reproduce this plot using ggplot. First use str(ef) to examine the structure of the result from envfit. Then plot the points and arrows using ggplot.

arrows1 <- ef$vectors$arrows |> as_tibble(rownames = "code")
as_tibble(NMDS$points, rownames = "code") |>
   ggplot(aes(x = MDS1, y = MDS2, label = code)) + 
   geom_text() +
   geom_text(data = arrows1, aes(x = 25*NMDS1, y = 25*NMDS2)) +
   geom_segment(data = arrows1, aes(x = 20*NMDS1, y = 20*NMDS2, xend = 0, yend = 0)) +
     theme_bw()

22.2 Packages used

In addition to tidyverse, kableExtra and broom, this lesson uses functions from the packages:

  • fields (for computing a distance matrix)
  • vegan (for the ordination functions)
  • ggmap (for drawing maps)
  • ggrepel (for labelling points on the maps)

22.3 Further reading

  • Vegan package and the ggvegan package
  • https://jonlefcheck.net/2012/10/24/nmds-tutorial-in-r/
  • https://rpubs.com/collnellphd/GWU_nmds
  • Morse code experiment description and analysis