Global and Local Measures of Spatial Autocorrelation

This post aims at being a summary of the available techniques to investigate spatial autocorrelation for the social sciences, rather than presenting the theory behind spatial autocorrelation. For that there are great books available on line, like Anselin’s, Le Page, and Bivand just to cite a few.

The techniques presented here work for a spatial polygons data frame. The difference between spatial points and polygons data frames is not that big, the idea is the same and most of what I am doing here can be applied to data points.

Why do we look at spatial autocorrelation at all? Spatial autocorrelation leads to biased results in regressions, this is the reason why we want to compute Moran’s I and why we include spatial autocorrelation if its measurement proves to be significative and non-random.

Spatial autocorrelation can be investigated globally or locally. “Globally”, implies that the measure you’re going to obtain refers to the dataset as a whole, whether it is a whole country, continent or region. “Locally”, means that you are taking into consideration each and every polygon and getting a measure for each one of them.

We start by uploading the data, projecting them (important when considering the distance based measures -earth is not flat, whatever they may say!), and construct neighborhood relations (in this case Queen and Rook, but could be any other). For more detail see this post on how to construct neighbor relations.

library(maptools)
library(spdep)
NC= readShapePoly(system.file("shapes/sids.shp", package="maptools")[1], IDvar="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))
nb.FOQ = poly2nb(NC, queen=TRUE, row.names=NC$FIPSNO)
nb.RK = poly2nb(NC, queen=FALSE, row.names=NC$FIPSNO)

Global measures of spatial autocorrelation

There are two main measures of global spatial autocorrelation: Moran’s I and Geary’s C. Moran’s I is the most used in my experience, but both work are perfectly acceptable.
Moran’s I ranges between -1 (strong negative spatial autocorrelation with a dispersed pattern) and 1 (strong positive spatial autocorrelation with a clustered pattern) with 0 being the absence of spatial autocorrelation.
Geary’s C ranges between 0 and 2, with positive spatial autocorrelation ranging from 0 to 1 and negative spatial autocorrelation between 1 and 2.
Of course being these inferential measures, if the p-value is non significant we cannot exclude that the patterns could be random(!)

In this case Moran’s I is positive and significant, the z-score (not provided by moran.test) is positive implying spatial clusters, so we can reject the null hypothesis.

library(spdep)
nwb <- NC$NWBIR74
moran.test(nwb, listw = nb2listw(nb.RK))
##
##  Moran I test under randomisation
##
## data:  nwb
## weights: nb2listw(nb.RK)
##
## Moran I statistic standard deviate = 3.0787, p-value = 0.001039
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance
##       0.185965551      -0.010101010       0.004055701
geary.test(nwb, listw = nb2listw(nb.RK))
##
##  Geary C test under randomisation
##
## data:  nwb
## weights: nb2listw(nb.RK)
##
## Geary C statistic standard deviate = 2.0324, p-value = 0.02106
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance
##        0.83274888        1.00000000        0.00677185

if you have polygons with no neighbors remember to specify zero.policy=NULL

moran.plot(nwb, listw = nb2listw(nb.RK))

moran plot

moran.mc(nwb, listw = nb2listw(nb.RK), nsim=100)
##
##  Monte-Carlo simulation of Moran I
##
## data:  nwb
## weights: nb2listw(nb.RK)
## number of simulations + 1: 101
##
## statistic = 0.18597, observed rank = 100, p-value = 0.009901
## alternative hypothesis: greater

permutation Moran_s I

plot(moran.mc(nwb, listw = nb2listw(nb.RK), nsim=100))


Same thing can be done for Geary’s C:

geary.mc(nwb, listw = nb2listw(nb.RK), nsim=100)
##
##  Monte-Carlo simulation of Geary C
##
## data:  nwb
## weights: nb2listw(nb.RK)
## number of simulations + 1: 101
##
## statistic = 0.83275, observed rank = 5, p-value = 0.0495
## alternative hypothesis: greater
plot(geary.mc(nwb, listw = nb2listw(nb.RK), nsim=100))

Local measures of spatial autocorrelation

Local Moran and Local G

locm <- localmoran(nwb, listw = nb2listw(nb.RK))
locG <- localG(nwb, listw = nb2listw(nb.RK))

Get the neighbor matrix into a listwise format with listw: there’s two options here, row-standardized weights matrix style = "W" creates proportional weights when polygons have an unequal number of neighbors, balancing out observations with few neighbors. Binary weights style = "B" upweight observations with many neighbors.

library(classInt)
library(dplyr)
myvar <- NC$NWBIR74
nb <- nb.RK
# Define weight style
ws <- c("W")

# Define significance for the maps
significance <- 0.05
plot.only.significant <- TRUE

# Transform the neigh mtx into a listwise object
listw <- nb2listw(nb, style=ws)

# Create the lagged variable
lagvar <- lag.listw(listw, myvar)

# get the mean of each
m.myvar <- mean(myvar)
m.lagvar <- mean(lagvar)

The next step is to derive the quadrants and set the coloring scheme. I like to color the border of each polygon with the color of their local moran score, regardless of their pvalue, and then fill only the significant ones.

n <- length(NC)
#
vec <- c(1:n)
vec <- ifelse(locm[,5] < significance, 1,0)

# Derive quadrants
q <- c(1:n) for (i in 1:n) {   if (myvar[[i]]>=m.myvar & lagvar[[i]]>=m.lagvar) q[i] <- 1
  if (myvar[[i]]<m.myvar & lagvar[[i]]<m.lagvar) q[i] <- 2
  if (myvar[[i]]<m.myvar & lagvar[[i]]>=m.lagvar) q[i] <- 3   if (myvar[[i]]>=m.myvar & lagvar[[i]]<m.lagvar) q[i] <- 4
}

# set coloring scheme
q.all <- q
colors <- c(1:n)
for (i in 1:n) {
  if (q.all[i]==1) colors[i] <- "red"
  if (q.all[i]==2) colors[i] <- "blue"
  if (q.all[i]==3) colors[i] <- "lightblue"
  if (q.all[i]==4) colors[i] <- "pink"
  if (q.all[i]==0) colors[i] <- "white"   if (q.all[i]>4) colors[i] <- "white"
}

# Mark all non-significant regions white
locm.dt <- q*vec
colors1 <- colors
for (i in 1:n)
{
  if ( !(is.na (locm.dt[i])) )  {
  if (locm.dt[i]==0) colors1[i] <- "white"
}
}
colors2 <- colors
colors2 <- paste(colors2,vec)
pos = list()
for (i in 1:n) {
  pos[[i]] <- c(which(NC$NWBIR74==colors2["blue 0"]))
}

blue0 <- which(colors2=="blue 0")
red0 <- which(colors2=="red 0")
lightblue0 <- which(colors2=="lightblue 0")
pink0 <- which(colors2=="pink 0")
lb <- 6
labels=c("High-High", "High-Low", "Low-High", "Low-Low")

# plot the map
if (plot.only.significant==TRUE) plot(NC, col=colors1,border=F) else
  plot(NC, col=colors,border=F)
plot(NC[blue0,],border="blue",lwd=0.5,add=T)
plot(NC[lightblue0,],border="lightblue",add=T,lwd=0.5)
plot(NC[red0,],border="red",add=T,lwd=0.5)
plot(NC[pink0,],border="pink",add=T,lwd=0.5)
legend("bottomleft", legend = labels, fill = c("red", "pink", "lightblue", "blue"), bty = "n")

lisa map
Local G gives back z-scores values and indicate the posibility of a local cluster of high values of the variable being analysed, very low values indicate a similar cluster of low values.

library(RColorBrewer)

nclassint <- 3
colpal <- brewer.pal(nclassint,"PiYG")
cat <- classIntervals(locG, nclassint, style = "jenks", na.ignore=T)
color.z <- findColours(cat, colpal)

plot(NC, col= color.z, border=T)

Rplot

# color only significant polygons
plot(NC, border=T)
plot(NC[which(vec==1),], col=color.z[which(vec==1)], border=T, add=T)

lisa map2

Creating neighborhood matrices for Spatial Polygons in R (updated)

One of the first steps in spatial analysis is to create a neighborhood matrix, that is to say create a relationship/connection between each and (ideally!) every polygon. Why? Well, given that the premise for spatial analysis is that neighboring locations are more similar than far away locations, we need to define what is “near”, a set of neighbors for each location capturing such dependence.

There are many ways to define neighbors, and usually, they are not interchangeable, meaning that one neighborhood definition will capture spatial autocorrelation differently from another.

In R the package spdep allows to create a neighbor matrix according to a wide range of definitions: contiguity, radial distance, graph based, and triangulation (and more). There are 3 main and most used neighbors:

A) Contiguity based of order 1 or higher (most used in social sciences)

B) Distance based

C) Graph based

Install and load the maptools and spdep libraries shapefile from North Carolina counties:

library(maptools)
library(spdep)
NC= readShapePoly(system.file("shapes/sids.shp", package="maptools")[1], IDvar="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))

A. Contiguity based relations

are the most used in the presence of irregular polygons with varying shape and surface, since contiguity ignores distance and focuses instead on the location of an area. The function poly2nb allows to create 2 types of contiguity based relations:

1. First Order Queen Contiguity

FOQ contiguity defines a neighbor when at least one point on the boundary of one polygon is shared with at least one point of its neighbor (common border or corner);

nb.FOQ = poly2nb(NC, queen=TRUE, row.names=NC$FIPSNO)
#row.names refers to the unique names of each polygon
nb.FOQ
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 490
## Percentage nonzero weights: 4.9
## Average number of links: 4.9

Calling nb.FOQ you get a summary of the neighbor matrix, including the total number of areas/counties, and average number of links.

2. First Order Rook Contiguity

FOR contiguity does not include corners, only borders, thus comprising only polygons sharing more than one boundary point;

nb.RK = poly2nb(NC, queen=FALSE, row.names=NC$FIPSNO)
nb.RK
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 462
## Percentage nonzero weights: 4.62
## Average number of links: 4.62

NB: if there is a region without any link, there will be a message like this:
Neighbour list object:
Number of regions: 910
Number of nonzero links: 4620
Percentage nonzero weights: 0.5924405
Average number of links: 5.391209
10 regions with no links:
1014 3507 3801 8245 9018 10037 22125 30005 390299 390399

where you can identify the regions with no links (1014, 3507,…) using which(…), and in R it is possible to “manually” connect them or change the neighbor matrix so that they can be included (such as graph or distance based neighbors).
Sometimes, it also happens that some polygons that have been retouched (sounds like a blasphemy but it happens a lot with historical maps) may not recognize shared borders. This is when manually setting up neighbors comes in handy (you can’t do that in Geoda).

Contiguity

Higher order neighbors are useful when looking at the effect of lags on spatial autocorrelation and in spatial autoregressive models like SAR with a more global spatial autocorrelation:

nb.SRC = nblag(nb.RK,2) #second order rook contiguity
nb.SRC
## [[1]]
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 490
## Percentage nonzero weights: 4.9
## Average number of links: 4.9
##
## [[2]]
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 868
## Percentage nonzero weights: 8.68
## Average number of links: 8.68
##
## attr(,"call")
## nblag(neighbours = nb.RK, maxlag = 2)

Contiguity2

B. Distance based neighbors

DBN defines a set of connections between polygons either based on a (1) defined Euclidean distance between centroids dnearneigh or a certain (2) number of neighbors knn2nb (e.g. 5 nearest neighbors);

coordNC = coordinates(NC) #get centroids coordinates
d05m = dnearneigh(coordNC, 0, 0.5, row.names=NC$FIPSNO)
nb.5NN = knn2nb(knearneigh(coordNC,k=5),row.names=NC$FIPSNO) #set the number of neighbors (here 5)
d05m
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 430
## Percentage nonzero weights: 4.3
## Average number of links: 4.3
nb.5NN
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 500
## Percentage nonzero weights: 5
## Average number of links: 5
## Non-symmetric neighbours list

a little trick: if you want information on neighbor distances whatever the type of neighborhood may be:

distance = unlist(nbdists(nb.5NN, coordNC))
distance
##   [1] 0.3613728 0.3693554 0.3864847 0.2766561 0.5168459 0.3709748 0.2607982
##   [8] 0.3232974 0.4376632 0.2862144 0.5773310 0.3778483 0.4463538 0.2914539
## ...
## [498] 0.3407192 0.3995114 0.1838115

Distance

C. Graph based (I’ve never used them, but it’s good to know that they exist)

Delauney triangulation tri2nb constructs neighbors through Voronoi triangles such that each centroid is a triangle node. As a consequence, DT ensures that every polygon has a neighbor, even in presence of islands. The “problem” with this specification is that it treats our area of study as if it were an island itself, without any neighbors (as if North Carolina were an island with no Virginia or South Carolina)… Therefore, distant points that would not be neighbors (such as Cherokee and Brunswick counties) become such;
Gabriel Graph gabrielneigh is a particular case of the DT, where a and b are two neighboring points/centroids if in the circles passing by a and b with diameter ab does not lie any other point/centroid;
Sphere of Influence soi.graph: twopoints a and b are SOI neighbors if the circles centered on a and b, of radius equal to the a and b nearest neighbour distances, intersect twice. It is a sort of Delauney triangulation without the longest connections;
Relative Neighbors relativeneigh is a particular case of GG. A border belongs to RN if the intersection formed by the two circles centered in a and b with radius ab does not contain any other point.

delTrinb = tri2nb(coordNC, row.names=NC$FIPSNO) #delauney triangulation
summary(delTrinb)
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 574
## Percentage nonzero weights: 5.74
## Average number of links: 5.74
## Link number distribution:
##
##  2  3  4  5  6  7  8  9 10
##  1  2 13 29 27 22  3  1  2
## 1 least connected region:
## 37039 with 2 links
## 2 most connected regions:
## 37005 37179 with 10 links
GGnb = graph2nb(gabrielneigh(coordNC), row.names=NC$FIPSNO) #gabriel graph
summary(GGnb)
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 204
## Percentage nonzero weights: 2.04
## Average number of links: 2.04
## 20 regions with no links:
## 37109 37131 37137 37141 37145 37147 37151 37159 37161 37165 37173 37175 37179 37183 37185 37187 37189 37195 37197 37199
## Non-symmetric neighbours list
## Link number distribution:
##
##  0  1  2  3  4  5  6  7
## 20 27 16 15 13  7  1  1
## 27 least connected regions:
## 37047 37053 37055 37075 37091 37105 37107 37113 37115 37117 37119 37121 37129 37133 37135 37139 37143 37149 37153 37155 37157 37163 37167 37177 37181 37191 37193 with 1 link
## 1 most connected region:
## 37057 with 7 links
SOInb = graph2nb(soi.graph(delTrinb, coordNC), row.names=NC$FIPSNO) #sphere of influence
summary(SOInb)
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 470
## Percentage nonzero weights: 4.7
## Average number of links: 4.7
## Link number distribution:
##
##  1  2  3  4  5  6  7  9
##  1  5 12 26 30 15 10  1
## 1 least connected region:
## 37031 with 1 link
## 1 most connected region:
## 37097 with 9 links
RNnb = graph2nb(relativeneigh(coordNC), row.names=NC$FIPSNO) #relative graph
summary(RNnb)
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 133
## Percentage nonzero weights: 1.33
## Average number of links: 1.33
## 31 regions with no links:
## 37047 37053 37097 37107 37109 37115 37131 37137 37141 37143 37145 37147 37151 37155 37159 37161 37163 37165 37167 37173 37175 37179 37183 37185 37187 37189 37191 37193 37195 37197 37199
## Non-symmetric neighbours list
## Link number distribution:
##
##  0  1  2  3  4
## 31 30 18 17  4
## 30 least connected regions:
## 37009 37027 37031 37035 37037 37039 37055 37073 37075 37083 37091 37095 37105 37113 37117 37119 37121 37125 37127 37129 37133 37135 37139 37149 37153 37157 37169 37171 37177 37181 with 1 link
## 4 most connected regions:
## 37001 37003 37059 37079 with 4 links

GraphBased

What to do with all this stuff? …

compute and compare global Moran’s I
LISA maps
Variograms and correlograms
…?

Moran plots in ggplot2

Moran plots are one of the many way to depict spatial autocorrelation:
moran.test(varofint,listw)
where “varofint” is the variable we are studying, “listw” a listwise neighbourhood matrix, and the function “moran.test” performs the Moran’s test (duh!) for spatial autocorrelation and is included in the spdep funtionality. The same plot can be done using ggplo2 library. Provided that we already have our listwise matrix of neighborhood relationships listw, we first define the variable and the lagged variable under study, computing their mean and saving them into a data frame (there are a lot of datasets you can find implemented in R: afcon, columbus, syracuse, just to cite a few). The purpose is to obtain something that looks like this (I have used my own *large* set of Spanish data to obtain it):

ggplot2.moranplot1

Upload your data. Here is Anselin (1995) data on African conflicts, afcon:

data(afcon)
varofint listw varlag var.name <- "Total Conflicts"
m.varofint m.varlag
and compute the local Moran's statistic using localmoran:

lisa
and save everything into a dataframe:
df

use these variables to derive the four sectors "High-High"(red), "Low-Low"(blue), "Low-High"(lightblue), "High-Low"(pink):
df$sector significance vec =df$m.varofint & df$varlag>=df$m.varlag]  df$sector[df$varofint<df$m.varofint & df$varlag<df$m.varlag]  df$sector[df$varofint<df$m.varofint & df$varlag>=df$m.varlag]  =df$m.varofint & df$varlag<df$m.varlag]

df$sec.data

df$sector.col[df$sec.data==1] <- "red"
df$sector.col[df$sec.data==2] <- "blue"
df$sector.col[df$sec.data==3] <- "lightblue"
df$sector.col[df$sec.data==4] <- "pink"
df$sector.col[df$sec.data==0] <- "white"

df$sizevar df$sizevar 0.1)
df$FILL df$BORDER
to get the ggplot graph:
p 0.05", "High-High", "Low-Low","Low-High","High-Low"))+
scale_x_continuous(name=var.name)+
scale_y_continuous(name=paste("Lagged",var.name))+
theme(axis.line=element_line(color="black"),
axis.title.x=element_text(size=20,face="bold",vjust=0.1),
axis.title.y=element_text(size=20,face="bold",vjust=0.1),
axis.text= element_text(colour="black", size=20, angle=0,face = "plain"),
plot.margin=unit(c(0,1.5,0.5,2),"lines"),
panel.background=element_rect(fill="white",colour="black"),
panel.grid=element_line(colour="grey"),
axis.text.x  = element_text(hjust=.5, vjust=.5),
axis.text.y  = element_text(hjust=1, vjust=1),
strip.text.x  = element_text(size = 20, colour ="black", angle = 0),
plot.title= element_text(size=20))+
stat_smooth(method="lm",se=F,colour="black", size=1)+
geom_vline(xintercept=m.varofint,colour="black",linetype="longdash")+
geom_hline(yintercept=m.varlag,colour="black",linetype="longdash")+
theme(legend.background =element_rect("white"))+
theme(legend.key=element_rect("white",colour="white"),
legend.text =element_text(size=20))

Check out the interactive shiny version on pracademic

Location, location, location! Why space matters in demography and why we should care.

Read my first contribution to the Demotrends blog! and don’t forget to like Demotrends either in facebook or twitter 🙂
Of course all graphics have been realized in R (maptools library and a bunch of others).
Location, location, location! Why space matters in demography and why we should care..