A match made in R: checking the order of geographical areas in shape files and in your data frames

Not every shape file is as nice as those provided in libraries. Sometimes we have to deal with historical maps, which have been hand-drawn, re-touched and what not. To work with geo-referenced data it is essential to have a variable in both shape file and dataframe with unique coding that has exactly the same number of areas and the same ordering in both files.

A quick way to check if shapefile and dataframe have the same number of areas:

nrow(df) == length(shape.file$Code)

In the shapefile, one can also select a couple of areas big enough so that they can easily be located, and plot them as “control” areas.
For instance, I want to select the area with code “15078” in the shapefile:
[1] 271

which is the area in the 271-th position (same way shape.file$Code[271] gives the code of area 271).

this is an easy way to locate your “control” area(s).
Ideally, you should have some variable that is identical to the one in the shapefile, a codification of some sort, providing a unique Code, the name of the area or some factors that allow you to locate the area in space.

An easy way to check if both shape file and data frame have the same ordering of geographical areas is to test it:
>code.sh <- cbind(c(1:length(shape.file$Code)),as.vector(shape.file$Code))
>code.df <- cbind(c(1:nrow(df)),df$Code)
[,1]  [,2]

What if it’s not?
First option: the inelegant solution
Manually change the order of the areas in a csv file according to the exact order they have in the shape file. It’s easy as you can create an ordinal index for the shapefile codes, paste it in excel, and assign it with a vlookup function.
Second option: the smart R match
In R there is a function called match that returns a vector of the positions of first matches of the first argument in its second:
>my.match <- match(df$Code, shape.file$Code)
NB: to use match the two variables providing the code for the areas have to have the very same unique and identical codes, or else funny stuff happens. To check that everything is in its right place, you can plot the two “control” spatial polygons we chose in the beginning, using their position in the dataframe rather than in the shapefile:

Game of Thrones maps in R…

The map of GOT world with rivers, roads, lakes, the Wall, and main cities:


Neighborhood relations according to Sphere of Influence pretty much coincide with roads and rivers (package spdep):


Paste some images to locate the (surviving) Stark family members, using rasterImage from the png library:


A space-time box plot of Spain’s TFR for 910 comarcas.

The idea behind spatial analysis is that space matters and near things are more similar: a variable measured in city A is (ideally) different from the same variable measured in city B. A simple way to get a feeling and to represent this hypothesis is through graphical visualization, usually a map(s).


However, when dealing with time series maps are cumbersome and  with sometimes some information is lost, such as the national average or path convergence. Box plots are a simple yet very effective way to synthesize a lot of information in one graph. The following plot depicts TFR over a 30 years period for 910 Spanish areas with respect to the national average value (thick black line in the middle of the boxes).

p <- ggplot(dat, aes(x=factor(YEAR), y=dat$TFR))
p <- p + geom_boxplot()
p <- p + scale_y_continuous(limits=c(0,2.5)) + scale_x_discrete("YEAR", breaks=seq(1981,2011,by=5))