tidy datasets: ‘col_types’ to read a variable in the format you want

I have been having issues in my datasets as one variable is both numeric and text. Rather than arguing with excel, tidy offers a quick and painless solution: col_types:

dt <- read_csv("_yourdataset_.csv", col_types = cols(tricky_variable = "c"))[/sourcecode language="r"]

with date, time, numeric, integers, doubles, guess (let tidy guess the type) and skip as some of the available options.

Long to wide format with tidyr (and save it in n files)

The data comes from the https://esa.un.org/unpd/wpp/UN population projections

library(tidyr) #load tidyr or <a href="https://www.tidyverse.org/">tidyverse</a>, the latter being a collection of libraries

setwd("/Users/...") #set your working directory

dt <- read.csv("mydataset.csv", header=T) #read data

head(dt) #look at data

##   Index       Country Year Age Male_Pop Female_Pop
## 1     1 AmericanSamoa 2000   0      874        836
## 2     2 AmericanSamoa 2000   1      773        747
## 3     3 AmericanSamoa 2000   2      760        735
## 4     4 AmericanSamoa 2000   3      783        760
## 5     5 AmericanSamoa 2000   4      820        796
## 6     6 AmericanSamoa 2000   5      851        825

The idea would be to have a KEY column with the variables names and a VALUE column with the values. Since we have 2 value columns (male_pop and female_pop) we first need to gather them into 1 value column (Pop_sex) and then paste Pop_sex with Age.

# get it into the right format for "spread"
dt1 <- dt %>%

  gather(Pop_sex, value, 5:6) %>%

  unite(Pop_age, 5, 4, sep="_", remove=T) %>% # paste cols 5 and 4

  spread(Pop_age, value) %>% # spread into wide format

  write.csv(., file = "~/My folder of choice/nameofmyfile.csv") # this is optional

There’s a useful trick I’ve been using to get n csv files out of one long format dataset (eg. 1 file per year), I’ve found this somewhere in stackoverflow:

customFun  = function(mydt) {
  write.csv(mydt,paste0("name_",unique(mydt$year),".csv"))
  return(mydt)
}

mydt %>% 
  unite(newvar, 3:4, sep="_", remove=T) %>%
  spread(newvar, value) %>%
  group_by(year) %>% 
  do(customFun(.))

Note of the author: wide formats are never very useful but in case you really need them (linear regression &co) tidyr is a very compact solution. Be mindful that spreading over >1000 cols takes time. To get back from wide to long format use gather