Lecture 12: Data carpentry with `tidyverse`¶

STAT598z: Intro. to computing for statistics¶

Vinayak Rao¶

Department of Statistics, Purdue University¶

We will use a dataset of movies scraped off IMDB: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

Available from the class website

In [ ]:

movies_orig <- read.csv('./Data/movie_metadata.csv')
movies      <- movies_orig
# Can view this in RStudio using View(movies)
movies[1,]

In [ ]:

unique(movies$director_name)

In [ ]:

unique(movies$director_name[movies$imdb_score>8.5])

In [ ]:

(movies$movie_title[movies$imdb_score>9])

In [ ]:

library('tidyverse')
movies <- as_tibble(movies)

Most functions that works with dataframes works with tibbles

functions in tidyverse require tibbles
additionally, tibbles have some nice conveniences

In [ ]:

my_rnd <- tibble(x=rnorm(10), y = x+1, z = x>0)
print(my_rnd) # tibbles also print a bit more nicely

The 'pipe' operator `%>%`¶

tidyuniverse gets this from package purrr

magrittr offers additional functionality

A side point on infix functions¶

%func_name% is syntax for infix (rather than prefix) functions:

In [ ]:

'%plus%' <- function(x,y) x+y
    
1 %plus% 2; '%plus%'(3,4)

%>% pipes output of first function to first argument of the second

Can give more readable code. E.g. consider

In [ ]:

range(
      movies$actor_1_facebook_likes[
        order(
              movies$imdb_score, decreasing = T
             )
       ][1:10]
     )
# range(movies$actor_1_facebook_likes[
#   order(movies$imdb_score, decreasing = T)][1:10])

Have to parse code from inside to outside.

In [ ]:

movies$imdb_score %>% 
  order(decreasing = T) %>% 
  movies$actor_1_facebook_likes[.] %>%  #What are the arguments to '['?
  .[1:10] %>%
  range

By default, output of function to left of %>% is the first argument of the function to the right

Use . as placeholder if argument you are piping to is not the first

In [ ]:

4 %>% log(2) # log(4,2)

In [ ]:

4 %>% log(2,.) # log(2,4)

Can pipe to multiple arguments

In [ ]:

2 %>% log(.+6,.) # log(8,2)

Pipes in pipes are possible (but be careful)

In [ ]:

2 %>% log(.+6 %>% .^2 %>% print,.); log(38,2)

tiyverse gets %>% from the purrr package

The magrittr package provides more such functions.

E.g. the T-pipe %T>% passes the LHS onwards

useful for functions like plot where output isn't important

In [ ]:

library(magrittr); rnorm(100) %T>% hist  %>% mean

Our next package from tidyverse is dplyr

filter: pick observations by values (rows)
arrange: reorder rows
select: pick variables (columns) by their names
mutate: create new variables from existing variables
summarise: summarise many values

The scope of each is determined by group_by

For a more thorough overview, look at R for Data Science (http://r4ds.had.co.nz/transform.html#datatransformation)

The filter `filter()` function¶

Select observations/rows based on value

Cleaner alternative to indexing with logicals and which

In [ ]:

deniro <- filter(movies,actor_1_name == "Robert De Niro")
deniro[,c('imdb_score','movie_title')]

In [ ]:

deniro_good <- movies %>% 
                filter(actor_1_name == "Robert De Niro", 
                        imdb_score > 8)
deniro_good[,c('imdb_score','movie_title')]

Multiple argument are equivalent to logical AND (&):

 deniro_good <- filter(movies,actor_1_name == "Robert De Niro" & imdb_score > 7)

Logical or's must be written using |

In [ ]:

dnr_pcn<-movies %>% 
         filter((actor_1_name=="Robert De Niro") |
                            (actor_1_name=="Al Pacino"), 
                        imdb_score > 7)
dnr_pcn[,c('actor_1_name','imdb_score','movie_title')]

The `select()` function¶

Unlike filter(), select() picks columns of a tibble

In [ ]:

select(deniro_good, movie_title, imdb_score)

In [ ]:

select(deniro_good, director_name:actor_2_name)[1:10,]

Can also use - to eliminate columns:

In [ ]:

select(deniro_good, -(director_name:actor_2_name))

Also includes convenience functions like contains("actor") and num_range("var",1:3)

The `arrange()` function¶

Orders rows in increasing order of any chosen column

Additional columns can be provided to break ties
desc() can be used to sort in decreasing order

Missing values always go at the end

In [ ]:

movies %>% arrange(desc(imdb_score),desc(gross)) %>% 
   select(movie_title, imdb_score, gross) %>% .[1:10,]

In [ ]:

arrange(movies, imdb_score, gross) %>% 
  select(movie_title, imdb_score, gross) %>% .[1:10,]

In [ ]:

tmp <- arrange(movies, desc(imdb_score), desc(gross))

The `mutate()` function¶

Creates new columns at the end of current data.frame

In [ ]:

movies %>% filter(country== "USA") %>% 
     select(movie_title, imdb_score, gross, budget) %>%
     mutate(succ = gross/budget) %>% 
     arrange(desc(succ)) %>% .[1:20,]

mutate can refer to functions just created

In [ ]:

movies %>% filter(country == "USA") %>%
  select(movie_title, imdb_score, gross, budget) %>%
  mutate(succ = gross-budget, perc= 100*succ/budget) %>%
  distinct() %>% arrange((succ))

distinct() is a useful function to remove repeated rows

can provide column names as arguments for partial repetitions

transmute() is useful if we only care about the new column

`summarise()` and `group_by()`¶

Summarise collapses a dataframe to a single row:

In [ ]:

summarise(movies, score = mean(imdb_score))

Becomes very powerful in conjunction with group_by()

In [ ]:

top_dir <- movies %>% group_by(director_name) %>% 
        summarise(score = mean(imdb_score)) %>% 
        arrange(desc(score)) 
top_dir[1:15,]

n() is a convenient function to get number of elements

In [ ]:

top_dir <- movies %>% group_by(director_name) %>% 
        summarise(count=n(), score = mean(imdb_score)) %>% 
        arrange(desc(score)) %>% 
        filter(count>=5)
top_dir

In [ ]:

yr_scr <- movies %>% group_by(title_year) %>% 
         summarise(count=n(), score = median(imdb_score), 
                              ymin = quantile(imdb_score,.1),
                              ymax=quantile(imdb_score,.9)) %>% 
         arrange(desc(score)) %>% filter(count>=5)
yr_scr

In [ ]:

ggplot(yr_scr , aes(x=title_year, y = score)) + 
    geom_line() + 
    geom_errorbar(aes(ymin=ymin,ymax=ymax))

Can have nested groupings (can revert with ungroup())

In [ ]:

act_dir<-movies %>% group_by(actor_1_name,director_name) %>% 
        distinct(movie_title, .keep_all = T) %>% 
        summarise(num = n(), scr = mean(gross-budget),
                  ttl = paste(movie_title, collapse=";")) %>% 
        arrange(desc(scr)) %>% filter(num>2) 
act_dir[1:20,]

Let's try something more complicated:

Can we analyse scores/earnings across genres?

Things are actually a bit more complicated:

In [ ]:

levels(movies$genres)
#movies %>% select(movie_title, genres) %>% .[1:10,]

In [ ]:

gnr_type <- as.character(levels(movies_orig$genres)) %>% 
       strsplit('\\|') %>% #will see regular expressions later
       unlist %>% unique
gnr_type

In [ ]:

movies <- movies_orig
movies[,gnr_type] <- F
movies$genres <- as.character(movies$genres)
movies[29:54]

In [ ]:

for(ii in 1:nrow(movies)) { # Will look at better approaches
  movies[ii,gnr_type] <- 
   gnr_type %in% strsplit(movies$genres[ii],"\\|")[[1]]
}

colnames(movies)[38] <- "Sci_fi"
colnames(movies)[51] <- "Reality_TV"
colnames(movies)[53] <- "Film_Noir"
colnames(movies)[54] <- "Game_Show"
gnr_type <- colnames(movies)[29:54]
movies[1:10,29:54]

In [ ]:

#lm(imdb_score ~ Action + Adventure, movies  )
rslt <- lm(paste("gross ~", 
            (paste(gnr_type,collapse = '+'))), movies)
rslt

In [ ]:

movies$ntile <- ntile(movies$imdb_score,10)
movies %>% select(movie_title, ntile) %>% .[1:10,]

summarise_each let's one summarize multiple columns easily

In [ ]:

gnr_frac <- movies %>% group_by(ntile) %>% 
                       select(Action:Game_Show)%>% 
                       summarise_each(funs(mean, sum))
gnr_frac

In [ ]:

library('RColorBrewer')
gnr_frac %>% 
  gather('Genre','Count',Action:Game_Show) %>% ggplot() + 
  geom_line(aes(x=ntile,y=Count,color=Genre,linetype=Genre),
             size=1) + 
  scale_linetype_manual(values=c(rep("solid", 12),rep("dashed",11),
                                 rep("twodash",3))) +
  scale_color_manual(values=c(brewer.pal(12,"Set3"),
            brewer.pal(11,"Set3"), brewer.pal(3,"Set3"))) + 
  scale_y_log10()

mutate_each allows you to transform multiple columns

In [ ]:

gnr_frac %>% mutate_each(funs(./sum(.)),Action:Game_Show)

Generating tidy data¶

The `gather' function, allows you to combine multiple columns into 2 coloumns.

turns wide data into tall data

Tall data is useful for e.g. ggplot

In [ ]:

state_info <- as_tibble(state.x77)
(state_info[1:10,])

In [ ]:

state_info %>% 
  gather(Illiteracy:`HS Grad`, key='InfoType',
         value='InfoValue')%>%
  ggplot + 
   geom_smooth(aes(x=Income, y=InfoValue,color=InfoType))

In [ ]:

movies %>% filter(country=="USA") %>% 
     select(title_year, budget, gross)  %>% 
    gather(budget,gross, key = 'type', value='amt') %>% 
      ggplot + geom_smooth(aes(x=title_year,y=log10(amt), 
                               group=type, color=type)) +
      facet_wrap(~type)

In [ ]:

movies %>% filter(country=="USA") %>% 
    select(title_year, budget, Action:Musical)  %>% 
    gather(Action:Musical, key = 'type', value='amt') %>% 
    filter(amt==TRUE) %>%  
      ggplot + geom_smooth(aes(x=title_year,y=log10(budget), 
                               group=type, color=type),se=F)

spread() does the opposite

turns a tall data.frame into a wide one

Wide data is useful for e.g. lm

In [ ]:

spread(state_tall, key = InfoType, value=InfoValue)

In [ ]:

stdnt <- tibble(
  name      = rep(c("Alice", "Bob"), each=4),
  year      = c(2015, 2015, 2016, 2016, 2015, 2015, 2016, 2016),
  semester  = c("Spring","Fall","Spring","Fall", "Spring","Fall","Spring","Fall"),
  gpa = c(3.2, 3.9, 3.1, 3.6, 3.1, 3.9, 3.3, 3.3)
  )
stdnt

In [ ]:

stdnt %>% spread(key=semester, value=gpa)

What if there are missing/extra values?

melt() and dcast() from package reshape are slightly more powerful

however, tidyr with splyr should meet all your needs

Another useful pair of functions in separate() and unite()

In [ ]:

tmp <- movies %>% separate(director_name,c("First","Last"), 
                           sep=" ")

Can control what to do with missing/extra elements:

tmp <- movies %>% separate(director_name,c("First","Last"), sep=" ", extra="merge", fill="left")

Regular expressions will allow more expressivity

unite() does the opposite

In [ ]:

tmp <- unite(stdnt, yr_sm, year, semester)

In [ ]:

spread(tmp, key=yr_sm, value=gpa)

Lecture 12: Data carpentry with tidyverse¶

STAT598z: Intro. to computing for statistics¶

Vinayak Rao¶

Department of Statistics, Purdue University¶

The 'pipe' operator %>%¶

A side point on infix functions¶

The filter filter() function¶

The select() function¶

The arrange() function¶

The mutate() function¶

summarise() and group_by()¶

Generating tidy data¶

Lecture 12: Data carpentry with `tidyverse`¶

The 'pipe' operator `%>%`¶

The filter `filter()` function¶

The `select()` function¶

The `arrange()` function¶

The `mutate()` function¶

`summarise()` and `group_by()`¶