6  The Tidyverse approach to Data Analysis

The tidyverse is a set of packages promoted by Posit with the purpose of standardizing the processes in data science

https://www.tidyverse.org/

Tidyverse design principles

6.1 Packages included

  • readr
  • tibble
  • dplyr for data manipulation
  • tidyr for “tidy”ing data
  • forcats
  • ggplot2 for plotting data
  • purrr
  • stringr
  • and the Pipe operator %>%

6.1.1 The Pipe operator in packages magrittr and dplyr

The pipe %>% operator was first introduced in the magrittr package and became central to the tidyverse. It passes the left-hand side as the first argument to the right-hand side function, allowing readable left-to-right chains.

Since R 4.1 (2021), base R ships its own native pipe |> which covers most common use cases without loading any package:

# magrittr pipe (requires magrittr or dplyr)
mtcars %>% filter(cyl > 4) %>% summary()

# native pipe (base R >= 4.1, no package needed)
mtcars |> subset(cyl > 4) |> summary()

Both operators are functionally equivalent for the examples in this book. The native |> is preferred in new code; %>% remains common in older scripts and has extra features (%T>%, %<>%) for advanced use.

It comes in handy when applying a sequence of functions to data frames.

6.2 dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
group_by() which allows you to perform any operation “by group”.

6.2.1 Examples

glimpse

glimpse(mtcars)

select is used for subsetting variables

select(mtcars,mpg)

select(mtcars, mpg:disp,-cyl) # mpg to disp, except cyl

mutate adds new columns to a dataset

filter selects cases based on the values of the rows

mtcars %>% filter(hp>100)

Group_by is used to group data by one or more columns. Usually

mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))

arrange is used to sort cases in ascending or descending order.

mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg)) %>% arrange(desc(cyl))

6.3 Tibble

Tibbles are a modern take on dataframes. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).” https://tibble.tidyverse.org/

6.4 Tidymodels

The modern modeling framework in the tidyverse ecosystem. This chapter presents tidymodels as the primary approach for data preprocessing, training, tuning, and evaluation.

The tidymodels Package

6.4.1 Criticisms to this approach