glimpse(mtcars)6 The Tidyverse approach to Data Analysis
The tidyverse is a set of packages promoted by Posit with the purpose of standardizing the processes in data science
6.1 Packages included
readrtibbledplyrfor data manipulationtidyrfor “tidy”ing dataforcatsggplot2for plotting datapurrrstringr- and the Pipe operator
%>%
6.1.1 The Pipe operator in packages magrittr and dplyr
The pipe %>% operator was first introduced in the magrittr package and became central to the tidyverse. It passes the left-hand side as the first argument to the right-hand side function, allowing readable left-to-right chains.
Since R 4.1 (2021), base R ships its own native pipe |> which covers most common use cases without loading any package:
# magrittr pipe (requires magrittr or dplyr)
mtcars %>% filter(cyl > 4) %>% summary()
# native pipe (base R >= 4.1, no package needed)
mtcars |> subset(cyl > 4) |> summary()Both operators are functionally equivalent for the examples in this book. The native |> is preferred in new code; %>% remains common in older scripts and has extra features (%T>%, %<>%) for advanced use.
It comes in handy when applying a sequence of functions to data frames.
6.2 dplyr
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
group_by() which allows you to perform any operation “by group”.
6.2.1 Examples
glimpse
select is used for subsetting variables
select(mtcars,mpg)
select(mtcars, mpg:disp,-cyl) # mpg to disp, except cylmutate adds new columns to a dataset
filter selects cases based on the values of the rows
mtcars %>% filter(hp>100)Group_by is used to group data by one or more columns. Usually
mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))arrange is used to sort cases in ascending or descending order.
mtcars %>% filter(hp>100) %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg)) %>% arrange(desc(cyl))6.3 Tibble
Tibbles are a modern take on dataframes. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).” https://tibble.tidyverse.org/
6.4 Tidymodels
The modern modeling framework in the tidyverse ecosystem. This chapter presents tidymodels as the primary approach for data preprocessing, training, tuning, and evaluation.
6.4.1 Criticisms to this approach
- https://towardsdatascience.com/a-thousand-gadgets-my-thoughts-on-the-r-tidyverse-2441d8504433
- https://github.com/matloff/TidyverseSkeptic/blob/master/READMEFull.md
- https://www.r-bloggers.com/2019/12/why-i-dont-use-the-tidyverse/
- Jared P. Lander, R for Everyone. Advanced Analytics and Graphics, 2nd ed., Addison-Wesley, 2017