5 R Packages

5.1 Modern Package Equivalents

This book includes examples accumulated over multiple years. Some packages are kept for pedagogical continuity, but modern projects should generally prefer the following ecosystem updates:

Legacy package / pattern	Recommended modern equivalent	Notes
`caret`	`tidymodels` (`parsnip`, `recipes`, `workflows`, `tune`, `yardstick`)	Main modeling framework used in updated chapters
`FSelector`	`FSelectorRcpp`, `Boruta`, or `recipes` + `embed`	`FSelector` is aging; modern feature selection is usually recipe-based or model-based
`randomForest` (standalone workflows)	`ranger` via `parsnip`	Faster engine and native integration with tuning/resampling
`RMySQL`	`DBI` + `RMariaDB`	DBI-compliant connector and actively maintained backend
`foreign::read.arff`	`farff::readARFF` or `RWeka::read.arff`	Better maintained import options for ARFF data
`party`	`partykit`	Modern tree infrastructure
`reshape`	`tidyr`	`pivot_longer()` / `pivot_wider()` are current standard
`plyr`	`dplyr` / `purrr`	Superseded by tidyverse verbs and functional workflows
`preProcess` pipelines	`recipes` steps	Better integration with resampling and workflows
ad-hoc evaluation helpers	`yardstick`	Standardized metrics API
`ROCR` / manual ROC plotting	`yardstick` + `pROC` + `ggplot2`	Cleaner ROC/PR workflows with tidy outputs
`e1071::naiveBayes` only	`naivebayes` or `discrim` in `tidymodels`	More consistent modern modeling interfaces
`DMwR2::SelfTrain`-style workflows	pseudo-labeling with `tidymodels`	Flexible SSL templates in current chapters
`tm` text pipelines	`tidytext` (+ `quanteda` for corpus workflows)	More modern tokenization and text-analysis ecosystem

Packages still used in some examples (foreign, FSelector, tm, vioplot, UsingR) remain for compatibility with historical datasets and classroom material. They can be progressively replaced as equivalent examples are added.

We are going to create a minimal package with RStudio/Positron/VS Code. The package structure will be uploaded to Github in order to track all changes made to the package. Other users will be able to install your package from Github.

5.1.1 Create a repository in your Github account

The repository will contain your package. Example: hellopackage, basicpackage2, etc. Tick the cell “Add README.md”: This will create the main branch in Github. Accordingly, we will use the main branch in our project for the first commits (and not the master branch)

5.1.2 Create your Package-Project in Rstudio, allowing Git to track changes

git remote add origin <https://github.com/yourrepo......./yourpackage----.git>

use main (and not master) in order to conform to the actual Github standard

git pull origin main

git push -u origin main

Now, we should see our changes in the github repository.

5.1.3 Continue creating the package with RStudio

Switch to the Build tab - Check. Usually we get the warnings about the License, the documentation and the NAMESPACE file.

Go to Tools > Project Options > Build Tools –> Tick the cell “Generate documentation with Roxygen” and do not change the defaults.
in the Build panel More > Clean and Rebuild
Check
delete the NAMESPACE file because it is automatically created when building the package.
go to the DESCRIPTION file and change the License to GPL-3 (or whatever). Save the file.
Documentation warnings. We may include comments and other information for our functions using “#’” in the .R files. We should include @export for exporting the functions and @param for describing the parameters

Example: code your own function, such as the funhist2df function below that plots a histogram and boxplot and returns a summary data frame:

#' Creates some plots and a numeric summary
#'
#' @param x vector
#' @return the data frame with the summary
#' @export
funhist2df <- function(x) {
  par(mfrow = c(1, 2))
  hist(x, col = rainbow(30))
  boxplot(x, col = "green")
  par(mfrow = c(1, 1))
  data.frame(min = min(x), median = median(x), mean = mean(x), max = max(x))
}

go to More > Document (or hit Ctrl+Shift+D)
More > Clean and Rebuild
Check
Most probably, there is an error in the file hello.R We must document and export that file, too. Include roxygen lines such as:

#'
#' @export

We should see 0 errors | 0 warnings | 0 notes
Save all changes done.

5.1.4 Commit and push all your changes to Github

In order to install your new package and to see your changes, close the Project without saving the data and restart RStudio with a clean environment.

5.1.4.1 To install the new package from github

Load devtools and install from GitHub:

library(devtools)
devtools::install_github("yourusername/the_name_of_the_repo_containing_your_package")

5.1.5 Adding a vignette with data analysis

A vignette is a standard form of writing long and detailed documentation for a package. That includes any type of report. Type in the console (change names as you wish):

usethis::use_vignette(name = "vignette1", title = "My analysis of the data")

You will see that a file vignette1.Rmd has been created. You can place any set of R chunks there. For each package that you need to use in the vignette you need to declare the package in the description. It can be done automatically with

usethis::use_package("...whatever..package...")

The content of the “vignette1.Rmd” will usually contain several chunks of R code:

library(thepackagethatyouarecreating)

# load the data that you have created
data(thedatasetthatyouhavecreatedinthispackage)
# other data manipulation as examples

# do whatever with the functions and datasets that you have created
summary(thedatasetthatyou....)

You may now Install and Restart. It will create a package that can be shared.

Those files must be tracked on Github. If you do not want to include vignettes you may include any number of R Markdown documents. See next paragraph.

5.1.6 Adding RMarkdown documents

We can add any number R Markdown files to our package. Usually we will put them in a new rmd/ folder in the inst/ folder. This folder must be tracked on Github.

Now you can use your functions by typing yourpackage::yourfunction1() Now you can use your functions by typing yourpackage::yourfunction1().

5.2 A good use of a package: to export and to share data

We may add a dataset to a package so that it can be used when the package is installed. Or we can create a package that contains only data to be shared. An example of the second use is the well-known R package gapminder. Another recent example of a data package is the hagr (remotes::install_github(“datawookie/hagr”)).

We focus now on the second aspect and we will create a package that contains only data (but we may add also some reports in the form of documentation).

RStudio: create a New project > New directory > newname . For the sake of example we create the new project with name “datapackaplusb” (we intend to combine two simple datasets into one single file)
Create a repository in Github with the same name as the project, for the sake of clarity. Add some comments to the README.md. You will You will see the “main” branch created for the repo. – – Button Code: copy the https://github…
Go to RStudio > More > Configure Build tools > Git/SVN > Select Git and Say yes to create a git repository and restart RStudio. Now you have your local repository created (most probably in the “master” branch)
Open the Shell Git -> More -> Shell and paste the text that you copied from Github in the command git remote add origin <https://github.com-----------.git>
Type git pull origin main. With this first pull your local directory contains all changes done in “main” in Github. You should see now the “main” branch in RStudio.
Important: Go to the Git tab and switch from the “master” branch to the “main” branch so that both local and Github are now in the same “main” branch.
In the Git tab select the files and directories that you want to commit and push to Github (First commit and then push or “git push -u origin main”). Now, you should see your changes in the github repository.
In the Git tab select the files and directories that you want to commit and push to Github (first commit and then push, or run git push -u origin main). Now, you should see your changes in the github repository.

5.2.1 Create folder for the original files to be processed

Usually external files are placed in the dir ins/extdata. We may place the data there. Create those folders. We copy and paste the files “albrecht.csv” and bailey.csv (available in datasets/efforEstimation) to ins/extdata
Perform a first check of the package (click the button or devtools::check) – Warning about the license –> rewrite to, for instance, GPL-3
Usually, the external files are not uploaded to Github, specially if their size is too big. Additionally, when building the package We may ignore files located in some directories by adding those files to .Rbuildignore

^data-raw$
^ins/extdata$

Clean and rebuild the package. This will install the package we are creating in our environment, so that everything is available for use.
For the sake of example we copy and paste the files albretch.csv and bailey.csv in the folder ins/extdata
We can retrieve the actual path to those files extdata files with

system.file("extdata", "albretch.csv", package = "datapackaplusb")

or using read.csv() / read.table(). 7. Processing those external files into a data frame that is usable We will create a script in a new data_raw folder with usethis::use_data_raw(name = "dfaandb") (Give it the name that you like) 8. Do whatever you wish with the data. In this case we simply combine two datasets. Use the following script as your data-raw/dfaandb.R:

first_file_path  <- paste0(getwd(), "/ins/extdata/albretch.csv")
second_file_path <- paste0(getwd(), "/ins/extdata/bailey.csv")

data_one <- read.csv(first_file_path,  stringsAsFactors = FALSE, encoding = "UTF-8")
data_two <- read.csv(second_file_path, stringsAsFactors = FALSE, encoding = "UTF-8")

data_one$source <- "A"
data_two$source <- "B"
dfaandb <- rbind(data_one, data_two)

The final command is:

# save the dfaandb dataframe as an .rda file in datapackaplusb/data/
usethis::use_data(dfaandb, overwrite = TRUE)

creates the data/ folder with the data frame stored as .rda

Clean and Rebuild the package
The data can be accessed in the environment with data(“dfaandb”, package = “datapackaplusb”)
Document the data Go to Build > More > Configure Build Tools and check the tick in the “Document with ROxygen”
Create the empty file data.R in the R/ folder More > Document or devtools::document()
Important. Delete the NAMESPACE file and repeat devtools::document() (NAMESPACE is overwritten)
Add the the following content (change as appropriate) to the file data.R in the R folder

#' Data of effort and size for several projects
#'
#' No missing values
#' A dataset containing -----  whatever you put here.
#'
#' @title DATASET OF ALBRETCH AND BAILEY
#' @format A data frame with 42 rows and three variables:
#' \describe{
#' \item{effort}{Effort measured in -------.}
#' \item{size}{Size measured in ....}
#' \item{source}{A or B indicating one source or another.}
#' }
#' @source \url{https://....domain.com... }
"dfaandb"

Roxygen transforms the code above into a dfaandb.Rd file and adds it to the man/ folder. We can view this documentation in the help pane by typing ?dfaandb in the R console.
Check. (you may delete the file hello.R or add the following lines if you want to have that function)

#' The most used program :-) Greeting.
#' @description Hello to the world
#' @param No parameters
#' @export
#'
#' @examples
#' hello()

Commit and push all your changes to Github. You may ignore /data-raw and ins/extdata
close the project. Restart and install the package from github. devtools::install_github(“yourrepo/datapackaplusb”) library(datapackaplusb)
Type data(“dfaandb”, package=“datapackaplusb”)

Additional steps – Vignette. If you wish you can create a vignette in an .Rmd with a report obtained from the data do this

usethis::use_vignette(name = "effort_eda", title = "Basic EDA of the Effort data")

The directory vignettes is created and you may complete the .Rmd

You may have to install the package qpdf to avoid the Warning about the size of the documents

sudo apt-get install -y qpdf

– Working with R Markdown. We may add RMarkdown to our package. We create a sub folder rmd/ in the inst/ folder.

5.3 Popular R Packages for Data Mining and Machine Learning

The tables below give a curated overview of popular packages used throughout this course and in the broader R data science ecosystem.

5.3.1 Modern ML Ecosystems

Package	Purpose	Install
tidymodels	Unified tidy ML framework (replaces caret)	CRAN
mlr3	Modular ML framework with pipelines	CRAN
caret	Classic unified interface (legacy)	CRAN

5.3.2 Classification and Regression

Package	Purpose	Install
randomForest	Random forest	CRAN
ranger	Fast random forest	CRAN
xgboost	Gradient boosting	CRAN
lightgbm	Fast gradient boosting	CRAN
e1071	SVM, naive Bayes	CRAN
glmnet	Regularised regression (ridge, lasso)	CRAN
rpart	Decision trees	CRAN

5.3.3 Unsupervised Learning

Package	Purpose	Install
cluster	k-means, PAM, hierarchical clustering	CRAN
dbscan	Density-based clustering	CRAN
factoextra	PCA and clustering visualisation	CRAN
arules	Association rule mining	CRAN
arulesViz	Association rule visualisation	CRAN

5.3.4 Feature Engineering and Selection

Package	Purpose	Install
recipes	Pre-processing pipelines (tidymodels)	CRAN
Boruta	Feature selection via random forest	CRAN
FSelectorRcpp	Information-gain-based selection	CRAN
corrplot	Correlation matrix visualisation	CRAN

5.3.5 Text Mining and NLP

Package	Purpose	Install
tidytext	Tidy text analysis	CRAN
text2vec	Fast text vectorisation, topic models	CRAN
quanteda	Comprehensive NLP framework	CRAN
tm	Classic text mining (legacy)	CRAN

5.3.6 Network Analysis

Package	Purpose	Install
igraph	General graph analysis	CRAN
ggraph	Network visualisation (ggplot2-based)	CRAN
tidygraph	Tidy graph manipulation	CRAN