5  R Packages

5.1 Modern Package Equivalents

This book includes examples accumulated over multiple years. Some packages are kept for pedagogical continuity, but modern projects should generally prefer the following ecosystem updates:

Legacy package / pattern Recommended modern equivalent Notes
caret tidymodels (parsnip, recipes, workflows, tune, yardstick) Main modeling framework used in updated chapters
FSelector FSelectorRcpp, Boruta, or recipes + embed FSelector is aging; modern feature selection is usually recipe-based or model-based
randomForest (standalone workflows) ranger via parsnip Faster engine and native integration with tuning/resampling
RMySQL DBI + RMariaDB DBI-compliant connector and actively maintained backend
foreign::read.arff farff::readARFF or RWeka::read.arff Better maintained import options for ARFF data
party partykit Modern tree infrastructure
reshape tidyr pivot_longer() / pivot_wider() are current standard
plyr dplyr / purrr Superseded by tidyverse verbs and functional workflows
preProcess pipelines recipes steps Better integration with resampling and workflows
ad-hoc evaluation helpers yardstick Standardized metrics API
ROCR / manual ROC plotting yardstick + pROC + ggplot2 Cleaner ROC/PR workflows with tidy outputs
e1071::naiveBayes only naivebayes or discrim in tidymodels More consistent modern modeling interfaces
DMwR2::SelfTrain-style workflows pseudo-labeling with tidymodels Flexible SSL templates in current chapters
tm text pipelines tidytext (+ quanteda for corpus workflows) More modern tokenization and text-analysis ecosystem

Packages still used in some examples (foreign, FSelector, tm, vioplot, UsingR) remain for compatibility with historical datasets and classroom material. They can be progressively replaced as equivalent examples are added.

We are going to create a minimal package with RStudio/Positron/VS Code. The package structure will be uploaded to Github in order to track all changes made to the package. Other users will be able to install your package from Github.

5.1.1 Create a repository in your Github account

The repository will contain your package. Example: hellopackage, basicpackage2, etc. Tick the cell “Add README.md”: This will create the main branch in Github. Accordingly, we will use the main branch in our project for the first commits (and not the master branch)

5.1.2 Create your Package-Project in Rstudio, allowing Git to track changes

git remote add origin <https://github.com/yourrepo......./yourpackage----.git>
  • use main (and not master) in order to conform to the actual Github standard
git pull origin main
git push -u origin main

Now, we should see our changes in the github repository.

5.1.3 Continue creating the package with RStudio

Switch to the Build tab - Check. Usually we get the warnings about the License, the documentation and the NAMESPACE file.

  • Go to Tools > Project Options > Build Tools –> Tick the cell “Generate documentation with Roxygen” and do not change the defaults.

  • in the Build panel More > Clean and Rebuild

  • Check

  • delete the NAMESPACE file because it is automatically created when building the package.

  • go to the DESCRIPTION file and change the License to GPL-3 (or whatever). Save the file.

  • Documentation warnings. We may include comments and other information for our functions using “#’” in the .R files. We should include @export for exporting the functions and @param for describing the parameters

Example: code your own function, such as the funhist2df function below that plots a histogram and boxplot and returns a summary data frame:

#' Creates some plots and a numeric summary
#'
#' @param x vector
#' @return the data frame with the summary
#' @export
funhist2df <- function(x) {
  par(mfrow = c(1, 2))
  hist(x, col = rainbow(30))
  boxplot(x, col = "green")
  par(mfrow = c(1, 1))
  data.frame(min = min(x), median = median(x), mean = mean(x), max = max(x))
}
  • go to More > Document (or hit Ctrl+Shift+D)

  • More > Clean and Rebuild

  • Check

  • Most probably, there is an error in the file hello.R We must document and export that file, too. Include roxygen lines such as:

#'
#' @export
  • We should see 0 errors | 0 warnings | 0 notes

  • Save all changes done.

5.1.4 Commit and push all your changes to Github

In order to install your new package and to see your changes, close the Project without saving the data and restart RStudio with a clean environment.

5.1.4.1 To install the new package from github

Load devtools and install from GitHub:

library(devtools)
devtools::install_github("yourusername/the_name_of_the_repo_containing_your_package")

5.1.5 Adding a vignette with data analysis

A vignette is a standard form of writing long and detailed documentation for a package. That includes any type of report. Type in the console (change names as you wish):

usethis::use_vignette(name = "vignette1", title = "My analysis of the data")

You will see that a file vignette1.Rmd has been created. You can place any set of R chunks there. For each package that you need to use in the vignette you need to declare the package in the description. It can be done automatically with

usethis::use_package("...whatever..package...")

The content of the “vignette1.Rmd” will usually contain several chunks of R code:

library(thepackagethatyouarecreating)
# load the data that you have created
data(thedatasetthatyouhavecreatedinthispackage)
# other data manipulation as examples 
# do whatever with the functions and datasets that you have created
summary(thedatasetthatyou....)

You may now Install and Restart. It will create a package that can be shared.

Those files must be tracked on Github. If you do not want to include vignettes you may include any number of R Markdown documents. See next paragraph.

5.1.6 Adding RMarkdown documents

We can add any number R Markdown files to our package. Usually we will put them in a new rmd/ folder in the inst/ folder. This folder must be tracked on Github.

Now you can use your functions by typing yourpackage::yourfunction1() Now you can use your functions by typing yourpackage::yourfunction1().

5.2 A good use of a package: to export and to share data

We may add a dataset to a package so that it can be used when the package is installed. Or we can create a package that contains only data to be shared. An example of the second use is the well-known R package gapminder. Another recent example of a data package is the hagr (remotes::install_github(“datawookie/hagr”)).

We focus now on the second aspect and we will create a package that contains only data (but we may add also some reports in the form of documentation).

  • RStudio: create a New project > New directory > newname . For the sake of example we create the new project with name “datapackaplusb” (we intend to combine two simple datasets into one single file)

  • Create a repository in Github with the same name as the project, for the sake of clarity. Add some comments to the README.md. You will You will see the “main” branch created for the repo. – – Button Code: copy the https://github

  • Go to RStudio > More > Configure Build tools > Git/SVN > Select Git and Say yes to create a git repository and restart RStudio. Now you have your local repository created (most probably in the “master” branch)

  • Open the Shell Git -> More -> Shell and paste the text that you copied from Github in the command git remote add origin <https://github.com-----------.git>

  • Type git pull origin main. With this first pull your local directory contains all changes done in “main” in Github. You should see now the “main” branch in RStudio.

  • Important: Go to the Git tab and switch from the “master” branch to the “main” branch so that both local and Github are now in the same “main” branch.

  • In the Git tab select the files and directories that you want to commit and push to Github (First commit and then push or “git push -u origin main”). Now, you should see your changes in the github repository.

  • In the Git tab select the files and directories that you want to commit and push to Github (first commit and then push, or run git push -u origin main). Now, you should see your changes in the github repository.

5.2.1 Create folder for the original files to be processed

  1. Usually external files are placed in the dir ins/extdata. We may place the data there. Create those folders. We copy and paste the files “albrecht.csv” and bailey.csv (available in datasets/efforEstimation) to ins/extdata
  2. Perform a first check of the package (click the button or devtools::check) – Warning about the license –> rewrite to, for instance, GPL-3
  3. Usually, the external files are not uploaded to Github, specially if their size is too big. Additionally, when building the package We may ignore files located in some directories by adding those files to .Rbuildignore
^data-raw$
^ins/extdata$
  1. Clean and rebuild the package. This will install the package we are creating in our environment, so that everything is available for use.
  2. For the sake of example we copy and paste the files albretch.csv and bailey.csv in the folder ins/extdata
  3. We can retrieve the actual path to those files extdata files with
system.file("extdata", "albretch.csv", package = "datapackaplusb")

or using read.csv() / read.table(). 7. Processing those external files into a data frame that is usable We will create a script in a new data_raw folder with usethis::use_data_raw(name = "dfaandb") (Give it the name that you like) 8. Do whatever you wish with the data. In this case we simply combine two datasets. Use the following script as your data-raw/dfaandb.R:

first_file_path  <- paste0(getwd(), "/ins/extdata/albretch.csv")
second_file_path <- paste0(getwd(), "/ins/extdata/bailey.csv")

data_one <- read.csv(first_file_path,  stringsAsFactors = FALSE, encoding = "UTF-8")
data_two <- read.csv(second_file_path, stringsAsFactors = FALSE, encoding = "UTF-8")

data_one$source <- "A"
data_two$source <- "B"
dfaandb <- rbind(data_one, data_two)

The final command is:

# save the dfaandb dataframe as an .rda file in datapackaplusb/data/
usethis::use_data(dfaandb, overwrite = TRUE)

creates the data/ folder with the data frame stored as .rda

  1. Clean and Rebuild the package

  2. The data can be accessed in the environment with data(“dfaandb”, package = “datapackaplusb”)

  3. Document the data Go to Build > More > Configure Build Tools and check the tick in the “Document with ROxygen”

  4. Create the empty file data.R in the R/ folder More > Document or devtools::document()

  5. Important. Delete the NAMESPACE file and repeat devtools::document() (NAMESPACE is overwritten)

  6. Add the the following content (change as appropriate) to the file data.R in the R folder

#' Data of effort and size for several projects
#'
#' No missing values
#' A dataset containing -----  whatever you put here.
#'
#' @title DATASET OF ALBRETCH AND BAILEY
#' @format A data frame with 42 rows and three variables:
#' \describe{
#' \item{effort}{Effort measured in -------.}
#' \item{size}{Size measured in ....}
#' \item{source}{A or B indicating one source or another.}
#' }
#' @source \url{https://....domain.com... }
"dfaandb"
  1. Roxygen transforms the code above into a dfaandb.Rd file and adds it to the man/ folder. We can view this documentation in the help pane by typing ?dfaandb in the R console.

  2. Check. (you may delete the file hello.R or add the following lines if you want to have that function)

#' The most used program :-) Greeting.
#' @description Hello to the world
#' @param No parameters
#' @export
#'
#' @examples
#' hello()
  1. Commit and push all your changes to Github. You may ignore /data-raw and ins/extdata

  2. close the project. Restart and install the package from github. devtools::install_github(“yourrepo/datapackaplusb”) library(datapackaplusb)

  3. Type data(“dfaandb”, package=“datapackaplusb”)

Additional steps – Vignette. If you wish you can create a vignette in an .Rmd with a report obtained from the data do this

usethis::use_vignette(name = "effort_eda", title = "Basic EDA of the Effort data")

The directory vignettes is created and you may complete the .Rmd

You may have to install the package qpdf to avoid the Warning about the size of the documents

sudo apt-get install -y qpdf

– Working with R Markdown. We may add RMarkdown to our package. We create a sub folder rmd/ in the inst/ folder.

5.4 Further Information