#' Creates some plots and a numeric summary
#'
#' @param x vector
#' @return the data frame with the summary
#' @export
funhist2df <- function(x) {
par(mfrow = c(1, 2))
hist(x, col = rainbow(30))
boxplot(x, col = "green")
par(mfrow = c(1, 1))
data.frame(min = min(x), median = median(x), mean = mean(x), max = max(x))
}5 R Packages
5.1 Modern Package Equivalents
This book includes examples accumulated over multiple years. Some packages are kept for pedagogical continuity, but modern projects should generally prefer the following ecosystem updates:
| Legacy package / pattern | Recommended modern equivalent | Notes |
|---|---|---|
caret |
tidymodels (parsnip, recipes, workflows, tune, yardstick) |
Main modeling framework used in updated chapters |
FSelector |
FSelectorRcpp, Boruta, or recipes + embed |
FSelector is aging; modern feature selection is usually recipe-based or model-based |
randomForest (standalone workflows) |
ranger via parsnip |
Faster engine and native integration with tuning/resampling |
RMySQL |
DBI + RMariaDB |
DBI-compliant connector and actively maintained backend |
foreign::read.arff |
farff::readARFF or RWeka::read.arff |
Better maintained import options for ARFF data |
party |
partykit |
Modern tree infrastructure |
reshape |
tidyr |
pivot_longer() / pivot_wider() are current standard |
plyr |
dplyr / purrr |
Superseded by tidyverse verbs and functional workflows |
preProcess pipelines |
recipes steps |
Better integration with resampling and workflows |
| ad-hoc evaluation helpers | yardstick |
Standardized metrics API |
ROCR / manual ROC plotting |
yardstick + pROC + ggplot2 |
Cleaner ROC/PR workflows with tidy outputs |
e1071::naiveBayes only |
naivebayes or discrim in tidymodels |
More consistent modern modeling interfaces |
DMwR2::SelfTrain-style workflows |
pseudo-labeling with tidymodels |
Flexible SSL templates in current chapters |
tm text pipelines |
tidytext (+ quanteda for corpus workflows) |
More modern tokenization and text-analysis ecosystem |
Packages still used in some examples (foreign, FSelector, tm, vioplot, UsingR) remain for compatibility with historical datasets and classroom material. They can be progressively replaced as equivalent examples are added.
We are going to create a minimal package with RStudio/Positron/VS Code. The package structure will be uploaded to Github in order to track all changes made to the package. Other users will be able to install your package from Github.
5.1.1 Create a repository in your Github account
The repository will contain your package. Example: hellopackage, basicpackage2, etc. Tick the cell “Add README.md”: This will create the main branch in Github. Accordingly, we will use the main branch in our project for the first commits (and not the master branch)
5.1.2 Create your Package-Project in Rstudio, allowing Git to track changes
git remote add origin <https://github.com/yourrepo......./yourpackage----.git>- use main (and not master) in order to conform to the actual Github standard
git pull origin maingit push -u origin mainNow, we should see our changes in the github repository.
5.1.3 Continue creating the package with RStudio
Switch to the Build tab - Check. Usually we get the warnings about the License, the documentation and the NAMESPACE file.
Go to Tools > Project Options > Build Tools –> Tick the cell “Generate documentation with Roxygen” and do not change the defaults.
in the Build panel More > Clean and Rebuild
Check
delete the NAMESPACE file because it is automatically created when building the package.
go to the DESCRIPTION file and change the License to GPL-3 (or whatever). Save the file.
Documentation warnings. We may include comments and other information for our functions using “#’” in the .R files. We should include
@exportfor exporting the functions and@paramfor describing the parameters
Example: code your own function, such as the funhist2df function below that plots a histogram and boxplot and returns a summary data frame:
go to More > Document (or hit Ctrl+Shift+D)
More > Clean and Rebuild
Check
Most probably, there is an error in the file hello.R We must document and export that file, too. Include roxygen lines such as:
#'
#' @exportWe should see 0 errors | 0 warnings | 0 notes
Save all changes done.
5.1.4 Commit and push all your changes to Github
In order to install your new package and to see your changes, close the Project without saving the data and restart RStudio with a clean environment.
5.1.4.1 To install the new package from github
Load devtools and install from GitHub:
library(devtools)
devtools::install_github("yourusername/the_name_of_the_repo_containing_your_package")5.1.5 Adding a vignette with data analysis
A vignette is a standard form of writing long and detailed documentation for a package. That includes any type of report. Type in the console (change names as you wish):
usethis::use_vignette(name = "vignette1", title = "My analysis of the data")You will see that a file vignette1.Rmd has been created. You can place any set of R chunks there. For each package that you need to use in the vignette you need to declare the package in the description. It can be done automatically with
usethis::use_package("...whatever..package...")
The content of the “vignette1.Rmd” will usually contain several chunks of R code:
library(thepackagethatyouarecreating)# load the data that you have created
data(thedatasetthatyouhavecreatedinthispackage)
# other data manipulation as examples # do whatever with the functions and datasets that you have created
summary(thedatasetthatyou....)You may now Install and Restart. It will create a package that can be shared.
Those files must be tracked on Github. If you do not want to include vignettes you may include any number of R Markdown documents. See next paragraph.
5.1.6 Adding RMarkdown documents
We can add any number R Markdown files to our package. Usually we will put them in a new rmd/ folder in the inst/ folder. This folder must be tracked on Github.
Now you can use your functions by typing yourpackage::yourfunction1() Now you can use your functions by typing yourpackage::yourfunction1().
5.3 Popular R Packages for Data Mining and Machine Learning
The tables below give a curated overview of popular packages used throughout this course and in the broader R data science ecosystem.
5.3.1 Modern ML Ecosystems
| Package | Purpose | Install |
|---|---|---|
| tidymodels | Unified tidy ML framework (replaces caret) | CRAN |
| mlr3 | Modular ML framework with pipelines | CRAN |
| caret | Classic unified interface (legacy) | CRAN |
5.3.2 Classification and Regression
| Package | Purpose | Install |
|---|---|---|
| randomForest | Random forest | CRAN |
| ranger | Fast random forest | CRAN |
| xgboost | Gradient boosting | CRAN |
| lightgbm | Fast gradient boosting | CRAN |
| e1071 | SVM, naive Bayes | CRAN |
| glmnet | Regularised regression (ridge, lasso) | CRAN |
| rpart | Decision trees | CRAN |
5.3.3 Unsupervised Learning
| Package | Purpose | Install |
|---|---|---|
| cluster | k-means, PAM, hierarchical clustering | CRAN |
| dbscan | Density-based clustering | CRAN |
| factoextra | PCA and clustering visualisation | CRAN |
| arules | Association rule mining | CRAN |
| arulesViz | Association rule visualisation | CRAN |
5.3.4 Feature Engineering and Selection
| Package | Purpose | Install |
|---|---|---|
| recipes | Pre-processing pipelines (tidymodels) | CRAN |
| Boruta | Feature selection via random forest | CRAN |
| FSelectorRcpp | Information-gain-based selection | CRAN |
| corrplot | Correlation matrix visualisation | CRAN |
5.3.5 Text Mining and NLP
| Package | Purpose | Install |
|---|---|---|
| tidytext | Tidy text analysis | CRAN |
| text2vec | Fast text vectorisation, topic models | CRAN |
| quanteda | Comprehensive NLP framework | CRAN |
| tm | Classic text mining (legacy) | CRAN |
5.3.6 Network Analysis
| Package | Purpose | Install |
|---|---|---|
| igraph | General graph analysis | CRAN |
| ggraph | Network visualisation (ggplot2-based) | CRAN |
| tidygraph | Tidy graph manipulation | CRAN |