The goal of NumericEnsembles is to automatically conduct a thorough analysis of numeric data. The user only needs to provide the data and answer a few questions (such as which column to analyze). NumericEnsembles fits 23 individual models to the training data, and also makes predictions and checks accuracy for each of the individual models. It also builds 17 ensembles from the 23 individual data, fits each ensemble model to the training data then makes predictions and tracks accuracy for each ensemble. The package also automatically returns 26 plots (such as train vs holdout for the best model), 6 tables (such as head of the data), and a grand summary table sorted by accuracy with the best model at the top of the report.
Installation
You can install the development version of NumericEnsembles like so:
devtools::install_github("InfiniteCuriosity/NumericEnsembles")
Example
NumericEnsembles will automatically build 40 models to predict the sale price of houses in Boston, from the Boston housing data set.
library(NumericEnsembles)
Numeric(data = MASS::Boston,
colnum = 14,
numresamples = 25,
how_to_handle_strings = 0,
do_you_have_new_data = "N",
save_all_trained_models = "N",
remove_ensemble_correlations_greater_than = 1.00,
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20
)
The 40 models are:
- Bagged Random Forest
- Bagging
- BayesGLM
- BayesRNN
- BoostRF (Random Forest)
- Cubist
- Earth
- Elastic
- Ensemble Bagged Random Forest
- Ensemble Bagging
- Ensemble BayesGLM
- Ensemble BayesRNN
- Ensemble BoostRF (Random Forest)
- Ensemble Cubist
- Ensemble Earth
- Ensemble Elastic
- Ensemble Graident Boosted
- Ensemble K-Nearest Neighbors
- Ensemble Lasso
- Ensemble Linear
- EnsembleRF (Random Forest)
- Ensemble Ridge
- Ensemble RPart
- EnsembleSVM (Support Vector Machines)
- Ensemble Trees
- Ensemble XGBoost
- GAM (Generalized Additive Models)
- Gradient Boosted
- KNN (K-Nearest Neighbors) (tuned)
- Lasso
- Linear (tuned)
- Neuralnet
- PCR (Principal Components Regression)
- PLS (Partial Least Squares)
- RF (Random Forest)
- Ridge
- RPart
- SVM (Supoort Vector Machines)
- Tree
- XGBoost
The 26 plots created automatically:
- SSE by model and resample
- MAE by model and resample
- MSE by model and resample
- Bias by model and resample
- Mean SSE barchart
- Mean MAE barchart
- Mean MSE barchart
- Mean bias barchart
- Over or underfitting barchart
- Duration barchart
- Train vs holdout by model and resample
- Model accuracy barchart
- y (predictor variable) vs target variables
- Boxplots of the numeric data
- Histograms of the numeric data
- Overfitting by model and resample
- Accuracy by model and resample
- Best model Q-Q plot
- Best model histogram of the residuals
- Best model residuals
- Best model predicted vs actual
- Best model four plots at once (Predicted vs actual, residuals, histogram of residuals, Q-Q plot)
- Correlation plot of the numeric data as circles and colors
- Correlation of the numeric data as numbers and colors
- Pairwise scatter plots
The tables created automatically are:
- Correlation of the ensemble
- Head of the ensemble
- Data summary
- Correlation of the data
- RMSE, means, fitting, model summaries of the train, test and validation sets
- Head of the data frame
Example using pre-trained models on totally new data in the NumericEnsembles package
The NumericEnsembles package also has a way to create trained models and test those pre-trained models on totally unseen data using the same pre-trained models as on the initial analysis.
The package contains two example data sets to demonstrate this result. Boston_Housing is the Boston Housing data set, but the first five rows have been removed. We will build our models on that data set. NewBoston is totally new data, and actually the first five rows from the original Boston Housing data set.
library(NumericEnsembles)
Numeric(data = Boston_Housing,
colnum = 14,
numresamples = 25,
how_to_handle_strings = 0,
do_you_have_new_data = "N",
save_all_trained_models = "N",
remove_ensemble_correlations_greater_than = 1.00,
use_parallel = "Y",
train_amount = 0.60,
test_amount = 0.20,
validation_amount = 0.20
)
Use the data set New_Boston when asked for “What is the URL of the new data?”
You may use external data to accomplish the same result.