Title: | Robust Explainable Outlier Detection Based on OutlierTree |
---|---|
Description: | Bagged OutlierTrees is an explainable unsupervised outlier detection method based on an ensemble implementation of the existing OutlierTree procedure (Cortes, 2020). This implementation takes advantage of bootstrap aggregating (bagging) to improve robustness by reducing the possible masking effect and subsequent high variance (similarly to Isolation Forest), hence the name "Bagged OutlierTrees". To learn more about the base procedure OutlierTree (Cortes, 2020), please refer to <arXiv:2001.00636>. |
Authors: | Rafael Santos [aut, cre] |
Maintainer: | Rafael Santos <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-01-20 03:09:34 UTC |
Source: | https://github.com/rafajpsantos/bagged.outliertrees |
Fit Bagged OutlierTrees ensemble model to normal data with perhaps some outliers.
bagged.outliertrees( df, ntrees = 100L, subsampling_rate = 0.25, max_depth = 4L, min_gain = 0.01, z_norm = 2.67, z_outlier = 8, pct_outliers = 0.01, min_size_numeric = 25L, min_size_categ = 50L, categ_split = "binarize", categ_outliers = "tail", numeric_split = "raw", cols_ignore = NULL, follow_all = FALSE, gain_as_pct = TRUE, nthreads = parallel::detectCores() )
bagged.outliertrees( df, ntrees = 100L, subsampling_rate = 0.25, max_depth = 4L, min_gain = 0.01, z_norm = 2.67, z_outlier = 8, pct_outliers = 0.01, min_size_numeric = 25L, min_size_categ = 50L, categ_split = "binarize", categ_outliers = "tail", numeric_split = "raw", cols_ignore = NULL, follow_all = FALSE, gain_as_pct = TRUE, nthreads = parallel::detectCores() )
df |
Data Frame with normal data that might contain some outliers. See details for allowed column types. |
ntrees |
Controls the ensemble size (i.e. the number of OutlierTrees or bootstrapped training sets). A large value is always recommended to build a robust and stable ensemble. Should be decreased if training is taking too much time. |
subsampling_rate |
Sub-sampling rate used for bootstrapping. A small rate results in smaller bootstrapped training sets, which should not suffer from the masking effect. This parameter should be adjusted given the size of the training data (perhaps a smaller value for large training data and conversely). |
max_depth |
Maximum depth of the trees to grow. Can also pass zero, in which case it will only look for outliers with no conditions (i.e. takes each column as a 1-d distribution and looks for outliers in there independently of the values in other columns). |
min_gain |
Minimum gain that a split has to produce in order to consider it (both in terms of looking
for outliers in each branch, and in considering whether to continue branching from them). Note that default
value for GritBot is 1e-6, with |
z_norm |
Maximum Z-value (from standard normal distribution) that can be considered as a normal observation. Note that simply having values above this will not automatically flag observations as outliers, nor does it assume that columns follow normal distributions. Also used for categorical and ordinal columns for building approximate confidence intervals of proportions. |
z_outlier |
Minimum Z-value that can be considered as an outlier. There must be a large gap in the Z-value of the next observation in sorted order to consider it as outlier, given by (z_outlier - z_norm). Decreasing this parameter is likely to result in more observations being flagged as outliers. Ignored for categorical and ordinal columns. |
pct_outliers |
Approximate max percentage of outliers to expect in a given branch. |
min_size_numeric |
Minimum size that branches need to have when splitting a numeric column. In order to look for outliers in a given branch for a numeric column, it must have a minimum of twice this number of observations. |
min_size_categ |
Minimum size that branches need to have when splitting a categorical or ordinal column. In order to look for outliers in a given branch for a categorical, ordinal, or boolean column, it must have a minimum of twice this number of observations. |
categ_split |
How to produce categorical-by-categorical splits. Options are:
|
categ_outliers |
How to look for outliers in categorical variables. Options are:
|
numeric_split |
How to determine the split point in numeric variables. Options are:
This doesn't affect how outliers are determined in the training data passed in |
cols_ignore |
Vector containing columns which will not be split, but will be evaluated for usage
in splitting other columns. Can pass either a logical (boolean) vector with the same number of columns
as |
follow_all |
Whether to continue branching from each split that meets the size and gain criteria. This will produce exponentially many more branches, and if depth is large, might take forever to finish. Will also produce a lot more spurious outiers. Not recommended. |
gain_as_pct |
Whether the minimum gain above should be taken in absolute terms, or as a percentage of
the standard deviation (for numerical columns) or shannon entropy (for categorical columns). Taking it in
absolute terms will prefer making more splits on columns that have a large variance, while taking it as a
percentage might be more restrictive on them and might create deeper trees in some columns. For GritBot
this parameter would always be |
nthreads |
Number of parallel threads to use when fitting the model. |
An object with the fitted model that can be used to detect more outliers in new data.
GritBot software: https://www.rulequest.com/gritbot-info.html
Cortes, David. "Explainable outlier detection through decision tree conditioning." arXiv preprint arXiv:2001.00636 (2020).
predict.bagged.outliertrees print.bagged.outlieroutputs hypothyroid
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)
Hypothyroid
hypothyroid
hypothyroid
An object of class data.frame
with 2772 rows and 23 columns.
Predict method for Bagged OutlierTrees
## S3 method for class 'bagged.outliertrees' predict( object, newdata, min_outlier_score = 0.95, nthreads = parallel::detectCores(), ... )
## S3 method for class 'bagged.outliertrees' predict( object, newdata, min_outlier_score = 0.95, nthreads = parallel::detectCores(), ... )
object |
A Bagged OutlierTrees object as returned by |
newdata |
A Data Frame in which to look for outliers according to the fitted model. |
min_outlier_score |
Minimum outlier score to use when finding outliers. |
nthreads |
Number of threads to use when predicting. |
... |
No use. |
Will return a list of lists with the outliers and their
information (each row is an entry in the first list, with the same names as the rows in the input data
frame), which can be printed into a human-readable format after-the-fact through functions
print
.
bagged.outliertrees print.bagged.outlieroutputs
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)
Pretty-prints outliers as output by the predict
function from a Bagged OutlierTrees
model (as generated by function bagged.outliertrees
).
## S3 method for class 'bagged.outlieroutputs' print(x, outliers_print = 15, ...)
## S3 method for class 'bagged.outlieroutputs' print(x, outliers_print = 15, ...)
x |
Outliers as returned by predict method on an object from |
outliers_print |
Maximum number of outliers to print. |
... |
No use. |
The same input x
that was passed (as invisible
).
bagged.outliertrees predict.bagged.outliertrees
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)
library(bagged.outliertrees) ### example dataset with interesting outliers data(hypothyroid) ### fit a Bagged OutlierTrees model model <- bagged.outliertrees(hypothyroid, ntrees = 10, subsampling_rate = 0.5, z_outlier = 6, nthreads = 1 ) ### use the fitted model to find outliers in the training dataset outliers <- predict(model, newdata = hypothyroid, min_outlier_score = 0.5, nthreads = 1 ) ### print the top-10 outliers in human-readable format print(outliers, outliers_print = 10)