| Title: | Robust Outlier Detection for Diverse Distributions |
|---|---|
| Description: | Provides robust outlier detection techniques for identifying anomalies in multivariate data, with a focus on methods that remain effective under non-Gaussian distributions. For more details see Saluja, Parlak, and Mejia (2026+) <doi:10.48550/arXiv.2505.11806>. |
| Authors: | Amanda Mejia [aut, cre], Damon Pham [ctb] (ORCID: <https://orcid.org/0000-0001-7563-4727>), Saranjeet Singh Saluja [ctb], Fatma Parlak [ctb], Zeshawn Zahid [ctb] |
| Maintainer: | Amanda Mejia <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.3 |
| Built: | 2026-05-09 07:13:39 UTC |
| Source: | https://github.com/mandymejia/rrobot |
Dots parameter documentation
... |
Additional arguments to to method-specific functions. |
Alpha parameter documentation
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
B parameter documentation
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
Binwidth parameter documentation
binwidth |
Histogram bin width (default = 0.1). |
Boot_quant parameter documentation
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
Calculates the robust mean, covariance matrix, and optionally robust distances using either:
"auto" mode: automatically selects the best robust subset using covMcd
"manual" mode: uses provided robust covariance matrix and subset indices
compute_RD( x, mode = c("auto", "manual"), cov_mcd = NULL, ind_incld = NULL, dist = TRUE )compute_RD( x, mode = c("auto", "manual"), cov_mcd = NULL, ind_incld = NULL, dist = TRUE )
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
mode |
Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values. |
cov_mcd |
Optional covariance matrix (p × p); required in "manual" mode. |
ind_incld |
Optional vector of row indices used to compute the robust mean; required in "manual" mode. |
dist |
Logical; if TRUE, compute squared robust Mahalanobis distances for all observations. |
A list with elements:
Vector of row indices used to compute the robust mean and covariance.
Vector of excluded row indices.
Number of included observations.
Robust mean vector (length p).
Robust covariance matrix (p × p).
Matrix square root of the inverse covariance matrix (p × p).
Squared robust distances for all observations (length T), or NULL if dist = FALSE.
The matched function call.
Cov_mcd parameter documentation
cov_mcd |
Optional covariance matrix (p × p); required in "manual" mode. |
Cutoff parameter documentation
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
Dist parameter documentation
dist |
Logical; if TRUE, compute squared robust Mahalanobis distances for all observations. |
Imp_data parameter documentation
imp_data |
A numeric matrix (T × p) of single-imputed data. |
Imp_datasets parameter documentation
imp_datasets |
A list of M numeric matrices (T × p); multiply imputed datasets. |
Impute_method parameter documentation
impute_method |
Character string; imputation method for univariate outliers. |
Ind_incld parameter documentation
ind_incld |
Optional vector of row indices used to compute the robust mean; required in "manual" mode. |
K parameter documentation
k |
Integer; number of perturbation cycles per imputation (default = 10). |
M parameter documentation
M |
Integer; number of multiply imputed datasets (default = 5). |
Threshold_method parameter documentation
method |
Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH". |
Method_univOut parameter documentation
method |
Character string. One of |
Mode parameter documentation
mode |
Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values. |
Creates diagnostic plots for robust distance analysis results.
## S3 method for class 'RD' plot(x, type = c("histogram", "imputations", "univOut"), method = NULL, ...)## S3 method for class 'RD' plot(x, type = c("histogram", "imputations", "univOut"), method = NULL, ...)
x |
An object of class "RD" from RD() or threshold_RD(). |
type |
Character string specifying plot type: "histogram" (default), "imputations", or "univOut". |
method |
Character string specifying threshold method. Auto-detected if NULL. |
... |
Additional arguments passed to plotting functions. |
A ggplot object.
Quantile parameter documentation
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
RD_obj parameter documentation
RD_obj |
Pre-computed RD_result object from |
RD_org_obj parameter documentation
RD_org_obj |
Output list from |
Detects univariate outliers using an iterative SHASH fitting process with
optional pre-flagging strategies. A SHASH (Sinh-Arcsinh) distribution is
fitted to the data iteratively, each time excluding candidate outliers from
the fit, until the set of flagged observations converges or maxit
is reached.
SHASH_out( x, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = c("both", "upper", "lower"), use_iso = TRUE, thr_iso = 0.6, maxit = 100, weight_init = NULL )SHASH_out( x, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = c("both", "upper", "lower"), use_iso = TRUE, thr_iso = 0.6, maxit = 100, weight_init = NULL )
x |
Numeric vector. May contain |
thr0 |
Positive numeric scalar. Threshold for initial outlier
pre-flagging when |
thr1 |
Positive numeric scalar. Threshold used to classify observations as inliers during iterative convergence (default: 2.58). |
thr |
Positive numeric scalar. Final threshold applied to the converged SHASH-normalised scores to declare outliers in the returned output (default: 4). |
tail |
Character string specifying which tail(s) to check for outliers.
Must be one of
|
use_iso |
Logical. If |
thr_iso |
Numeric scalar in [0, 1]. Isolation forest anomaly score
threshold above which observations are treated as candidate outliers
during pre-screening (default: 0.6). Only used when |
maxit |
Positive integer. Maximum number of fitting iterations before the algorithm stops regardless of convergence (default: 100). |
weight_init |
Optional logical vector of length |
A list of class "SHASH_out" with the following elements:
out_idxInteger vector. Indices of observations in
x that were flagged as outliers at the final threshold
thr.
x_normNumeric vector. SHASH-normalised scores for every
observation (same length as x; NA where x was
NA).
SHASH_coefNamed list with elements mu,
sigma, nu, and tau: the fitted SHASH parameter
estimates from the final iteration (sigma and tau are on the log
scale, as returned by gamlssML).
isotree_scoresNumeric vector of isolation forest anomaly
scores (same length as x). NA when use_iso =
FALSE or weight_init was supplied.
initial_weightsLogical vector. Inlier weights used for
the very first fitting iteration (same length as x).
indx_itersInteger matrix of dimensions
length(x) × last_iter. Each column records which
observations were flagged as outliers (value 1) during that iteration.
norm_itersNumeric matrix of dimensions
length(x) × last_iter. Each column records the
SHASH-normalised scores from that iteration.
last_iterInteger. The number of iterations completed
before convergence or hitting maxit.
convergedLogical. TRUE if the inlier weight
vector stabilised before reaching maxit.
paramsList. A record of all input parameters, stored for reproducibility.
# --- Example 1: Synthetic data with known injected outliers --------------- # Using rnorm lets us inject outliers at known positions so we can verify # the function finds exactly what we planted. set.seed(42) x <- rnorm(200, mean = 10, sd = 2) # Shift a handful of observations far into the upper tail outlier_positions <- c(17, 77, seq(190, 200)) x[outlier_positions] <- x[outlier_positions] + 10 result_sim <- SHASH_out( x, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = "both", use_iso = FALSE # skip isolation forest to keep the example fast ) result_sim$out_idx # should recover positions near outlier_positions result_sim$converged # did the iterative fit stabilise? # --- Example 2: Real benchmark data (Hawkins-Bradu-Kass) ------------------ # hbk is a classic outlier detection benchmark shipped with robustbase, # which this package already imports, so it is always available. data("hbk", package = "robustbase") result_hbk <- SHASH_out( hbk$X1, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = "both", use_iso = FALSE ) result_hbk$out_idx # flagged observations in the X1 column result_hbk$SHASH_coef # fitted SHASH parameters; sigma and tau are log-scale # Which positions were flagged as outliers? result_hbk$out_idx # Did the algorithm converge before hitting maxit? result_hbk$converged # How many iterations did it take? result_hbk$last_iter# --- Example 1: Synthetic data with known injected outliers --------------- # Using rnorm lets us inject outliers at known positions so we can verify # the function finds exactly what we planted. set.seed(42) x <- rnorm(200, mean = 10, sd = 2) # Shift a handful of observations far into the upper tail outlier_positions <- c(17, 77, seq(190, 200)) x[outlier_positions] <- x[outlier_positions] + 10 result_sim <- SHASH_out( x, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = "both", use_iso = FALSE # skip isolation forest to keep the example fast ) result_sim$out_idx # should recover positions near outlier_positions result_sim$converged # did the iterative fit stabilise? # --- Example 2: Real benchmark data (Hawkins-Bradu-Kass) ------------------ # hbk is a classic outlier detection benchmark shipped with robustbase, # which this package already imports, so it is always available. data("hbk", package = "robustbase") result_hbk <- SHASH_out( hbk$X1, thr0 = 2.58, thr1 = 2.58, thr = 4, tail = "both", use_iso = FALSE ) result_hbk$out_idx # flagged observations in the X1 column result_hbk$SHASH_coef # fitted SHASH parameters; sigma and tau are log-scale # Which positions were flagged as outliers? result_hbk$out_idx # Did the algorithm converge before hitting maxit? result_hbk$converged # How many iterations did it take? result_hbk$last_iter
These two functions form a matched pair for transforming data between the
SHASH (Sinh-Arcsinh) distribution and the standard normal distribution.
SHASH_to_normal() maps SHASH-distributed observations onto an
approximately normal scale; normal_to_SHASH() is the inverse.
SHASH_to_normal(x, mu, sigma, nu, tau) normal_to_SHASH(x, mu, sigma, nu, tau)SHASH_to_normal(x, mu, sigma, nu, tau) normal_to_SHASH(x, mu, sigma, nu, tau)
x |
Numeric vector of values to transform. |
mu |
Numeric scalar. Location parameter controlling the mean of the SHASH distribution. |
sigma |
Numeric scalar. Spread parameter on the log scale. The function
applies |
nu |
Numeric scalar. Skewness parameter. A value of 0 gives a symmetric distribution. |
tau |
Numeric scalar. Tail-weight parameter on the log scale. Pass
|
A numeric vector of transformed values, the same length as x.
SHASH_to_normal(): Transforms SHASH-distributed data to
approximately normal data.
normal_to_SHASH(): Transforms standard normal data back to the
SHASH-distributed scale.
set.seed(42) x <- rnorm(200) x[c(17, 77)] <- x[c(17, 77)] + 5 mu <- 0; sigma <- 0; nu <- 0; tau <- 0 z <- SHASH_to_normal(x, mu = mu, sigma = sigma, nu = nu, tau = tau) x_recovered <- normal_to_SHASH(z, mu = mu, sigma = sigma, nu = nu, tau = tau) all.equal(x, x_recovered)set.seed(42) x <- rnorm(200) x[c(17, 77)] <- x[c(17, 77)] + 5 mu <- 0; sigma <- 0; nu <- 0; tau <- 0 z <- SHASH_to_normal(x, mu = mu, sigma = sigma, nu = nu, tau = tau) x_recovered <- normal_to_SHASH(z, mu = mu, sigma = sigma, nu = nu, tau = tau) all.equal(x, x_recovered)
Summary method for Hardin & Rocke F results
## S3 method for class 'F_result' summary(object, ...)## S3 method for class 'F_result' summary(object, ...)
object |
An object of class "F_result" or "HR_result" |
... |
Additional arguments to to method-specific functions. |
NULL, invisibly
Summary method for MI_boot results
## S3 method for class 'MI_boot_result' summary(object, ...)## S3 method for class 'MI_boot_result' summary(object, ...)
object |
An object of class "MI_boot_result" |
... |
Additional arguments to to method-specific functions. |
NULL, invisibly
Summary method for MI results
## S3 method for class 'MI_result' summary(object, ...)## S3 method for class 'MI_result' summary(object, ...)
object |
An object of class "MI_result" |
... |
Additional arguments to to method-specific functions. |
NULL, invisibly
Summary method for SI_boot results
## S3 method for class 'SI_boot_result' summary(object, ...)## S3 method for class 'SI_boot_result' summary(object, ...)
object |
An object of class "SI_boot_result" |
... |
Additional arguments to to method-specific functions. |
NULL, invisibly
Summary method for SI results
## S3 method for class 'SI_result' summary(object, ...)## S3 method for class 'SI_result' summary(object, ...)
object |
An object of class "SI_result" |
... |
Additional arguments to to method-specific functions. |
NULL, invisibly
Thr parameter documentation
thr |
Threshold multiplier for outlier detection (default = 4). |
Thresh_result parameter documentation
thresh_result |
A threshold result object from any threshold method containing threshold information. |
Performs univariate outlier detection + imputation, robust distance, and multiple thresholding methods.
threshold_RD( x, w = NULL, method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH", "all"), RD_obj = NULL, impute_method = "mean", cutoff = 4, trans = "SHASH", M = 50, k = 100, alpha = 0.01, quantile = 0.01, verbose = FALSE, boot_quant = 0.95, B = 1000 )threshold_RD( x, w = NULL, method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH", "all"), RD_obj = NULL, impute_method = "mean", cutoff = 4, trans = "SHASH", M = 50, k = 100, alpha = 0.01, quantile = 0.01, verbose = FALSE, boot_quant = 0.95, B = 1000 )
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
method |
Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH". |
RD_obj |
Pre-computed RD_result object from |
impute_method |
Character string; imputation method for univariate outliers. |
cutoff |
A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4. |
trans |
Character string; transformation method, one of "SHASH" or "robZ". |
M |
Integer; number of multiply imputed datasets (default = 5). |
k |
Integer; number of perturbation cycles per imputation (default = 10). |
alpha |
Significance level used to compute RD threshold (default = 0.01 for 99th percentile). |
quantile |
Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold. |
verbose |
Logical; if TRUE, print progress messages. |
boot_quant |
Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI). |
B |
Integer; number of bootstrap samples per imputed dataset (default = 1000). |
A list with:
Result from the specific threshold method, or list of all methods if "all".
The robust distance object from compute_RD().
The matched function call.
Trans parameter documentation
trans |
Character string; transformation method, one of "SHASH" or "robZ". |
Verbose parameter documentation
verbose |
Logical; if TRUE, print progress messages. |
W parameter documentation
w |
A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI). |
X parameter documentation
x |
A numeric matrix or data frame of dimensions T × p (observations × variables). |