Evaluating Outlier Detection Strategies #Imaginations Hub

Image source - Pexels.com


Utilizing batting stats from Main League Baseball’s 2023 season

Shohei Ohtani, picture by Erik Drost on Flikr, CC BY 2.0

Outlier detection is an unsupervised machine studying process to determine anomalies (uncommon observations) inside a given information set. This process is useful in lots of real-world instances the place our obtainable dataset is already “contaminated” by anomalies. Scikit-learn implements a number of outlier detection algorithms, and in instances the place we now have an uncontaminated baseline, we are able to additionally use these algorithms for novelty detection, a semi-supervised process that predicts whether or not new observations are outliers.

Overview

The 4 outlier detection algorithms we’ll examine are:

  • Elliptic Envelope is appropriate for normally-distributed information with low dimensionality. As its title implies, it makes use of the multivariate regular distribution to create a distance measure to separate outliers from inliers.
  • Native Outlier Issue is a comparability of the native density of an statement with that of its neighbors. Observations with a lot decrease density than their neighbors are thought-about outliers.
  • One-Class Assist Vector Machine (SVM) with Stochastic Gradient Descent (SGD) is an O(n) approximate answer of the One-Class SVM. Word that the O(n²) One-Class SVM works effectively on our small instance dataset however could also be impractical on your precise use case.
  • Isolation Forest is a tree-based method the place outliers are extra shortly remoted by random splits than inliers.

Since our process is unsupervised, we don’t have floor fact to match accuracies of those algorithms. As a substitute, we need to see how their outcomes (participant rankings specifically) differ from each other and achieve some instinct into their habits and limitations, in order that we’d know when to want one over one other.

Let’s examine just a few of those strategies utilizing two metrics of batter efficiency from 2023’s Main Leage Baseball (MLB) season:

  • On-base share (OBP), the speed at which a batter reaches base (by hitting, strolling, or getting hit by pitch) per plate look
  • Slugging (SLG), the typical variety of whole bases per at bat

There are many extra subtle metrics of batter efficiency, together with OBP plus SLG (OPS), weighted on-base common (wOBA), and adjusted weighted runs created (WRC+). Nonetheless, we’ll see that along with being generally used and simple to grasp, OBP and SLG are reasonably correlated and roughly usually distributed, making them effectively suited to this comparability.

Dataset preparation

We use the pybaseball package deal to acquire hitting information. This Python package deal is beneath MIT license and returns information from Fangraphs.com, Baseball-Reference.com, and different sources which have in flip obtained offical data from Main League Baseball.

We use pybaseball’s 2023 batting statistics, which might be obtained both by batting_stats (FanGraphs) or batting_stats_bref (Baseball Reference). It seems that the participant names are extra appropriately formatted from Fangraphs, however participant groups and leagues from Baseball Reference are higher formatted within the case of traded gamers. For a dataset with improved readability, we really have to merge three tables: FanGraphs, Baseball Reference, and a key lookup.

from pybaseball import (cache, batting_stats_bref, batting_stats, 
playerid_reverse_lookup)
import pandas as pd

cache.allow() # keep away from pointless requests when re-running

MIN_PLATE_APPEARANCES = 200

# For readability and affordable default type order
df_bref = batting_stats_bref(2023).question(f"PA >= MIN_PLATE_APPEARANCES"
).rename(columns="Lev":"League",
"Tm":"Staff"
)
df_bref["League"] =
df_bref["League"].str.substitute("Maj-","").substitute("AL,NL","NL/AL"
).astype('class')

df_fg = batting_stats(2023, qual=MIN_PLATE_APPEARANCES)

key_mapping =
playerid_reverse_lookup(df_bref["mlbID"].to_list(), key_type='mlbam'
)[["key_mlbam","key_fangraphs"]
].rename(columns="key_mlbam":"mlbID",
"key_fangraphs":"IDfg"
)

df = df_fg.drop(columns="Staff"
).merge(key_mapping, how="internal", on="IDfg"
).merge(df_bref[["mlbID","League","Team"]],
how="internal", on="mlbID"
).sort_values(["League","Team","Name"])

Information Exploration

First, we word that these metrics differ in imply and variance and are reasonably correlated. We additionally word that every metric is pretty symmetric, with median worth near imply.

print(df[["OBP","SLG"]].describe().spherical(3))

print(f"nCorrelation: df[['OBP','SLG']].corr()['SLG']['OBP']:.3f")
           OBP      SLG
depend 362.000 362.000
imply 0.320 0.415
std 0.034 0.068
min 0.234 0.227
25% 0.300 0.367
50% 0.318 0.414
75% 0.340 0.460
max 0.416 0.654

Correlation: 0.630

Let’s visualize this joint distribution, utilizing:

  • Scatterplot of the gamers, coloured by Nationwide League (NL) vs American League (AL)
  • Bivariate kernel density estimator (KDE) plot of the gamers, which smoothes the scatterplot with a Gaussian kernel to estimate density
  • Marginal KDE plots of every metric
import matplotlib.pyplot as plt
import seaborn as sns

g = sns.JointGrid(information=df, x="OBP", y="SLG", top=5)
g = g.plot_joint(func=sns.scatterplot, information=df, hue="League",
palette="AL":"blue","NL":"maroon","NL/AL":"inexperienced",
alpha=0.6
)
g.fig.suptitle("On-base share vs. Sluggingn2023 season, min "
f"MIN_PLATE_APPEARANCES plate appearances"
)
g.determine.subplots_adjust(high=0.9)
sns.kdeplot(x=df["OBP"], colour="orange", ax=g.ax_marg_x, alpha=0.5)
sns.kdeplot(y=df["SLG"], colour="orange", ax=g.ax_marg_y, alpha=0.5)
sns.kdeplot(information=df, x="OBP", y="SLG",
ax=g.ax_joint, colour="orange", alpha=0.5
)
df_extremes = df[ df["OBP"].isin([df["OBP"].min(),df["OBP"].max()])
| df["OPS"].isin([df["OPS"].min(),df["OPS"].max()])
]

for _,row in df_extremes.iterrows():
g.ax_joint.annotate(row["Name"], (row["OBP"], row["SLG"]),dimension=6,
xycoords='information', xytext=(-3, 0),
textcoords='offset factors', ha="proper",
alpha=0.7)
plt.present()

The highest-right nook of the scatterplot reveals a cluster of excellence in hitting akin to the heavy higher tails of the SLG and OBP distributions. This small group excels at getting on base and hitting for further bases. How a lot we contemplate them to be outliers (due to their distance from the vast majority of the participant inhabitants) versus inliers (due to their proximity to at least one one other) is dependent upon the definition utilized by our chosen algorithm.

Apply outlier detection algorithms

Scikit-learn’s outlier detection algorithms usually have match() and predict() strategies, however there are exceptions and in addition variations between algorithms of their arguments. We’ll contemplate every algorithm individually, however we’ll match every to a matrix of attributes (n=2) per participant (m=453). We’ll then rating not solely every participant however a grid of values spanning the vary of every attribute, to assist us visualize the prediction operate.

To visualise determination boundaries, we have to take the next steps:

  1. Create a 2D meshgrid of enter characteristic values.
  2. Apply the decision_function to every level on the meshgrid, which requires unstacking the grid.
  3. Re-shape the predictions again right into a grid.
  4. Plot the predictions.

We’ll use a 200×200 grid to cowl the present observations plus some padding, however you may modify the grid to your required velocity and determination.

import numpy as np

X = df[["OBP","SLG"]].to_numpy()

GRID_RESOLUTION = 200

disp_x_range, disp_y_range = ( (.6*X[:,i].min(), 1.2*X[:,i].max())
for i in [0,1]
)
xx, yy = np.meshgrid(np.linspace(*disp_x_range, GRID_RESOLUTION),
np.linspace(*disp_y_range, GRID_RESOLUTION)
)
grid_shape = xx.form
grid_unstacked = np.c_[xx.ravel(), yy.ravel()]

Elliptic Envelope

The form of the elliptic envelope is set by the information’s covariance matrix, which provides the variance of characteristic i on the principle diagonal [i,i] and the covariance of options i and j within the [i,j] positions. As a result of the covariance matrix is delicate to outliers, this algorithm makes use of the Minimal Covariance Determinant (MCD) Estimator, which is really useful for unimodal and symmetric distributions, with shuffling decided by the random_state enter for reproducibility. This strong covariance matrix will come in useful once more later.

As a result of we need to examine the outlier scores of their rating slightly than a binary outlier/inlier classification, we use the decision_function to attain gamers.

from sklearn.covariance import EllipticEnvelope

ell = EllipticEnvelope(random_state=17).match(X)
df["outlier_score_ell"] = ell.decision_function(X)
Z_ell = ell.decision_function(grid_unstacked).reshape(grid_shape)

Native Outlier Issue

This method to measuring isolation relies on k-nearest neighbors (KNN). We calculate the whole distance from every statement to its nearest neighbors to outline native density, after which we examine every statement’s native density with that of its neighbors. Observations with native density a lot lower than their neighbors are thought-about outliers.

Selecting the variety of neighbors to incorporate: In KNN, a rule of thumb is to let Ok = sqrt(N), the place N is your statement depend. From this rule, we get hold of a Ok shut to twenty (which occurs to be the default Ok for LOF). You may enhance or lower Ok to scale back overfitting or underfitting, respectively.

Ok = int(np.sqrt(X.form[0]))

print(f"Utilizing Ok=Ok nearest neighbors.")
Utilizing Ok=19 nearest neighbors.

Selecting a distance measure: Word that our options are correlated and have completely different variances, so Euclidean distance is just not very significant. We’ll use Mahalanobis distance, which accounts for characteristic scale and correlation.

In calculating the Mahalanobis distance, we’ll use the strong covariance matrix. If we had not already calculated it by way of Ellliptic Envelope, we might calculate it immediately.

from scipy.spatial.distance import pdist, squareform

# If we didn't have the elliptical envelope already,
# we might calculate strong covariance:
# from sklearn.covariance import MinCovDet
# robust_cov = MinCovDet().match(X).covariance_
# However we are able to simply re-use it from elliptical envelope:
robust_cov = ell.covariance_

print(f"Sturdy covariance matrix:nnp.spherical(robust_cov,5)n")

inv_robust_cov = np.linalg.inv(robust_cov)

D_mahal = squareform(pdist(X, 'mahalanobis', VI=inv_robust_cov))

print(f"Mahalanobis distance matrix of dimension D_mahal.form, "
f"e.g.:nnp.spherical(D_mahal[:5,:5],3)...n...n")
Sturdy covariance matrix:
[[0.00077 0.00095]
[0.00095 0.00366]]

Mahalanobis distance matrix of dimension (362, 362), e.g.:
[[0. 2.86 1.278 0.964 0.331]
[2.86 0. 2.63 2.245 2.813]
[1.278 2.63 0. 0.561 0.956]
[0.964 2.245 0.561 0. 0.723]
[0.331 2.813 0.956 0.723 0. ]]...
...

Becoming the Native Outlier Issue: Word that utilizing a customized distance matrix requires us to cross metric="precomputed" to the constructor after which the gap matrix itself to the match methodology. (See documentation for extra particulars.)

Additionally word that in contrast to different algorithms, with LOF we’re instructed to not use the score_samples methodology for scoring current observations; this methodology ought to solely be used for novelty detection.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=Ok, metric="precomputed", novelty=True
).match(D_mahal)

df["outlier_score_lof"] = lof.negative_outlier_factor_

Create the choice boundary: As a result of we used a customized distance metric, we should additionally compute that customized distance between every level within the grid to the unique observations. Earlier than we used the spatial measure pdist for pairwise distances between every member of a single set, however now we use cdist to return the distances from every member of the primary set of inputs to every member of the second set.

from scipy.spatial.distance import cdist

D_mahal_grid = cdist(XA=grid_unstacked, XB=X,
metric='mahalanobis', VI=inv_robust_cov
)
Z_lof = lof.decision_function(D_mahal_grid).reshape(grid_shape)

Assist Vector Machine (SGD-One-Class SVM)

SVMs use the kernel trick to rework options into the next dimensionality the place a separating hyperplane might be recognized. The radial foundation operate (RBF) kernel requires the inputs to be standardized, however because the documentation for StandardScaler notes, that scaler is delicate to outliers, so we'll use RobustScaler. We'll pipe the scaled inputs into Nyström kernel approximation, as advised by the documentation for SGDOneClassSVM.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVM

suv = make_pipeline(
RobustScaler(),
Nystroem(random_state=17),
SGDOneClassSVM(random_state=17)
).match(X)

df["outlier_score_svm"] = suv.decision_function(X)

Z_svm = suv.decision_function(grid_unstacked).reshape(grid_shape)

Isolation Forest

This tree-based method to measuring isolation performs random recursive partitioning. If the typical variety of splits required to isolate a given statement is low, that statement is taken into account a stronger candidate outlier. Like Random Forests and different tree-based fashions, Isolation Forest doesn’t assume that the options are usually distributed or require them to be scaled. By default, it builds 100 timber. Our instance solely makes use of two options, so we don’t allow characteristic sampling.

from sklearn.ensemble import IsolationForest

iso = IsolationForest(random_state=17).match(X)

df["outlier_score_iso"] = iso.score_samples(X)

Z_iso = iso.decision_function(grid_unstacked).reshape(grid_shape)

Outcomes: inspecting determination boundaries

Word that the predictions from these fashions have completely different distributions. We apply QuantileTransformer to make them extra visually comparable on a given grid. From the documentation, please word:

Word that this rework is non-linear. It could distort linear correlations between variables measured on the similar scale however renders variables measured at completely different scales extra immediately comparable.

from adjustText import adjust_text
from sklearn.preprocessing import QuantileTransformer

N_QUANTILES = 8 # This many colour breaks per chart
N_CALLOUTS=15 # Label this many high outliers per chart

fig, axs = plt.subplots(2, 2, figsize=(12, 12), sharex=True, sharey=True)

fig.suptitle("Comparability of Outlier Identification Algorithms",dimension=20)
fig.supxlabel("On-Base Proportion (OBP)")
fig.supylabel("Slugging (SLG)")

ax_ell = axs[0,0]
ax_lof = axs[0,1]
ax_svm = axs[1,0]
ax_iso = axs[1,1]

model_abbrs = ["ell","iso","lof","svm"]

qt = QuantileTransformer(n_quantiles=N_QUANTILES)

for ax, nm, abbr, zz in zip( [ax_ell,ax_iso,ax_lof,ax_svm],
["Elliptic Envelope","Isolation Forest",
"Local Outlier Factor","One-class SVM"],
model_abbrs,
[Z_ell,Z_iso,Z_lof,Z_svm]
):
ax.title.set_text(nm)
outlier_score_var_nm = f"outlier_score_abbr"

qt.match(np.type(zz.reshape(-1,1)))
zz_qtl = qt.rework(zz.reshape(-1,1)).reshape(zz.form)

cs = ax.contourf(xx, yy, zz_qtl, cmap=plt.cm.OrRd.reversed(),
ranges=np.linspace(0,1,N_QUANTILES)
)
ax.scatter(X[:, 0], X[:, 1], s=20, c="b", edgecolor="okay", alpha=0.5)

df_callouts = df.sort_values(outlier_score_var_nm).head(N_CALLOUTS)
texts = [ ax.text(row["OBP"], row["SLG"], row["Name"], c="b",
dimension=9, alpha=1.0)
for _,row in df_callouts.iterrows()
]
adjust_text(texts,
df_callouts["OBP"].values, df_callouts["SLG"].values,
arrowprops=dict(arrowstyle='->', colour="b", alpha=0.6),
ax=ax
)

plt.tight_layout(pad=2)
plt.present()

for var in ["OBP","SLG"]:
df[f"Pctl_var"] = 100*(df[var].rank()/df[var].dimension).spherical(3)

model_score_vars = [f"outlier_score_nm" for nm in model_abbrs]
model_rank_vars = [f"Rank_nm.upper()" for nm in model_abbrs]


df[model_rank_vars] = df[model_score_vars].rank(axis=0).astype(int)

# Averaging the ranks is bigoted; we simply want a countdown order
df["Rank_avg"] = df[model_rank_vars].imply(axis=1)

print("Counting all the way down to the best outlier...n")
print(
df.sort_values("Rank_avg",ascending=False
).tail(N_CALLOUTS)[["Name","AB","PA","H","2B","3B",
"HR","BB","HBP","SO","OBP",
"Pctl_OBP","SLG","Pctl_SLG"
] +
[f"Rank_nm.upper()" for nm in model_abbrs]
].to_string(index=False)
)
Counting all the way down to the best outlier...

Identify AB PA H 2B 3B HR BB HBP SO OBP Pctl_OBP SLG Pctl_SLG Rank_ELL Rank_ISO Rank_LOF Rank_SVM
Austin Barnes 178 200 32 5 0 2 17 2 43 0.256 2.6 0.242 0.6 19 7 25 12
J.D. Martinez 432 479 117 27 2 33 34 2 149 0.321 52.8 0.572 98.1 15 18 5 15
Yandy Diaz 525 600 173 35 0 22 65 8 94 0.410 99.2 0.522 95.4 13 15 13 10
Jose Siri 338 364 75 13 2 25 20 2 130 0.267 5.5 0.494 88.4 8 14 15 13
Juan Soto 568 708 156 32 1 35 132 2 129 0.410 99.2 0.519 95.0 12 13 11 11
Mookie Betts 584 693 179 40 1 39 96 8 107 0.408 98.6 0.579 98.3 7 10 20 7
Rob Refsnyder 202 243 50 9 1 1 33 5 47 0.365 90.5 0.317 6.6 5 19 2 14
Yordan Alvarez 410 496 120 24 1 31 69 13 92 0.407 98.3 0.583 98.6 6 9 18 6
Freddie Freeman 637 730 211 59 2 29 72 16 121 0.410 99.2 0.567 97.8 9 11 9 8
Matt Olson 608 720 172 27 3 54 104 4 167 0.389 96.5 0.604 99.2 11 6 7 9
Austin Hedges 185 212 34 5 0 1 11 2 47 0.234 0.3 0.227 0.3 10 1 4 3
Aaron Choose 367 458 98 16 0 37 88 0 130 0.406 98.1 0.613 99.4 3 5 6 4
Ronald Acuna Jr. 643 735 217 35 4 41 80 9 84 0.416 100.0 0.596 98.9 2 3 10 2
Corey Seager 477 536 156 42 0 33 49 4 88 0.390 97.0 0.623 99.7 4 4 3 5
Shohei Ohtani 497 599 151 26 8 44 91 3 143 0.412 99.7 0.654 100.0 1 2 1 1

Evaluation and Conclusions

It seems just like the 4 implementations principally agree on find out how to outline outliers, however with some noticeable variations in scores and in addition in ease of use.

Elliptic Envelope has narrower contours across the ellipse’s minor axis, so it tends to spotlight these fascinating gamers who run opposite to the general correlation between options. For instance, Rays outfielder José Siri ranks as extra of an outlier beneath this algorithm on account of his excessive SLG (88th percentile) versus low OBP (fifth percentile), which is per an aggressive hitter who swings laborious at borderline pitches and both crushes them or will get weak-to-no contact.

Elliptic Envelope can be straightforward to make use of with out configuration, and it gives the strong covariance matrix. When you have low-dimensional information and an inexpensive expectation for it to be usually distributed (which is usually not the case), you may need to do this easy method first.

One-class SVM has extra uniformly spaced contours, so it tends to emphasise observations alongside the general path of correlation greater than the Elliptic Envelope. All-Star first basemen Freddie Freeman (Dodgers) and Yandy Diaz (Rays) rank extra strongly beneath this algorithm than beneath others, since their SLG and OBP are each glorious (99th and 97th percentile for Freeman, 99th and ninety fifth for Diaz).

The RBF kernel required an additional step for standardization, nevertheless it additionally appeared to work effectively on this straightforward instance with out fine-tuning.

Native Outlier Issue picked up on the “cluster of excellence” talked about earlier with a small bimodal contour (barely seen within the chart). For the reason that Dodgers’ outfielder/second-baseman Mookie Betts is surrounded by different glorious hitters together with Freeman, Yordan Alvarez, and Ronald Acuña Jr., he ranks as solely the Twentieth-strongest outlier beneath LOF, versus tenth or stronger beneath the opposite algorithms. Conversely, Braves outfielder Marcell Ozuna had barely decrease SLG and significantly decrease OBP than Betts, however he’s extra of an outlier beneath LOF as a result of his neighborhood is much less dense.

LOF was essentially the most time-consuming to implement since we created strong distance matrices for becoming and scoring. We might have spent a while tuning Ok as effectively.

Isolation Forest tends to emphasise observations on the corners of the characteristic house, as a result of splits are distributed throughout options. Backup catcher Austin Hedges, who performed for the Pirates and Rangers in 2023 and signed with Guardians for 2024, is powerful defensively however the worst batter (with a minimum of 200 plate appearances) in each SLG and OBP. Hedges might be remoted in a single cut up on both OBP or OPS, making him the strongest outlier. Isolation Forest is the solely algorithm that didn’t rank Shohei Ohtani because the strongest outlier: since Ohtani was edged out in OBP by Ronald Acuña Jr., each Ohtani and Acuña might be remoted in a single cut up on solely one characteristic.

As with frequent supervised tree-based learners, Isolation Forest doesn’t extrapolate, making it higher suited to becoming to a contaminated dataset for outlier detection than for becoming to an anomaly-free dataset for novelty detection (the place it wouldn’t rating new outliers extra strongly than the present observations).

Though Isolation Forest labored effectively out of the field, its failure to rank Shohei Ohtani because the biggest outlier in baseball (and doubtless all skilled sports activities) illustrates the first limitation of any outlier detector: the information you employ to suit it.

Not solely did we omit defensive stats (sorry, Austin Hedges), we didn’t hassle to incorporate pitching stats. As a result of pitchers don’t even attempt to hit anymore… aside from Ohtani, whose season included the second-best batting common in opposition to (BAA) and Eleventh-best earned run common (ERA) in baseball (minimal 100 innings), a complete-game shutout, and a sport wherein he struck out ten batters and hit two dwelling runs.

It has been advised that Shohei Ohtani is a sophisticated extraterrestrial impersonating a human, nevertheless it appears extra possible that there are two superior extraterrestrials impersonating the identical human. Sadly, considered one of them simply had elbow surgical procedure and received’t pitch in 2024… however the different simply signed a file 10-year, $700 million contract. And because of outlier detection, now we are able to see why!


Evaluating Outlier Detection Strategies was initially printed in In direction of Information Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.


Related articles

You may also be interested in