Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Winsorizing

Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical
data to reduce the effect of possibly spurious outliers. It is named after the engineer-turned-biostatistician
Charles P. Winsor (1895–1951). The effect is the same as clipping in signal processing.

The distribution of many statistics can be heavily influenced by outliers. A typical strategy is to set all
outliers to a specified percentile of the data; for example, a 90% winsorization would see all data below the
5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
Winsorized estimators are usually more robust to outliers than their more standard forms, although there are
alternatives, such as trimming, that will achieve a similar effect.

Example
Consider the data set consisting of:

{92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, −40, 101, 86, 85, 15, 89, 89, 28, −5, 41}       (N =
20, mean = 101.5)

The data below the 5th percentile lies between −40 and −5, while the data above the 95th percentile lies
between 101 and 1053 (pertinent values shown in bold); accordingly, a 90% winsorization would result in
the following:

{92, 19, 101, 58, 101, 91, 26, 78, 10, 13, −5, 101, 86, 85, 15, 89, 89, 28, −5, 41}       (N = 20,
mean = 55.65)

After winsorization the mean has dropped to nearly half its previous value, and is consequently more in line
with the data it represents.

Python can winsorize data using SciPy library :

from scipy.stats.mstats import winsorize


winsorize([92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5,
41], limits=[0.05, 0.05])

R can winsorize data using the DescTools package:[1]

library(DescTools)
a<-c(92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41)
DescTools::Winsorize(a, probs = c(0.05, 0.95))

Distinction from trimming


Note that winsorizing is not equivalent to simply excluding data, which is a simpler procedure, called
trimming or truncation, but is a method of censoring data.

In a trimmed estimator, the extreme values are discarded; in a winsorized estimator, the extreme values are
instead replaced by certain percentiles (the trimmed minimum and maximum).
Thus a winsorized mean is not the same as a truncated mean. For instance, the 10% trimmed mean is the
average of the 5th to 95th percentile of the data, while the 90% winsorized mean sets the bottom 5% to the
5th percentile, the top 5% to the 95th percentile, and then averages the data. In the previous example the
trimmed mean would be obtained from the smaller set:

{92, 19, 101, 58,       91, 26, 78, 10, 13,       101, 86, 85, 15, 89, 89, 28, −5, 41}       (N = 18,
mean = 56.5)

In this case, the winsorized mean can equivalently be expressed as a weighted average of the truncated
mean and the 5th and 95th percentiles (for the 10% winsorized mean, 0.05 times the 5th percentile, 0.9
times the 10% trimmed mean, and 0.05 times the 95th percentile) though in general winsorized statistics
need not be expressible in terms of the corresponding trimmed statistic.

More formally, they are distinct because the order statistics are not independent.

Uses
Winsorization is a used in the survey methodology context in order to "trim" extreme survey non-response
weights.[2]

It is also used in the construction of some stock indexes when looking at the range of certain factors (for
example growth and value) for particular stocks.[3]

See also
Trimmed estimator
Huber loss
Robust regression

References
1. Andri Signorell et al. (2021). DescTools: Tools for descriptive statistics. R package version
0.99.41.
2. Lee, Brian K., Justin Lessler, and Elizabeth A. Stuart. "Weight trimming and propensity score
weighting." PLOS ONE 6.3 (2011): e18174. link (https://journals.plos.org/plosone/article?id=
10.1371/journal.pone.0018174)
3. MSCI Global Investable Market Value and Growth Index Methodology 2.2.1 link (https://ww
w.msci.com/eqb/methodology/meth_docs/MSCI_GIMIVGMethod_Feb2021.pdf)

Hastings, Jr., Cecil; Mosteller, Frederick; Tukey, John W.; Winsor, Charles P. (1947). "Low
moments for small samples: a comparative study of order statistics" (https://doi.org/10.121
4%2Faoms%2F1177730388). Annals of Mathematical Statistics. 18 (3): 413–426.
doi:10.1214/aoms/1177730388 (https://doi.org/10.1214%2Faoms%2F1177730388).
Dixon, W. J. (1960). "Simplified Estimation from Censored Normal Samples" (https://doi.org/
10.1214%2Faoms%2F1177705900). Annals of Mathematical Statistics. 31 (2): 385–391.
doi:10.1214/aoms/1177705900 (https://doi.org/10.1214%2Faoms%2F1177705900).
Tukey, J. W. (1962). "The Future of Data Analysis" (https://doi.org/10.1214%2Faoms%2F117
7704711). Annals of Mathematical Statistics. 33 (1): 1–67 [p. 18].
doi:10.1214/aoms/1177704711 (https://doi.org/10.1214%2Faoms%2F1177704711).
JSTOR 2237638 (https://www.jstor.org/stable/2237638).

External links
"Winsorization" (https://www.r-bloggers.com/winsorization/). R-bloggers. June 30, 2011.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Winsorizing&oldid=1162901838"

You might also like