Professional Documents
Culture Documents
Statistical Functions in MySQL - Open Source Is Everything
Statistical Functions in MySQL - Open Source Is Everything
Robert Eisele
Computer Science & Machine Learning
some of the calculations. You can check out the source of the UDF on
Github:
Arithmetic mean
The classical arithmetic mean is already calculable natively as indicated
above with the AVG() function or by doing it yourself:
Weighted average
A weighted average can be obtained in a similar way by dividing out two
sums as follows, where "w" is the per-row weight:
Harmonic average
The harmonic average, which for example is used for rates and ratios can
also be calculated quite easily with native functions. Suppose you want to
calculate the average cost of data transmission. One hosting packet allows
you to run at a rate of 9GiB per dollar and one on 17GiB per dollar. An
arithmetic mean would give you an average of 13GiB / dollar, which is
wrong. The correct solution would be 2 /(1 / 9 + 1 / 17) = 11.7GiB / dollar,
or in abstract MySQL syntax:
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 2/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
Geometric mean
The geometric average, which usually comes into use when it comes to
the calculation of product averages, such as tiered discounts or similar
quantities, can also be calculated when we introduce some kind of
algebra. The product of several numbers can also be de ned by the sum
of their logarithms, which in turn is taken as the exponent of e. In
MySQL syntax that would mean:
With this knowledge in mind, we can easily extrapolate from the product
to the geomean:
Midrange
The mid-range only takes into account the extremes of a data set and can
be computed as follows:
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 3/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
Median
There are some good examples in the comments of the documentation
of how the median can be implemented with MySQL. However, a
spoiled Excel user will run in circles screaming in the face of such
cruelties. That's why I've added a median function to my UDF, so that
this will be valid:
SELECT x, COUNT( * )
FROM t1
GROUP BY x
ORDER BY COUNT( * ) DESC
LIMIT 1;
VAR_POP( x ) = VARIANCE( x )
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 4/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
Covariance
Oracle provides the additional functions COVAR_POP(x, y) and
COVAR_SAMP(x, y), respectively, in order to calculate the co-variance -
the variance between two random variables. With MySQL, this
functionality can be simulated with native functions as follows:
COVAR_POP(x, y):
SELECT( SUM( x * y ) - SUM( x ) * SUM( y ) / COUNT( x ) ) / COUNT( x ) FROM t1
COVAR_SAMP(x, y):
SELECT( SUM( x * y ) - SUM( x ) * SUM( y ) / COUNT( x ) ) /( COUNT( x ) - 1 ) FROM t1
This task would be more awless and e cient with a native function
COVARIANCE(x, y), which I've added to my infusion extenseion in order
to have a shortcut for the COVAR_POP() example above:
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 5/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
as well as
Row Ranking
If you want to give each line of a MySQL result a unique serial number,
you must use a little trick with variables like this:
This example may be easy, but it complicates things with more complex
queries. I don't know why MySQL doesn't have a function for this, but
I've caught it with my infusion extension to correspond to TSQL:
Longtail Analysis
I think it's better to start with an example to illustrate further
considerations. With Longtail analysis I mean the representation of a
frequency distribution - or a histogram. For search engine optimization
you can determine what the search term distribution over a certain
period of time was. Since the proportion of unique search terms is
usually relatively high and since the image of such a graph is almost
always the same, it's obvious to not run GROUP BY queries on a large
data set to simply get something like the following:
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 6/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 7/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
} else {
print (count - less) / count * 100, "% are better than average"
}
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 8/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
Fadi (itoctopus)
commented after 4 days
Justin Swanhart
commented after one day
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 9/10
2/10/2019 Statistical functions in MySQL • Open Source is Everything
Jonathan:
http://rpbouman.blogspot.com/2008/07/calculating-percentiles-with-
mysql.html
Jonathan Levin
commented after 8 hours
Great job!
Do you do requests?
I would really love for a 95% function.
Sorry, comments are closed for this article. Contact me if you want to
leave a note.
https://www.xarg.org/2012/07/statistical-functions-in-mysql/ 10/10