Statistical and Mathematical Functions for Derivations

The following is a list of all the statistical and mathematical functions available for derivations in SuperCROSS.

The statistical functions compute various descriptive statistics for tabulation fields representing ranged (for example, age) or bracketed (for example, income) values using the frequencies found in the populated table.

Statistical function derivations do not work directly on the values in a table, instead they are weighted by ranges defined on each item in a recode and the label specified on the range in which the calculated value falls is displayed, rather than the calculated value.

These ranges are automatically created if the labels of recode items can be parsed as integers. If the labels cannot be automatically parsed as integers then these ranges are not created and none of the statistical functions work. For example, Mean, Median and Stdev.

For example, in the Retail Banking database if you add the Age field (excluding unknown), count Customers, and add a mean field derivation, the value displayed will be the (approximate) average age of customers, rather than the average of the number of customers in each age group.

When using statistical function derivations in this manner you need to keep the following in mind:

The function does not operate directly on the tabulation values being displayed. For example, Mean is not the same as the value of sum(A:Z)/n (where n is the number of cells in the range A:Z).
The function does not work if SuperCROSS could not parse the recode item labels as integers and form ranges from them, or if the user did not double-click on each recode item label and explicitly set a range.
You cannot operate directly on the values being displayed.

Function

Description

Exp

Exponential function where Exp = exp(x)

Ln

Logarithm function where Log = ln(x)

Round

Round function where Round = floor(x+0.5)

Sqrt

Square root function where Sqrt = sqrt(x)

Median

The middle value of a set or series of values (after data has been ordered from lowest to highest).

For example, the median value of the set 22, 25, 34, 46, 90 is 34.

If there are an even number of values in the set (and therefore no single middle value), then the median is calculated as the mean of the two middle values.

For example, for the set 14, 16, 23, 26, 28, 33, the median is calculated as:

Percentile

The rank of a data point in a group of data. A percentile can only be derived from a set of numeric values.

Pareto

The Pareto index is a measure of the breadth of income or wealth distribution.

One of the simplest characterisations of the Pareto distribution, when used to model the distribution of income, is that the proportion of the population whose income exceeds any positive number x is:

(x_m/x)ⁱ

Where:

x_m is the wealth of the poorest people
the Pareto index is the parameter superscript ⁱ

Since a proportion must be between 0 and 1 inclusive, the index ⁱ must be positive.

For example, to find the proportion of the population whose income exceeds $80,000 per year where the minimum income is $33,750 per year and the index is ^4/3.

This indicates that approximately 31% of the population is earning more than $80,000 per year.

Mean

The measure of the centre of ordinary arithmetic average (the sum of all values divided by the number of values).

Variance

The variance is used as a measure of how far a set of numbers are spread out from each other, describing how far the numbers lie from the mean (expected value).

Variance is calculated as the average of the squared differences from the mean. For example:

Values in data set:

5, 7, 6, 1, 8

Mean of this data set:

Squared differences are:

Variance is:

Stdev

Standard deviation is defined as the average amount by which values in a distribution differ from the mean, ignoring the sign of the difference. Standard deviation is calculated by taking the square root of the variance.

For example, using the example above of the data set 5, 7, 6, 1, 8, the variance is 5.84, so the standard deviation is:

Asymmetry

Measures the lack or absence of balance (symmetry) about the mean.

Skewness

A measure of the asymmetry from the normal distribution in a set of statistical data.

The skewness value can be positive or negative, or even undefined:

A negative skew indicates that the "tail" on the left side of the probability density function is longer than the right side and the bulk of the values (possibly including the median) lie to the right of the mean.
A positive skew indicates that the 'tail' on the right side is 'longer' than the left side and the bulk of the values lie to the left of the mean.
A zero value indicates that the values are relatively evenly distributed on both sides of the mean, typically, but not necessarily, implying a symmetric distribution.

Forecast

A forecast of a future (or past) value, based on the existing values. See Forecasting for more information about creating a forecast, and how forecasts are calculated.