Recursive statistics for data streams
If you ever have to process an indeterminate length stream of data, it’s useful to be able to compute some basic statistics as you go. The key phrase there is as you go - we could easily compute a mean of an arbitrary number of samples
but that means we have to keep examples in memory! That’s space complexity, which for very a large is unacceptable.
Recursive mean
Fortunately, we can compute the recursive mean given
- The previous mean
- The most recent example
- The number of examples observed so far
Then all we need to do is as follows. (Note that the use of is Bessel’s correction)
Starting with . Note that we “start” from to avoid division by zero. Think of it as counting the number of observations so far. We can’t calculate anything until we’ve seen at least one anyway. Simple python example:
import random;
mean_prev = 0;
i = 0;
for x in [random.uniform(-10,10) for _ in range(10)]:
i += 1
mean = ((i-1) * mean_prev + x) / i;
mean_prev = mean;
print("[{:d}] x={:.2f}, mean={:.2f}".format(i, x, mean))
[1] x=-0.55, mean=-0.55
[2] x=-4.28, mean=-2.41
[3] x=9.66, mean=1.61
[4] x=7.60, mean=3.11
[5] x=-4.35, mean=1.62
[6] x=8.14, mean=2.70
[7] x=-8.69, mean=1.08
[8] x=-6.55, mean=0.12
[9] x=7.41, mean=0.93
[10] x=6.21, mean=1.46
Recursive variance and standard deviation
To calculate the variance, we also need the sum of the squares of the of the data points from .
Then the standard deviation is simply the square root of the variance.
It is simple to add this into our python sample:
import random;
import math;
mean_prev = 0;
sum_sq = 0;
i = 0;
for x in [random.uniform(-10,10) for _ in range(10)]:
i += 1
mean = ((i-1) * mean_prev + x) / i;
sum_sq += (x-mean)**2;
var = sum_sq / i;
std = math.sqrt(var)
mean_prev = mean;
print("[{:d}] x={:.2f}, mean={:.2f}, std={:.2f}".format(i, x, mean, std))
[1] x=-9.89, mean=-9.89, std=0.00
[2] x=4.05, mean=-2.92, std=4.93
[3] x=1.69, mean=-1.38, std=4.40
[4] x=-2.57, mean=-1.68, std=3.83
[5] x=1.94, mean=-0.96, std=3.67
[6] x=6.77, mean=0.33, std=4.26
[7] x=9.95, mean=1.71, std=5.02
[8] x=-7.99, mean=0.49, std=5.57
[9] x=-5.04, mean=-0.12, std=5.51
[10] x=-6.69, mean=-0.78, std=5.55
And that’s all there is to calculating the most common stream statistics.
Addendum: Recursive estimate of the covariance
Something I’ve also had to deal with in multivariate streams is a recursive estimate of the covariance matrix. Fortunately we can do this using the means we have already calculated. I repeat that this provides an estimation of the sample covariance - you will obviously calculate a different covariance matrix if you took all the points and calculated it classically. But if you can fit all points in memory then what’s the point of doing this?
Here is an example Java implementation from a stream clusterer I am writing:
private double[] totals;
private double[] mean;
private double[][] cov;
private int weight;
... (Rest of class redacted) ...
/**
* Adds the example to the cluster by updating the cluster statistics.
* @param example The next stream example to be added to the cluster.
*/
public void add(double[] example) {
weight++;
double[] prevMean = Arrays.copyOf(mean, mean.length);
for(int i=0;i<example.length;i++) {
totals[i] += example[i];
mean[i] = totals[i] / weight;
for(int j=0;j<example.length;j++) {
cov[i][j] = (cov[i][j] * (weight-1) + (example[i] - mean[i]) * (example[j] - prevMean[j]))/weight;
}
}
}