Recursive statistics for data streams

If you ever have to process an indeterminate length stream of data, it's useful to be able to compute some basic statistics as you go. The key phrase there is as you go - we could easily compute a mean of an arbitrary number of samples $N$

$$ \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i $$

but that means we have to keep $N$ examples in memory! That's $O(N)$ space complexity, which for very a large $N$ is unacceptable.

Recursive mean

Fortunately, we can compute the recursive mean given

  • The previous mean $\bar{x}_{i-1}$
  • The most recent example $x_i$
  • The number of examples observed so far $i$

Then all we need to do is as follows. (Note that the use of $i-1$ is Bessel's correction)

$$ \bar{x}_i = \frac{(i-1) \times \bar{x}_{i-1} + x_i}{i} $$

Starting with $\bar{x}_1 = 0$. Note that we "start" from $1$ to avoid division by zero. Think of it as counting the number of observations so far. We can't calculate anything until we've seen at least one anyway. Simple python example:

import random;  
mean_prev = 0;
i = 0;

for x in [random.uniform(-10,10) for _ in range(10)]:
    i += 1
    mean = ((i-1) * mean_prev + x) / i;
    mean_prev = mean;
    print("[{:d}] x={:.2f}, mean={:.2f}".format(i, x, mean))
[1] x=-0.55, mean=-0.55
[2] x=-4.28, mean=-2.41
[3] x=9.66, mean=1.61
[4] x=7.60, mean=3.11
[5] x=-4.35, mean=1.62
[6] x=8.14, mean=2.70
[7] x=-8.69, mean=1.08
[8] x=-6.55, mean=0.12
[9] x=7.41, mean=0.93
[10] x=6.21, mean=1.46
Recursive variance and standard deviation

To calculate the variance, we also need the sum of the squares of the of the data points from $0..i$.

$$ \sigma^2_i = \frac{\sum x^2_i}{i} - \bar{x}^2_i $$

Then the standard deviation is simply the square root of the variance.

$$ \sigma_i = \sqrt{\sigma^2_i} $$

It is simple to add this into our python sample:

import random;
import math;
mean_prev = 0;
sum_sq = 0;
i = 0;

for x in [random.uniform(-10,10) for _ in range(10)]:
    i += 1
    mean = ((i-1) * mean_prev + x) / i;
    sum_sq += (x-mean)**2;
    var = sum_sq / i;
    std = math.sqrt(var)
    mean_prev = mean;

    print("[{:d}] x={:.2f}, mean={:.2f}, std={:.2f}".format(i, x, mean, std))
[1] x=-9.89, mean=-9.89, std=0.00
[2] x=4.05, mean=-2.92, std=4.93
[3] x=1.69, mean=-1.38, std=4.40
[4] x=-2.57, mean=-1.68, std=3.83
[5] x=1.94, mean=-0.96, std=3.67
[6] x=6.77, mean=0.33, std=4.26
[7] x=9.95, mean=1.71, std=5.02
[8] x=-7.99, mean=0.49, std=5.57
[9] x=-5.04, mean=-0.12, std=5.51
[10] x=-6.69, mean=-0.78, std=5.55

And that's all there is to calculating the most common stream statistics.

Addendum: Recursive estimate of the covariance

Something I've also had to deal with in multivariate streams is a recursive estimate of the covariance matrix. Fortunately we can do this using the means we have already calculated. I repeat that this provides an estimation of the sample covariance - you will obviously calculate a different covariance matrix if you took all the points and calulcated it classically. But if you can fit all points in memory then what's the point of doing this?

$$
C_i(x,y) = C_{i-1} (x,y) \cdot (i-1) + \frac{(x_i - \bar{x}_i)(y_i - \bar{y}_{i-1})}{i}
$$

Here is an example Java implementation from a stream clusterer I am writing:

    private double[] totals;
    private double[] mean;
    private double[][] cov;
    private int      weight;

... (Rest of class redacted) ...

    /**
     * Adds the example to the cluster by updating the cluster statistics.
     * @param example The next stream example to be added to the cluster.
     */
    public void add(double[] example) {
        weight++;

        double[] prevMean = Arrays.copyOf(mean, mean.length);

        for(int i=0;i<example.length;i++) {
            totals[i] += example[i];
            mean[i] = totals[i] / weight;

            for(int j=0;j<example.length;j++) {
                cov[i][j] = (cov[i][j] * (weight-1) + (example[i] - mean[i]) * (example[j] - prevMean[j]))/weight;
            }
        }
    }