· 4 min read

Recursive statistics for data streams

#statistics #data-science

If you ever have to process an indeterminate length stream of data, it’s useful to be able to compute some basic statistics as you go. The key phrase there is as you go - we could easily compute a mean of an arbitrary number of samples NN

xˉ=1Ni=1Nx_i\bar{x} = \frac{1}{N} \sum_{i=1}^N x\_i

but that means we have to keep NN examples in memory! That’s O(N)O(N) space complexity, which for very a large NN is unacceptable.

Recursive mean

Fortunately, we can compute the recursive mean given

  • The previous mean xˉi1\bar{x}_{i-1}
  • The most recent example xix_i
  • The number of examples observed so far ii

Then all we need to do is as follows. (Note that the use of i1i-1 is Bessel’s correction)

xˉi=(i1)×xˉ_i1+x_ii\bar{x}_i = \frac{(i-1) \times \bar{x}\_{i-1} + x\_i}{i}

Starting with xˉ1=0\bar{x}_1 = 0. Note that we “start” from 11 to avoid division by zero. Think of it as counting the number of observations so far. We can’t calculate anything until we’ve seen at least one anyway. Simple python example:

import random;  
mean_prev = 0;
i = 0;

for x in [random.uniform(-10,10) for _ in range(10)]:
    i += 1
    mean = ((i-1) * mean_prev + x) / i;
    mean_prev = mean;
    print("[{:d}] x={:.2f}, mean={:.2f}".format(i, x, mean))
[1] x=-0.55, mean=-0.55
[2] x=-4.28, mean=-2.41
[3] x=9.66, mean=1.61
[4] x=7.60, mean=3.11
[5] x=-4.35, mean=1.62
[6] x=8.14, mean=2.70
[7] x=-8.69, mean=1.08
[8] x=-6.55, mean=0.12
[9] x=7.41, mean=0.93
[10] x=6.21, mean=1.46

Recursive variance and standard deviation

To calculate the variance, we also need the sum of the squares of the of the data points from 0..i0..i.

σ2_i=xi2ixˉ2_i\sigma^2\_i = \frac{\sum x^2_i}{i} - \bar{x}^2\_i

Then the standard deviation is simply the square root of the variance.

σ_i=σ2_i\sigma\_i = \sqrt{\sigma^2\_i}

It is simple to add this into our python sample:

import random;
import math;
mean_prev = 0;
sum_sq = 0;
i = 0;

for x in [random.uniform(-10,10) for _ in range(10)]:
    i += 1
    mean = ((i-1) * mean_prev + x) / i;
    sum_sq += (x-mean)**2;
    var = sum_sq / i;
    std = math.sqrt(var)
    mean_prev = mean;

    print("[{:d}] x={:.2f}, mean={:.2f}, std={:.2f}".format(i, x, mean, std))
[1] x=-9.89, mean=-9.89, std=0.00
[2] x=4.05, mean=-2.92, std=4.93
[3] x=1.69, mean=-1.38, std=4.40
[4] x=-2.57, mean=-1.68, std=3.83
[5] x=1.94, mean=-0.96, std=3.67
[6] x=6.77, mean=0.33, std=4.26
[7] x=9.95, mean=1.71, std=5.02
[8] x=-7.99, mean=0.49, std=5.57
[9] x=-5.04, mean=-0.12, std=5.51
[10] x=-6.69, mean=-0.78, std=5.55

And that’s all there is to calculating the most common stream statistics.

Addendum: Recursive estimate of the covariance

Something I’ve also had to deal with in multivariate streams is a recursive estimate of the covariance matrix. Fortunately we can do this using the means we have already calculated. I repeat that this provides an estimation of the sample covariance - you will obviously calculate a different covariance matrix if you took all the points and calculated it classically. But if you can fit all points in memory then what’s the point of doing this?

Ci(x,y)=C_i1(x,y)(i1)+(x_ixˉ_i)(y_iyˉ_i1)iC_i(x,y) = C\_{i-1} (x,y) \cdot (i-1) + \frac{(x\_i - \bar{x}\_i)(y\_i - \bar{y}\_{i-1})}{i}

Here is an example Java implementation from a stream clusterer I am writing:

    private double[] totals;
    private double[] mean;
    private double[][] cov;
    private int      weight;

... (Rest of class redacted) ...

    /**
     * Adds the example to the cluster by updating the cluster statistics.
     * @param example The next stream example to be added to the cluster.
     */
    public void add(double[] example) {
        weight++;

        double[] prevMean = Arrays.copyOf(mean, mean.length);

        for(int i=0;i<example.length;i++) {
            totals[i] += example[i];
            mean[i] = totals[i] / weight;

            for(int j=0;j<example.length;j++) {
                cov[i][j] = (cov[i][j] * (weight-1) + (example[i] - mean[i]) * (example[j] - prevMean[j]))/weight;
            }
        }
    }