— statistics, data-science — 1 min read
If you ever have to process an indeterminate length stream of data, it's useful to be able to compute some basic statistics as you go. The key phrase there is as you go - we could easily compute a mean of an arbitrary number of samples $N$
$$ \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i $$
but that means we have to keep $N$ examples in memory! That's $O(N)$ space complexity, which for very a large $N$ is unacceptable.
Fortunately, we can compute the recursive mean given
Then all we need to do is as follows. (Note that the use of $i-1$ is Bessel's correction)
$$ \bar{x}_i = \frac{(i-1) \times \bar{x}_{i-1} + x_i}{i} $$
Starting with $\bar{x}_1 = 0$. Note that we "start" from $1$ to avoid division by zero. Think of it as counting the number of observations so far. We can't calculate anything until we've seen at least one anyway. Simple python example:
1import random; 2mean_prev = 0;3i = 0;45for x in [random.uniform(-10,10) for _ in range(10)]:6 i += 17 mean = ((i-1) * mean_prev + x) / i;8 mean_prev = mean;9 print("[{:d}] x={:.2f}, mean={:.2f}".format(i, x, mean))
1[1] x=-0.55, mean=-0.552[2] x=-4.28, mean=-2.413[3] x=9.66, mean=1.614[4] x=7.60, mean=3.115[5] x=-4.35, mean=1.626[6] x=8.14, mean=2.707[7] x=-8.69, mean=1.088[8] x=-6.55, mean=0.129[9] x=7.41, mean=0.9310[10] x=6.21, mean=1.46
To calculate the variance, we also need the sum of the squares of the of the data points from $0..i$.
$$ \sigma^2_i = \frac{\sum x^2_i}{i} - \bar{x}^2_i $$
Then the standard deviation is simply the square root of the variance.
$$ \sigma_i = \sqrt{\sigma^2_i} $$
It is simple to add this into our python sample:
1import random;2import math;3mean_prev = 0;4sum_sq = 0;5i = 0;67for x in [random.uniform(-10,10) for _ in range(10)]:8 i += 19 mean = ((i-1) * mean_prev + x) / i;10 sum_sq += (x-mean)**2;11 var = sum_sq / i;12 std = math.sqrt(var)13 mean_prev = mean;1415 print("[{:d}] x={:.2f}, mean={:.2f}, std={:.2f}".format(i, x, mean, std))
1[1] x=-9.89, mean=-9.89, std=0.002[2] x=4.05, mean=-2.92, std=4.933[3] x=1.69, mean=-1.38, std=4.404[4] x=-2.57, mean=-1.68, std=3.835[5] x=1.94, mean=-0.96, std=3.676[6] x=6.77, mean=0.33, std=4.267[7] x=9.95, mean=1.71, std=5.028[8] x=-7.99, mean=0.49, std=5.579[9] x=-5.04, mean=-0.12, std=5.5110[10] x=-6.69, mean=-0.78, std=5.55
And that's all there is to calculating the most common stream statistics.
Something I've also had to deal with in multivariate streams is a recursive estimate of the covariance matrix. Fortunately we can do this using the means we have already calculated. I repeat that this provides an estimation of the sample covariance - you will obviously calculate a different covariance matrix if you took all the points and calculated it classically. But if you can fit all points in memory then what's the point of doing this?
$$ C_i(x,y) = C_{i-1} (x,y) \cdot (i-1) + \frac{(x_i - \bar{x}_i)(y_i - \bar{y}_{i-1})}{i} $$
Here is an example Java implementation from a stream clusterer I am writing:
1private double[] totals;2 private double[] mean;3 private double[][] cov;4 private int weight;56... (Rest of class redacted) ...78 /**9 * Adds the example to the cluster by updating the cluster statistics.10 * @param example The next stream example to be added to the cluster.11 */12 public void add(double[] example) {13 weight++;1415 double[] prevMean = Arrays.copyOf(mean, mean.length);1617 for(int i=0;i<example.length;i++) {18 totals[i] += example[i];19 mean[i] = totals[i] / weight;2021 for(int j=0;j<example.length;j++) {22 cov[i][j] = (cov[i][j] * (weight-1) + (example[i] - mean[i]) * (example[j] - prevMean[j]))/weight;23 }24 }25 }