Dr. Will Faithfull

PCA Feature Extraction for Change Detection in Multivariate Streaming Data

28.08.2019 — statistics, data-science — 1 min read

I get quite a lot of enquiries asking for a code example of the titular paper:

Kuncheva, Ludmila I., and William J. Faithfull. "PCA feature extraction for change detection in multidimensional unlabeled data." IEEE transactions on neural networks and learning systems 25.1 (2013): 69-80.

So, without further ado, here is a jupyter notebook hosted on colab which demonstrates the concept. The core of the python script (sans SPLL implementation, which can be found here) is included below in case Google decides to kill colab at some point in the future.

1from sklearn.decomposition import PCA
2import numpy as np
3import numpy.random as rnd
4import matplotlib.pyplot as plt
5
6# Make a stream with a change halfway along
7def make_stream(stream_length, features):
8  a = rnd.randn(stream_length // 2, 10)
9   
10  b = np.hstack((rnd.randn(stream_length // 2,5), rnd.randn(stream_length // 2, 5) * 50))
11  stream = np.vstack((a, b))
12  
13  # just adding a bit of artificial variance to the data to avoid empty clusters
14  stream *= rnd.randn(features)
15  return stream
16
17stream = make_stream(500, 10)
18
19window_size = 50
20length = len(stream) - (2 * window_size)
21
22statistic_raw = []
23statistic_pca = []
24
25for i in range((window_size * 2), length):
26  
27  W1 = stream[i:i+window_size, :]
28  W2 = stream[1+i+window_size:1+i+(2*window_size), :]
29  
30  # This would be much faster with incremental PCA, but this is just PoC...
31  pca = PCA()
32  pca.fit(W1)
33  
34  # Keep the least variant features
35  feature_indices = np.where(pca.explained_variance_ratio_ < 0.05)[0]
36  
37  if np.size(feature_indices) < 2:
38    # Guard against empty clusters...
39    print("WARN: ({:.2f}) Skipping SPLL as it would fail with empty cluster...".format(i / length))
40    st_raw = 0
41    st_pca = 0
42  else:
43    # Transform with W1's coefficients, only keep the least variant features
44    W1_PCA = pca.transform(W1)[:, feature_indices.tolist()]
45    W2_PCA = pca.transform(W2)[:, feature_indices.tolist()]
46    
47    change_raw, _, st_raw = SPLL(W1, W2)
48    change_pca, _, st_pca = SPLL(W1_PCA, W2_PCA)
49  
50  if i % 100 == 0: 
51    print(".", end="")
52  
53  statistic_raw.append(st_raw)
54  statistic_pca.append(st_pca)
55  
56# We see that the feature extraction is beneficial: PCA change score is higher 
57# before the change and lower after.
58plt.plot(np.array(statistic_pca) - np.array(statistic_raw))