— statistics, data-science — 1 min read
I get quite a lot of enquiries asking for a code example of the titular paper:
Kuncheva, Ludmila I., and William J. Faithfull. "PCA feature extraction for change detection in multidimensional unlabeled data." IEEE transactions on neural networks and learning systems 25.1 (2013): 69-80.
So, without further ado, here is a jupyter notebook hosted on colab which demonstrates the concept. The core of the python script (sans SPLL implementation, which can be found here) is included below in case Google decides to kill colab at some point in the future.
1from sklearn.decomposition import PCA2import numpy as np3import numpy.random as rnd4import matplotlib.pyplot as plt56# Make a stream with a change halfway along7def make_stream(stream_length, features):8 a = rnd.randn(stream_length // 2, 10)9 10 b = np.hstack((rnd.randn(stream_length // 2,5), rnd.randn(stream_length // 2, 5) * 50))11 stream = np.vstack((a, b))12 13 # just adding a bit of artificial variance to the data to avoid empty clusters14 stream *= rnd.randn(features)15 return stream1617stream = make_stream(500, 10)1819window_size = 5020length = len(stream) - (2 * window_size)2122statistic_raw = []23statistic_pca = []2425for i in range((window_size * 2), length):26 27 W1 = stream[i:i+window_size, :]28 W2 = stream[1+i+window_size:1+i+(2*window_size), :]29 30 # This would be much faster with incremental PCA, but this is just PoC...31 pca = PCA()32 pca.fit(W1)33 34 # Keep the least variant features35 feature_indices = np.where(pca.explained_variance_ratio_ < 0.05)[0]36 37 if np.size(feature_indices) < 2:38 # Guard against empty clusters...39 print("WARN: ({:.2f}) Skipping SPLL as it would fail with empty cluster...".format(i / length))40 st_raw = 041 st_pca = 042 else:43 # Transform with W1's coefficients, only keep the least variant features44 W1_PCA = pca.transform(W1)[:, feature_indices.tolist()]45 W2_PCA = pca.transform(W2)[:, feature_indices.tolist()]46 47 change_raw, _, st_raw = SPLL(W1, W2)48 change_pca, _, st_pca = SPLL(W1_PCA, W2_PCA)49 50 if i % 100 == 0: 51 print(".", end="")52 53 statistic_raw.append(st_raw)54 statistic_pca.append(st_pca)55 56# We see that the feature extraction is beneficial: PCA change score is higher 57# before the change and lower after.58plt.plot(np.array(statistic_pca) - np.array(statistic_raw))