ML 115: Dimensionality Reduction (20 pts extra)

What You Need

Purpose

To practice implementing Random Forests.

Using Google Colab

In a browser, go to
https://colab.research.google.com/
If you see a blue "Sign In" button at the top right, click it and log into a Google account.

From the menu, click File, "New notebook".

Counting Edge Points

Execute these commands to count points within 0.01 of an edge in various dimensional hypercubes:
import random
random.seed(999)
print("Dimensions \t Edge points out of 1000")
for dimensions in [2,3,10,30,100,300,1000,3000,10000]:
  edge_points = 0
  for p in range(1000):
    on_edge = 0
    for x in range(dimensions):
      if random.random() < 0.01:
        on_edge = 1
    edge_points += on_edge
  print(dimensions, "\t\t", edge_points)
As shown below, all 1000 of the tested points are on the edge for dimansions of 1000 and higher.

Flag ML 115.1: 200 Dimensions (5 pts)

Repeat the process above, but use 3000 samples and test only 200 dimensions.

Leave the random.seed at 999, to ensure that you get the expected flag value.

The flag is covered by a green rectangle in the image below.

Distance Between Points

Execute these commands to find the distance between random points in various dimensional hypercubes:
import random, math

random.seed(999)
print("Dimensions \t Avg Distance Between Points")
for dimensions in [2, 3, 10, 100, 1_000, 10_000, 100_000, 1_000_000]:
  total_distance = 0
  for p in range(100):
    dist2 = 0
    for x in range(dimensions):
      dist2 += (random.random() - random.random())**2
    total_distance +=  math.sqrt(dist2)
  formatted_dimensions = "{:,}".format(dimensions)
  print(formatted_dimensions.rjust(9), "\t\t", "{:.2f}".format(total_distance/100.0))
As shown below, the distance grows with dimensionality.

Flag ML 115.2: 500,000 Dimensions (5 pts)

Repeat the process above, but use 300 pairs of points and test only 50,000 dimensions.

Make sure you divide the total distance by 300 to get the average distance.

Leave the random.seed at 999, to ensure that you get the expected flag value.

The flag is covered by a green rectangle in the image below.

Principal Component Analysis (PCA)

Importing the MNIST Images

Execute these commands to import the handwritten digit images and display some of them:
from sklearn.datasets import fetch_openml
import numpy as np

mnist = fetch_openml('mnist_784', as_frame=False, parser="auto")
X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]
X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]

import matplotlib.pyplot as plt

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap="binary")
    plt.axis("off")

plt.figure(figsize=(4, 4))
for idx, image_data in enumerate(X_train[:16]):
    plt.subplot(4, 4, idx + 1)
    plot_digit(image_data)
plt.subplots_adjust(wspace=0, hspace=0)
plt.show()
The first few images appear, As shown below.

Choosing the Right Number of Dimensions

These images are 28x28 pixels in grayscale, so there are 784 dimensions. We'll reduce that number using PCA.

Execute these commands to find the number of dimensions needed to fit 95% of the variance:

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(d)
As shown below, the answer is 154 dimensions.

Flag ML 115.3: 99% of the Variance (5 pts)

Repeat the process above, but explain 99% of the variance.

The flag is the number of dimensions required.

Using PCA with a Random Forest

Execute these commands to model the MNIST images by first reducing the dimensions with PCA, then classifying using a random forest.

This model uses RandomizedSearchCV to find a good combination of hyperparameters for both PCA and the random forest classifier. The only two hyperparameters tested are pca__n_components and randomforestclassifier__n_estimators.

To speed things up, we use only the 100 training images and only perform 10 iterations.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

clf = make_pipeline(PCA(random_state=42),
                    RandomForestClassifier(random_state=42))
param_distrib = {
    "pca__n_components": np.arange(10, 80),
    "randomforestclassifier__n_estimators": np.arange(50, 500)
}
rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3,
                                random_state=42)
rnd_search.fit(X_train[:1000], y_train[:1000])

print(rnd_search.best_params_)

y_pred = rnd_search.predict(X_test)
accuracy_score = accuracy_score(y_pred,y_test)
print(accuracy_score)
As shown below, best fit uses 465 estimators and only 23 dimensions, and achieves an accuracy over 85%.

PCA for Compression

To see the effect of reducing th original 784 dimensions down to 23, execute this code:
pca = PCA(n_components=23)
X_reduced = pca.fit_transform(X_train, y_train)
X_recovered = pca.inverse_transform(X_reduced)

plt.figure(figsize=(7, 4))
for idx, X in enumerate((X_train[::2100], X_recovered[::2100])):
    plt.subplot(1, 2, idx + 1)
    plt.title(["Original", "Compressed"][idx])
    for row in range(5):
        for col in range(5):
            plt.imshow(X[row * 5 + col].reshape(28, 28), cmap="binary",
                       vmin=0, vmax=255, extent=(row, row + 1, col, col + 1))
            plt.axis([0, 5, 0, 5])
            plt.axis("off")
As shown below, there's quite a lot of distortion:

Flag ML 115.4: 99% of the Variance (5 pts)

Repeat the process above, but change n_iter from 10 to 3, and change the number of training samples used from 1000 to 3000.

The flag is the number of dimensions used, covered by a green rectangle in the image below.

References

Chapter 8 - Dimensionality Reduction

Posted 10-6-23
Video added 10-21-23