ML 113: Decision Trees (15 pts extra)

What You Need

A Web browser

Purpose

To practice implementing Decision Trees.

Using Google Colab

In a browser, go to

https://colab.research.google.com/

If you see a blue "Sign In" button at the top right, click it and log into a Google account.

From the menu, click File, "New notebook".

Preparing a Dataset

Execute these commands to create a dataset containing measurements of iris flowers:

import sklearn from sklearn.datasets import load_iris iris = load_iris(as_frame=True) X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values y_iris = iris.target for i in range(5): print(X_iris[i], y_iris[i]) print() print(y_iris.value_counts())

As shown below, the data consists of 150 instances, 50 in each of the three categories.

Classifying With a Decision Tree

Execute these commands to create a decision tree to fit the iris dataset:

import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42) tree_clf.fit(X_iris, y_iris) from pathlib import Path IMAGES_PATH = Path() / "images" / "decision_trees" IMAGES_PATH.mkdir(parents=True, exist_ok=True) from sklearn.tree import export_graphviz export_graphviz( tree_clf, out_file=str(IMAGES_PATH / "iris_tree.dot"), # path differs in the book feature_names=["petal length (cm)", "petal width (cm)"], class_names=iris.target_names, rounded=True, filled=True ) from graphviz import Source Source.from_file(IMAGES_PATH / "iris_tree.dot")

As shown below, you see a decision tree with five nodes and depth 2.

Notice that the "gini" impurity decreases with each decision.

Displaying Predictions

Execute these commands to make a chart showing the training data and the model's predictions:

import numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap custom_cmap = ListedColormap(['Red', 'Green', 'DarkMagenta']) plt.figure(figsize=(8, 4)) lengths, widths = np.meshgrid(np.linspace(0, 7.2, 100), np.linspace(0, 3, 100)) X_iris_all = np.c_[lengths.ravel(), widths.ravel()] y_pred = tree_clf.predict(X_iris_all).reshape(lengths.shape) plt.contourf(lengths, widths, y_pred, alpha=0.3, cmap=custom_cmap) for idx, (name, style) in enumerate(zip(iris.target_names, ("yo", "bs", "g^"))): plt.plot(X_iris[:, 0][y_iris == idx], X_iris[:, 1][y_iris == idx], style, label=f"Iris {name}") plt.show()

As shown below, the model sorts the classes pretty well. The colors approximately match the colors in the decision tree chart above.

Flag ML 113.1: Depth 3 (5 pts)
Repeat the process above, but change max_depth in the second block of code from 2 to 3.
The flag is the "gini" impurity of one of the leaf nodes, covered by a green rectangle in the image below.

Preparing a Dataset

Execute these commands to create a dataset with a curve and noise:

import numpy as np import matplotlib.pyplot as plt import math np.random.seed(42) x_data = np.linspace(-10, 10, num=400) y_data = 0.1*x_data*np.cos(x_data) + 0.1*np.random.normal(size=400) plt.scatter(x_data[::1], y_data[::1]) plt.grid() plt.show()

As shown below, the data shows a curve plus noise.

Regression With a Decision Tree

Execute these commands to create a decision tree to fit the curve:

from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42) tree_reg.fit(x_data.reshape(-1,1), y_data) from pathlib import Path IMAGES_PATH = Path() / "images" / "decision_trees" IMAGES_PATH.mkdir(parents=True, exist_ok=True) from sklearn.tree import export_graphviz export_graphviz( tree_reg, out_file=str(IMAGES_PATH / "curve_tree.dot"), # path differs in the book rounded=True, filled=True ) from graphviz import Source Source.from_file(IMAGES_PATH / "curve_tree.dot")

As shown below, you see a decision tree with five nodes and depth 2.

Notice that the "squared_error" decreases with each decision.

Displaying Predictions

Execute these commands to make a chart showing the training data and the model's predictions:

pred_left = 0 # Just for the flag def plot_regression_predictions(tree_reg, X, y, axes=[-10, 10, -1.5, 1.5]): x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1) y_pred = tree_reg.predict(x1) plt.axis(axes) plt.xlabel("$x_1$") plt.plot(X, y, "b.") plt.plot(x1, y_pred, "r.-", linewidth=2, label=r"$\hat{y}$") global pred_left pred_left = y_pred[0] plot_regression_predictions(tree_reg, x_data, y_data) print(f"{pred_left:0.5f}") print() plt.show()

As shown below, the model fits the data in a very crude, approximate way.

The number at the top is the Y value of the prediction for the leftmost point.

Flag ML 113.2: Depth 6 (5 pts)
Repeat the process above, but change max_depth in the second block of code from 2 to 6.
The flag is covered by a green rectangle in the image below.

Flag ML 113.3: Unlimited Depth (5 pts)
Repeat the process above, but remove max_depth in the second block of code.
The model severelyy overfits the data.
The flag is covered by a green rectangle in the image below.

References

Chapter 6 -- Decision Trees

Posted 9-20-23
Typo fixed 9-30-23
Video added 10-21-23
Random seed added 12-12-23
Minor text correction 12-12-23