A machine learning (ML) model is rarely ready to be launched into production without tuning. Like bindings on a ski or the knobs and levers in an aircraft cockpit, catastrophe can ensue for those who venture out into the open expanses of AI without all the proper settings baked in prior to launch. That’s why hyperparameter tuning – the science of choosing all the right settings for ML – is a core competency of the data science team at States Title.

But, picking the correct hyperparameter settings for a particular ML problem is tricky to get right. Unlike device settings that we’re used to in everyday life – such as turning the volume knob up one notch or level – even small changes in hyperparameters can produce large and initially counterintuitive changes in model performance.

Fortunately, there’s been a lot of research into “hyperparameter optimization” algorithms that automatically search for the best set of hyperparameters after a little configuration. These algorithms are widely accessible through robust implementations in various Python packages. For example, `hyperopt` is a widely used package that allows data scientists to utilize several powerful algorithms for hyperparameter optimization simply by defining an objective function and declaring a search space.

Relying on these tools, however, can turn hyperparameter selection into a “black box” that makes it hard to explain why a particular set of hyperparameters works best for a problem. One way to overcome the “black box” nature of these algorithms is to visualize the history of hyperparameter settings that were tried during hyperparameter optimization to help identify underlying trends in hyperparameter settings that work well.

In this post, we’ll demonstrate how to create effective interactive visualizations of hyperparameter settings that allow us to understand the relationship between hyperparameter settings that were tried during hyperparameter optimization. Part 1 of this post will set up a simple hyperparameter optimization example with hyperopt. In part 2, we’ll show how to use `plotly` to create an interactive visualization of the data generated by the hyperparameter optimization in part 1.

This post assumes that the reader is familiar with the concept of hyperparameter optimization. Additionally, although we will create an example hyperparameter optimization to generate the data that we will visualize, we will not go into great detail about this optimization since the intent of this post is not to be a tutorial on hyperopt; check out the hyperopt documentation for a complete tutorial.

For conciseness, code examples will be presented assuming that all necessary imports have already been run. For reference, here is the full set of imports that are needed for the code examples:

from functools import partial from pprint import pprint import numpy as np import pandas as pd from hyperopt import fmin, hp, space_eval, tpe, STATUS_OK, Trials from hyperopt.pyll import scope, stochastic from plotly import express as px from plotly import graph_objects as go from plotly import offline as pyo from sklearn.datasets import load_boston from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor from sklearn.metrics import make_scorer, mean_squared_error from sklearn.model_selection import cross_val_score, KFold from sklearn.utils import check_random_state pyo.init_notebook_mode()

Before we can dive into visualization with Plotly, we need to generate some hyperparameter optimization data from hyperopt for us to visualize. There are four key steps that we need to follow to set up a hyperparameter optimization with hyperopt:

- Choosing and loading a dataset
- Declaring the hyperparameter search space
- Defining the objective function
- Running the hyperparameter optimization

We’ll provide a brief description along with example code for each of these steps but we won’t go into great detail to justify specific choices since the goal of this hyperparameter optimization is just to generate data for us to visualize. To run the example notebook, please follow the instructions in the file “README.md” in this directory. Additionally, a static (HTML) version of the notebook with output from running the notebook is available here.

We’ll use the UCI Boston dataset as our example dataset for our hyperparameter optimization.

The features for the UCI Boston dataset are various neighborhood characteristics and the target is the median value of homes in that neighborhood. Scikit-learn provides a convenient wrapper function named “load_boston”. We’ll use this function to load the dataset into a Pandas dataframe as follows:

MEDIAN_HOME_VALUE = "median_home_value" # Load the boston dataset using sklearn's helper function boston_dataset = load_boston() # Convert the data into a Pandas dataframe data = np.concatenate( [boston_dataset["data"], boston_dataset["target"][..., np.newaxis]], axis=1, ) features, target = boston_dataset["feature_names"], MEDIAN_HOME_VALUE columns = np.concatenate([features, [target]]) boston_dataset_df = pd.DataFrame(data, columns=columns)

Here’s what the first five rows of the dataset look like:

Since the target is continuous for this dataset, we will want to compare a couple of different regression models. We’ll set up the hyperparameter optimization to compare two types of models: random forest regressor and gradient boosting regressor. The random forest regressor will allow hyperopt to tune the number of trees and the max depth of each tree. The gradient boosting regressor will allow hyperopt to tune the learning rate, in addition to the number of trees and max depth of each tree. The following dictionary declares this hyperparameter search space in the format that’s expected by hyperopt:

# Define constant strings that we will use as keys in the “search space” dictionary below. Note that # the convention I use throughout is to represent the characters in the string with a variable that # matches that string except that the characters in the variable are capitalized. This convention # allows us to easily interpret what these variable mean when we encounter them in the code. For # example, we know that the variable `MODEL` contains the string “model”. Following this pattern of # representing strings with variables allows me to avoid typos when repeatedly using the same string # throughout the code since typos in variable names will be caught as errors by my linter. GRADIENT_BOOSTING_REGRESSOR = "gradient_boosting_regressor" KWARGS = "kwargs" LEARNING_RATE = "learning_rate" LINEAR_REGRESSION = "linear_regression" MAX_DEPTH = "max_depth" MODEL = "model" MODEL_CHOICE = "model_choice" NORMALIZE = "normalize" N_ESTIMATORS = "n_estimators" RANDOM_FOREST_REGRESSOR = "random_forest_regressor" RANDOM_STATE = "random_state" # Declare the search space for the random forest regressor model. random_forest_regressor = { MODEL: RANDOM_FOREST_REGRESSOR, # I define the model parameters as a separate dictionary so that we can feed the parameters into # the `__init__` of the model with dictionary unpacking. See the `sample_to_model` function # that's defined alongside the objective function to see this in action. KWARGS: { N_ESTIMATORS: scope.int( hp.quniform(f"{RANDOM_FOREST_REGRESSOR}__{N_ESTIMATORS}", 50, 150, 1) ), MAX_DEPTH: scope.int( hp.quniform(f"{RANDOM_FOREST_REGRESSOR}__{MAX_DEPTH}", 2, 12, 1) ), RANDOM_STATE: 0, }, } # Declare the search space for the gradient boosting regressor model, following the same structure # as the random forest regressor search space. gradient_boosting_regressor = { MODEL: GRADIENT_BOOSTING_REGRESSOR, KWARGS: { LEARNING_RATE: scope.float( hp.uniform(f"{GRADIENT_BOOSTING_REGRESSOR}__{LEARNING_RATE}", 0.01, 0.15,) ), # lower learning rate N_ESTIMATORS: scope.int( hp.quniform(f"{GRADIENT_BOOSTING_REGRESSOR}__{N_ESTIMATORS}", 50, 150, 1) ), MAX_DEPTH: scope.int( hp.quniform(f"{GRADIENT_BOOSTING_REGRESSOR}__{MAX_DEPTH}", 2, 12, 1) ), RANDOM_STATE: 0, }, } # Combine both model search spaces with a top level "choice" between the two models to get the final # search space. space = { MODEL_CHOICE: hp.choice( MODEL_CHOICE, [random_forest_regressor, gradient_boosting_regressor,], ) }

For the objective function, we’ll compute the mean squared error for each instance of the dataset via a ten-fold cross validation. We’ll report the average mean squared error across folds as the loss. The following code defines this objective:

# Define a few additional variables to represent strings. Note that this code expects that we have # access to all variables that we previously defined in the "search space" code snippet. LOSS = "loss" STATUS = "status" # Mapping from string name to model class deinition object that we'll use to create an initialized # version of a model from a sample generated from the search space by hyperopt. MODELS = { GRADIENT_BOOSTING_REGRESSOR: GradientBoostingRegressor, RANDOM_FOREST_REGRESSOR: RandomForestRegressor, } # Create a scoring function that we'll use in our objective mse_scorer = make_scorer(mean_squared_error) # Helper function thta converts from a sample generated by hyperopt to an initialized model. Note # that because we split the model type and model keyword-arguments into separate key-value pairs in # the search space declaration we are able to use dictionary unpacking to create an initialized # version of the model. def sample_to_model(sample): kwargs = sample[MODEL_CHOICE][KWARGS] return MODELS[sample[MODEL_CHOICE][MODEL]](**kwargs) # Define the objective function for hyperopt. We'll fix the `dataset`, `features`, and `target` # arguments with `functools.partial` to create that version of this function that we will supply as # an argument to `fmin` def objective(sample, dataset_df, features, target): model = sample_to_model(sample) rng = check_random_state(0) # Handle randomization by shuffling when creating folds. In reality, we probably want a better # strategy for managing randomization than the fixed `RandomState` instance generated above. cv = KFold(n_splits=10, random_state=rng, shuffle=True) # Calculate average mean squared error for each fold. Since `n_splits` is 10, `mse` will is an # array of size 10 with each element representing the average mean squared error for a fold. mse = cross_val_score( model, dataset_df.loc[:, features], dataset_df.loc[:, target], scoring=mse_scorer, cv=cv, ) # Return average of mean squared error across all folds. return {LOSS: np.mean(mse), STATUS: STATUS_OK}

We’ll run the hyperparameter optimization for one thousand trials by calling the `fmin` function. Importantly, we’ll provide an instance of a `Trials` object in which hyperopt will record the hyperparameter settings of each iteration of the hyperparameter optimization. We will pull the data for the visualization from this `Trials` instance. Running the following code executes the hyperparameter optimization:

# Since we defined our objective function to be generic in terms of the dataset, we need to use # `partial` from the `functools` module to "fix" the `dataset_df`, `features`, and `target` # arguments to the values that we want for this example so that we have an objective function that # takes in only one argument as assumed by the `hyperopt` interface. boston_objective = partial( objective, dataset_df=boston_dataset_df, features=features, target=MEDIAN_HOME_VALUE ) # `hyperopt` tracks the results of each iteration in this `Trials` object. We’ll be collecting the # data that we will use for visualization from this object. trials = Trials() rng = check_random_state(0) # reproducibility! # `fmin` searches for hyperparameters that “minimize” our object, mean squared error and returns the # “best” set of hyperparameters. best = fmin(boston_objective, space, tpe.suggest, 1000, trials=trials, rstate=rng)

Hyperopt records the history of hyperparameter settings that are tried during hyperparameter optimization in the instance of the `Trials` object that we provided as an argument to our call to `fmin`. After the optimization completes, we can inspect the `trials` variable to see what settings hyperopt selected for the first five trials as follows:

>>> pprint([t for t in trials][:5]) [{'book_time': datetime.datetime(2020, 10, 26, 6, 41, 56, 177000), 'exp_key': None, 'misc': {'cmd': ('domain_attachment', 'FMinIter_Domain'), 'idxs': {'gradient_boosting_regressor__learning_rate': [], 'gradient_boosting_regressor__max_depth': [], 'gradient_boosting_regressor__n_estimators': [], 'model_choice': [0], 'random_forest_regressor__max_depth': [0], 'random_forest_regressor__n_estimators': [0]}, 'tid': 0, 'vals': {'gradient_boosting_regressor__learning_rate': [], 'gradient_boosting_regressor__max_depth': [], 'gradient_boosting_regressor__n_estimators': [], 'model_choice': [0], 'random_forest_regressor__max_depth': [5.0], 'random_forest_regressor__n_estimators': [90.0]}, 'workdir': None}, 'owner': None, 'refresh_time': datetime.datetime(2020, 10, 26, 6, 41, 58, 323000), 'result': {'loss': 16.359897953574603, 'status': 'ok'}, 'spec': None, 'state': 2, 'tid': 0, 'version': 0}, ...]

As you can see, the `trials` object is essentially a list of dictionaries where each dictionary contains detailed data about one iteration of the hyperparameter optimization. This is not a particularly easy format to manipulate, so we’ll convert the relevant bits of data into a `Pandas` dataframe where each row of the dataframe contains information for one trial:

# This is a simple helper function that allows us to fill in `np.nan` when a particular # hyperparameter is not relevant to a particular trial. def unpack(x): if x: return x[0] return np.nan # We'll first turn each trial into a series and then stack those series together as a dataframe. trials_df = pd.DataFrame([pd.Series(t["misc"]["vals"]).apply(unpack) for t in trials]) # Then we'll add other relevant bits of information to the correct rows and perform a couple of # mappings for convenience trials_df["loss"] = [t["result"]["loss"] for t in trials] trials_df["trial_number"] = trials_df.index trials_df[MODEL_CHOICE] = trials_df[MODEL_CHOICE].apply( lambda x: RANDOM_FOREST_REGRESSOR if x == 0 else GRADIENT_BOOSTING_REGRESSOR )

Let’s take another look at the first five trials in this new format:

This is much more manageable than the list of dictionaries we had before.

One useful way to visualize the trial iterations is to plot the trial number vs. the loss to see whether hyperparameter optimization converged over time as we expect. Using Plotly’s high-level Express interface makes this easy; we simply call the `scatter` method on our dataframe and indicate which columns we want to use as x and y values:

# px is an alias for "express" that's created by following the convention of importing "express" by # running `from plotly import express as px` fig = px.scatter(trials_df, x="trial_number", y="loss")

This generates the following plot:

One interesting feature of this plot is that there is a clear separation between the bottom row of dots with “loss” values in the range ten-to-twelve and the rest of the dots. We need more information to understand what causes this separation. One hypothesis is that the separation is caused by different model types; for example, the bottom row of dots might all be gradient boosting regressor models and the rest of the dots might all be random forest regressor models.

Let’s color the dots for each model type differently to see if this is the case by adding a “color” argument to the call of the `scatter` method as follows:

fig = px.scatter(trials_df, x="trial_number", y="loss", color=MODEL_CHOICE)

With this, we get the following plot:

Interestingly, we see that model type does not completely explain the gap between the bottom row of dots and the rest of the dots since gradient boosting regressor models show up in the rest of the dots as well.

We can add more depth to the information in our visualization by creating an interactive visualization such that when we hover over each point we can see the set of hyperparameters that resulted in the loss for that point. At first, it looks like we should be able to achieve this by simply including a value for the “hover_data” argument of the scatter method. But, since we only want to include the hyperparameters that are relevant to each model type for each point, we will need to call the “update_trace” method after our call to scatter to add the hover data as this allows us to filter which hyperparameters are shown for each point. Here is what that looks like:

def add_hover_data(fig, df, model_choice): # Filter to only columns that are relevant to the current model choice. Note that this relies on # the convention of including the model name in the hyperparameter name when we declare the # search space. cols = [col for col in trials_df.columns if model_choice in col] fig.update_traces( # This specifies the data that we want to plot for the current model choice. customdata=trials_df.loc[ trials_df[MODEL_CHOICE] == model_choice, cols + [MODEL_CHOICE] ], hovertemplate="<br>".join( [ f"{col.split('__')[1]}: %{{customdata[{i}]}}" for i, col in enumerate(cols) ] ), # We only apply the hover data for the current model choice. selector={"name": model_choice}, ) return fig fig = px.scatter(trials_df, x="trial_number", y="loss", color=MODEL_CHOICE,) # We call the `add_hover_data` function once for each model type so that we can add different sets # of hyperparameters as hover data for each model type. fig = add_hover_data(fig, trials_df, RANDOM_FOREST_REGRESSOR) fig = add_hover_data(fig, trials_df, GRADIENT_BOOSTING_REGRESSOR)

Finally, we get the following plot (**recommended: **view interactive version of this plot here):

What can we tell with this additional detail? Well, hovering over the points corresponding to the best models on the bottom row reveals that the “max_depth” parameter is set three for each of those points. Additionally, hovering over points outside of that row reveals that the “max_depth” parameter is set to values other than three, such as two, four, and five. This indicates that there may be something special about a “max_depth” of three for our dataset. For example, this may indicate that model performance is primarily driven by three of the features. We will want to further investigate why “max_depth” of three works so well for our dataset and we will likely want to set “max_depth” to three for the final model that we build and deploy.

Another visualization that can improve our intuition about the hyperparameter settings is a contour plot of the “loss” values in terms of the hyperparameters. Contour plots are particularly powerful because they reveal how the interaction between different hyperparameter settings affects the loss. Generally, we will want to generate a separate contour plot for each pair of hyperparameters. To keep things simple in this case, we’ll fix the value of the “max_depth” hyperparameter to three and plot the “learning_rate” vs. “n_estimators” loss contour for that slice of the data. We can create this contour plot with Plotly by running the following:

# Since max_depth == 3 outperforms other settings, we'll filther to only look at that slice. This # creates a boolean array that we will use to filter down to relevant rows in the `trials_df` # dataframe. max_depth_filter = (trials_df[MODEL_CHOICE] == GRADIENT_BOOSTING_REGRESSOR) & ( trials_df["gradient_boosting_regressor__max_depth"] == 3 ) # plotly express does not support contour plots so we will use `graph_objects` instead. `go.Contour # automatically interpolates "z" values for our loss. fig = go.Figure( data=go.Contour( z=trials_df.loc[max_depth_filter, "loss"], x=trials_df.loc[max_depth_filter, "gradient_boosting_regressor__learning_rate"], y=trials_df.loc[max_depth_filter, "gradient_boosting_regressor__n_estimators"], contours=dict( showlabels=True, # show labels on contours labelfont=dict(size=12, color="white",), # label font properties ), colorbar=dict(title="loss", titleside="right",), hovertemplate="loss: %{z}<br>learning_rate: %{x}<br>n_estimators: %{y}<extra></extra>", ) ) fig.update_layout( xaxis_title="learning_rate", yaxis_title="n_estimators", title={ "text": "learning_rate vs. n_estimators | max_depth == 3", "xanchor": "center", "yanchor": "top", "x": 0.5, }, )

This results in the following plot (**recommended: **view interactive version of this plot here):

One takeaway from this plot is that we may want to experiment with increasing the maximum value “n_estimators” hyperparameter since the areas of lowest loss appear at the top of this plot.

In this post, we’ve covered how to convert the data contained in the trials object into a Pandas dataframe so that we can easily analyze the history of hyperparameter settings. Once we have the data in a dataframe, we can easily create visualizations that allow us to better intuition about why a particular set of hyperparameter settings is best. In particular, we’ve shown that adding depth to the visualization by creating a simple interactive visualization with Plotly can reveal a number of interesting trends about the hyperparameter settings that work well for our problem. In addition, we’ve shown that contour plots are useful for indicating adjustments we may want to make to the hyperparameter search space.

Get notified when new blogs post