Source
- A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning
- Hyperparameter tuning in Cloud Machine Learning Engine using Bayesian Optimization
- Hyperparameter tuning on Google Cloud Platform is now faster and smarter
Introduction
Hyperparameters in neural networks are important; they define the network structure and affect model updates by controlling variables such as learning rates, optimization method and loss function. Therefore, it is imperative to find an efficient way of finding the optimal set of hyperparameters.
Quaint methods such as grid search are sluggish as they crawl through the entire search space and therefore end up wasting significant resource. By contrast, random search, which samples the search space randomly, ameliorates this effect and is widely used in practice. A drawback of random search, however, is that it doesn’t use information from prior experiments to select the next setting. This is particularly undesirable when the cost of running experiments is high and you want to make an educated decision on what experiment to run next.
Bayesian optimization is an extremely powerful technique when the mathematical form of the objective function is unknown or expensive to compute. The main idea behind it is to compute a posterior distribution (also called surrogate function) over prior (the objective function) based on the data (using the famous Bayes theorem), and then select good points to try with respect to this posterior distribution. If we can completely match the posterior with prior, then we can precisely determine the configuration of the hyperparameters which invokes the highest return.
Much of the efficiency stems from the ability of Bayesian optimization to incorporate prior belief about the problem to help direct the sampling, and to trade off exploration and exploitation of the search space.
Implementation
Goal
The main goal is to optimize a function $f(x)$ (which we will define as the performance of a function of hyperparameter values, e.g. model accuracy) over a compact set $A$ (the search space of the said hyperparameters), written mathematically as:
An example would be, if $f(x) = (1-x)e^x$, the maximum value $f(x) = 1$ will occur at $x = 0$ and so arg max is 0.
Exploration-exploitation
Exploration: Pick a random place that may yield a good or bad result (variance is high).
Exploitation: Pick a well acquainted place that have historically given fairly good result (mean is high). Although it may not be the best possible result available.
Gaussian process (GP) is a distribution over functions specified by its mean function, $m$ and covariance function, $k$. It returns the mean and variance of a normal distribution.
From the Gaussian process, we know that the predicted distribution (posterior) for the observed value at $x_{t+1}$ given the first $t$ observations, $D_t$ is:
where
$y$ is the t-dimensional vector of observed values, and $K$ and $k$ are:
A very popular choice for $k$ is the squared exponential function:
The choice of covariance function $k$ is crucial, and the above squared exponential kernel might be slightly naive. It may be beneficial to introduce other parameters in order to control behaviours such as kernel width and smoothness.
Acquisition Function
Rather than sampling randomly, Bayesian optimization uses acquisition function, which does automatic trade-off between exploration and exploitation. At each iteration, the acquisition function is maximized to determine where next to sample from the objective function. The objective is then sampled at the argmax of the acquisition function, and based on the return value from the objective function, the Gaussian process is updated and this process is repeated.
Bayesian optimization use evidence and prior knowledge to maximize the posterior at each step, so that each new evaluation decreases the distance between the true global maximum and the expected maximum.
Upper confidence bound
Perhaps the simplest acquisition function looks at an optimistic value for the point. Given a parameter $\beta$, it assumes the value of the point will be $\beta$ standard deviations above the mean. Mathematically, it is
The equation would become $\alpha(x) = \mu(x) - \beta\sigma(x)$ for lower confidence bound if we were to find the minimum instead.
By varying $\beta$, we can encourage the algorithm to explore or exploit more. Note that the value for $\beta$ is left to the user.
Probability of improvement
The main idea behind probability of improvement acquisition function is that we pick the next point based on the maximum probability of improvement (MPI) with respect to the current maximum.
Note that $y^+$ is the same as $f(x^+)$, $u(x)$ is the acquisition function, and $P(x)$ = $PI(x)$.
In the above figure, the maximum observed value (so far), $y^+$ is at $x^+$. The area of the green shaded region gives the probability of improvement at $x_3$. The model predicts negligible possibility of improvement at $x_1$ or $x_2$. Whereas sampling at $x_3$ is likely to produce an improvement over $y^+$.
Where $\Phi$ is normal cumulative distribution function (the probability that a random variable $X$ will take a value less than or equal to $x$).
Expected Improvement
The algorithm which Google's Cloud ML Engine uses for acquisition function is called Expected Improvement. It measures the expected increase in the maximum objective value seen over all experiments, given the next point we pick. In contrast, MPI only takes into account the probability of improvement and not the magnitude of improvement for the next point. Mathematically, as proposed by Mockus, it is
where the expectation is taken over $Y_{t+1}$ which is distributed as the posterior objective value at $x_{t+1}$ In other words, if the objective function were drawn from our posterior distribution, then $EI(x)$ measures the expected increase in the best objective we have found during entire tuning run if we select $x$ and then stop.
When using a Gaussian process we can obtain the following analytical expression for EI
Where $Z$ has the expression denoted below and $\phi$ and $\Phi$ denote the probability density function (PDF) and cumulative distribution function (CDF) of the standard normal distribution function.
$\phi$ =
$\Phi$ =
Bayesian Optimization Algorithm
Result
Interactive Bayesian optimization for material design
In the above figure, the user would select from the provided two images, one that most resembles the target image at each iteration.
It can be seen that $argmax_{EI}$ requires conspicuously fewer iterations in order to derive the correct material parameters.
Conclusion
Searching for hyperparameters can be made more efficiently through the usage of Bayesian Optimization. There are actually further under the hood features implemented by Google for their AI Platform hyperparameter tuning service that further improves the quality of life during parameter searching. Options such as early stopping, where the system stops parameter searching if it determines a better configuration does not exist, and learning from previously visited parameter set in order to speed up the algorithm when expanding on the parameter space.
There are of course other issues, such as handling the trade-off between exploration and exploitation in the acquisition function. Or the fact that the acquisition is only looking one-step ahead (although it should be able to implement multi-step through recursion). There may have been other advancements made in recent years that will require further investigation.