What is the “right” way to think about parameter tuning in machine learning? Contemporary machine learning models may have thousands of parameters that have to be hand tuned by the engineer. Current techniques for automating this tuning have two flavors:
(a) use derivative free optimization to set the parameters. This reduces the problem to black box optimization and is necessarily exponential time in the number of parameters. Current “best-practice” using Gaussian Processes is often not much better than pure random search.
(b) combine multiple models using statistical aggregation techniques. This reduces the problem to model selection. The resulting models are often large, unwieldly and uninterpretable (like random forests).
Both methods ignore the structure of machine learning design, where pipelines are built stage-wise in a DAG, and ignore concerns about stability of the end-to-end pipeline. Are we just thinking about this problem incorrectly? What other structures and modeling can we take advantage of when optimizing machine learning models end-to-end