Professional Documents
Culture Documents
The Next Frontier of Advanced Analytics
The Next Frontier of Advanced Analytics
Figure 1: Model Selection. (A) The current dominant approach: the analyst chooses one combination of
options for feature engineering (FE), algorithm selection (AS), and parameter tuning (PT); we call it a
Model Selection Triple (MST). She iterates by modifying the MST, e.g., altering a parameter, or dropping
a feature. (B) Our vision: she groups logically related MSTs, while the system optimizes the computations
and helps manage results across iterations. (C) MSTs act as a unifying abstraction (a “narrow waist”) for
a new class of analytics systems that we call Model Selection Management Systems (MSMS).
a fixed tree height (PT option). (2) Execution: the flexibility to cover a wide spectrum of automation,
system executes the MST to build and evaluate the in which the analyst controls exactly how much au-
ML model, typically on top of a data management tomation she desires for her application. This would
platform, e.g., an RDBMS or Spark. (3) Consump- enable analysts to still use their expertise during
tion: the analyst assesses the results to decide upon Steering, but push much of the “heavy lifting” to
the MST for the next iteration, or stops the process. the system during Execution and Consumption.
For example, if the tree is too big, she might re-
duce the height (PT option changed), or drop some 1.1 Our Vision: MSMS to Manage MSTs
features (FE option changed). Even such minor
changes in MSTs can cause major changes in ac- We envision a unifying framework that enables
curacy and interpretability, and it is generally hard analysts to easily explore a set of logically related
to anticipate such effects. Thus, analysts evaluate MSTs per iteration rather than just one MST per
several MSTs using an iterative process. iteration. The analyst’s expertise is useful in de-
Alas, most existing ML systems (e.g., [2, 6, 8, 11, ciding which MSTs are grouped. Figure 1(B) illus-
19]) force analysts to explore one MST per iteration. trates our vision. In the figure, the analyst groups
This overburdens the analyst, since she has to per- multiple values of tree height and feature subsets.
form more iterations. Also, since the system is igno- As we will explain shortly, this ability to handle a
rant of the relationship between the MSTs explored, logically related set of MSTs all at once is a simple
opportunities to speed up Execution are lost, while but powerful unifying abstraction for a new class of
Consumption becomes more manual, which causes analytics systems that aim to support the iterative
more pain for analysts. Our vision aims to mitigate process of model selection. We call such systems
these issues by providing more systems support to model selection management systems (MSMS).
improve the effectiveness of analysts as well as the Iteration in an MSMS. MSTs are too low-level
efficiency of the iterative process of model selection. for analysts to enumerate. Thus, an MSMS should
Automation Spectrum. A natural first step is to provide a higher level of abstraction for specifying
automate some of the analyst’s work by providing MSTs. A trivial way is to use for loops. But our
a more holistic view of the process to the system. vision goes deeper to exploit the full potential of
For example, the system can evaluate decision trees our idea, by drawing inspiration from the RDBMS
with all possible heights (PT options), or all subsets philosophy of handling “queries.” By repurposing
of features (FE options). Clearly, such naive brute- three key ideas from the database literature, an
force automation might be prohibitively expensive, MSMS can make model selection significantly easier
while the analyst’s expertise might be wastefully ig- and faster. (1) Steering: an MSMS should offer a
nored. On the other hand, if the system hardcodes framework of declarative operations that enable an-
only a few sets of MSTs, analysts might deem it too alysts to easily group logically related MSTs. For
restrictive to be useful. Ideally, what we want is the example, the analyst can just “declare” the set of
tree heights and feature subsets (projections). The
Table 1: Major categories of ML systems surveyed, along with examples from both products and research.
It is possible for a system to belong to more than one category, since it could have multiple key goals.
on “model management”-related issues, which in- fashion. By leveraging three key ideas from data
volve logistical tasks such as deployment and ver- management research – declarativity, optimization,
sioning. A few recent systems aim to tackle one and provenance – such model selection management
or more of FE, AS, and PT – Columbus, MLBase, systems could help make model selection easier and
DeepDive, and AzureML. However, they either do faster. This could be a promising direction for inter-
not abstract the whole process of model selection esting and impactful research in data management,
as we do, or do not aim to support a wide portion as well as its intersection with ML and HCI.
of the automation spectrum. We have already dis-
cussed Columbus and MLBase in Section 1. Deep-
6. REFERENCES
[1] Microsoft Azure ML. studio.azureml.net.
Dive provides a declarative language to specify fac- [2] Oracle R Enterprise. www.oracle.com.
[3] SAS Report on Analytics. sas.com/reg/wp/corp/23876.
tor graph models and aims to make FE easier [17], [4] M. Anderson et al. Brainwash: A Data System for Feature
but it does not address AS and PT. Automation of Engineering. In CIDR, 2013.
[5] P. Domingos. A Few Useful Things to Know about
PT using massive parallelism has also been stud- Machine Learning. CACM, 2012.
ied [7]. AzureML provides something similar, and [6] X. Feng et al. Towards a Unified Architecture for
in-RDBMS Analytics. In SIGMOD, 2012.
it also aims to make it easier to manage ML work- [7] Y. Ganjisaffar et al. Distributed Tuning of Machine
flows for algorithm selection [1]. All these projects Learning Algorithms Using MapReduce Clusters. In
LDMTA, 2011.
provided the inspiration for our vision. We distill [8] A. Ghoting et al. SystemML: Declarative Machine
their lessons as well as our interactions with ana- Learning on MapReduce. In ICDE, 2011.
[9] I. Guyon et al. Feature Extraction: Foundations and
lysts into a unifying abstract framework. We also Applications. New York: Springer-Verlag, 2001.
take the logical next step of envisioning a unified [10] T. Hastie et al. Elements of Statistical Learning: Data
mining, inference, and prediction. Springer-Verlag, 2001.
MSMS based on our framework to support FE, AS, [11] J. Hellerstein et al. The MADlib Analytics Library or
and PT in an integrated fashion (Figure 1). MAD Skills, the SQL. In VLDB, 2012.
[12] S. Kandel et al. Enterprise Data Analysis and
Visualization: An Interview Study. IEEE TVCG, 2012.
5. CONCLUSION
[13] T. Kraska et al. MLbase: A Distributed Machine-learning
System. In CIDR, 2013.
[14] A. Kumar et al. A Survey of the Existing Landscape of
We argue that it is time for the data management ML Systems. UW-Madison CS Tech. Rep. TR1827, 2015.
community to look beyond just implementing ML [15] A. Kumar et al. Learning Generalized Linear Models Over
Normalized Data. In SIGMOD, 2015.
algorithms efficiently and help improve the iterative [16] A. Kumar et al. To Join or Not to Join? Thinking Twice
process of model selection, which lies at the heart about Joins before Feature Selection. In SIGMOD, 2016.
[17] C. Ré et al. Feature Engineering for Knowledge Base
of using ML for data applications. Our unifying Construction. IEEE Data Engineering Bulletin, 2014.
abstraction of model selection triples acts as a ba- [18] C. Zhang et al. Materialization Optimizations for Feature
Selection Workloads. In SIGMOD, 2014.
sis for designing a new class of analytics systems to [19] Y. Zhang et al. I/O-Efficient Statistical Computing with
manage model selection in a holistic and integrated RIOT. In ICDE, 2010.