Prescriptive Analytics With Docplex and Pandas: Hugues Juille

Prescriptive Analytics with
DOcplex and pandas

Hugues JUILLE
© 2016 IBM Corporation

Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
2 © 2016 IBM Corporation

What is Prescriptive Analysis?
 Also known as: How can we
make it happen?
Prescriptive
Decision Optimization What will
happen?
Analytics
Value
Predictive
What
 Prescriptive analytics is about: happened?
Analytics
Descriptive
 recommending actions, Analytics
 based on desired outcomes,

 taking into account :
Difficulty
• specific scenarios,
• limited resources and
• knowledge of past and current events.
 This insight can help organizations make better decisions and have greater
control of business outcomes.
The Science of Better Decisions
How to best allocate

aircrafts and crews?
Optimization helps businesses:

Inventory cost vs.
What to build, • create the best possible plans customer satisfaction
where and when?
• explore alternatives and understand trade-off
• respond to changes in business operations
Risk vs. potential reward Cost vs.carbon

emission
How does Optimization work?

What is an optimization model?
An optimization model is A Mathematical Programming
composed of: model:
• Decision variables
• Constraints
• An objective function
Solving a model means: A Constraints Programming

Finding an assignment to (CP) model:
decision variables that: • Based on higher level constructs:
• Discrete or interval variables
• minimize or maximize the
• Rich set of logical, arithmetic or
objective function,
(non-linear) functional constraints
• subject to meeting all over variables
constraints • Dedicated to combinatorial /
scheduling problems

Agenda

Modelling languages for Prescriptive Analytics
 Modelling languages for Prescriptive Analytics: AMPL, GAMMS, OPL…
 Enable concise formulations close to mathematical language, intensive use
of matrices representation…
Input data definition
Decision variables: How much to

produce for each product
𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛𝑝 ≥ 0
Objective: maximize profit
𝑃𝑟𝑜𝑓𝑖𝑡𝑝 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛𝑝
𝑝
Constraints: demand for components
cannot exceed stock
∀𝑐, 𝐷𝑒𝑚𝑎𝑛𝑑𝑝,𝑐 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛𝑝 ≤ 𝑆𝑡𝑜𝑐𝑘𝑐

𝑝
Why Python for Prescriptive Analytics?
 Take advantage of Python expressiveness (generators, aggregators,
operator overloading, tuples…).
 Python capabilities make it a viable alternative to specialized modelling

languages:
 1 single language to create the constraints AND do the workflow.
 Standard libraries with abstract constructs to manipulate: vectors,
matrices, relational data model…
 Ecosystem, ease of use, proven robustness, data ingest
 Workflow and mathematical description are part of the language, no

memory management

Why Python for Prescriptive Analytics?
 Core Python libraries for scientific people
 Notebooks = great technology for prototyping optimization

models in an interactive way
 Leverage Big Data tools, such as Apache Spark.

Agenda

DOcplex: What is it? How to get it?
• Easily formulate your optimization models and solve them with IBM Decision Optimization on the
Cloud solve service or CPLEX local solver (with 0 code change).
• Access to free solve capabilities to discover this new API is made easy thanks to our cloud free trial
and our new CPLEX Optimization Studio free Community Edition (aka COS CE): you can get access
to any of those two with the help of one mail address.
• Available through the standard Python pip install with no need to download anything else or
contact any IBM person if you go full cloud.
• Just look for docplex in your browser to get access to docplex pypi repo or doc.
Comprehensive documentation and resources
 All documentation and resources are available on-line
 Educative: examples / cookbooks for all levels of expertise:

Discovering IBM Decision Optimization technologies…
…Reference manuals for APIs
 Social: community / forums

DOcplex for optimization modelling (MP)
Import DOcplex MP package import docplex.mp
Create the container for your model mdl = Model('Warehouse')
Define decision variables x = mdl.add_continuous_var('totDmd')

supply_vars =
(individually or as collections, mdl.binary_var_matrix(warehouses,
discrete or continuous) stores, 'supply')
Define constraints over variables mdl.add_constraint(supply_vars[w, s] <=

open_vars[w])
mdl.add_constraint(mdl.sum(supply_vars[w,
s] for s in stores) <= w.capacity)
mdl.minimize(total_opening_cost +
Define objective total_supply_cost)
mdl.solve()
Solve the model using local Cplex
mdl.solve(url=SVC_URL, key=SVC_KEY)
or on the cloud
Agenda

DOcplex and Notebooks for Optimization

Installing DOcplex and configuring your credentials

Easy to download and parse json

Visualizing the input data





Agenda

Slicing and Aggregate constructs
 Two important constructs to describe complex problems in a compact form:
 Slicing filters: select a subset of items in a multi-dimensional collection
 Aggregate:
• used in combination with slicing,
• build the actual mathematical expression
OPL:
forall ( l in leg_ids, we in weeks )
leg_teu[l][we] == sum (tv in trans_vars : tv.l.leg_id == l && w[tv.date] == we)
trans[tv] * size [tv.eqc];
DOcplex:
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(trans[(tv.leg_id, tv.mot, tv.date, tv.eqc)] *
size[tv.eqc] for tv in trans_vars_list
if tv.leg_id == l and w[tv.date] == we))

Performance considerations
 Runtime model generation should be as effective as possible:
 may be invoked thousands of time when running in production
 large models may involve millions of variables and constraints
 “naïve” translation of slicing/aggregate in Python can be very inefficient when
nested loops are involved
 Use pandas for handling slicing on large collections
“pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language”
 DOcplex can benefit of the following pandas features:
 Data organized in multi-indexed tables
 Efficient merge operations between tables
 Efficient indexing, filtering and grouping operations on tables
 Data Frame trans eqc leg_id mot date week
@trans_01 DRY-20 CDC-BOR Truck 10/06/16 23
trans_df:
@trans_02 HIGH-40 CHE-MAR Train 10/06/16 23
… … … … … …
 “naïve” slicing:
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))
--> Elapsed time: 5875 ms
 Slicing with pandas:
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))
27 --> Elapsed time: 4681 ms © 2016 IBM Corporation
 Issue with this formulation:
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))
 Slicing is calculated inside the nested loops

 cost of creating a pandas Data Frame is incurred at each iteration
 Much better strategy:

 Prepare the results of all slicing filters before entering the nested loops
 This can be done thanks to pandas’ groupby and aggregate operations

 Prepare all results of slicing beforehand:
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
legWeeksMultiIndex = pd.MultiIndex.from_product([leg_ids, weeks], names=["leg_id", "week"])

legWeeksMultiIndex_df = pd.DataFrame(legWeeksMultiIndex.values.tolist(),
columns=["leg_id", "week"])
trans_full_df = legWeeksMultiIndex_df.merge(trans_df, how='left').fillna(0)
trans_sum_grpby = trans_full_df[['leg_id', 'week', 'result']].groupby(['leg_id', 'week']).\

aggregate(lambda x: mdl.sum(x.tolist()))
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
--> Elapsed time: 2323 ms
 Based on two pandas operations:

 groupby: split dataset into groups
 aggregate: perform a computation on the grouped data
 Re-writing using helper methods for generic patterns:
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
trans_sum_grpby = for_cross_prod_sum_by([leg_ids, weeks], trans_df,

['leg_id', 'week'], 'result')
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
 To be compared with initial “naïve” slicing formulation:

for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))
 Performance vs readability trade-off

Conclusion
 Python is one of the most relevant tools to easily turn an idea into working code
when dealing with data-wrangling problems, and then visualize their results.
 The exact same code that has been written and tested in a notebook for loading
data, modelling an optimization problem, solving it… can readily be integrated
and executed in a deployed Python environment.
 DOcplex objective: facilitate the diffusion and use of optimization technologies
 DOcplex + pandas: an alternative to specialized modelling languages
 On-going effort for defining “best practices” and patterns to:
 address performance issues
 facilitate formulation of models formulation that is readable and maintainable

Thank you!
Questions/Answers

Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.

• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

Prescriptive Analytics With Docplex and Pandas: Hugues Juille

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prescriptive Analytics With Docplex and Pandas: Hugues Juille

Uploaded by

Copyright:

Available Formats

Prescriptive Analytics with

DOcplex and pandas

© 2016 IBM Corporation

2 © 2016 IBM Corporation

 based on desired outcomes,

How to best allocate

Optimization helps businesses:

Risk vs. potential reward Cost vs.carbon

5 © 2016 IBM Corporation

Solving a model means: A Constraints Programming

6 © 2016 IBM Corporation

7 © 2016 IBM Corporation

Input data definition

Decision variables: How much to

Objective: maximize profit

∀𝑐, 𝐷𝑒𝑚𝑎𝑛𝑑𝑝,𝑐 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛𝑝 ≤ 𝑆𝑡𝑜𝑐𝑘𝑐

 Python capabilities make it a viable alternative to specialized modelling

 Workflow and mathematical description are part of the language, no

9 © 2016 IBM Corporation

 Notebooks = great technology for prototyping optimization

 Leverage Big Data tools, such as Apache Spark.

10 © 2016 IBM Corporation

11 © 2016 IBM Corporation

 Educative: examples / cookbooks for all levels of expertise:

 Social: community / forums

13 © 2016 IBM Corporation

Create the container for your model mdl = Model('Warehouse')

Define decision variables x = mdl.add_continuous_var('totDmd')

Define constraints over variables mdl.add_constraint(supply_vars[w, s] <=

15 © 2016 IBM Corporation

16 © 2016 IBM Corporation

17 © 2016 IBM Corporation

18 © 2016 IBM Corporation

19 © 2016 IBM Corporation

20 © 2016 IBM Corporation

21 © 2016 IBM Corporation

22 © 2016 IBM Corporation

23 © 2016 IBM Corporation

24 © 2016 IBM Corporation

25 © 2016 IBM Corporation

 Slicing is calculated inside the nested loops

 Much better strategy:

28 © 2016 IBM Corporation

legWeeksMultiIndex = pd.MultiIndex.from_product([leg_ids, weeks], names=["leg_id", "week"])

trans_sum_grpby = trans_full_df[['leg_id', 'week', 'result']].groupby(['leg_id', 'week']).\

 Based on two pandas operations:

trans_sum_grpby = for_cross_prod_sum_by([leg_ids, weeks], trans_df,

 To be compared with initial “naïve” slicing formulation:

 Performance vs readability trade-off

30 © 2016 IBM Corporation

 DOcplex objective: facilitate the diffusion and use of optimization technologies

 DOcplex + pandas: an alternative to specialized modelling languages

 On-going effort for defining “best practices” and patterns to:

 address performance issues

 facilitate formulation of models formulation that is readable and maintainable

31 © 2016 IBM Corporation

32 © 2016 IBM Corporation

• © IBM Corporation 2016. All Rights Reserved.

34 © 2016 IBM Corporation

You might also like