Morey Zkmorey Mba MGT 2021 Thesis

Integrating Machine Learning into Data Analysis
and Plant Performance

by
Zachariah Keith Morey

B.S. Chemical Engineering, Brigham Young University (2015)
Submitted to the MIT Sloan School of Management and

Department of Mechanical Engineering
in partial fulfillment of the requirements for the degrees of
Master of Business Administration
and
Master of Science in Mechanical Engineering
in conjunction with the Leaders for Global Operations program
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2021
© Zachariah Keith Morey, 2021. All rights reserved.
The author hereby grants to MIT permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MIT Sloan School of Management and
May 14, 2021
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arnold Barnett, Thesis Supervisor
George Eastman Professor of Management Science
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jung-Hoon Chun, Thesis Supervisor
Professor of Mechanical Engineering
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maura Herson
Assistant Dean, MBA Program, MIT Sloan School of Management
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nicolas Hadjiconstantinou
Chair, Mechanical Engineering Committee on Graduate Students
2
Integrating Machine Learning into Data Analysis and Plant
Performance
by
Zachariah Keith Morey
Submitted to the MIT Sloan School of Management and

on May 14, 2021, in partial fulfillment of the
requirements for the degrees of
Master of Business Administration
and
Master of Science in Mechanical Engineering
in conjunction with the Leaders for Global Operations program
Abstract
In the current manufacturing environment, the push for high levels of plant perfor-
mance has led to scrutinizing, optimizing, and improving every step of the manufac-
turing process. While improvements are being made in physical and software tech-
nology that enable advancements like automated robots or additive manufacturing,
data management and analysis continues to be an area of opportunity. Challenges
with data analysis are exacerbated by the ever increasing influx of data from every
point of the product manufacturing process as well as the integration of that data
with legacy and novel equipment, software, and employee capabilities. Identifying
improvements in processing and utilizing data can contribute to a better understand-
ing of the data itself as well as insights to drive improved manufacturing and plant
performance. This thesis shows, drawing from a recent project at Nissan’s Canton,
Mississippi manufacturing facility and utilizing data from a global group of Nissan
manufacturing plants, that machine learning can be applied to plant performance
data to identify and prioritize metrics and to better understand the impact of those
metrics on overall plant performance.
Nissan already benchmarks plant performance between its manufacturing facilities
and uses that to drive improvement and investment opportunities. By examining the
data set used for that benchmarking analysis we gain an understanding of both how
plants have performed in recent history and what successful plants are doing that
contributes to better performance. We then run this data through a linear regression
model and an XGBoost machine learning model to compare how the machine learning
model performs when compared to a standard linear regression. We show that while
both models perform well, the machine learning model outperforms the linear regres-
sion model. Specifically the machine learning model achieves a 10% improvement on
R2 with a value of .88 while the linear regression achieves an R2 value of .80. In
3
addition, the machine learning model better handles missing data and shows that the
Design Standard Time Ratio and Delivery Scheduled Time Achievement Ratio are
metrics that need to be prioritized for better plant performance. This thesis argues
that while our project focused on a small benchmarking data set, machine learning
and its benefits can be applied more broadly to data from the manufacturing facilities.
We conclude by presenting some examples and opportunities for how a manufactur-
ing company like Nissan can set up its data, utilize models, and train employees to
take advantage of the growing knowledge base around data management, machine
learning, and plant performance.
Thesis Supervisor: Arnold Barnett, Thesis Supervisor

Title: George Eastman Professor of Management Science
Thesis Supervisor: Jung-Hoon Chun, Thesis Supervisor

Title: Professor of Mechanical Engineering
4
Acknowledgments
I owe an enormous thank you to my internship supervisor, Va’Shound Taylor. He

opened up many doors for me and gave me the freedom I needed to explore my ques-
tions and interests around people and data. Despite the enormous time requirements
on him with an unplanned plant shutdown and COVID 19 upsets, he made sure that I
was able to accomplish what I needed and wanted to do while staying flexible enough
for me to work remotely for part of the internship. I would also like to thank my
Advanced Manufacturing Team for welcoming me to their group and including me
in the challenges and opportunities they tackled. The Industrial Engineering team,
particularly Josh Hancock, Jon Dilmore, and Joseph Kerlin opened their doors to
me and spent a significant amount of time walking me through the plant, describing
processes and operations, and teaching me how they improve plant performance. I
learned so much from the many individuals at the Canton, Mississippi plant who
freely gave me their time and attention. Thank you!
I also owe a big thank you to my advisors, Professor Jung-Hoon Chun and Profes-
sor Arnie Barnett. They were both responsive to my questions, interested in finding
ways to truly help the Canton, MS plant, and flexible as I juggled internship and
family needs amidst the pandemic. Their encouragement and support motivated and
helped me immensely.
I am so grateful for my incredible LGO 2021 classmates. They are an incredible

group to have gone through the program with. When I first started considering the
LGO program many alumni told me that my classmates would be one of the most
rewarding things I took from my time at MIT. While I believed them, I had no idea
just how much of an impact and how meaningful my relationships with my classmates
would be. They have supported me and my family, encouraged us, loved us, and been
there for us. I am thrilled that our relationships won’t end once we graduate and look
forward to continuing to stay close to them through our careers and lives.
Finally, and most of all, I want to end with the greatest cheerleaders and champi-
ons I could dream for, my wife Hailey and three kids James, Thomas, and Millie. They
5
have uprooted themselves repeatedly for this LGO experience and have supported and
loved me throughout the ups and downs of it all. Their trust and excitement have
been humbling and motivate me to do and be better every day. I am truly grateful
that they joined me on this journey and look forward to supporting them as they
have so selflessly supported me. Thank you fam!
6
Contents
1 Introduction 11
1.1 Project Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Statement of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Project Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Background on Current State 19

2.1 Automotive Manufacturing . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 The Nissan Company . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Literature Review 27
3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Understanding Plant Data and Performance 35

4.1 Data at the Plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Design Standard Time . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Design Standard Time Ratio . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Straight Through Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7
4.5 Schedule Sequence Achievement Ratio . . . . . . . . . . . . . . . . . 44
4.6 Stock in Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Stock in Transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Stock in Place + Stock in Transit . . . . . . . . . . . . . . . . . . . . 46
4.9 Production Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Plant Reporting Process . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11 Summary of Data Gathered and Used . . . . . . . . . . . . . . . . . . 48
5 Models 51
5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Understanding the Data Set . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 XGBoost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Model Discussion 69
6.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Machine Learning XGBoost . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Model Comparison and Wrap-up . . . . . . . . . . . . . . . . . . . . 73
7 Summary and Future Work 75

7.1 Summary of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.1 Data Lake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.2 Additional Models . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.3 Data Analytics Training: . . . . . . . . . . . . . . . . . . . . 80
7.2.4 Prioritizing Metrics . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Python Code for Models 83
8
List of Figures
2-1 Diagram showing the key parts to the various industrial revolutions . 20
2-2 Nissan Production in its early years . . . . . . . . . . . . . . . . . . . 22
2-3 Managing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3-1 Fundamentals of Industry 4.0 and their interactions. Source, Agrawal

et al. “Industry 4.0: Reimagining Manufacturing Operations after
COVID-19.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3-2 Current and future application and opportunities for machine learning
in the manufacturing space. Source: Sharp et al. Journal of Manufac-
turing Systmes 48 (2018) 170-179) . . . . . . . . . . . . . . . . . . . . 31
4-1 An example of a DSTR calculation on a manufacturing process . . . . 41

4-2 DSTR benchmarking example from 2006. Note that numbers are for
illustrative purposes only. . . . . . . . . . . . . . . . . . . . . . . . . 42
4-3 Shows 2018 Overall QCTP score rated on 0-5 scale for plants in net-
work. Plant names removed for anonymity and numbers for illustrative
purposes only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5-1 Relationship between global plant performance and number of vehicles

manufactured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5-2 Relationship between global plant performance and Final Straight Through
Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5-3 Relationship between global plant performance and Delivery Scheduled
Time Achievement Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . 55
9
5-4 Relationship between global plant performance and Design Standard
Time Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5-5 Relationship between global plant performance and Stock in Place ratio. 57
5-6 Relationship between global plant performance and Stock in Place +
Stock in Transit ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5-7 Performance of linear model in comparing predicted plant score vs
actual plant score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5-8 Illustration on how a boosting algorithm works. From https://www.datacamp.com/community
in-python, accessed 1/1/2021 . . . . . . . . . . . . . . . . . . . . . . 62
5-9 XGBoost Feature Importance chart showing number of times a metric
was used in regression tree analysis . . . . . . . . . . . . . . . . . . . 65
5-10 SHAP value feature importance chart showing the impact a metric has
in model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5-11 SHAP value feature impact chart showing the spread of impact a metric
has in model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7-1 Schematic showing a data lake setup and how it can be accessed and
utilized. From AWS https://aws.amazon.com/big-data/datalakes-and-
analytics/what-is-a-data-lake/ accessed 1/5/21 . . . . . . . . . . . . . 78
10
Chapter 1
Introduction
Machine Learning refers to the ability to use data and algorithms to build models that
make predictions based on that data. Critically, Machine Learning is set up in such
a way that the computer develops the optimal approach to develop the prediction
or solution without the need for specific human guidance. Unfortunately, due to the
widespread applications of machine learning and the complex nature behind it, many
misunderstandings surround it and its useful application in industry. To that end,
machine learning is often used to over promise results as a panacea to the struggles
and challenges that businesses and organizations face around the world. Still, as Ma-
chine Learning continues to mature with organizations developing in house machine
learning capabilities and a foundation to use it well, there are incredible opportu-
nities for insights and efficiencies to be gained through its use across industries and
organizations. In this thesis, we look at how Machine Learning can be applied in a
manufacturing setting, and specifically how it can be used to benchmark data from
various manufacturing plants to identify trends in optimal plant performance. While
the use case of Machine Learning in this thesis is a specific instance of benchmarking
plant data, we will generalize how the Machine Learning framework and fundamentals
can be used in broader application at a plant or organizational level.
11
1.1 Project Motivation
Manufacturing has gone through several critical developments in recent history. The
first industrial revolution, occurring in the late 1700s to mid 1800s, developed as
new manufacturing technologies and practices enabled a transition away from hand
manufacturing to machines and mass production. That growth in conjunction with
additional developments in the areas of chemicals, metallurgy, and energy use fun-
damentally changed society. From the way people worked to the products and op-
portunities available to the masses, much of the world quickly started adapting to
make use of what the industrial revolution provided. The second industrial revolution
started in the mid to late 1800s as steel development, petroleum development, and
eventually electrification kicked off another wave of growth and globalization. That
globalization continued into the latter half of the 20th century especially as compa-
nies and industries fine tuned their manufacturing processes enabling products to be
manufactured wherever was most convenient and economic for them.
Since that initial explosion in growth that came from globalization and technology
in the first and second industrial revolutions, much of the current continued growth
in industries is attributed to the information revolution where technologies like semi-
conductors and the integrated circuit have led to widespread computer and software
use. This technology and information revolution have enabled a wealth of data to
be collected on every aspect of life, business, and manufacturing. From the complex
manufacturing perspective, this access to data has provided insights and opportu-
nities to improve, but many companies struggle to take full advantage of the data
they gather because of the sheer amount of data that’s constantly gathered, the lack
of both human and software resources to accurately process that data, and the ever
changing technologies that add to the already existing unprocessed data mounds, or
struggle to interact with legacy systems.
12
1.2 Problem Statement
Society has long imagined a world in which data, analytics, and software seamlessly
combine to generate mammoth improvements in technology and the way we live our
lives. From classic science fiction works like Isaac Asimov’s Foundation series where
psychohistory allows society to plot a course to a better future, to current technology
giants promising artificial intelligent sidekicks that resolve most of life’s pain points,
many of these aspirations remain far from reality. While enormous progress has been
made in technology and software development, many companies today are struggling
to identify how to better use the technology and data available to them.
Manufacturing in general then is struggling with what data to get, how to prioritize
it, and how to use insights that can be gathered from that data to remain competitive
in an increasingly global world where margins are thin and innovation is critical to
survival. While there are plenty of leaders, companies, and technologies that promise
to help incumbents maintain competitiveness through a new product or service, many
companies are already struggling with limited resources and funding for the people,
technologies, or services that could help them manage their data and operations.
Limited resources and funding for new developments coupled with already massive
past investments into probes, sensors, databases, and other technology platforms have
put companies in a difficult position trying to understand what they have that could
work well and what they need in order to remain competitive in today’s complex
landscape.
This thesis aims to answer the following questions:
• Manufacturing plants are already collecting massive amounts of data about the
product, the employees, and the processes. Are there straightforward avenues
where Machine Learning can be applied to help a company identify opportuni-
ties to remain competitive?
• With those avenues identified, how can Machine Learning be used to help a
company effectively prioritize where to focus its limited resources?
13
• Are there immediate applications where those priorities can be used to gain
insights into the plant and actions that can be taken to make improvements?
1.3 Statement of Hypothesis
Our hypothesis in this project is that with the enormous amounts of data already
being collected at automotive manufacturing facilities, and specifically Nissan, Ma-
chine Learning can be utilized as a tool to gain insights into plant operations and
performance, enabling leadership to make better decisions for the plant, its employ-
ees, and the company overall. While Machine Learning is often touted as a tool to
revolutionize entire plants and companies, this thesis will focus more on its ability to
provide insights that enable improvements over time.
One of the major benefits of Machine Learning at a manufacturing plant level as
well as a broader company level is that many of the principles and frameworks for
machine learning can be applied at micro and macro levels. For example, machine
learning algorithms can be applied to a single manufacturing line looking at something
as specific as defect rates on a particular car panel. Machine learning can also be
applied at a plant or corporate level to comb through the many data streams that
exist at those levels and tease out insights into broader trends. While this thesis
project will specifically focus on a global benchmarking data set that encompasses
plant data from over 20 different manufacturing facilities across the globe, attention
will be given too on how these same principles can be applied at more specific levels
in the organization.
Identifying opportunities to better utilize the data already being collected at a
manufacturing plant or across a corporation is a critical next step in enabling com-
panies to successfully make improvements in this information rich era. While data
and insights come from simply sticking sensors onto as many pieces of equipment as
possible, teams and companies become overwhelmed with the ever growing volume of
data. Machine learning provides a simple and effective way to utilize technology to
parse through much of that data and help companies identify opportunities.
14
1.4 Project Approach
The first phase of the project involved extensive time spent at Nissan’s Canton, Mis-
sissippi manufacturing assembly plant to understand the process of vehicle assembly.
As cars have thousands of different parts, the manufacturing and assembly process
requires a well coordinated dance of thousands of employees, machines, and stations.
We spoke with many employees to better understand how they worked, how they
identified opportunities for improvements, what they did with mistakes, and what
they needed to continue to improve. Spending time on the manufacturing floor with
front line operators and engineers played a critical role in understanding the current
state of the Nissan facility. We also spent many hours with the industrial engineering
team to better understand how they reduced inefficiencies in the manufacturing pro-
cess and what steps needed to be taken to improve plant operations for the future.
In the automotive industry where margins on vehicles can be very small, it is vital
for a plant to continuously improve in order to remain competitive and a part of the
corporation’s portfolio.
Machine learning was not initially a part of our project scope. Two things led
us to it. The first was the general frustration we heard from operators, engineers,
and managers about the challenges they had dealing with data. With the number
of projects they work on, daily firefighting that needs to be done, and general ad-
ministrative work, very little time is available to comb through data. In addition, it
was often very difficult for employees to access data due to the many locations where
that data was stored and the general lack of knowledge around how to quickly and
efficiently pull or query that data. The second thing that led us to machine learning
was the closure of the plant due to the COVID-19 pandemic. Prior to the plant
closure in mid-March 2020, we were at the plant every day walking the lines and
speaking directly with many of the individuals that were working on plant efficiency
improvements. With the need to work remotely, we were still able to keep in contact
with many of the groups like Industrial Engineering and provide input to the ongoing
projects, but it quickly became evident that helping identify opportunities to better
15
utilize plant data would be an effective way to continue to contribute remotely.
Part of the hope of this thesis is to illustrate that there are various levels to machine
learning. Coming in as a group fairly new to machine learning, we looked for simple
ways to apply it. A great opportunity arose when we learned that there was a data
set benchmarking Canton plant performance and all other global Nissan plants on a
variety of critical metrics. Plants were then ranked against each other based on these
metrics and those rankings then impacted the funding and vehicle allocations that
each plant receive. Because Canton was taking on various initiatives to become more
competitive with its global peers, applying machine learning to this benchmarking
data set to identify which variables were the most important to prioritize seemed like
a great opportunity to both demonstrate an application for machine learning and
identify priorities for the plant.
The final steps then were applying various data analysis techniques and machine
learning techniques to identify what provided the most insight into plant performance
in the benchmarking data set. Once we had the machine learning results, we utilized
insights from that to influence a separate project we were leading around implement-
ing dashboard technology in the plant to better capture and visualize critical data
metrics. By combining these two projects, we were able to demonstrate two avenues
the plant could use to better manage the data it was collecting and reduce the amount
of time needed to process and share that data.
1.5 Thesis Structure
The thesis is organized as follows:

Chapter 2 - Background presents an overview of the current state of automo-
tive manufacturing with an additional look into data capture and aggregation in the
manufacturing space. This section also shares insight into the techniques being used
to handle data in the space particularly with regards to machine learning and how
automotive manufacturing is looking to implement such technologies.
Chapter 3 - Literature Review reviews the current academic literature sur-
16
rounding machine learning in manufacturing settings. Particular attention is paid
to research around using machine learning to improve plant performance or provide
insight into what metrics to prioritize to perform better. This section also looks
into proposed areas of development for big data in manufacturing and what steps
companies need to take to better utilize the technologies available to them.
Chapter 4 - Analysis of Data Usage and Application in a Plant Setting
looks at a deeper level into the various types of data collected in an automotive
manufacturing setting and how that data is used and applied to influence production
and decision making. Particular attention is paid to some of the challenges associated
with the current state of a manufacturing facility and the pain points that it presents.
Chapter 5 - Linear and Machine Learning Models on Benchmarking
Data describes an initial application of a regression analysis and machine learning
analysis on the global plant performance benchmarking data set. The purpose of this
section is to highlight what the two models are doing and compare their results.
Chapter 6 - Discussion on the Models looks at the lessons learned from the
regression and machine learning approaches. The intent is not to establish one over
the other but to highlight some of the benefits, drawbacks, and opportunities for
the models. In addition, it acknowledges the large and evolving space for additional
models and work to be done.
Chapter 7 - Summary and Future Work provides a final review of the project,
lessons learned, and opportunities for additional work in the future. We share ideas
around centralizing data acquisition to enable faster and broader data analysis, con-
sidering additional models, and providing data analytics training to more employees.
We conclude acknowledging the potential for Machine Learning and other data anal-
yses in manufacturing and encourage its study and application.
17
THIS PAGE INTENTIONALLY LEFT BLANK
18
Chapter 2
Background on Current State
This chapter presents a short background on automotive manufacturing, the Nissan

company, and data in manufacturing. It gives a general description of the current
state of these topics to convey how improved data analytics and machine learning
integration can contribute to the understanding and performance of automotive man-
ufacturing as well as manufacturing more broadly.
2.1 Automotive Manufacturing
To better understand automotive manufacturing and where the industry is today, it

is helpful to first understand the Industrial Revolution. The Industrial Revolution
is often broken down into four distinct stages. Figure 2-1 highlights this breakdown.
The first stage of the Industrial Revolution included developments in the use of water
and steam to power machinery and transportation in the 18th century. The second
stage included the development of assembly lines and fuels that enabled electrification
around the beginning of the 19th century. The third stage developed during the second
half of the 19th century and included automation in manufacturing and associated
computer and electronic technology. The final stage, Industry 4.0, is often described
as a stage in which society is still in the middle of experiencing and includes the
development of cyber-physical systems made possible by the internet.
Each of these monumental shifts in technology and manufacturing capability has
19
Figure 2-1: Diagram showing the key parts to the various industrial revolutions
Source: Sarah Mclellan at blogs.brighton.uc.uk
significantly impacted the automotive industry. The automotive industry famously

implemented the production line with Ford’s Model T manufacturing starting in
1913 [2] which aligns with the Industry 2.0 shift. Automotive factories implement
significant automation from the body shop where welding of various metal frames
is done to painting where robots can consistently apply precise and even coats of
paint in line with the Industry 3.0 shift. While automation technologies continue to
be improved on and developed, some may be surprised that after decades of work
not all of the automotive manufacturing process is automated. While many areas
like welding or painting have been automated, human assembly remains a significant
player in many assembly operations especially when it comes to the Fit and Finish
areas with items like seats, screws, handholds, and wire harnesses. Because the
Internet of Things and the accompanying technology shift at automotive plants is still
underway for Industry 4.0, more will be discussed on this in the next chapter. Suffice
it to say though that Industry 4.0 has enormous opportunity to help automotive
manufacturers improve the performance of their facilities but the challenge for how
to implement Industry 4.0 in a cost effective and timely manner is significant.
While each of these Industrial stages played a critical role in manufacturing tech-
nology, we would be remiss to not include a mention for the Toyota Production
20
System and lean manufacturing as playing significant roles in revolutionizing how
manufacturing is done. The Toyota Production System is based on the principles of
automation with a human touch and just in time manufacturing [19]. Both of these
principles highlight streamlining a process and eliminating waste while ensuring that
what is being manufactured is what the customer is looking for. Steve Spears studied
the Toyota Production System and identified that the system "creates a community
of scientists." In addition, Spears writes that the Toyota Production System encom-
passes four major rules:
• "Rule 1 - All work shall be highly specified as to content, sequence, timing, and
outcome.
• Rule 2 - Every customer-supplier connection must be direct, and there must be

an unambiguous yes-or-no way to send requests and receive responses.
• Rule 3 - The pathway for every product and service must be simple and direct.
• Rule 4 - Any improvement must be made in accordance with the scientific

method, under the guidance of a teacher, at the lowest possible level in the
organization" [18, p. 99]
These rules and the culture Toyota has created around the Toyota Production System
have enabled them to excel in vehicle manufacturing, and many of the principles and
practices of the Toyota Production System have been implemented in a wide variety
of industries.
While Toyota has their Toyota Production System, many companies create sim-
ilar production systems and tailor it to meet their needs. For example, Nissan has
created the Alliance Production Way which has similar lean manufacturing ideas.
The Toyota Production System and other programs like it work with the technolo-
gies the Industrial Revolution provided to create constantly improving manufacturing
practices, better products, and more satisfied customers.
21
Figure 2-2: Nissan Production in its early years
Source: [16] data from pg. 7
2.2 The Nissan Company
We want to include a brief section on the history of Nissan as our work for this thesis
was done in partnership with them. Nissan has a rich and interesting history and
their products have enriched the lives of the customers they serve, the employees they
work with, and the communities they anchor.
Nissan, originally Nihon Sangyo, was founded by Yoshisuke Aikawa and started
on the Tokyo Stock Exchange as Nissan in 1933 [13]. Nissan quickly grew in its pro-
duction capabilities in its early years and saw large growth in demand for its vehicles.
See Figure 2-2 to see the high initial growth, especially domestically in Japan, and
subsequent slowdown during WWII when some of its manufacturing capacity was
used for wartime efforts.
After WWII, production quickly ramped back up under Katsuji Kawamata and
exceeded 100,000 vehicles in a year in 1956 [16, p. 11]. In the late 1950s, Nissan
started to look to external markets for growth opportunities as well as for markets
22
that would help it expand and achieve economies of scale. In 1958 Nissan debuted its
first vehicles in the US at the Imported Motor Car Show in Los Angeles. Speaking in
1966, Kawamata stated that expanding into the US helped give Nissan more prestige
while achieving price cuts through mass production of export cars (p. 13). Nissan
quickly saw the need to develop vehicles tailored for the American market as some
of their initial Datsun model cars struggled to keep up with American speeds on US
highways. The Road and Track magazine reviewed the Datsun in 1958 and said,
"The performance of the Datsun is best described as melancholy" (p. 18). Despite
the poor initial reviews, interest in Datsun vehicles started growing.
Nissan entered the American market at a critical juncture in consumer taste and
demand. As cars became more affordable and more were purchased, Americans grad-
ually shifted away from the big and bulky vehicles of American auto manufacturers
and started favoring the smaller and more compact vehicles popularized by imported
manufacturers. In 1949 import car sales to the US only accounted for .25% of car
sales, but by 1959 it was 10.1% (p. 28) In 1960 Nissan Motor Corporation U.S.A. was
officially established to better control Nissan’s distribution and dealerships through-
out the country (p. 46). In 1983 Nissan built a manufacturing plant in Smyrna,
Tennessee with the largest manufacturing capacity of any other car maker at 640,000
vehicles each year [12]. Nissan’s Canton, Mississippi facility followed in 2003 with a
450,000 annual vehicle capacity and where Nissan’s first full-size truck, the Nissan
Titan, was launched and assembled. Between the various vehicle and engine manu-
facturing sites, Nissan is capable of manufacturing 1 million vehicles and 1.4 million
engines each year [11]. Nissan has had to overcome significant challenges over the
decades. Yet its focus on long-term customer relationships, high quality vehicles at a
competitive price, and focus on constant improvement have helped it establish itself
in the US and around the world. Having spent a significant amount of time at its
plants in Tennessee and Mississippi, the impact that Nissan has on the thousands of
employees, customers, and community members it serves is impressive. Nissan will
continue to play an important role in automotive manufacturing and development.
23
2.3 Big Data
So far we have drilled down from a high level overview of automotive manufacturing
to Nissan history and current state. Now we look at big data in manufacturing.
Many companies are already generating much of the data they need. They have
sensors on most pieces of equipment, methods for tracking parts and quality, and
data across the entire value chain from suppliers to customers. A significant challenge
companies face is how to effectively manage AND utilize all of that information. A
McKinsey study from 2019 illustrates that many of the problems company leaders
used to have with new data infrastructure can be avoided today as systems do not need
to be replaced. Rather, data platforms today can be designed to sit above the existing
infrastructure and interact with them [7]. Examples of this include infrastructure like
a data lake which will be explored more later on in the thesis. Korbel et al. continue
by emphasizing that once data infrastructure is in place, data management becomes
a critical aspect to ensuring digital projects can be successful. See Figure 2-3 below
for their illustration of key steps to managing data.
Helu et al. [5] share the need for improved data management systems and pipelines
that enable better data capture in the manufacturing plant and sorting of that data
into more accessible formats. Helu encourages the exploration and use of improved
data pipeline architecture to capture machine and sensor data from a variety of ma-
chines and sensors by using software like Apache Kafka that then enable cohesive data
sorting and storage for database management and analytics. This allows companies
to better use the data they already have while also allowing for the deployment of
new data sources and processes within the highly varied operations and technologies
at any facility.
While most manufacturing sites are getting the data they need, a comprehensive
strategy for the understanding and management of that data is critical. Successful
strategies will need to incorporate not only up to date technologies and software, but
also appropriate systems of governance for processing and managing the data and
trainings for employees and managers to successfully utilize the associated data and
24
Figure 2-3: Managing data
Source: [7]
25
technologies. The benefits of pursuing and implementing better big data practices in
manufacturing are great and will continue to be explored throughout this thesis.
Summary This chapter has covered some of the history and current state around
automotive manufacturing, the Nissan corporation, and big data. Understanding the
incredible development of technologies, processes, and employee skills throughout the
manufacturing environment is encouraging as continued ingenuity will be required to
take advantage of the data and technologies available today and in the future. These
sections provide a nice foundation for the rest of this thesis that looks into how to
better utilize data generated at automotive manufacturing facilities and some of the
promising machine learning technology that can be applied.
26
Chapter 3
Literature Review
While artificial intelligence and machine learning are often thought of in reference
to current software technology companies, utilizing AI and ML and improving the
generation and management of data has been a focus for the manufacturing space as
well. Even going back to the days of Henry Ford and his focus on eliminating waste,
better understanding the processes and performance of the manufacturing operation
has led manufacturing companies to continually look for better tools and applications.
In the following paragraphs we show some of the literature behind artificial intelli-
gence and machine learning in the manufacturing space including the history, current
state, and potential for its application.
3.1 History
To better understand where machine learning and data use is in today’s manufacturing
world, it is important to look back at its history. Machine learning applications
have grown significantly in recent years as technology and computational power have
significantly increased the speed at which models can be run and reduced the cost of
computation. In the past servers had to be reserved for use and shared across a large
organization while today many algorithms can be run on personal computers.
Michael Sharp points out that machine learning saw initial use in the 1980s but
"industrial adoption was not high because the methods were difficult to implement
27
and ahead of technology available at the time" [17]. In addition to the challenges
of implementing machine learning methods, the presence of equipment and buildings
not set up for machine learning and big data or industry 4.0 initiatives slowed its
roll out. While software companies have been able to quickly iterate on data and
models that all live on servers and computers, it has taken longer for manufacturing
companies that have machines and layouts running old technologies.
Still, the excitement and focus on machine learning and artificial intelligence in
manufacturing can be traced to some of the Industry 4.0 concepts that started around
2011 when the Industry 4.0 concept was first introduced. Industry 4.0 presented a
cohesive vision for how data, analytics, humans, and technology could all be used
together to bring about the next revolution in manufacturing performance and ca-
pability. While Industry 4.0 first came to light at the industrial trade fair Hannover
Messe in 2011, Pfeiffer et al. points out that many of the concepts and ideas behind
Industry 4.0 surfaced during the financial recession around 2009. Major consulting
groups were researching and writing on factors contributing to the decline and cata-
lysts to recover from it. Pfeiffer mentions that "Ironically, the same consulting firms
that joined the vanguard recommending intentional deindustrialization now point to
the industrial sector not only as the core element of the value chain but also as the
essential prerequisite for the preservation of “high-quality services” in the national or
regional economy" [15].
After the 2011 conference where Industry 4.0 was first mentioned, it has become
a catch phrase encompassing all kinds of technologies, practices, and approaches to
industry and manufacturing. In 2016 it was even a focus as a part of the World Eco-
nomic Forum agenda. Industry 4.0 has promised grand transformations for companies
and society, and while not all of its promises are as easily realized as many originally
hoped, companies and industries continue to strive to implement the technologies
around it.
28
3.2 Present
While principles of machine learning and artificial technology have been used for
decades, today’s manufacturing environment is seeing significant changes as Industry
4.0 gets deployed more and more broadly.
Sharp et al. (2018) point out that the volume of data produced by manufacturing
systems today continues "rapidly growing beyond the capabilities of traditional algo-
rithms, especially for users who want the most useful information from their data."
The number of data sources, high data storage capacities, and ever increasing sample
rates combine to create enormous and difficult to manage data sets. As Sharp et
al. (2018) continue on to say, "Collected information that is not able to be correctly
interpreted or made useful in a timely manner is rarely even so much as marginally
better than having not collected the data" [17]. Charalambous et al. (2019) mention
in their McKinsey study that while manufacturing was quick to adopt automation
and control systems, they have had less rapid adoption in the actual production space.
This can create challenges for companies as they depend on operators that are us-
ing their historical experiences and knowledge base for decision making which can
work well while those employees are around but creates problems when they retire
or encounter abnormal or highly complex scenarios. On the opportunity for addi-
tional AI in manufacturing, Charalambous continues, "With respect to operational
improvement and dynamic adaptability, artificial intelligence can outperform conven-
tional decision-support technologies. Also, thanks to new, high-performance software
tools, processing power, and cheap memory, AI enables companies to cost-effectively
create and maintain their own algorithms and intellectual property in-house, which
is cheaper, more versatile, and more adaptive to constantly changing equipment and
market conditions. AI can fully automate complex tasks and provide consistent and
precise optimum set points in autopilot mode. It requires less manpower to main-
tain, and—equally important—it can be adjusted quickly when management revises
manufacturing strategy and production plans" [Charalambous et al.]
Industry 4.0 encompasses all of the pieces of a manufacturing company or en-
29
Figure 3-1: Fundamentals of Industry 4.0 and their interactions. Source, Agrawal et
al. “Industry 4.0: Reimagining Manufacturing Operations after COVID-19.”
terprise. From resource procurement to finished product delivery and reliability, it

envisions a better way to understand, monitor, and improve every aspect of the en-
tire process. Figure 3-1 shows the various pieces related to Industry 4.0. While many
studies and papers are written about each of these foundational technologies, it is
important to step back and see the whole picture and what techniques like machine
learning and artificial intelligence can do as a part of the bigger picture.
Agrawal et al. (2020) point out that with the COVID 19 pandemic, adoption of
Industry 4.0 technologies and practices continues to be asymmetric as some compa-
nies postpone plans to save cash while others double down on it to capture benefits
[1]. While a complete restart of factory, equipment, and technology is likely impos-
sible for many companies due to cost, timing, and other constraints, there are still
plentiful opportunities to apply principles of Industry 4.0 to prove out technologies
or computational processes. 3-2 shows how machine learning is currently being used
30
Figure 3-2: Current and future application and opportunities for machine learning in
the manufacturing space. Source: Sharp et al. Journal of Manufacturing Systmes 48
(2018) 170-179)
in some manufacturing applications but there is significant opportunity to broaden

its use and impact across the entire manufacturing space. Thus, while widespread
5G technologies, for example, may still be several years out in broad manufacturing
space application, companies are taking steps today to implement technologies and
practices into their manufacturing environments.
3.3 Future
While steps towards a more intelligent manufacturing process and Industry 4.0 are
being made, additional progress is needed before many of the goals of Industry 4.0
can be realized. Li et al. (2017) point out that future work needs to encompass:
31
• Model-driven and collaborative manufacturing including the infrastructure nec-
essary to enable this
• Knowledge-driven efforts to support and enable manufacturing
• Human-machine-material efforts to enable equipment, processes, and practices

to better work together
• Autonomous intelligence application in manufacturing to handle systems and

processes without human interaction
Li recognizes that breakthroughs are still needed in order to make much of this
possible across the full life cycle of manufacturing [9].
Other challenges will need to be overcome like what Sharp et al. point out in
needing to develop ways to enable new equipment and technology to communicate
with older equipment and technology that still work. We saw some of this impact at
the manufacturing facilities we visited where old equipment still worked well and was
not near end of life, but the data and its format were not conducive to easy automation
of data collection and analysis. A successful transition to a more connected and cloud
based manufacturing system will need to incorporate strategies to either easily update
or upgrade existing equipment or make the value proposition of new equipment and
technology clear and compelling enough to allow cash strapped companies to pivot
sooner rather than later.
While significant benefits await in updating the manufacturing facility to a more
connected state, manufacturing as a whole stands to benefit too from the insights and
learnings that come from connecting the upstream and downstream paths from the
manufacturing facility to the learnings that can be incorporated from suppliers and
customers. Jack et al. (2020) illustrate how advanced analytics of already existing
data pools enable improved visibility into market performance, gaps in customer
loyalty, and results of mitigation efforts on improving that loyalty [6]. This can
directly impact the manufacturing environment as orders can be streamlined and the
plant can be optimized to meet more accurate targets rather than highly fluctuating
forecasts.
32
[10] highlights that Industry 4.0 in the automotive space can be strengthened by:
• Encourage better awareness on technology maturity. This can be done through

better trainings as well as knowledge sharing
• Identifying and sharing how the technology benefits help employees will encour-
age wider adoption
• The government can play a role in creating a better environment for technology
and standards
3.4 Conclusion
In this chapter we endeavored to share a sampling of literature surrounding machine
learning and artificial intelligence in manufacturing. We shared some of the work
looking into the past, present, and future of the space and highlighted how Industry
4.0 provides a helpful foundation for seeing the current understanding around data,
technology, and practices as well as the bigger picture vision and goals for where
the manufacturing space can grow. Like many industries, the manufacturing space
is seeing significant change as new technologies, algorithms/models, and methods
come together to identify improvements and efficiencies to improve products used all
around the world. While only a couple of decades ago many feared that industry and
manufacturing were in a decline, Industry 4.0 and emerging technologies emphasize
that incredible opportunities and growth await companies and groups that can harness
the data and technologies proliferating across the globe.
33
34
Chapter 4
Understanding Plant Data and

Performance
In this chapter we explore some of the critical data collected at the plant level and
how it is reported on and analyzed in the decision making process. Examples of
collected data will be shown and some of the challenges associated with local plant
level and global level data gathering and decision making will be discussed as well.
Please note that all data used was generated for illustrative purposes and should not
be construed to reflect the actual plant and company performance.
4.1 Data at the Plant

In a large manufacturing facility that builds hundreds of thousands of vehicles each
year, enormous amounts of data are generated every second. Everything from the
color of the part installed to the employee that performed the installation is tracked.
While some of this data is monitored for deviations and alarms, like defects, much of
it gets stored just in case it is needed at a later date. With thousands of employees,
machines, and vehicles, the complexity of data gathering and utilization is a major
challenge.
One of the early challenges we saw during this project with plant data and report-
ing was the varied nature of what was producing the data. For example, there are a
35
large variety of machines that are performing jobs throughout the plant and each is
generating data as it works. With the manufacturing site having multiple production
lines, as well as backup or overflow machines, there are often multiple machines doing
similar work. However, one of the challenges with these machines is that they are
often purchased at different times and potentially from different manufacturers. This
means that the way in which they generate data, how it is recorded and stored, and
how easy it is to connect the machine to a centralized database all vary by machine.
In our discussions with line operators and facility engineers, we learned that some
machines are connected to databases that anyone can access at their computer, while
other machines required an employee to go to it and physically download the data to
a storage device in order to then use the data for analysis.
While there is a seemingly infinite amount of data generated at the plant, our
project focused on data that the company used in plant performance analysis and
benchmarking. The next several sections explain the data used in the plant bench-
marking.
4.2 Design Standard Time
The Design Standard Time, DST, is Nissan’s method for measuring work. It is the
optimum time required only to assemble the vehicle and does not account for any
other time factors involved such as part transportation, steps to get to a bin of parts,
time to place a screw, etc. The design standard times are calculated for every task
that is a part of vehicle assembly, and the aggregate design time that is required for a
given section of the line determines the number of workers that section gets. Another
way to look at this is that the greater the Design Standard Time required to perform
a given set of actions like dashboard assembly, the more employees are going to be
needed in that section to perform the work. Because there are many movements and
actions required to assemble a vehicle, it is important to note what operations are
within the scope of design standard time. Below is a list of actions within the scope
of design standard time:
36
• Assembly, adhesion, connections
• Parts handling, minimum walking
• Alignment and securing
• Fluid filling like gas or oil
Below is another list of actions that fall outside of the scope of design standard
time:
• Production aids
• Most walking, bending, stretching, squatting
• Material handling and double handling of a part. For example, kitting areas
where parts are pre-sorted and then taken to the line where they are rehandled
• Tote and bin handling
• Checks, repairs, inspections
• Equipment wait time
Comparing these lists reiterates that the design standard time is the optimum
time a vehicle could be assembled if any and all waste could be removed from the
vehicle assembly process.
In reality though, employees need to set down a tool like a screwdriver in order to
pick up and set screws. Steps need to be taken in order to move to the next vehicle
on the assembly line or retrieve the next part. Travel time is required for a vehicle
to get from one area to the next. So while the ideal design standard time is virtually
impossible to achieve, it provides a benchmark for how lean a process can be and
encourages groups like the industrial engineers, operators, and management teams to
find ways to reduce waste. As one final example, in the real world, the following steps
happen in assembling a part onto the vehicle:
• Step 1 - Operator walks along the manufacturing line
37
• Step 2 - Operator picks up tool
• Step 3 - Operator picks up part
• Step 4 - Operator walks to the unit
• Step 5 - Operator installs the part (example: places part into grooves on vehicle)
• Step 6 - Operator secures the part (example: uses screwdriver to secure part
into place
All of those steps add up to the amount of time used to assemble that specific part.
In the DST world, the only steps that are used in calculating the design standard
time are:
• Step 1 - Operator picks up part
• Step 2 - Operator installs part
• Step 3 - Operator secures part
At the outset of our study, we were confused as to why DST seems to ignore
real world complications like taking steps to move the part from the storage bins to
the vehicle. However, DST looks at all the steps that are truly adding value to the
vehicle. In the words of one of the industrial engineers we worked with, DST looks at
the steps that the customer pays for. The customer is paying for parts to be installed
and secured, not for double handling of the part. While a manufacturing plant will
never achieve the calculated Design Standard Time, the Design Standard Time Ratio
helps us understand the gap between the Design Standard Time and actual assembly
time.
The industrial engineers have an entire manual detailing how DST should be
calculated for each part installation on the car. The standards account for the size
and location of parts and are also adjusted based on the size of the vehicle. For
example, the larger trucks have additional time allocated for locating and securing
a part since the vehicle is larger than, for example, a small car and requires slightly
38
more time for assembly. The DST is written in minutes and an example calculation
for a medium part is detailed below.
• Step 1 - Collect Medium Part = .05 minutes
• Step 2 - Locate Medium Part. .04 minutes
• Step 3 - Secure Part with Bolt = .05 minutes
– Four bolts = .05 minutes * 4 = .20 minutes
• Total DST = .29 minutes or 17.4 seconds
The drawing of the part installations with the accompanying DST calculation are
given to the operators, engineers, and managers to work with.
4.3 Design Standard Time Ratio

As mentioned in section 4.2, the DST is an ideal that does not account for all the
extra steps or actions that might be taken to assemble a part on the vehicle. The
actual time that it takes to perform an installation is taken and compared to the DST
to create the Design Standard Time Ratio or DSTR. The equation for this is simply:
𝐴𝑇
𝐷𝑆𝑇 𝑅 = (4.1)
𝐷𝑆𝑇
where:
DSTR = Design Standard Time Ratio
AT = Actual Time or the amount of time taken to perform a task
DST = Design Standard Time
If a given process has two minutes of DST and it takes an operator four minutes
to do the work, then the DSTR would be two. Similarly if DST is two minutes and it
takes two minutes to perform the work but two workers are required, then there is a
total of four minutes of human work being done in the process and the DSTR comes
out to a ratio of two once again.
39
The DSTR is a critical benchmark in the plant and across Nissan plants. Because
all the DST is calculated following standards that are independent of the vehicle model
or plant location, the DSTR becomes a value that can allow for a quick understanding
of the efficiency of one plant compared to another. For example, if one plant’s trim
and chassis area has a DSTR of 3.0 and another plant’s trim and chassis area has a
DSTR of 1.9, the lower DSTR suggests a more efficient manufacturing line that is
either using less human resources to accomplish the work or is doing the work more
quickly than the plant with higher DSTR. See Figure 4-1 for an example on how the
actual time versus design standard time are calculated for a given process step.
In Figure 4-1 it is important to note that some of the steps have no DST given
for them. In the second to last work element where the part is picked up again, no
additional DST is given for needing to pick up the part a second time. There are
a number of reasons why AT is often larger than DST. Everything from the size of
the vehicle to the location of the station parts and tools impact the length of time
that it takes to perform the required work. Occasionally employees may struggle to
install the part in an optimal manner due to fatigue, lack of training, or interference
from another part. Identifying areas where AT is significantly greater than DST can
provide insight into ways to improve the job processes to both help the employee and
reduce waste. To calculate the DSTR in this example, we note that the Total AT
taken in the process is .270 mins. The Total DST allocated for the process is .120
minutes. The DSTR then in this scenario is the the AT of .270 minutes divided by
the DST of .120 minutes which comes out to a ratio of 2.25.
While this is the DSTR for a given process in vehicle assembly, the total DSTR
can be aggregated across a section like doors, an area like Trim and Chassis, an entire
vehicle model like the truck, and across the full plant. In this case the Total Plant
DSTR would be the aggregate Total AT divided by the aggregate Total DST. It is
important to note that the Total AT includes all human time spent on the vehicle.
That means that the employees managing inventory and material handling as well as
those engaged in quality checks and assessments at the end of the line have their time
included in the Total AT. However, because those jobs are not doing work directly
40
Figure 4-1: An example of a DSTR calculation on a manufacturing process
41
Figure 4-2: DSTR benchmarking example from 2006. Note that numbers are for
illustrative purposes only.
adding value to the vehicle, they have no DST allocated to them and only increase
the value of the numerator. Similarly, the denominator, Total DST is a fixed value
calculated during the initial development and roll out of the vehicle production. The
only way to improve on DSTR then is to find ways to reduce the actual time spent
on vehicle manufacturing and assembly.
Nissan’s global target for the DSTR is 2.0. Plants often break down their DSTR
by vehicle model or manufacturing line for comparison. Figure 4-2 shows an older
chart and how the company might compare various manufacturing sites and their
DSTR performance across vehicle models.
4.4 Straight Through Ratio
In our vehicle assembly process, vehicles travel down a manufacturing line one af-
ter another and go through specific sequences of work to assemble the vehicle. For
42
example, in the trim and chassis area, a vehicle may start out as an empty chassis
and go through sequential steps putting on insulation, wiring, airbags, carpet, liners,
and seats. The Straight Through Ratio, or STR, is a calculation that looks at the
number of vehicles that start the sequence of processes and go through all of them
without needing any rework. While it is extremely rare for a vehicle to need to be
removed directly from the line at a given processing step, there are specific areas
where a vehicle can be taken offline to address any problems before being reinserted
into the process. For example, after the paint shop there is a quality assessment area
where vehicles with paint defects can be isolated and touched up before moving on
to trim and chassis work.
The STR then is a data point that provides insight into how capable the process
is at assembling vehicles that don’t require additional offline work.
𝑂𝐾𝑉
𝑆𝑇 𝑅 = (4.2)
𝑇𝑉
where:
STR is the Straight Through Ratio
OKV is the number of vehicles that pass inspection criteria without rework
TV is the total vehicle count
Similarly, the Final – Straight Through Ratio (FSTR) is the same ratio as STR
but specifically looks at the vehicles that make it through to delivery.
OKV
𝐹 𝑆𝑇 𝑅 = (4.3)
TVD
where:
FSTR is the Final Straight Through Ratio
OKV is the number of vehicles that pass inspection criteria without rework
TVD is the total vehicle count at the delivery checkpoint
43
4.5 Schedule Sequence Achievement Ratio
Within the ever busy manufacturing plant with tens of thousands of parts, hundreds
of vehicles, and thousands of employees at any one time, the vast scheduling and
movement of machines, people, and vehicles is quite complex. Couple that with the
inflow of parts and materials from third party vendors and the delivery of vehicles to
expecting dealerships and customers, timing everything appropriately is critical. One
of Nissan’s principles they follow is Douki Seisan which means synchronization with
the customer such as the vehicle owner. This synchronization needs to be with both
the pipeline for the development of the products as well as the pipeline from order
to delivery of the products. Because the manufacturing plant focuses on production,
we will focus on the second pipeline. The pipeline from order to delivery applies to
production activities which also include the suppliers and transportation of materials
and products.
By aiming for a system that synchronizes with what the customer is looking for,
Douki Seisan, the process aims to:
• Reduce the lead time from order to delivery
• Aligns production to a fixed schedule that is based on the real sales demand
Lead times can be reduced by ensuring that processes are communicating with
each other allowing for a reduction in inventory waiting and buffers between the
processes. In addition, as the Alliance Production Way, or APW, manual on which
these principles are codified states, “focus on “Zero defects, zero breakdowns, and
reduction in set-up time through our adherence to Time and Sequence.” Similarly,
from APW, “the essence of Synchronization with the Customer (Douki-Seisan) is total
optimization of the entire process, from Customer Order to Delivery.”
The scheduled Sequence Achievement Ratio, SSAR, is a metric then to quickly
understand how well vehicle production is able to hold to the strict schedule laid
out for manufacturing. Repeated problems that sidetrack vehicles from maintaining
their original sequence can be identified and addressed to make sure that the plant
44
is meeting the needs of the customer. The importance of this metric quickly be-
came apparent to us as many of the onsite vendors, such as vehicle seats or vehicle
dashboards, are making their parts according to the sequence given them. Should a
vehicle get pulled out of the line, the ripple effects would spread wide through the
plant and vendor facilities as they in turn have to change their sequence to match
the manufacturing line changes. Obviously there are safeguards in place to limit the
damage of those ripples or where a vehicle gets pulled offline, but the impact of not
maintaining a robust SSAR is high. The equation to calculate SSAR is:
OKV
𝑆𝑆𝐴𝑅 = (4.4)
TVP
where:
SSAR is the Scheduled Sequence Achievement Ratio
OKV is the number of vehicles delivered in their scheduled sequence
TVP is the total vehicle count processed
4.6 Stock in Place
As described in section 4.5, one of the goals of synchronization with the customer is to
reduce the lead time in delivery to the customer and reduce inventory and buffers in
the manufacturing process. A way to track and benchmark a plant’s stock is through
the Stock in Place metric. It’s a simple percentage number expressing the number of
matching parts, or parts that meet the sourcing guide standards , on site ready to be
used in the manufacturing and assembly process over the number of Tier 1 alliance
parts. A lower stock in place value signifies that a plant is doing a better job at ‘just
in time’ delivery as they have less of the parts sitting and waiting which ultimately
reduces cost and waste.
45
4.7 Stock in Transit
Similar to the Stock in Place metric, Stock in Transit looks at all the stock, both
pre and post assembly, traveling to or away from the manufacturing site. A lower
Stock in Transit number means the plant is reducing cost and waste by optimizing
the goods being transported around the globe.
4.8 Stock in Place + Stock in Transit
Because the Stock in Place and Stock in Transit metrics are so similar, some histor-
ical benchmarking has been done between plants by simply adding the two metrics
together into a new total stock metric. Similar to the rationale for the values on their
own, a lower value signifies better inventory and logistics performance by the plant
and company.
4.9 Production Volume
Each of the plants across Nissan’s portfolio are given a manufacturing target to pro-
duce to. This number is determined after Nissan headquarters analyzes customer
demands and trends and can forecast production needs from each facility. Thus the
actual production from a facility was not necessarily a maximization of its capacity.
While Nissan plants were not independently setting their own production volumes,
their day to day performance impacts the volume they are given. For example, if a
plant finds a way to significantly reduce the cost to produce each vehicle, it becomes
more competitive for future production volume because it can do so more cheaply.
Similarly a plant that is struggling with cost or quality may see less future production
volume while it figures out how to improve. The vehicle production volume then, while
not a completely independent variable to the plant, is used as a benchmark between
plants to compare performance.
46
4.10 Plant Reporting Process
Every year the Nissan manufacturing plants go through a performance benchmark-
ing and ranking. Over the years the metrics being used to compare the plants has
changed as better metrics are identified, but the principles of using the data to iden-
tify performance targets as well as best practices has helped improve the performance
of the plants. It is important to note that the data we are using in this thesis is only
one of the many ways that plants are evaluated and benchmarked.
The way the ranking happens each year goes as follows:
• Each plant reports monthly data on the specified key metrics like SSAR
• Each plant is visited by the benchmarking team at least once during the year
where the team evaluates performance throughout the plant
• The benchmarking team analyzes the data and evaluations of each of the man-
ufacturing facilities
• The benchmarking team calculates a performance score for each site
• Benchmarking results are released
The key metrics being evaluated center around four main performance categories:
• Quality - including metrics like defect rate or recalls
• Cost - focusing metrics related to the cost of manufacturing the vehicle
• Time - focusing on metrics related to time required to manufacture the vehicle
• Production - focusing on metrics related to the quantity of vehicles manufac-

tured
Results are released showing plant performance in individual categories like Cost
as well as an Overall QCTP category. Plants that have demonstrated significant
improvement are highlighted. After the results are released plants work on improving
47
Figure 4-3: Shows 2018 Overall QCTP score rated on 0-5 scale for plants in network.
Plant names removed for anonymity and numbers for illustrative purposes only
throughout the following year and work together to identify what practices going well
at other facilities could be shared.
Figure 4-3 shows how each plant receives an overall score for their Quality, Cost,
Time, and Production performance and can compare how they do with all the other
manufacturing facilities in the Nissan network.
4.11 Summary of Data Gathered and Used
The introduction to section 4.1 highlighted the vast data generated at the manu-
facturing plants. We then wrote about specific metrics gathered as a part of the
Alliance Benchmarking and plant performance ranking. Our project decided to focus
on that subset of data because it provided several years of data, is a critical part
to understanding plant performance and the associated investment and production
opportunities that become available to the plant, and was a data set well suited to
seeing how machine learning could help drive insights.
The data we looked at was from the Alliance Performance Ranking from the 2015
48
to 2018 years, a total of four years of data. The data is reported by the plants on
a monthly basis, but the overall plant performance scores are only given on a yearly
basis. Chapter 5 will look into how that timeline discrepancy impacts model per-
formance. In addition, not all years kept the same metrics. For example, a variable
called Production Straight Through Ratio was added in the 2017 and 2018 years to
provide insight into how well the Production area of the plants performed. Simi-
larly the SSAR was not added until 2017. Because of this, we ran multiple models
using data only in the 2015-2018 range as well as another model in 2017-2018 that
incorporates the additional variables and examines their impact on the models.
49
50
Chapter 5
Models
In Chapter 4 we discussed some of the critical data associated with Nissan’s Alliance
Performance Ranking. Chapter 5 looks into how that data can be modeled to better
understand trends in plant performance. The end goal of this modeling is to:
• Identify a model that accurately represents past plant performance
• Identify a model that can be used to run scenarios and predictions around new
plant performance metrics
5.1 Approach
Throughout the duration of the Leaders for Global Operations project at Nissan’s
Canton location, the overarching goal was to identify opportunities to help improve
plant performance. While we worked on a number of different initiatives with various
groups, the idea of providing a quantitative look at plant performance was compelling.
After reviewing numerous data sets and reports, we identified that focusing on the
plant benchmarking and performance annual reports would be an ideal place to start
our analysis. This decision was based on the fact that there was reasonably accessible
historical data as well as the large impact, and subsequent plant focus, that the
rankings had on plant funding, investment, and growth.
51
Currently the plant uses the yearly performance ranking report to identify oppor-
tunities for improvement. Throughout the year the plant tracks the critical metrics
(listed in Chapter 4), works to address any that in the prior year needed improvement,
and addresses any that start to fall short of targets during the year. The teams that
work on these performance initiatives are cross-functional and include individuals fo-
cused full-time on these benchmark metrics as well as individuals that participate
part-time for specific work in their areas of expertise.
While a focus on better understanding the benchmarking and metrics was not
new to the plant, creating a model to take historical data from multiple plants and
understand the priority of the various metrics was new. The goal of the project then
was to create a usable model that could be easily updated with new data and that
could help plant management understand where to prioritize efforts.
After working with many of the employees involved with the benchmark reporting
and plant improvement opportunities, we determined that the data-driven approach
should utilize a model that accomplishes the following:
• Understand global plant performance
• Prioritize metrics most impactful on overall performance scores
• Provide accurate and robust correlation with historical data
• Allow for adaptability in the future
5.2 Understanding the Data Set

The historical plant data available for review covered the 2015-2018 years, a period
of four years. Evaluating the plant data from 2015-2018, there are several variables
that appear consistently over those years. Those variables are:
• Production Volume – Number of Vehicles produced
• Final Straight Through Ratio – Vehicles not taken off the line
52
• Delivery Scheduled Time Achievement Ratio – Vehicles reaching delivery point
within +-2 hours of scheduled arrival
• Stock in Place
• Stock in Place + Stock in Transit
• Design Standard Time Ratio – Actual time spent manufacturing vehicle versus
design time
For the purpose of the model all of these variables are considered the independent
variables while the plant score is the variable dependent on each of these independent
variables.
While all of the variables have variability across the plants, a quick regression
plots shows a linear correlation between each of the variables and the plant score. See
Figures 5-1 through 5-6.
Each of the plots show the independent variable value and the associated plant
score. A solid regression line shows the best linear fit between the points and includes
a shaded region capturing the 95% confidence interval. Overall, a simple regression
line obviously does not do a good job of capturing all the data points, but it does
show general trends in line with what we expected from the data. For example, higher
stock/inventory values correlate to poorer overall plant scores, and a better straight
through ratio correlates to improved plant scores. Note that numbers used are for
illustrative purposes only.
53
Figure 5-1: Relationship between global plant performance and number of vehicles
manufactured.
54
Figure 5-2: Relationship between global plant performance and Final Straight
Through Ratio.
Figure 5-3: Relationship between global plant performance and Delivery Scheduled
Time Achievement Ratio.
55
Figure 5-4: Relationship between global plant performance and Design Standard
Time Ratio.
56
Figure 5-5: Relationship between global plant performance and Stock in Place ratio.
57
Figure 5-6: Relationship between global plant performance and Stock in Place +
Stock in Transit ratio.
58
5.3 Linear Regression
While there are a number of ways to run a linear regression on data such as that
generated in the plant performance rankings, we wanted to explore using machine
learning as our primary method for model creation and regression analysis. All the
machine learning analysis was done in Python 3 using Jupyter Lab notebooks. The
plant performance data was downloaded and stored in excel documents. To run the
actual machine learning algorithms the Python script called up the excel files and
then ran the data. Additional information on the code used can be found in the
appendix.
Simply put, linear regression is a way to take one or more variables and model the
relationships between them. The utility of this comes from then being able to take
the model and predict new values (outputs) based on new data (inputs) that it has
not seen before. While general plots of data can help us understand the trend of data,
without a ‘line’ it is difficult to predict an exact value for what new data could imply.
As an example, The FSTR plot, Figure 5-2, shows that as a plant improves its ratio,
its plant score improves as well. However, once we create a linear regression using
that data, we have an ability to predict what a plant score will be given an FSTR
value. The accuracy of that linear regression and its prediction capability becomes
critical to its utility.
The linear regression analysis was done with the Scikit-learn (sklearn) library
created to use machine learning in Python. Sklearn has many algorithms that can be
used and we utilized it for our simple regressions and more complex machine learning.
The LinearRegression model in SKlearn is an Ordinary Least Squares Linear Re-

gression that fits a linear model and minimizes the residual sum of squares between
the observed datapoints and the predicted results. It can return an R-squared score
to help understand the accuracy of the model.
To run the linear regression, we created a simple machine learning model that
split the plant data into a training and a test set. The linear regression model was
trained on the training set to fit a model and then run on the test set. The resulting
59
Figure 5-7: Performance of linear model in comparing predicted plant score vs actual
plant score.
score for how well the model performed in accurately predicting the plant scores in
the test set is shown in 5-7.
This simple graph in 5-7 illustrates the actual plant score values on the X-axis
and the predicted plant score values on the Y-axis. A perfect regression model would
have all points fall on the blue line. As can be seen, and from our discussion earlier on
the simple variable regressions, the linear model does a fairly good job at predicting
the correct plant scores. The R-squared value for this model was .80.
In addition to the R-squared value, the linear regression model also provides us
with the coefficients for the variables used in the regression. We can use these coeffi-
cients to write out the equation of the line used to model and fit the plant performance
data. That equation is:
𝑃 𝑆 = 1.29𝑒−05 𝑃 𝑉 + 1.70𝑒−02 𝐹 𝑆𝑇 𝑅 + 6.25𝑒−03 𝐷𝑆𝑇 𝐴𝑅 − 1.35𝑒−02 𝑆𝐼𝑃

(5.1)
−5.83𝑒−03 𝑆𝐼𝑃 𝑆𝐼𝑇 − .298𝐷𝑆𝑇 𝑅 + 2.11
60
where:
PS is the plant score
PV is production volume
FSTR is the final straight through ratio
DSTAR is the delivery scheduled time achievement ratio
SIP is the stock in place ratio
SIPSIT is the stock in place + stock in transit ratio
DSTR is the design standard time achievement ratio
While the size of the coefficients do not allow for a direct comparison of which
variables are the most impactful, the linear regression model and associated equation
allow for straightforward exploration of how changes in given variables impact the
overall plant score. For example, an improvement in PV or DSTR can be input
into the equation to see how it will impact the PS. In addition, before projects are
approved for capital expenditure, an evaluation can be done to see what impact their
anticipated benefit on one of the variables will be and then projects can be prioritized
based on cost and PS impact.
One of the downsides to using a linear regression model with this data set is that
the data needs to go through extensive cleaning in order to be used in the model.
No missing cells can be included in the model and so the manufacturing plants with
missing data in a given month either need to manually be deleted for that month, or
a script needs to be written to go through and delete any entries with missing data.
5.4 XGBoost Model
Similar to the simple Linear Regression, XGBoost is a machine learning model that
takes a train and test split of the data set to create a model. The XGBoost model
trains itself on the training split of data and then we can evaluate how well it performs
against the test data set.
XGBoost is a nickname for a machine learning technique called Extreme Gradient
Boosting. It is a machine learning algorithm that works as a sequential technique.
61
Figure 5-8: Illustration on how a boosting algorithm works. From
https://www.datacamp.com/community/tutorials/xgboost-in-python, accessed
1/1/2021
That means that it improves its prediction accuracy by comparing the outcomes of
one test instant with a previous test. The outcomes of the model tests are given
a weight based on its results and the results of previous instances. Outcomes with
better prediction results are given a lower weight while the outcomes that perform
worse are given a higher weight. The figure 5-8 is a great example from datacamp on
how it works.
Figure 5-8 shows the steps that the boosting algorithm goes through to identify
an optimal model. The goal of the classifier is to correctly classify the +’s and –‘s
spread throughout the box.
• The first box creates a vertical line and says anything to the left of that line
(D1) is a + and anything to the right of it is a -. This vertical split however
creates three incorrect classifications.
62
• The second box uses a classifier that focuses on a new split that gives weight to
correcting the three incorrectly classified points from the first box. It creates a
division at D2 and says anything to the left is + and anything to the right is -.
Again it incorrectly classifies three points.
• The third classifier now creates a horizontal split and gives more weight to the
incorrect points from Box 2. It also incorrectly classifies three points.
• The final Box 4 now uses a weighted combination of the weak classifiers from
the original three boxes. By using what it learned from the first three instances,
it does a good job of correctly classifying all the points in the box.
• In summary: a boosting algorithm takes a weak model and uses it to create

an understanding of feature importance in the data set. Using those ‘learnings’
from the errors in the weak model it creates a new and better model. [14]
XGBoost uses tree ensembles for classification. This is based on Classification

and Regression Tree, or CART, models. Essentially trees are built one after another
based on the features in the data set, such as DSTR, and the model works to reduce
the amount of error in the classifications as the next tree is built. Leventis talks
through the objective function being minimized by the algorithm to create a model
that most accurately represents the data fed into it [8]. In addition, the original paper
on XGBoost by Chen et al. details why XGBoost is able to succeed in many machine
learning challenges and is now widely used by data scientists [4].
The XGBoost model uses the same dataset as what was used in the linear regres-
sion. However, a major upside to XGBoost is its ability to handle missing datasets.
So, for example, if a plant did not record a DSTR value for a given month or year
being analyzed, the model would still run without it and interpret around it. This is
extremely beneficial because occasionally a plant may have not reported a data fea-
ture in a given month, but it did report other data that can still be used in correlating
features with plant score.
After splitting the data on a 70% train and 30% test split and running the model,
63
the XGBoost model matches the training data set with 99% accuracy and when the
model is applied to the test data set it achieves 88.3% score.
After seeing the score we noted that the machine learning model XGBoost per-
formed slightly better than the linear multivariate regression. While both models
are useful in understanding overall plant performance across the Nissan network, the
XGBoost’s improved performance is promising and encourages future consideration
for data analysis across Nissan manufacturing.
One of the benefits of the XGBoost algorithm when compared to the linear regres-
sion model is that it can handle missing data sets. Unlike the linear regression model
that errors out if it finds missing data and requires treatment of the data beforehand,
XGBoost learns from the data what the best assumptions are to make for missing
data and incorporates that into the model. Chen et. al describe that XGBoost uses
sparsity-aware split finding to handle the problems of missing data values, frequent
zero values in the data, or side effects of feature engineering. Sparsity-aware split
finding essentially chooses a default direction as it builds out the tree when it comes
across a problem like missing data. However, the default direction is not a random
guess, rather it is learned from the data and so XGBoost chooses the optimal direc-
tion to go to build out the model based on what it learns from all the other data
available to it [4]. This is a powerful feature because it avoids what can happen in
other models where the missing data requires some valid data to be scrubbed in order
for the model to work. As an example, in our linear regression of a plant failed to
report a DSTR value for a month, then the other values that the plant still reported
could not be used because the model needed to have data for every variable present.
XGBoost provides significant benefit in not just allowing all available data to be used,
but also interpreting what assumptions make the most sense for the missing data in
order to build out an accurate model.
One of the challenges of machine learning models when compared to simple linear
regressions is that they can be complicated if not impossible to understand. Many of
these models are often called blackboxes because of the inability to understand how
it arrived at its predictions. The next paragraphs demonstrate how XGBoost can be
64
Figure 5-9: XGBoost Feature Importance chart showing number of times a metric
was used in regression tree analysis
used to better understand how it is making its predictions and how to interpret the
results.
This first section discusses Feature Importance. We can gain some insight into
how XGBoost is weighing the various features, the dependent variables, that we had
in our dataset by creating an F Score. The F score is simply a count of how many
times a feature is used to branch off of in the trees XGBoost grows. The more often
a feature is used, the more important it was in creating the model.
Figure 5-9 shows that in the model we ran on the 2015-2018 plant performance
data, the production volume was the most influential feature used to create tree
branches. Note however that this count is not the number of branches in the final
model, otherwise we would have an overfitting tree. Rather, the numbers show that as
the model created hundreds of trees and evaluated the error rates in them, Production
Volume was one of the most used features to branch off of in the trees. This chart
is helpful because a plant leader can look at this feature list and quickly understand
what factors are most important to prioritize. While all the features impact the
65
overall plant score, with limited time and resources some features can be focused on
due to the algorithms suggestion that they have a larger impact on the plant score.
While this feature importance chart is very helpful in understanding how the
machine learning algorithm works to come up with its optimized model and which
features were most used in developing that model, it does not completely capture the
impact those features have on the desired results. Because of that, we continued to
look for additional tools to help plant leadership better understand the results of the
model and make decisions with it.
Figures 5-10 and 5-11 use what are called Shapley values to help explain what
the machine learning model is doing and how to understand the importance of the
features used in the data set. With the Shapley Values, we can use what are called
SHAP, Shapley Additive exPlanations, to understand our model.
The SHAP values do three important things:
• Global interpretability – SHAP values can show how each feature impacts the
dependent variable. Rather than just a magnitude count like we saw in the chart
above, SHAP values can help us understand the positive or negative impact the
feature has on the dependent variable.
• Local interpretability – each instance of the model gets SHAP values which can
then be used to explore what is impacting the model over time.
• Application – SHAP values can be used across any tree-based models thus
improving their usefulness in conveying what the models are doing
Figure 5-10 looks similar to the feature importance chart from 5-9, however, in-
stead of simply referring to the number of times a feature is being called on by the
model, it instead looks at the contribution that the features have on the model. Thus
the features that are closer to the top are the features that are having more of a
contribution to the model than the ones towards the bottom. Again, this can be
extremely helpful for management by allowing them to identify critical variables to
spend limited time and resources on improving or maintaining.
66
Figure 5-10: SHAP value feature importance chart showing the impact a metric has
in model.
Figure 5-11 is made up of all the data in the training set. Similar to the bar
chart in Figure 5-9, the features are ranked top to bottom by level of importance
on the model. Horizontally, the spread of the data points show whether or not that
specific observation is contributing to a higher or lower prediction value for the model.
Overall, it can clearly be seen that DSTR and DSTAR have a greater impact on the
prediction than the other variables.
We found that the SHAP charts were a more intuitive and better method for
helping convey the importance of various features in the data set. While the model
itself provides a robust mechanism for predictive capabilities, the SHAP values and
charts help convey the importance of the various features. Thus, even if a specific
prediction of a plant score given a subset of inputs is not needed, a higher-level
understanding of which features are driving the greatest impact on plant score can
be influential in driving management focus and decision making.
It is important for us to note that, at the time of this thesis study, the plant
performance calculations done by the Alliance Benchmarking group to provide a
plant score for each plant across the Alliance Network was not shared generally with
plant employees. However, the Alliance Benchmarking group performing the analysis
obviously have the calculations taking individual plant performance on the Quality,
67
Figure 5-11: SHAP value feature impact chart showing the spread of impact a metric
has in model.
Cost, Time, and Production, as well as the observations from the plant visits, and
assigning an overall score. Thus, while this modeling demonstrates an ability to
accurately model plant performance data to the plant scores given by the Alliance
Benchmarking group, it could be said that this modeling would be unnecessary if
the Alliance Benchmarking group simply started sharing how they analyze and rank
all the plants with with respect to each other. However, we strongly feel that there
is still significant benefit to this analysis because the machine learning framework
provides a foundation on which further studies and insights into plant performance
can be done. In other words, while the models below provide a short-term benefit of
better understanding how plant data correlates to overall plant performance scores,
the methodology behind the models can and should be used with additional data
sets to better understand the manufacturing plant and factors that contribute to or
inhibit better performance. Additional recommendations are provided in Chapters 6
and 7.
68
Chapter 6
Model Discussion
In this chapter we will look into the lessons learned from the regression and machine
learning approaches to the given data set. The goal of this chapter and the overall
thesis is not to say that one is always better than the other, but rather to highlight the
benefits of both and illustrate scenarios in which they can be used. After all, the best
approach to better understanding and utilizing data is to apply multiple techniques
to the data. In addition to the regression and machine learning model used in this
project, we acknowledge that there are MANY more analysis and machine learning
techniques that were not applied in this situation. Chapter 7 will include comments
around future work and next steps that could be considered to create a robust data
analytics operation as a part of a job or team.
6.1 Regression
This section will talk about the benefits and drawbacks of the regression analysis
used on the plant performance data. As was discussed in Chapter 4, the point of
performing a regression on a set of data is to understand the relationship between
variables. Sometimes this can simply be between two variables, like how gender might
impact height, but it can often involve many additional variables, like how the sale
price of a house is impacted by location, schools, crime rates, commute time to major
cities, and proximity to a park.
69
A linear regression is any relationship that can be described using a straight line
and follows the equation y=mx + c where y is the dependent variable and depends
on the value of the independent variable x. An equation where a dependent variable
y is dependent on two variables would follow the form y = m1x + m2z + c where y is
dependent on x and z independent variables. While it’s easy for us to visualize the one
dimensional single independent variable relationship as well as the three dimensional
two independent variable relationship, it gets much harder to simply draw out a
chart after that. Still, while it’s more difficult to put into a chart, regressions can
account for scenarios with many additional variables. This takes us into the world of
multivariate regressions where
𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ... + 𝛽𝑛 𝑋𝑛 (6.1)
Advantages
After feeding the data into a regression model, the model provides the coefficients
for the various independent variables. Further analysis on the model can help us un-
derstand whether or not the independent variables are significant (have a meaningful
impact on the outcome of the dependent variable) and how well the equation does
in describing the data. As noted in chapter 4, understanding how well the equation
describes the data is given by the R-squared value which describes the variance be-
tween the model and the actual data. The closer the R-squared value is to one, the
more accurate the model is at correctly calculating the actual data.
There are several major benefits to using a multiple regression model. The first
is that the multiple regression model can help provide insight into which variables
are providing more impact to the dependent variable. This insight can allow a plant
manager to encourage staff to focus limited time and resources on those variables that
are most impactful to the independent variable being considered. Because our plant
scores only ranged from one to five, the coefficients for each of the variables were small.
In addition, it is important to consider what unit the variable is being measured in
is. For example, while the production volume coefficient was very low (.000013), the
70
production volumes are in the tens of thousands, so even though a single unit change
of another car being produced or not being produced has a minute impact on the
plant performance score, the overall value can still have a reasonable impact on the
score. So while the coefficient values can provide some insight into which variables
provide the greatest impact on the dependent variable, it is important to realize
that a multiple regression model can help one better understand which variables are
impactful, but may not fully explain how to prioritize those variables without further
analysis.
Another benefit to a multiple regression model is the interpretability of the model.

This benefit lies in both the model’s ability to help explain the impact of variables
on the dependent variable as noted above, and the ability to share the equation
arrived at to explain the data. As an example, an engineer could show the equation
on a board and show how changing one variable (like plant DSTR) while holding
all the other variables constant impacts the overall plant score. In addition, should
another engineer or manager want to perform similar scenario exercises, it is very
straightforward for them to get the equation, plug it into their own spreadsheet,
and start using it to explore the model. This interpretability and usability is a
very important benefit to multiple regression models and is often a reason why they
are used over more complex and less understandable models. While the multiple
regression model may sacrifice a little bit of accuracy, that may be a worthwhile
trade off if the ‘customer’ to whom the model will be delivered is able to understand
and better use the model.
Disadvantages
A major disadvantage to the multiple regression model is that it requires the data
to be cleaned before running the model. In simple data sets this can be straightforward
and easy to accomplish. However, in a major plant like the many that Nissan runs, it
can be a very difficult and time consuming job to get the data cleaned and formatted to
the point where it can be used. As was noted in Chapter 4, some plants reporting their
metrics would not always provide the requested data each and every month. Thus,
because the model needed data entries for each month in order to work appropriately,
71
those months in which a plant did not report an entry had to be deleted from the
data set. The unacceptable alternative to that, at least for this study, would have
been to delete the variable itself from all the report. We looked into why a plant
might not report a value for a given month, and many times there were legitimate
reasons. For example, a plant that did not report a variable for a given month may
have had an atypical upset or a machine may have broken that normally gathered
the required data.
6.2 Machine Learning XGBoost
One of the biggest drawbacks to machine learning techniques in data analysis is the
challenge of interpretability. Because it is difficult to understand what the model is
doing to arrive at its predictions, it is likewise difficult to explain those predictions
once they are received. While multiple regression models often have coefficients and an
equation that can be used to analyze various scenarios, machine learning algorithms
like XGBoost need to use the model itself to run scenarios and analyze the data. For
someone well versed in the model this often is not a problem at all. However, it can
be difficult to help someone else get up to speed to use the model and explore the
data. Techniques like the SHAP values discussed in the previous chapter can provide
helpful insights into the model, what it’s doing, and how different variables impact
the outcome, but again it does not necessarily make it easy for someone unfamiliar
with the model to pick it up and look into various scenarios.
While the interpretability is less obvious, the accuracy of the machine learning
model and its flexibility to handle data sets that may have problems like missing data
points can mean that it is still a better alternative.
Machine learning as well is a growing field with constantly evolving models and
algorithms. XGBoost is a fairly recent technique but studies updating it and advanc-
ing different concepts are consistently being published. Thus, while a technique like
XGBoost is very useful, it is not as established as techniques like linear regression
and may be superceded by new and better techniques.
72
6.3 Model Comparison and Wrap-up
Overall, both models were successful in providing us with strong performance and
an ability to represent the data. Both models could successfully be applied to future
work and projects in the plant to help predict how improving the input variables
could impact the overall plant performance scores.
The linear regression model provides coefficients for the variables that allow for a
relatively simple equation to be built and shared. That equation is easy to understand
and can be iterated on in a simple excel document for scenario analysis. However,
the linear regression model struggled with missing data and does not perform quite
as well as the machine learning model.
The XGBoost machine learning model performed better but has less interpretabil-
ity than the simpler linear regression model. However, by incorporating SHAP values
and their associated charts, the XGBoost model and its results provided clearer un-
derstanding of variable impact. While the linear regression model can use excel to
perform scenario analyses, XGBoost needs coding to perform similar predictions.
Thus while XGBoost provides superior results, it is more complicated initially to set
up and use.
The case of overall plant performance and the plant ranking done at Nissan man-
ufacturing facilities was a helpful place to explore the potential benefits of a machine
learning algorithm. The XGBoost model showed that better performance could be
achieved by using the model, however both the machine learning model and the multi-
ple regression model provided fairly accurate models. In other words, while XGBoost
could be said to have provided a more accurate model, the multiple regression model
still performed well enough that it could be used due to its better interpretability
and shareability. However, the point of this thesis is to demonstrate how a machine
learning model can be applied more generally, and the results of this model encour-
age further study into additional applications in the plant and across the company.
While this study used a convenient data set that provided values for plants around
the world, there are significant opportunities for application at a plant, regional, or
73
global level with data sets that were not available to our team and with which the
model could provide helpful insights. Some of those ideas and recommendations will
be looked at in the next chapter.
74
Chapter 7
Summary and Future Work
7.1 Summary of Project

Throughout our time at the Nissan manufacturing plant, our goal was to better
understand plant performance and identify how data being gathered could be used to
understand what impacted how well a plant performed. The plant we visited collected
enormous amounts of data, and when combined with the data being generated at
other manufacturing facilities around the world, proved to be a staggering amount of
information. Originally we had intended to collect a similar dataset from each of the
plants to analyze, but when we learned that Nissan already has a platform in place
to benchmark the plants and their performance we decided that that was a good
place to start. Despite all the data generated at each of the plants, only a relatively
small portion of that data was used for the plant benchmarking. That provided a
convenient dataset to apply our Linear Regression and XGBoost models to compare
performance. While both models were relatively accurate in their ability to model
plant performance successfully, the XGBoost model provided a small improvement in
accuracy while also needing less up front data cleaning. XGBoost when combined with
SHAP analysis provided insight into how the XGBoost model was performing and
which features were most impactful to the model. While both linear regression and
XGBoost were helpful in understanding global plant performance, the opportunities
that the machine learning algorithms presents for future plant analysis is promising.
75
7.2 Future Work
As we mentioned in previous chapters, the models we evaluated looked at a very

specific report that all the manufacturing plants provide for an overall performance
analysis and ranking that Nissan does each year. We chose this data due to our limited
time at the company and because it was a way to prove out a machine learning model
and its potential. While our results provided insight into the data set we were using,
we are enthusiastic about the potential to use machine learning in future projects
throughout the company. By nature of this project being a master’s thesis, there
is plenty of additional work and exploration that can be done to improve on the
work done here and identify future opportunities to implement it. Three areas that
we see immediate room for additional work and exploration are in the areas of data
storage and preparation, additional model usage and comparisons, and increased data
analytics training.
7.2.1 Data Lake
One of the early challenges we faced was around data acquisition and preparation for
model usage. The data we were looking at was not immediately ready to feed into
the linear or machine learning algorithms and so had to be manipulated and adjusted
in order to work. While data preparation is something that often needs to be done in
data analysis, it often introduces opportunities for mistakes and can be challenging to
replicate by others. In addition, knowing where to find appropriate data and having
confidence that the data found includes all the relevant data needed for the analysis
was challenging. A common discussion point we heard at the manufacturing facilities
centered around how hard it was to find the right data and access it. Antiquated ma-
chines, complicated databases, and permission rights could make it all but impossible
to gather and analyze data in a timely manner.
A potential solution to this challenge centers around the idea of a data lake. A
data lake is a centralized location where a manufacturing site or even company can
store any and all of its data. That data can be stored in whatever format it was
76
generated in without first needing to touch and restructure it. Once that data is in
the data lake it can be accessed, queried, and analyzed. See Figure 7-1 for a recent
schematic from Amazon’s AWS services that shows how a data lake can be accessed by
a number of different systems for use in everything from typical analytics to machine
learning and data movement.
The goal then of establishing a data lake would be to centralize all of the data
generated at a plant and enable straightforward access to the data by everyone from
hourly employees to engineers to managers. Teams could access the data for reporting,
to create dashboards, and to perform heavy analytics or machine learning. In fact,
by centralizing the data into a single source, data and relationship discovery can be
more easily performed which would allow employees to better understand trends and
discover new insights into the plant and its performance.
Another point that a comprehensive data lake brings up is applying machine learn-
ing to additional data sets. We looked at a simple data set that brought together
benchmarking data from plants around the world, but there are significant oppor-
tunities to explore other sets of data at global, regional, and local levels. Having a
much more comprehensive set of data available would allow Nissan to explore where
machine learning could make the greatest impact in finding insights and solutions to
problems. For example, the team responsible for the data used in the plant perfor-
mance analysis have gradually changed what data is included in the benchmarking
over the years. Stock in Place and Stock in Place + Stock in Transit were once used
as part of the analysis but are no longer included as other variables and metrics were
deemed more critical for measuring and tracking both from a performance perspec-
tive but also in relation to impact on plant performance. A comprehensive data lake
would enable machine learning and other analytics to comb through plant data to
determine what variables are most impactful on overall plant performance. Those
key metrics can then be included and relied on for benchmarking and performance
analysis.
One of the promises of Industry 4.0 is a revolution in data, insights, and produc-
tivity. Establishing a data lake will enable greater access to the data already being
77
Figure 7-1: Schematic showing a data lake setup and how it can be accessed and uti-
lized. From AWS https://aws.amazon.com/big-data/datalakes-and-analytics/what-
is-a-data-lake/ accessed 1/5/21
78
generated and provide ample opportunity to explore what is truly going on at the
plant and identify opportunities and insights. Establishing and migrating data to the
data lake will be an intensive process, but the potential for the data lake is great.
7.2.2 Additional Models
Due to the time constraints of the project as well as the disruption caused by the
pandemic, our team focused primarily on understanding the company and its plant
performance data, and then ran linear and XGBoost analyses on it to understand
and compare results. While the learning and insights we saw were helpful, there is
still significantly more room for additional model exploration. Other models can be
applied to the plant performance data that we found, but there is also a wealth of
data at the local, regional, and global level that can be explored. Applying regression
and machine learning models to more and more data can help identify insights and
opportunities that our limited analysis did not consider. Below are a few additional
models that could be applied, evaluated, and compared.
LASSO: LASSO stands for Least Absolute Shrinkage and Selection Operator.
It is a type of linear regression that encourages models that use fewer parameters
and helps deal with issues like multicollinearity. This can be especially important in
sites like a manufacturing facility where we are not always sure whether or not our
predictor variables are truly independent. In other words, if we were to find that
Production Volume is also a good predictor of Stock in Place, then we would have
redundant information in our model potentially skewing our results. Lasso performs
regularization on the model which enables it to zero out and eliminate some of the
parameters going into the model resulting in a simpler end model. This feature
selection is a benefit to Lasso because it could help identify important metrics out of
the many that the plant is tracking. If we are working with a smaller data set and
do not want the model to eliminate any of the variables, a Ridge Regression model
is something to consider.
CART: CART stands for Classification and Regression Trees. These are essen-
tially decision trees that split on a specific variable like gender or age and branch
79
off of that variable. Classification trees refer to categorical variables while regression
trees refer to numerical variables. Another way to look at this is classification trees
are used when a variable needs to be placed into a specific category while regression is
used more for value prediction. Classification and regression trees are straightforward
to understand as they are essentially if-then statements that help lead to an outcome.
While XGBoost uses trees to generate its model, CART is a simple model that can
be used and compared against the other models being used.
Random Forest: A random forest model builds off of CART by combining an
ensemble of CART models. The random forest takes the predictions from all of the
CART models and if the setup is a regression, the random forest takes the average of
the CART predictions. If the purpose is classification then the random forest selects
the majority selected class. Random forests work because each of the CART models
are looking at a different randomized set of the underlying data. In addition, each of
the CART models are only looking at a subset of the independent variables rather
than looking at all of them. While random forests can be more accurate, they are
also much more complex and can be difficult to interpret or explain.
While these are a few examples of additional models to consider, the truth is that
there is rarely one model that works the best in all scenarios. Ideally data teams
will be able to try out various models and compare performance and results to select
the best model for a given data set and situation. Future work needs to be done to
compare how these models perform on various data sets that become available for
analysis.
7.2.3 Data Analytics Training:
As our team interacted with individuals and teams across multiple sites, we saw
a strong desire for improved training and opportunity around data analysis. We
were excited to see efforts in the supply chain organization to propose trainings and
standards in enabling entire teams to become competent in up to date data analysis
practices.
To build on the previous two sections, Nissan has a wealth of data available for
80
analysis. There is too much data and too much that changes for there to be an effec-
tive way to have only a few individuals or teams responsible for data intelligence and
analytics. The consulting company McKinsey has advocated for analytics academies
within companies that enable an AI educated workforce. These analytics academies
provide a common vision and understanding across the organization, allow the train-
ing to be tailored to the company’s specific needs, goals, and industry, and provide
opportunity to help employees connect with others within their organizations that
they can learn from and lean on.
The supply chain organization in Smyrna, Tennessee has already started down
this path and put people through a structured curriculum to help them understand
and start using various analytics techniques. While not everyone will progress to be
artificial intelligence or machine learning experts, the ability for greater numbers of
employees to understand the same basic concepts about data analytics enables broader
acceptance and implementation of the ideas and practices that need to be in place
to successfully implement better data management. The efforts of the supply chain
group can be rolled out to other areas and organizations throughout the company.
7.2.4 Prioritizing Metrics
While the prior recommendations all focus on how to make machine learning and
data analytics an integral part of plant operations in the future, actions can still
be taken from the recommendations provided by the machine learning model on the
plant performance data set. The model prioritized the DSTR and DSTAR metrics
as being most impactful to overall plant performance. The DSTR metric relates to
the time being taken to manufacture the vehicle and DSTAR relates to the quality
and dependability of the process. Because there are so many moving parts, person-
nel, and data associated with these metrics, we recommend breaking down the data
that go into these metrics and evaluating machine learning modeling opportunities
incorporating that breakdown to identify whether there are particular parts of DSTR
and DSTAR that are more impactful to focus on. We can see this immediately being
applied at a plant level where data is easier to obtain and analyze, but there is also
81
value in seeing if multiple sites like Canton and Smyrna could share data for analysis
to see if there are broader global trends. The models we ran demonstrated an initial
prioritization of impactful metrics related to plant performance, and those results can
be built on to identify specific actions that can be taken at a site or at the company
to improve plant performance.
7.3 Summary
This was not meant to be fully comprehensive, covering everything that can or should
be done regarding data analysis and machine learning in a plant. There are many
opportunities that we did not see and bright individuals who will find more than we
were able to in our time at Nissan. Our work showed that there are significant oppor-
tunities to apply and learn from advanced data techniques. By pursuing strategies
like a data lake to more easily access greater data subsets, using additional analytics
models, and increasing employee training, we believe Nissan and any other organiza-
tion looking to improve its data analytics will find itself better able to manage and
benefit from the data it produces each day.
82
Appendix A
Python Code for Models
This section is here to provide a resource on how we set up the code to generate the
linear and machine learning models. The intent in providing this is for it to be used
as a model to replicate and build off of what we did. We readily admit to not being
experts in Python or in machine learning and data analysis, and so we confidently
encourage you to find opportunities to improve on it and make it better. Overall, the
intent of our work and project was to explore how to manage data and performance
better in a manufacturing setting and to see if machine learning could contribute. We
saw promising results and encourage continued exploration of this space.
Python code with comments preceded by a # symbol
This first section shows the Linear Regression code
import time
import numpy as np
import pandas as pd
import matplotlib . pyplot as plt
pd . set_option ( ’ display . max_columns ’ , None )
# This calls up the spreadsheet containing the data .

df = pd . read_excel ( " 2015 - 2018 Data . xlsx " , sheet_name = " 2015 - 2018
Yearly " )
83
# This outputs a table showing the data
df . head ()
# Calling a Python function that helps with plot visualizations

plt . style . use ( ’ ggplot ’)
# This is an example scatter plot that can be visualized to help see

if there ’s a general relationship
between two variables . In this
case the scatter plot puts
Production Volume on the X - axis
and Score on the Y - axis . The rest
of the code is formatting .
ax1 = df . plot ( kind = ’ scatter ’ , x = ’ Production Volume ’ , y = ’ Score ’ ,

color = ’ blue ’ , alpha = 0 .5 , figsize = (
10 , 7 ) )
plt . legend ( labels = [ ’ Production Volume ’ , ’ Score ’] )
plt . title ( ’ Relationship between Production Volume and Score ’ , size =
24 )
plt . xlabel ( ’ Production Volume ’ , size = 18 )
plt . ylabel ( ’ Score ’ , size = 18 ) ;
# This next section is another sequence of code that also does a

scatter plot but includes a
regression line as well . It uses
the seaborn function
import seaborn as sns

fig = plt . figure ( figsize = ( 10 , 7 ) )
sns . regplot ( x = df [ [ ’ Production Volume ’] ] , y = df . Score , color = ’ blue ’ ,
marker = ’+ ’)
84
# Legend , title and labels .
plt . legend ( labels = [ ’ Production Volume ’ , ’ Score ’] )
plt . title ( ’ Relationship between Production Volume and Score ’ , size =
24 )
plt . xlabel ( ’ Height ( inches ) ’ , size = 18 )
plt . ylabel ( ’ Weight ( pounds ) ’ , size = 18 ) ;
# Everything above this was to help understand the data a little bit
better . Now we need to prepare the
data for use in our linear
regression . The linear regression
function doesn ’t handle missing
data in rows and columns , so we
need to go through and clean it up
. This next set of code goes
through and drops rows that have
missing / blank data .
df [ ’ Score ’] . replace ( ’ ’ , np . nan , inplace = True )

df . dropna ( subset = [ ’ Score ’] , inplace = True )
df [ ’ FSTR ’] . replace ( ’ ’ , np . nan , inplace = True )
df . dropna ( subset = [ ’ FSTR ’] , inplace = True )
print ( df )
# Now we use a package called sklearn that has some great regression
functions included .
from sklearn . linear_model import LinearRegression
# create linear regression object

lr_score = LinearRegression ()
# fit linear regression

lr_score . fit ( df [ [ ’ DSTR ’] ] , df [ ’ Score ’] )
85
# get the slope and intercept of the line best fit
print ( lr_score . intercept_ )
# - 224 . 49884070545772
print ( lr_score . coef_ )
# Now we drop the plant names and date columns as those aren ’t
variables that go into the model (
ie they aren ’t impacting the
independent variable ) .
drop_columns = [ ’ Date ’ , ’ Plant ’] # these are the column names to drop

df = df . drop ( drop_columns , axis = 1 )
df . head ()
# Split into train testing data
from sklearn . model_selection import train_test_split
X = df . loc [ : , df . columns ! = ’ Score ’]

y = df . loc [ : , df . columns = = ’ Score ’]
X_train , X_test , y_train , y_test = train_test_split (X , y , test_size =

0 .3 , random_state = 0 )
# 70 / 30 TRAIN / TEST SPLIT
# This next sequence creates the model and outputs the R2 value and
coefficients .
model = LinearRegression ()
model . fit ( X_train , y_train )
linearscore = model . score ( X_test , y_test )
linearcoef = model . coef_
linearint = model . intercept_
print ( linearscore )
print ( linearcoef )
86
print ( linearint )
# While we have the R2 value , we can also plot a chart that shows how
well the model is fitting the
data .
# plot prediction and actual data

y_pred = model . predict ( X_test )
plt . plot ( y_test , y_pred , ’. ’)
# plot a line , a perfit predict would all fall on this line

x = np . linspace (2 , 5 , 50 )
y = x
plt . plot (x , y )
plt . show ()
This next section shows the Machine Learning code.
# Similar to linear regression we ’ re going to import the packages we

need .
import time
import xgboost as xgb
from sklearn . metrics import me an _sq ua re d_e rr or
import pandas as pd
import numpy as np
# Import excel data sheet
df = pd . read_excel ( " 2015 - 2018 Data . xlsx " , sheet_name = " 2015 - 2018
Yearly " )
df . head ()
# Drop plant names and dates
87
drop_columns = [ ’ Date ’ , ’ Plant ’]
df = df . drop ( drop_columns , axis = 1 )
df . head ()
# Split into train testing data
from sklearn . model_selection import train_test_split
X = df . loc [ : , df . columns ! = ’ Score ’]

y = df . loc [ : , df . columns = = ’ Score ’]
X_train , X_test , y_train , y_test = train_test_split (X , y , test_size =

0 .2 , random_state = 0 ) # 70 / 30 TRAIN
/ TEST SPLIT
# Name the model and call the XGBoost ML package
xgbr = xgb . XGBRegressor ( verbosity = 0 )
# Fit and create the model

xgbr . fit ( X_train , y_train )
preds = xgbr . predict ( X_test )
# View how the model performs using training and test data
score = xgbr . score ( X_train , y_train )

print ( " Training score : " , score )
score = xgbr . score ( X_test , y_test )

print ( " Training score : " , score )
# Create a plot that shows how many times each feature ( metric ) was
used in the tree growing
xgb . plot_importance ( xgbr )

plt . rcParams [ ’ figure . figsize ’] = [5 , 5 ]
88
plt . show ()
# This next section demonstrates the SHAP features that help provide
visualizations to the XGBoost
machine learning model
import shap
shap . initjs ()
model = xgb . train ( { " learning_rate " : 0 . 01 } , xgb . DMatrix (X , label = y ) ,

100 )
# This small section of code was a workaround for the SHAP values not
quite working right with my
operating system . Yours may not
need it .
model_bytearray = model . save_raw () [ 4 : ]

def myfun ( self = None ) :
return model_bytearray
model . save_raw = myfun
explainer = shap . TreeExplainer ( model )
shap_values = explainer . shap_values ( X )
# Outputs chart showing impact of metric

shap . force_plot ( explainer . expected_value , shap_values [0 , : ] , X . iloc [0
,:])
# visualize the training set predictions

shap . force_plot ( explainer . expected_value , shap_values , X )
# summarize the effects of all the features

shap . summary_plot ( shap_values , X )
# Feature importance plot

shap . summary_plot ( shap_values , X , plot_type = " bar " )
89
90
Bibliography
[1] Agrawal, M., Eloot, K., Mancini, M., and Patel, A. (2020). Industry 4.0: Reimag-
ining manufacturing operations after COVID-19.
[2] Brooks, B. (2019). The Biggest Milestones in the History of Automotive Manu-
facturing. Section: Manufacturing Trends.
[Charalambous et al.] Charalambous, E., Feldmann, R., Richter, G., and Schmitz,
C. AI in production: A game changer for manufacturers with heavy assets.
[4] Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 785–794. arXiv: 1603.02754.
[5] Helu, M., Sprock, T., Hartenstine, D., Venketesh, R., and Sobel, W. (2020).
Scalable data pipeline architecture to support the industrial internet of things.
CIRP Annals, 69(1):385–388.
[6] Jack, D. and Sizov, G. (2020). Boosting Automotive Aftermarket Revenues

through Advanced Analytics. Section: Article.
[7] Korbel, M., Sim, S., Somers, K., and Van Niel, J. (2019). Enabling a digital and
analytics transformation in heavy-industry manufacturing.
[8] Leventis, D. (2019). XGBoost Mathematics Explained.
[9] Li, B.-h., Hou, B.-c., Yu, W.-t., Lu, X.-b., and Yang, C.-w. (2017). Applica-
tions of artificial intelligence in intelligent manufacturing: a review. Frontiers of
Information Technology & Electronic Engineering, 18(1):86–96.
[10] Lin, D., Lee, C., Lau, H., and Yang, Y. (2018). Strategic response to Industry
4.0: an empirical investigation on the Chinese automotive industry. Industrial
Management & Data Systems, 118(3):589–605.
[11] Nissan (2020a). About Nissan Canton.
[12] Nissan (2020b). About Nissan Smyrna.
[13] Nissan (2020c). The History of Nissan | Nissan USA.
[14] Pathak, M. (2019). Using XGBoost in Python.
91
[15] Pfeiffer, S. (2017). The Vision of “Industrie 4.0” in the Making—a Case of Future
Told, Tamed, and Traded. Nanoethics, 11(1):107–121.
[16] Rae, J. B. (1982). Nissan/Datsun, a history of Nissan Motor Corporation in

U.S.A., 1960-1980 /. McGraw-Hill.
[17] Sharp, M., Ak, R., and Hedberg, T. (2018). A survey of the advancing use and
development of machine learning in smart manufacturing. Journal of Manufactur-
ing Systems, 48:170–179.
[18] Spear, S. and Bowen, H. K. (1999). Decoding the DNA of the Toyota Production
System. Harvard Business Review, 77(5):96–106.
[19] Toyota (2019). Toyota Production System | Vision & Philosophy | Company.
92

Morey Zkmorey Mba MGT 2021 Thesis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Morey Zkmorey Mba MGT 2021 Thesis

Uploaded by

Copyright:

Available Formats

Integrating Machine Learning into Data Analysis

and Plant Performance

Zachariah Keith Morey

Submitted to the MIT Sloan School of Management and

Submitted to the MIT Sloan School of Management and

Thesis Supervisor: Arnold Barnett, Thesis Supervisor

Thesis Supervisor: Jung-Hoon Chun, Thesis Supervisor

I owe an enormous thank you to my internship supervisor, Va’Shound Taylor. He

I am so grateful for my incredible LGO 2021 classmates. They are an incredible

2 Background on Current State 19

4 Understanding Plant Data and Performance 35

7 Summary and Future Work 75

A Python Code for Models 83

3-1 Fundamentals of Industry 4.0 and their interactions. Source, Agrawal

4-1 An example of a DSTR calculation on a manufacturing process . . . . 41

5-1 Relationship between global plant performance and number of vehicles

1.3 Statement of Hypothesis

1.5 Thesis Structure

The thesis is organized as follows:

Background on Current State

This chapter presents a short background on automotive manufacturing, the Nissan

2.1 Automotive Manufacturing

To better understand automotive manufacturing and where the industry is today, it

significantly impacted the automotive industry. The automotive industry famously

• Rule 2 - Every customer-supplier connection must be direct, and there must be

• Rule 4 - Any improvement must be made in accordance with the scientific

2.2 The Nissan Company

Industry 4.0 encompasses all of the pieces of a manufacturing company or en-

terprise. From resource procurement to finished product delivery and reliability, it

in some manufacturing applications but there is significant opportunity to broaden

• Knowledge-driven efforts to support and enable manufacturing

• Human-machine-material efforts to enable equipment, processes, and practices

• Autonomous intelligence application in manufacturing to handle systems and

• Encourage better awareness on technology maturity. This can be done through

Understanding Plant Data and

4.1 Data at the Plant

4.2 Design Standard Time

• Parts handling, minimum walking

• Alignment and securing

• Fluid filling like gas or oil

• Most walking, bending, stretching, squatting

• Tote and bin handling

• Checks, repairs, inspections

• Equipment wait time

• Step 1 - Operator walks along the manufacturing line

• Step 3 - Operator picks up part

• Step 4 - Operator walks to the unit

• Step 1 - Operator picks up part

• Step 2 - Operator installs part

• Step 3 - Operator secures part

• Step 1 - Collect Medium Part = .05 minutes

• Step 2 - Locate Medium Part. .04 minutes

• Step 3 - Secure Part with Bolt = .05 minutes

– Four bolts = .05 minutes * 4 = .20 minutes

• Total DST = .29 minutes or 17.4 seconds

4.3 Design Standard Time Ratio

4.4 Straight Through Ratio

STR is the Straight Through Ratio

TV is the total vehicle count

FSTR is the Final Straight Through Ratio

TVD is the total vehicle count at the delivery checkpoint

• Reduce the lead time from order to delivery