1. Gather your data
Proper data science automation begins with getting the data into the
system. With a data science platform like RapidMiner, it doesn’t matter
where the data is coming from—tlocal sources like a file, remote sources
such as a database, or even cloud-based data sources. Once the data has
been loaded into the system, the next step is data preparation.
2. Prep your data
Raw data needs to be converted into a form that can be used in model
training and testing. Some of the typical steps in data preparation include:
« Data cleaning: This is the process of preparing data for analysis and
modeling by identifying and dealing with any issues in the data like
incorrect formatting, incomplete entries, or other problems that make it
difficult or impossible to use.
Feature selection: Choosing the relevant variables that will have the
most impact on the model. Perhaps surprisingly, by reducing the
number of variables that the model uses, you can often increase its
performance.
Data transformation: Adapting the format of the data to make it more
useful for the project. This step takes the raw source data and adjusts it
so the data is right for the project.
.
.
With the RapidMiner platform, these steps are easy, as many of them are
automated and others are guided. For data cleaning, automation can
perform normalizations and principal component analyses (PCAs) and deal
with problematic or missing values, among other actions.
Because many of the data cleaning and transformation steps are highly
domain-specific, a data scientist's best approach, if they aren’t an expert
themselves, is to partner with subject matter experts in the domain. The
automated data cleaning fixes all sorts of problems and helps make the
data more suitable for optimal machine learning.
3. Build your model
The next step in the process is model building. This is where the rubber
meets the road for data scientists. With an automated data science
platform, the best model for the data is selected automatically. Although
with many platforms (including RapidMiner), you'd be able to adjust this
and try a different kind of model for your data.
Something noteworthy during this step is feature engineering. This is
when you use existing features or variables to identify and derive new
relationships in the data. RapidMiner's Auto Mode! can tackle this for you.
Auto Model uses a multi-objective optimization technique that generates
additional features from the existing ones. Sometimes, automatic feature
engineering introduces overfitting (or adding more parameters than the
data justifies into the feature space). But its multi-objective approach won't
allow overfitting to occur because it also balances the complexity of your
model while it's working.
The RapidMiner platform presents users with the models it believes will
perform best with the data provided, in a way that humans can easily
understand, in a reasonable amount of time, while allowing for override if
the need is there. This combination of automated features makes it easy to
get a model up and running in no time, even if you aren't a data science
expert.
4. Push your model into
production
Once the models are built and decided on, they have to go into production
where they begin interacting with the world in real time. Although models
are built on existing or historical data, once a model is in production, it
uses new data to generate predictions
Often models are deployed as active, which means they're producing
predictions. Or, they are deployed as challengers, which allows the
business to compare models to see what performs the best and also
compare to their current methods of doing things.
5. Validate your models
You're almost done! But not quite. Even after the models have been
deployed and are producing predictions, they need to be validated ona
regular basis to ensure that they are performing well.
Sometimes models are built on a data set and, after time, their
performance gets worse. This is often a result of external changes and can
be detected by looking at the so-called drift of the inputs, which refers to
changes in the inputs over time—specifically, between when a model was
trained and now.
Deep learning models are especially vulnerable to drift. With a data science
platform like RapidMiner, it is straightforward to identify and deal with drift
by retraining the model on updated data or taking other measures if
needed.
Astandard part of deployment is model operations, or model ops. Model
ops is all about automating the maintenance around the deployed models.
This way, data scientists can be alerted if something is no longer working as
it should. Model ops also integrates the production models with an existing
IT infrastructure through APIs.