Data Science Automation

1. Gather your data Proper data science automation begins with getting the data into the system. With a data science platform like RapidMiner, it doesn’t matter where the data is coming from—tlocal sources like a file, remote sources such as a database, or even cloud-based data sources. Once the data has been loaded into the system, the next step is data preparation. 2. Prep your data Raw data needs to be converted into a form that can be used in model training and testing. Some of the typical steps in data preparation include: « Data cleaning: This is the process of preparing data for analysis and modeling by identifying and dealing with any issues in the data like incorrect formatting, incomplete entries, or other problems that make it difficult or impossible to use. Feature selection: Choosing the relevant variables that will have the most impact on the model. Perhaps surprisingly, by reducing the number of variables that the model uses, you can often increase its performance. Data transformation: Adapting the format of the data to make it more useful for the project. This step takes the raw source data and adjusts it so the data is right for the project. . . With the RapidMiner platform, these steps are easy, as many of them are automated and others are guided. For data cleaning, automation can perform normalizations and principal component analyses (PCAs) and deal with problematic or missing values, among other actions. Because many of the data cleaning and transformation steps are highly domain-specific, a data scientist's best approach, if they aren’t an expert themselves, is to partner with subject matter experts in the domain. The automated data cleaning fixes all sorts of problems and helps make the data more suitable for optimal machine learning. 3. Build your model The next step in the process is model building. This is where the rubber meets the road for data scientists. With an automated data science platform, the best model for the data is selected automatically. Although with many platforms (including RapidMiner), you'd be able to adjust this and try a different kind of model for your data. Something noteworthy during this step is feature engineering. This is when you use existing features or variables to identify and derive new relationships in the data. RapidMiner's Auto Mode! can tackle this for you. Auto Model uses a multi-objective optimization technique that generates additional features from the existing ones. Sometimes, automatic feature engineering introduces overfitting (or adding more parameters than the data justifies into the feature space). But its multi-objective approach won't allow overfitting to occur because it also balances the complexity of your model while it's working. The RapidMiner platform presents users with the models it believes will perform best with the data provided, in a way that humans can easily understand, in a reasonable amount of time, while allowing for override if the need is there. This combination of automated features makes it easy to get a model up and running in no time, even if you aren't a data science expert. 4. Push your model into production Once the models are built and decided on, they have to go into production where they begin interacting with the world in real time. Although models are built on existing or historical data, once a model is in production, it uses new data to generate predictions Often models are deployed as active, which means they're producing predictions. Or, they are deployed as challengers, which allows the business to compare models to see what performs the best and also compare to their current methods of doing things. 5. Validate your models You're almost done! But not quite. Even after the models have been deployed and are producing predictions, they need to be validated ona regular basis to ensure that they are performing well. Sometimes models are built on a data set and, after time, their performance gets worse. This is often a result of external changes and can be detected by looking at the so-called drift of the inputs, which refers to changes in the inputs over time—specifically, between when a model was trained and now. Deep learning models are especially vulnerable to drift. With a data science platform like RapidMiner, it is straightforward to identify and deal with drift by retraining the model on updated data or taking other measures if needed. Astandard part of deployment is model operations, or model ops. Model ops is all about automating the maintenance around the deployed models. This way, data scientists can be alerted if something is no longer working as it should. Model ops also integrates the production models with an existing IT infrastructure through APIs.

Data Science Automation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Automation

Uploaded by

Copyright:

Available Formats

You might also like