CENG3300 Lecture 3

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Data Science for Molecular

Engineering
Lecture 3
Data processing
ILO’s
1. Define a machine learning problem;
2. Determine relevant data needed;
3. Explore the data;
4. Prepare data for better use.
Defining the problem
• The most important question but often overlooked
• How will your solution be used?
• Are there alternative solutions? What are the pros and cons of each method?
• What data are you going to use?Is the performance measurable?
• What assumption have you made?
Defining the problem - example
“I will build a machine learning model to predict the toxicity of a
molecule.”
• The solution would be used to predict the toxicity of molecules, which can be
valuable in drug discovery, chemical safety assessment.
• Alternatives:
• I can look up toxicity in handbook. Pros? Cons?
• I can analyze toxicity by functional groups. Pros? Cons?
• I will use a dataset of molecular structures and their toxicity information, like
binary biological response, or continuous bioactivity data.
• Assumptions:
• Toxicity is determined by molecular structure
• Dataset is available with acceptable quality
Defining the problem - example
“I will build a machine learning model to predict the best temperature
of a chemical reactor.”
“I will build a machine learning model to predict the best temperature of a
chemical reactor.”
• It can assist in process optimization, control, and automation. By accurately predicting
the best temperature, it can help improve the efficiency, yield, and safety of chemical
reactions.
• Alternatives:
• Traditional feedback control
• Physics-based models
• I will use a dataset containing historical operational data of the chemical reactor,
including inputs (such as reactant concentrations, flow rates, catalyst properties) and
corresponding temperature values.
• Assumptions:
• A unique optimal temperature corresponds to a certain reaction scenario
• Reaction data are measured with high confidence
“I will build a machine learning model to predict the whether a medical
image contains cancer cells/tissues.”
Defining the problem - example
“I will build a machine learning model to predict the whether a medical
image contains cancer cells/tissues.”
• It can help improve the accuracy and efficiency of medical diagnosis.
• Alternatives:
• Expert manual inspection
• Pattern recognition tools
• I would need a dataset of medical images labeled as containing cancer
cells/tissues or not. The dataset should include a diverse set of images from
different patients, imaging modalities, and cancer types.
• Assumptions:
• Cancer cells can be determined from features in the medical images
• Cancer cells share some common features which are different than normal cells
What might make it difficult/inappropriate to
formulate the following data science problems?
• “I would like to build a machine learning model for predicting the molecular
weight of a molecule.”

• “I would like to build a machine learning model to predict what is the best drug
in the world.”

• “I would like to build a machine learning model to determine the optimal


conditions for cell culture growth and maintenance.”

• “I would like to build a machine learning model to determine the long-term


health effect of nuclear wastewater.”
Get the data
• 1. What data do you need? How much data do you need?
• 2. Do you have the space to store the data?
• 3. Is the data public? Or do you need authorization to obtain the
data?
• 4. Is there any legal obligations when using the data?
Data collection
• Existing database
• Web downloading
• Web scraping
• Purchase of commercial datasets
• Creating new database
• Computational simulation
• Experimental measurements
• Survey/questionnaire
Explore the data
• You are managing the molecular dataset in a pharmaceutical
company. One day your manager told you there is a potential new
database that they would like to buy to support drug discovery and
development. Their concern is whether this new database overlaps
largely with existing database in the company. Both databases are
huge, ~10 million molecular structures, and the vendor agreed to
share 1% of the data as a sample. What would you do to help with the
decision making?
Explore the data
1. Create a copy! (backup, down-sampling)
2. Check the format of each variable, missing values,
3. Calculate some descriptive statistics
4. Visualize the data
5. Looking for correlations
Data exploration
• Descriptive statistics can help you get a feeling of the distribution of
the data
• Using visualizations where applicable can be very helpful
• Looking for similarities and correlations
• Experimenting with feature combinations (Use of domain knowledge
and intuition)
Visualization

Temporal trend Spatial distribution


Correlation
Process the data
• Varies based on data exploration and project objective
• Data cleaning
• Data truncation
• Raw data transformation
• Feature scaling
• Develop python functions/objects for convenient processing of
additional future data
Feature combination/transformation
• Multiplication or division between features
• E.g. average population per square meter

• Transformation of features
• E.g. Temperature and conversion
Data processing
• Real data can be messy
• Missing values
• Duplicate values
• Sometimes nontrivial, e.g., “H2O” and “water” represent the same thing
• Redundant features
• The “long tail”
• Multiple datasets
Data cleaning – missing values
• Missing values
• Machine learning models generally cannot work with missing values
• However, missing values are prevalent due to various causes
• For experimental data
• For simulation data
• For survey data
• For transcript data
Data cleaning – missing values
• What would you do?
Data truncation (optional)
• The long tail can contain highly noisy information that might not be
helpful for model development or understanding the data pattern;
• Data can be truncated based on physical knowledge;
• E.g. concentration data ~1~10mol/L, if you see 10000mol/L, it is probably an erroneous
record
• Data can be truncated based on frequency, especially on “big datasets”
• Such truncations might result in significant reduction of categories without significant
loss of data
Feature transformation
• Label encoding
• Text to label
• Can be sufficient for ordinal variables
• E.g. very bad – 1, bad – 2, neutral – 3, good – 4, excellent – 5;
• One-hot encoding

Why do we need to do one hot encoding compared to just label encoding?


Feature scaling
• Machine learning models generally do not work well with features of
very different scales
• Why?
• Min-max scaling

• Standardization

You might also like