IBM Capstone SpaceY Taylor Collard

Taylor Collard
17 May 2024
Outline
• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix
2
Executive Summary
• Methodologies used:
○ API calls and Web Scraping
○ SQL queries
○ Python visualization libraries
○ Simple Linear Regression
○ K Nearest Neighbor
○ Decision Trees
○ Logistic Regression
○ Support Vector Means
• Summary of all results

○ We have collected data from public SpaceX API and by scraping Wikipedia SpaceX data
○ We were able to determine statistical correlations and build a prediction model with 83%
accuracy
3
Introduction
• SpaceX advantage comes from reusing the first stage of their rockets,
saving clients over 100 million dollars per launch
• SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62
million dollars; other providers cost upwards of 165 million dollars each
• Therefore, if we can determine if the first stage will land, we can determine
the cost of SpaceX launches and use this information to our advantage.
• In the following analysis, we examine SpaceX launch data to determine

statistically significant factors that help us build machine learning models to
predict the success or failure of first stage rocket landings.
4
Section
1
5
Methodology
Executive Summary
• Data collection methodology:
• Data was collected from both the SpaceX API and web scraping public data
(wikipedia) on SpaceX launches.
• Perform data wrangling

• Using Pandas and Numpy, nulls removed/aggregated, launch sites and orbits
identified, and landing outcome integrated into data
• Perform exploratory data analysis (EDA) using visualization and SQL

• Perform interactive visual analytics using Folium and Plotly Dash
• Perform predictive analysis using classification models
6
• How to build, tune, evaluate classification models
Data Collection – SpaceX API
• Defined functions to target Check

certain columns of data, called Import Fetch Data from
Response
Libraries SpaceX API
API, and converted returned Status
JSON file into Dataframe
Parse JSON Convert JSON

• GitHub URL of the completed Clean Data
Data to Dataframe
SpaceX API calls notebook as
an external reference and
peer-review purpose
Save Data to
CSV
7
Data Collection - Scraping
• Utilized an HTTP GET Check

Import Fetch Web
method to fetch data, Libraries Page
Response
BeautifulSoup to parse Status
data as an object and
extract data with searches
Parse HTML Store Data in
• GitHub URL of the Extract Data
Content Dataframe
completed web scraping
notebook, as an external
reference and peer-review
purpose Save Data to
Clean Data
CSV
8
Data Wrangling
• Using Pandas and Numpy, nulls

removed/aggregated, launch Import Load Data Inspect Data
Libraries
sites and orbits identified, and
landing outcome integrated into
data
Feature
Clean Data Transform Data
• GitHub URL of completed data Engineering
wrangling related notebooks, as
an external reference and
peer-review purpose
Save Cleaned
Data
9
EDA with Data Visualization
• Exploratory Data Analysis was conducted using:

○ categorical scatter plots to visualize success rate and relationship
of
■ Payload Mass (kg) vs. Flight Number
■ Launch Site vs. Flight Number
■ Payload Mass (kg) vs. Launch Site
○ A scatter plot for Flight Number vs. Orbit Type
○ Bar Chart for Orbit Type vs. Success Rate
○ Line Graph for Success Rate vs. Time
• GitHub URL of completed EDA with data visualization notebook, as an
external reference and peer-review purpose
10
EDA with SQL
• SQL Queries used to:

○ Create initial table
○ Select distinct launch sites
○ Inspect Payload Mass by customer and by Booster Version
○ Find first successful landing date
○ Discover Boosters used for successful landing on a drone ship
○ Find specific failure locations by date
○ Rank all landing outcomes by count
• Add the GitHub URL of your completed EDA with SQL notebook, as an
external reference and peer-review purpose
11
Build an Interactive Map with Folium
• Folium map created with:

• Circles and markers to denote labeled launch sites
• Markers for landing results geolocated at corresponding launch sites
○ Marker Clusters used to declutter
• Markers and lines used to mark and measure distance from launch site to
nearest coastline, railway, highway and city
• GitHub URL of completed interactive map with Folium map, as an external

reference and peer-review purpose (USE nbviewer.org to view maps)
12
Build a Dashboard with Plotly Dash
• Pie graphs were used to show success/failure rates per launch site, and
the overall share of successful launches in order to determine any
statistical significance of launch location
• A Categorical scatter plot was used to visualize the relationship between
booster type, payload, and success rates across all locations.
• GitHub URL of completed Plotly Dash lab (code only), as an external
reference and peer-review purpose
13
Predictive Analysis (Classification)
• Target data (class/success) was

assigned to Y, and the remaining Import Preprocess
data fit to X. Both variables split into Load Data
training and testing data Libraries Data
• Models were trained and evaluated
using GridSearch and fit and score
functions defined in supporting
libraries
• Confusion matrices show Evaluate
false-negative and false-positive Split Data Train Model
results
Model
• GitHub URL of completed predictive

analysis lab, as an external
reference and peer-review purpose Optimize
Save Model
Model
14
Results
• EDA
○ Year is most significantly related to success
○ No strong correlation between orbit type
○ Some correlation between launch site and success rate
■ However more launches were at CCAFS SLC-40 earlier on, when
failure was more common
■ Same location has more successes than other locations later on
• Interactive analytics demo in screenshots on later slides
• Predictive analysis results
○ Models showing ~83% accuracy on test data
15
Section
2
Flight Number vs. Launch Site
• Early reliance on CCAFS SLC-40

○ Apparently due to early experience rather than location (see later
successes)
• More success over time at all sites apparent
17
Payload vs. Launch Site
• Greater success rate at higher payloads

• Most failures are lower payload at CCAFS SLC 40
○ Early tests or indicative of future trend?
18
Success Rate vs. Orbit Type
• Clear illustration of orbit

success rates
• Further analysis needed to
determine if Orbit is
primary factor
19
Flight Number vs. Orbit Type
• LEO, MEO - low success

rate in bar chart due to
count
• GTO, PO, ISS - some
recent failures
20
Payload vs. Orbit Type
• Some orbits with lower

success rate, more
successful at higher
payloads.
21
Launch Success Yearly Trend
• Most significant correlation

○ Significant
increases over time
○ Needs updating
with last four years
22
All Launch Site Names
• ‘Distinct’ returns unique occurrences
23
Launch Site Names Begin with 'CCA'
• ‘Like’ is a wildcard searchtool

that extremely useful in cases
where there is potential
unknown variance or errors
24
Total Payload Mass
• Calculating the total payload carried by boosters from NASA

• A simple example of utilizing the SUM function
25
Average Payload Mass by F9 v1.1
• Calculation of the average payload mass carried by booster version F9

v1.1
○ There is potential interpretation differences here as the F9 v1.1 is
also listed with B1011-B1017 suffixes
26
First Successful Ground Landing Date
• Use min for earliest date, max for latest date
27
Successful Drone Ship Landing with Payload between 4000 and 6000
• Utilizing ‘and’ to filter for multiple conditions

○ "%sql SELECT Booster_Version,
Landing_Outcome,PAYLOAD_MASS__KG_ FROM
SPACEXTABLE WHERE Landing_Outcome is 'Success (drone
ship)' AND PAYLOAD_MASS__KG_ > 4000 AND
PAYLOAD_MASS__KG_ < 6000;"
28
Total Number of Successful and Failure Mission Outcomes
• In this data all 98 missions are shown as successes (mission outcome

not landing outcome) leading to a potentially confusing query result as
failures are not shown as ‘0’ despite being included in the query
29
Boosters Carried Maximum Payload
• A subquery is utilized to filter all rows with max payload. All are shown to
be F9 B5 boosters
30
2015 Launch Records
• Query formatting was changed

here to fit in lines
• Substrings added due to month
names not being supported in
SQLLite
• These boosters from CCAFS
LC-40 were the only two drone
ship landing failures in 2015
31
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20
• Shown here is the ranked count of

landing outcomes (such as Failure
(drone ship) or Success (ground
pad)) between the date 2010-06-04
and 2017-03-20, in descending
order
32
Section
3
Launch Locations
• Launch locations marked and labeled. Florida locations overlapped due to proximity. 34
Landing result marker clusters
• Marker Cluster elements show launch outcomes decluttered for visibility. Green - Successful landings, Red- Failed Landings
• Lat/Long details also integrated into map to follow user’s mouse

35
KSC-LC proximity calculations
• Kennedy Space Center
Launch Complex 39A and
nearest landmarks
○ Nearest city: Titusville
16.3km
○ Nearest ‘Highway’:
Kennedy Parkway North
0.85km
○ Nearest Railway: 5.52km
○ Nearest Coasline:
6.78km
36
Section
4
Share of Successful Launches by Site
• Current view shows almost half of all successful landings come from CCAFS SLC-40
• When choosing a specific launch site from the dropdown, share of success vs. failures is shown
38
CCAFS SLC-40 has highest success ratio
• Although CCAFS SCL-40 has almost half of all total successful landings, it still
has only a 42.9% success rate
39
Payload vs. Success Rate (all sites)
• This slider shows that in the 1000-6000kg range across all sites the FT Booster shows
significantly higher success rates than the v1.1 Booster
Return to slide 17
40
Section
5
Classification Accuracy
An example of the .score method
• All models had the same accuracy on the test data according to the score
method, likely due to sample size 42
Classification Accuracy
Code used to cross-validate
• Despite all models performing the same with .score method,

cross-validation predicts SVM will be the most accurate model 43
Confusion Matrix
Decision Tree All other
Model models
• All models performed approximately the same with 15 accurate

classifications on the test data and 3 inaccurate.
• The decision tree model differed only by having one false negative
rather than 3 false positives
44
Conclusions
• Dataset is far too small to reliable discern

between machine learning models
• Most of the statistical significance here
seems to boil down to “over time SpaceX
has improved their processes and has
more successes”
• Further analysis is needed with more
recent data
45
Appendix
• Github repository of IBM course labs

○ https://github.com/tjcolla/ibm_course/tree/main
• As a note, I use the free software Obsidian to take notes on the code used
in the course, enabling markdown language and inserting python into the
notes via
```python
Example code
```
46

IBM Capstone SpaceY Taylor Collard

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IBM Capstone SpaceY Taylor Collard

Uploaded by

Copyright:

Available Formats

Taylor Collard

• Summary of all results

• In the following analysis, we examine SpaceX launch data to determine

• Perform data wrangling

• Perform exploratory data analysis (EDA) using visualization and SQL

• Defined functions to target Check

Parse JSON Convert JSON

• Utilized an HTTP GET Check

• Using Pandas and Numpy, nulls

• Exploratory Data Analysis was conducted using:

• SQL Queries used to:

• Folium map created with:

• GitHub URL of completed interactive map with Folium map, as an external

• Target data (class/success) was

• GitHub URL of completed predictive

• Early reliance on CCAFS SLC-40

• Greater success rate at higher payloads

• Clear illustration of orbit

• LEO, MEO - low success

• Some orbits with lower

• Most significant correlation

• ‘Distinct’ returns unique occurrences

• ‘Like’ is a wildcard searchtool

• Calculating the total payload carried by boosters from NASA

• Calculation of the average payload mass carried by booster version F9

• Use min for earliest date, max for latest date

• Utilizing ‘and’ to filter for multiple conditions

• In this data all 98 missions are shown as successes (mission outcome

• Query formatting was changed

• Shown here is the ranked count of

• Lat/Long details also integrated into map to follow user’s mouse

• Despite all models performing the same with .score method,

• All models performed approximately the same with 15 accurate

• Dataset is far too small to reliable discern

• Github repository of IBM course labs

You might also like