Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Taylor Collard

17 May 2024
Outline

• Executive Summary
• Introduction
• Methodology
• Results
• Conclusion
• Appendix

2
Executive Summary
• Methodologies used:
○ API calls and Web Scraping
○ SQL queries
○ Python visualization libraries
○ Simple Linear Regression
○ K Nearest Neighbor
○ Decision Trees
○ Logistic Regression
○ Support Vector Means

• Summary of all results


○ We have collected data from public SpaceX API and by scraping Wikipedia SpaceX data
○ We were able to determine statistical correlations and build a prediction model with 83%
accuracy

3
Introduction
• SpaceX advantage comes from reusing the first stage of their rockets,
saving clients over 100 million dollars per launch
• SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62
million dollars; other providers cost upwards of 165 million dollars each
• Therefore, if we can determine if the first stage will land, we can determine
the cost of SpaceX launches and use this information to our advantage.

• In the following analysis, we examine SpaceX launch data to determine


statistically significant factors that help us build machine learning models to
predict the success or failure of first stage rocket landings.

4
Section
1

5
Methodology

Executive Summary
• Data collection methodology:
• Data was collected from both the SpaceX API and web scraping public data
(wikipedia) on SpaceX launches.

• Perform data wrangling


• Using Pandas and Numpy, nulls removed/aggregated, launch sites and orbits
identified, and landing outcome integrated into data

• Perform exploratory data analysis (EDA) using visualization and SQL


• Perform interactive visual analytics using Folium and Plotly Dash
• Perform predictive analysis using classification models
6
• How to build, tune, evaluate classification models
Data Collection – SpaceX API

• Defined functions to target Check


certain columns of data, called Import Fetch Data from
Response
Libraries SpaceX API
API, and converted returned Status
JSON file into Dataframe

Parse JSON Convert JSON


• GitHub URL of the completed Clean Data
Data to Dataframe
SpaceX API calls notebook as
an external reference and
peer-review purpose
Save Data to
CSV

7
Data Collection - Scraping

• Utilized an HTTP GET Check


Import Fetch Web
method to fetch data, Libraries Page
Response
BeautifulSoup to parse Status
data as an object and
extract data with searches
Parse HTML Store Data in
• GitHub URL of the Extract Data
Content Dataframe
completed web scraping
notebook, as an external
reference and peer-review
purpose Save Data to
Clean Data
CSV

8
Data Wrangling

• Using Pandas and Numpy, nulls


removed/aggregated, launch Import Load Data Inspect Data
Libraries
sites and orbits identified, and
landing outcome integrated into
data
Feature
Clean Data Transform Data
• GitHub URL of completed data Engineering
wrangling related notebooks, as
an external reference and
peer-review purpose
Save Cleaned
Data

9
EDA with Data Visualization

• Exploratory Data Analysis was conducted using:


○ categorical scatter plots to visualize success rate and relationship
of
■ Payload Mass (kg) vs. Flight Number
■ Launch Site vs. Flight Number
■ Payload Mass (kg) vs. Launch Site
○ A scatter plot for Flight Number vs. Orbit Type
○ Bar Chart for Orbit Type vs. Success Rate
○ Line Graph for Success Rate vs. Time
• GitHub URL of completed EDA with data visualization notebook, as an
external reference and peer-review purpose
10
EDA with SQL

• SQL Queries used to:


○ Create initial table
○ Select distinct launch sites
○ Inspect Payload Mass by customer and by Booster Version
○ Find first successful landing date
○ Discover Boosters used for successful landing on a drone ship
○ Find specific failure locations by date
○ Rank all landing outcomes by count
• Add the GitHub URL of your completed EDA with SQL notebook, as an
external reference and peer-review purpose

11
Build an Interactive Map with Folium

• Folium map created with:


• Circles and markers to denote labeled launch sites
• Markers for landing results geolocated at corresponding launch sites
○ Marker Clusters used to declutter
• Markers and lines used to mark and measure distance from launch site to
nearest coastline, railway, highway and city

• GitHub URL of completed interactive map with Folium map, as an external


reference and peer-review purpose (USE nbviewer.org to view maps)

12
Build a Dashboard with Plotly Dash

• Pie graphs were used to show success/failure rates per launch site, and
the overall share of successful launches in order to determine any
statistical significance of launch location
• A Categorical scatter plot was used to visualize the relationship between
booster type, payload, and success rates across all locations.
• GitHub URL of completed Plotly Dash lab (code only), as an external
reference and peer-review purpose

13
Predictive Analysis (Classification)

• Target data (class/success) was


assigned to Y, and the remaining Import Preprocess
data fit to X. Both variables split into Load Data
training and testing data Libraries Data
• Models were trained and evaluated
using GridSearch and fit and score
functions defined in supporting
libraries
• Confusion matrices show Evaluate
false-negative and false-positive Split Data Train Model
results
Model

• GitHub URL of completed predictive


analysis lab, as an external
reference and peer-review purpose Optimize
Save Model
Model

14
Results

• EDA
○ Year is most significantly related to success
○ No strong correlation between orbit type
○ Some correlation between launch site and success rate
■ However more launches were at CCAFS SLC-40 earlier on, when
failure was more common
■ Same location has more successes than other locations later on
• Interactive analytics demo in screenshots on later slides
• Predictive analysis results
○ Models showing ~83% accuracy on test data

15
Section
2
Flight Number vs. Launch Site

• Early reliance on CCAFS SLC-40


○ Apparently due to early experience rather than location (see later
successes)
• More success over time at all sites apparent

17
Payload vs. Launch Site

• Greater success rate at higher payloads


• Most failures are lower payload at CCAFS SLC 40
○ Early tests or indicative of future trend?

18
Success Rate vs. Orbit Type

• Clear illustration of orbit


success rates
• Further analysis needed to
determine if Orbit is
primary factor

19
Flight Number vs. Orbit Type

• LEO, MEO - low success


rate in bar chart due to
count
• GTO, PO, ISS - some
recent failures

20
Payload vs. Orbit Type

• Some orbits with lower


success rate, more
successful at higher
payloads.

21
Launch Success Yearly Trend

• Most significant correlation


○ Significant
increases over time
○ Needs updating
with last four years

22
All Launch Site Names

• ‘Distinct’ returns unique occurrences

23
Launch Site Names Begin with 'CCA'

• ‘Like’ is a wildcard searchtool


that extremely useful in cases
where there is potential
unknown variance or errors

24
Total Payload Mass

• Calculating the total payload carried by boosters from NASA


• A simple example of utilizing the SUM function

25
Average Payload Mass by F9 v1.1

• Calculation of the average payload mass carried by booster version F9


v1.1
○ There is potential interpretation differences here as the F9 v1.1 is
also listed with B1011-B1017 suffixes

26
First Successful Ground Landing Date

• Use min for earliest date, max for latest date

27
Successful Drone Ship Landing with Payload between 4000 and 6000

• Utilizing ‘and’ to filter for multiple conditions


○ "%sql SELECT Booster_Version,
Landing_Outcome,PAYLOAD_MASS__KG_ FROM
SPACEXTABLE WHERE Landing_Outcome is 'Success (drone
ship)' AND PAYLOAD_MASS__KG_ > 4000 AND
PAYLOAD_MASS__KG_ < 6000;"

28
Total Number of Successful and Failure Mission Outcomes

• In this data all 98 missions are shown as successes (mission outcome


not landing outcome) leading to a potentially confusing query result as
failures are not shown as ‘0’ despite being included in the query

29
Boosters Carried Maximum Payload

• A subquery is utilized to filter all rows with max payload. All are shown to
be F9 B5 boosters

30
2015 Launch Records

• Query formatting was changed


here to fit in lines
• Substrings added due to month
names not being supported in
SQLLite
• These boosters from CCAFS
LC-40 were the only two drone
ship landing failures in 2015

31
Rank Landing Outcomes Between 2010-06-04 and 2017-03-20

• Shown here is the ranked count of


landing outcomes (such as Failure
(drone ship) or Success (ground
pad)) between the date 2010-06-04
and 2017-03-20, in descending
order

32
Section
3
Launch Locations

• Launch locations marked and labeled. Florida locations overlapped due to proximity. 34
Landing result marker clusters

• Marker Cluster elements show launch outcomes decluttered for visibility. Green - Successful landings, Red- Failed Landings

• Lat/Long details also integrated into map to follow user’s mouse


35
KSC-LC proximity calculations
• Kennedy Space Center
Launch Complex 39A and
nearest landmarks
○ Nearest city: Titusville
16.3km
○ Nearest ‘Highway’:
Kennedy Parkway North
0.85km
○ Nearest Railway: 5.52km
○ Nearest Coasline:
6.78km

36
Section
4
Share of Successful Launches by Site

• Current view shows almost half of all successful landings come from CCAFS SLC-40
• When choosing a specific launch site from the dropdown, share of success vs. failures is shown

38
CCAFS SLC-40 has highest success ratio

• Although CCAFS SCL-40 has almost half of all total successful landings, it still
has only a 42.9% success rate

39
Payload vs. Success Rate (all sites)

• This slider shows that in the 1000-6000kg range across all sites the FT Booster shows
significantly higher success rates than the v1.1 Booster
Return to slide 17
40
Section
5
Classification Accuracy
An example of the .score method

• All models had the same accuracy on the test data according to the score
method, likely due to sample size 42
Classification Accuracy
Code used to cross-validate

• Despite all models performing the same with .score method,


cross-validation predicts SVM will be the most accurate model 43
Confusion Matrix
Decision Tree All other
Model models

• All models performed approximately the same with 15 accurate


classifications on the test data and 3 inaccurate.
• The decision tree model differed only by having one false negative
rather than 3 false positives
44
Conclusions

• Dataset is far too small to reliable discern


between machine learning models
• Most of the statistical significance here
seems to boil down to “over time SpaceX
has improved their processes and has
more successes”
• Further analysis is needed with more
recent data

45
Appendix

• Github repository of IBM course labs


○ https://github.com/tjcolla/ibm_course/tree/main
• As a note, I use the free software Obsidian to take notes on the code used
in the course, enabling markdown language and inserting python into the
notes via
```python
Example code
```

46

You might also like