Professional Documents
Culture Documents
Data Strategy Nov 6
Data Strategy Nov 6
Data Strategy Nov 6
B.RAMAMURTHY
New kinds of data from different sources (see p.23 of Data Science
book) : tweets, geo location, emails, blogs
Two major types: structured and unstructured data
Structured data: data collected and stored according to well
defined schema; Realtime stock quotes
Unstructured data: messages from social media, news, talks,
books, letters, manuscripts, court documents..
“Regardless of their differences, they work in tandem in any
effective big data operation. Companies wishing to make the most
of their data should use tools that utilize the benefits of both.” 5
We will discuss methods for analyzing both structured and
unstructured data
7000
6000
5000
Terabytes
4000
Top ten largest databases (2007)
3000
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/
6000
5000
4000
Terabytes
3000
Top ten largest databases (2007)
2000
1000
0
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate Facebook
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world
Data integration
Meta data
Data modeling
Organizational roles and responsibilities
Performance and metrics
Security and privacy
Structured data management
Unstructured data management
Business intelligence
Data analysis and visualization
Tapping into social data
This course will provide training in emerging technologies, tools, environments
and APIs available for developing and implementing one or more of these
components.
Q1 Q2 Q3 Q4 Q5 Total
16.7 13.9 9.6 18.5 13.7 72.4
20.0 16.0 9.0 19.0 17.0 76.0
20.0 20.0 15.0 25.0 20.0 90.0
Q1 Q2 Q3 Q4 Q5 Total
16.0 14.2 9.6 19.4 14.0 73.2
80.1% 71.1% 64.0% 77.4% 70.2% 73.2%
Q1 Q2 Q3 Q4 Q5 Total
17.3 13.6 9.7 17.6 13.3 71.5
86.7% 67.8% 64.6% 70.3% 66.7% 71.5%
Question 1..5, total, mean, median, mode; mean ver1, mean ver2
Rich's Data Analytics Training 11 10/29/2022
Traditional approach 2: points vs #students
12
data2<-read.csv(file.choose())
exam1<-data2$midterm
hist(exam1, col=rainbow(8))
boxplot(data2, col=rainbow(6))
boxplot(data2,col=c("orange","green","blue","grey","yellow", "sienna"))
fn<-boxplot(data2,col=c("orange","green","blue","grey","yellow", "pink"))$stats
grid(nx=NA, ny=NULL)
Projects:
R-Project
Code/Program Data
R Basics, fundamentals
The R language
Working with data
Statistics with R language
R syntax
R Control structures
R Objects
R formulas
Install and use packages
Quick overview and tutorial
Questions?
Regions of RStudio: (i) console, (ii) data, (iii) script, (iv) plots and
packages
Primary feature: Project is a collection of files: data, graphs, R script:
lets create a new project
R allows all the basic arithmetic: +, - , variables
Vectors: collection of same type of elements; very important data
element
Creating a vector; changing a vector; factoring a vector
x<- c(1,4,9,19)
Calling a function: mean (x)
Missing data: NA (not available), NULL(absence of anything)
z<- c(8, NA, 19)
z <- c(8,NULL, 18)
znew<-na.omit(z)