Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

DCCS208(02) Korea University 2019 Fall

Introduction
to Big Data
Chapter 1 & 2 (Week 1)
Course overview & introduction
Asst. Prof. Minseok Seo
mins@korea.ac.kr
Course Overview
Introduction to Big Data 01
Contents

1. Course Overview
 Brief introduction of professor & course
 Object & Aim of the course
 Assignments & Quiz
 Evaluation

2. Introduction to Big Data


 Definition of Big Data
 Key techniques in Data Science
 Core technology of Informatics
Course Overview
Course information

Introduction to Big Data, DCCS208(02), Fall 2019.

 Lecture time: Wed. (6,7) and Thu. (6)

 Location: Wed. (7-310) and Thu. (7-315)

 Completion division: Major elective subject

 Level: Junior / Senior

copyrightⓒ 2018 All rights reserved by Korea University 4 / 20


Course Overview
Definition of Big Data (Cont.)

VS.

Which is bigger, elephant or rat?

copyrightⓒ 2018 All rights reserved by Korea University 5 / 20


Course Overview
Definition of Big Data (Cont.)

 What is Data?

Attributes (Dimension; Features; Variables)


Objects (Samples, Individuals)

ID Height Weight Age


Student 1 189 cm 81 kg 24
Student 2 210 cm 90 kg 26
Student 3 191 cm 92 kg 27
… … … …
Student N 162 cm 71 kg 21

copyrightⓒ 2018 All rights reserved by Korea University 6 / 20


Course Overview
Definition of Big Data (Cont.)

 In a narrow sense, Big Data means only sample size.

 In a broad sense, Big Data represents both sample size and dimensionality.

copyrightⓒ 2018 All rights reserved by Korea University 7 / 20


Course Overview
Definition of Big Data (Cont.)

 3V’s (Volume, Velocity, and Variety)

copyrightⓒ 2018 All rights reserved by Korea University 8 / 20


Course Overview
Definition of Big Data (Cont.)

 5V’s (Volume, Velocity, Variety, Veracity, and Value)

 Volume: Data size


 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value

Value*

copyrightⓒ 2018 All rights reserved by Korea University 9 / 20


Course Overview
Relationship between Big-data & Data Science

 The amount of data and information is not directly correlated with


knowledge generation.

 But the demand for data scientists will be growing.

copyrightⓒ 2018 All rights reserved by Korea University 10 / 20


Course Overview
Job market of Big data

Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham

It is the time to prepare for an academic course to cultivate data analysts
commensurate with demand.

copyrightⓒ 2018 All rights reserved by Korea University 11 / 20


Course Overview
Object & Aim of the course

 Students who have taken this course expect to be able to learn:

Concept of
Big Data

Computational
Basic Skill in
approaches for
Data Science
Big Data

Introduction to
Big Data

Statistical
R
approaches for
programming
Big Data

Visualization
for Big Data

copyrightⓒ 2018 All rights reserved by Korea University 12 / 20


Course Overview
Course schedule (Before Mid-term exam)

Week Period Study Contents

1 09.02 - 09.08 Introduction to Big Data & Data Science

Overall workflow, Computer Software issues, and applications in the


2 09.09 - 09.15 Big Data era

3 09.16 - 09.22 Introduction to R programming

4 09.23 - 09.29 Descriptive & Fundamental Statistics

5 09.30 - 10.06 Understanding Data Structures (Types of random variable)

6 10.07 - 10.13 Data Visualization

7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)

8 10.21 - 10.27 Mid-term Exam

copyrightⓒ 2018 All rights reserved by Korea University 13 / 20


Course Overview
Course schedule (After Mid-term exam)

Week Period Study Contents

9 10.28 - 11.03 Parallel and Distributed Processing for Big Data

10 11.04 - 11.10 Statistical Estimation & Modeling

11 11.11 - 11.17 Computational approach for statistical modeling with robustness

12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)

13 11.25 - 12.01 Classification analysis (Supervised learning methods)

14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data

15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data

16 12.16 - 12.22 Final Exam

copyrightⓒ 2018 All rights reserved by Korea University 14 / 20


Course Overview
Two types of lectures per week

Wed. day Thu. Day


2hrs 1hr
Lecture for Theory Hands-on lecture

The methodology learned in theory class will be exercised in the computer lab. on Thursday.

 There are two representative computer language for Big data analysis, R and
Python.

 R will be used in this class.

 It is not required any prior knowledge of the R language because I plan to provide
example code for student's practice.

https://cran.r-project.org/

copyrightⓒ 2018 All rights reserved by Korea University 15 / 20


Course Overview
Exam, Quiz, and Homework

Midterm and Final exams


 There will be two exams.

 I will ask you to understand the basic computational/statistical algorithm.

Quiz
 There will be two simple quizzes in class to check the student's learning
progress of the course (before and after midterm respectively).

Homework
 There will be 4 times assignments.

 This will be a report on the theory and practice of data analysis learned in
class.

copyrightⓒ 2018 All rights reserved by Korea University 16 / 20


Course Overview
Evaluation plan

Midterm Final Quiz Assignment Attendance

10%
30%
20%

10%
30%

 Absolute grading system


Score ≥ 95, you will get A+
Score ≥ 90, you will get A
Score ≥ 85, you will get B+
and...

copyrightⓒ 2018 All rights reserved by Korea University 17 / 20


Course Overview
Textbook

 No Textbook

 This course will be proceed based on the presentation slide

 I will upload presentation slide in Blackboard & my homepage


Homepage: https://scholar.harvard.edu/msseo
Teaching >> Introduction to Big Data >> Related Materials

 Reference 1 (Kor. Version)


R for Practical Data Analysis
(online textbook and free)
http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
 Reference 2 (Eng. Version)
Introduction to Data Science by Rafael A. Irizarry, 2019.
(online textbook and free)
https://rafalab.github.io/dsbook/
 Reference 3 (Eng. Version)
R for Data Science by Garrett Grolemund.
(online textbook and free)
https://r4ds.had.co.nz/

copyrightⓒ 2018 All rights reserved by Korea University 18 / 20


Course Overview
Contact information

 Prof. Minseok Seo


Location: 7-203
Tel: 044-860-1379
Email: mins@korea.ac.kr

 TA. Heechan Chae


Location: 7-328
Email: chay219@korea.ac.kr

 If you have any questions about the course please email me and I will reply as
soon as I see it.

 If you need to meet in person, please make an appointment by email first.

 I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.

copyrightⓒ 2018 All rights reserved by Korea University 19 / 20


End of
Orientation
Contents

1. Course Overview
 Brief introduction of professor & course
 Object & Aim of the course
 Assignments & Quiz
 Evaluation

2. Introduction to Big Data


 Concept of Big Data
 Key techniques in Data Science for Big data
Characteristics of Big Data
Remind concept of Big Data

 5V’s (Volume, Velocity, Variety, Veracity, and Value)

 Volume: Data size


 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value

Value*

copyrightⓒ 2018 All rights reserved by Korea University 22 / 20


Petabyte era

1 PB = 1000000000000000B = 1015bytes = 1000terabytes

1000 PB = 1 exabyte (EB)

 transferred about 197 PB of data thorough its network each data (2018)

 processed about 24 petabytes daily (2009)

In fact, we can say that we have already entered the exabyte


era.

copyrightⓒ 2018 All rights reserved by Korea University 23 / 20


Characteristics of Big Data
How do you recognize if it's big data or not?

Computer Scientist

My computer is low on memory for


handling this data!!
That is Big Data

No!!!! This data is over 2TB. Where do I


store it?????
That is Big Data

In short, if you’re having trouble with data processing on your computer (멘붕에
빠지면), it will be due to the Big Data.

copyrightⓒ 2018 All rights reserved by Korea University 24 / 20


Characteristics of Big Data
How do you recognize if it's big data or not?

Statistician

When does this calculation end? I was


only waiting for 10 years ...

Dimensionality is too high!!!! I can’t build


statistical model using this data!!!

That is Big Data

In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지
면), it will be due to the Big Data.

copyrightⓒ 2018 All rights reserved by Korea University 25 / 20


Core technologies of Big Data era
IT technologies to resolve issue derived from the Big data

Software Hardware

Prescreening techniques

Data Visualization

Feature selection

Parallel processing

Clouding computing

Distributed processing

Difficulties arise in both hardware and software.

But students can approach software difficulties.

copyrightⓒ 2018 All rights reserved by Korea University 26 / 20


Computational language for Big Data
R and Python

Wed. day Thu. Day


2hrs 1hr
Lecture for Theory Hands-on lecture

 There are two representative computer language for Big data analysis, R and
Python.

 R programming language (free and relatively easy) for hands-on lecture.

 Let’s connect R homepage

https://cran.r-project.org/

copyrightⓒ 2018 All rights reserved by Korea University 27 / 20


Install R
(Step 1) Download the R installer

copyrightⓒ 2018 All rights reserved by Korea University 28 / 20


Install R
(Step 2) Download the RStudio

 Download Rstudio from https://www.rstudio.com/products/rstudio/download/

copyrightⓒ 2018 All rights reserved by Korea University 29 / 20


Install R
(Step 3) Install R and Rstudio

copyrightⓒ 2018 All rights reserved by Korea University 30 / 20


What is R
 R is an interpreted computer language.

 It is possible to interface procedures written in C, C+, and etc., languages for


efficiency.

 System commands can be called from within R

 R is used for data manipulation, statistics, and graphics.

copyrightⓒ 2018 All rights reserved by Korea University 31 / 20


R, S, and S-plus (History of R)
 S: an interactive environment for data analysis developed at Bell Laboratories since
1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers

 Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product


name: “S-plus”.
Implementation languages C, Fortran.

 R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of


Auckland, New Zealand during 1990s.

 Since 1997: international “R-core” team of ca. 15 people with access to common
CVS archive.

copyrightⓒ 2018 All rights reserved by Korea University 32 / 20


What R does and does not
 Possible
(1) data handling and storage: numeric, textual
(2) matrix algebra
(3) has tables and regular expressions
(4) high-level data analytic and statistical functions
(5) OOP (classes)
(6) Graphic
(7) Programming language: loops, branching, subroutines, and etc.,

 Impossible
(1) R is not a database, but connects to DBMSs
(2) R has no GUI, but connect to Java, TclTk
(3) R is fundamentally very slow, but allows to call own C/C++ code
(4) R is no spreadsheet view of data, but connects to Excel/MsOffice
(5) R is no professional & commercial support

 But all R users in the world are developers (Power of Collective intelligence; 집단지성).

 If you make a meaningful package at any time, you can publish it within 1 second.

 Therefore, applying latest algorithms are faster than any programming language.

copyrightⓒ 2018 All rights reserved by Korea University 33 / 20


Install R
(Step 3) Install R and Rstudio

copyrightⓒ 2018 All rights reserved by Korea University 34 / 20


End of Slide

You might also like