Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Big Data Processing

Jiaul Paik
Lecture 1
Today’s topics

• Course Information and Logistics

• Introduction to Big Data Processing


Course Information
Teacher
• Jiaul Paik

• Email ids:
• jiaul@cet.iitkgp.ac.in
• jia.paik@gmail.com
Prerequisites (Must)

• Knowledge of Data Structures

• Knowledge of Algorithm Design

• Programming
• Python is highly recommended

• Everything will be done on Linux system


Evaluation Policy

Type # times
Written Test 2

Programming Assignment 6
Major Topics of the Course
• Fundamentals of Hadoop
• Dealing with distributed data storage
• Mapreduce programming with Hadoop
• Functional Programming: Python & Scala
• Spark
• Basics
• Streaming data
• Relational data
• Graph data
• High level language: PIG Latin
• Apache Hbase
Programming Assignments
• Objectives
• Make you familiar with basics of big data processing technologies

• Gain experience with algorithm/program design for big data

• Submission
• Through moodle (link will be provided)
• Typical deadline
• 10-15 days (depending upon the complexity of the assignment)
Important Notes
Course Content
• This is a general purpose practical course
• If you want to do something with ‘big data’

• The techniques you learn can be applied to any form of data which is ‘big’

• It requires new kind of programming


• NOT difficult, but new programming style

• Thus …….
• hands-on programming experience with modern big data systems
is absolutely essential
Attendance
• There were 800 applications for this course

• I have selected 1/4th of them and you are one of them

Attendance is MANDATORY
Main Flavour of the Course
• This is a programming heavy course

• Needs advanced programming

If you do not have good programming skill and knowledge


of algorithm design you will struggle
Assignments
• A very important part of the course (after all this is a practical subject)

• All submission will be through moodle

• No extension of deadline for assignments

• Assignments to be solved individually (No Group activity)

• Evaluation:
• If your program does not run correctly, you will get ZERO credit
(no excuse please!!!)
What can you expect from this course?
1. Limitations of classical data processing systems

2. Basics of cluster computing for big data

3. Distributed storage for big data

4. Functional programming with Python and Scala

5. Hadoop internals and applications

6. Programming with Hadoop map-reduce

7. Spark internals and programming

8. Large scale Machine learning with Spark


Books
• Mining of Massive Datasets : Rajaraman and Ullman

• Data-Intensive Text Processing with MapReduce: Lin and Dyer

• Learning Spark: Konwinski et al.

• Spark - The Definitive Guide: Chambers and Zaharia

• MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems: Book by Adam Shook and Donald Miner

• Hadoop: The Definitive Guide: Book by Tom White

You might also like