LAB 4 - Thuc Hanh Data Preparation & Pre-Processing - Phan 3

Môn: HỌC MÁY & ỨNG DỤNG
GV: TS. Võ Thị Hồng Thắm
LAB 4
Sử dụng kết quả bài tập Lab 3, bổ sung các nội dung như gợi ý sao cho phù hợp với Data
được chọn gồm:
Rời rạc hóa dữ liệu bằng phương pháp binning/ phân lớp/gom cụm
Chuyển dữ liệu dạng danh mục sang dữ liệu dạng số
Giảm dữ liệu: chọn một phương pháp phù hợp (Wavelet/PCA/…) và thử nghiệm, tìm
hiểu các phương pháp hiện đại gần đây và thử nghiệm
Rút trích đặc trưng: tìm hiểu xem bài toán mình đang nghiên cứu (trên dữ liệu đã chọn)
thuộc lĩnh vực nào, các công trình nghiên cứu liên quan tiếp cận giải bài toán tương tự ra
sao, đặc biệt là các công trình gần đây nhất, có những phương pháp trích chọn đặc trưng
nào tiên tiến, thử nghiệm
Lưu thành 1 file STT.ipynb (đối với tất cả các tập dữ liệu) và nộp lại trong thời hạn 2
ngày. STT là số thứ tự SV xem trong group zalo.
--HẾT LAB 4--
Dataset tham khảo:
1| Common Crawl Corpus
Common Crawl is a corpus of web crawl data composed of over 25 billion web pages. For all
crawls since 2013, the data has been stored in the WARC file format and also contains metadata
(WAT) and text data (WET) extracts. The dataset can be used in natural language processing
(NLP) projects.
Get the data here.
2| Google Books Ngrams

Google Books Ngrams is a dataset containing Google Books n-gram corpora. N-grams are fixed
size tuples of items. In this dataset, the items are words extracted from the Google Books corpus.
The size of the dataset is 2.2 TB.
Get the data here.

3| Hourly Weather Surface – Brazil (Southeast region)
The Hourly Weather Surface – Brazil (Southeast region) covers hourly weather data from 122
weather stations of the southeast region (Brazil).The size of the dataset is 2 GB, and there are 17
climate parameters (continuous values) from 122 weather stations. The contents of the dataset
include instant air temperature, relative humidity of the air, instant dew point, solar radiation,
among others.
Get the data here.
4| Hotel Booking Demand

The Hotel Booking demand dataset contains booking information for a city hotel and a resort
hotel. It includes information such as booking time, length of stay, number of adults,
children/babies, number of available parking spaces, among other things. This dataset is ideal for
anyone looking to practice their exploratory data analysis (EDA) or get started in building
predictive models.
Get the data here.
5| Iris Species
The Iris Species is the Iris Plant Database, which contains three classes of 50 instances each,
where each class refers to a type of iris plant. One class is linearly separable from the other two,
and the latter are not linearly separable from each other. The columns of this dataset include Id,
Sepallength, PetalLength, etc.
Get the data here.
6| New York City Airbnb Open Data

The New York City Airbnb Open Data is a public dataset and a part of Airbnb. It includes all
needed information to find out more about hosts, geographical availability, necessary metrics to
make predictions and draw conclusions. This dataset describes the listing activity and metrics in
NYC, NY, for 2019.
Get the data here.
7| Slogan Dataset
The Slogan dataset can be used to analyse slogans of various organisations. It includes a list of
slogans in the form of company_name, company_slogan. The data has been acquired from
slogan-list.com, which contains more than 1000 pairs of “company, slogan” spread across 10+
categories.
Get the data here.
8| Taxi Trajectory Data

The Taxi Trajectory dataset provides a complete year (from 01/07/2013 to 30/06/2014) of the
trajectories for all the 442 taxis running in the city of Porto, Portugal. Each ride has been
categorised into three sub-categories which are taxi central based, stand-based and non-taxi
central based. Each data sample corresponds to one completed trip and contains a total of nine
features.
Get the data here.
9| Temperature Readings: IoT Devices

The Temperature Readings: IoT Devices dataset contains the temperature readings from IoT
devices installed outside and inside of an anonymous room. The size of the data is 7 MB, and it
has 5 columns with 97605 rows. The dataset can be used for time-series analysis project.
Get the data here.
10| Trending YouTube Video Statistics

The Trending YouTube Video Statistics is a daily record with daily statistics for trending
Youtube videos which were collected using YouTube API. It includes several months (and
counting) of data on daily trending YouTube videos, with up to 200 listed trending videos per
day. Each region’s data is in a separate file. Data includes the video title, channel title, publish
time, tags, views, likes and dislikes, description, and comment count.
Get the data here.

LAB 4 - Thuc Hanh Data Preparation & Pre-Processing - Phan 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LAB 4 - Thuc Hanh Data Preparation & Pre-Processing - Phan 3

Uploaded by

Copyright:

Available Formats

Môn: HỌC MÁY & ỨNG DỤNG

GV: TS. Võ Thị Hồng Thắm

Get the data here.

2| Google Books Ngrams

Get the data here.

Get the data here.

4| Hotel Booking Demand

Get the data here.

Get the data here.

6| New York City Airbnb Open Data

Get the data here.

Get the data here.

8| Taxi Trajectory Data

Get the data here.

9| Temperature Readings: IoT Devices

Get the data here.

10| Trending YouTube Video Statistics

Get the data here.

You might also like