Unit 3

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Reading in data locally and from the web

 Reading data is the gateway for any data analysis.


 Data can be read from local device or from web.
 In R, “Reading” or “loading” is the process of converting data (stored as plain text, a
database, HTML, etc.) into an object (e.g., a data frame)
 There are many ways to store data as well as many ways to read them.
 Different functions are available in R to import data from various file formats.
 While loading a data set into R, we need to tell R where those files live. The file could live
on your computer (local) or somewhere on the internet (remote).
 The place where the file lives on your computer is called the “path.”
 There are two kinds of paths: relative paths and absolute paths.
 A relative path is where the file is with respect to our current computer.
 An absolute path is where the file is in respect to the computer’s file system.
 As per the figure,
o We are working in a file named worksheet_02.ipynb .
o If we want to read the .csv file named happiness_report.csv into R, we could do this
using either a relative or an absolute path.

Reading happiness_report.csv using a relative path


happy_data <- read_csv("data/happiness_report.csv")

Reading happiness_report.csv using an absolute pat:


happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")

 In case of remote files, a Uniform Resource Locator (URL) (web address) indicates the
location of a file/resource.
Reading tabular data from a plain text file into R
 read_csv() to read in comma-separated files (csv file)
data <- read_csv("data/xyz.csv")

Data filename is “xyz.csv” stored under “data” folder.

 read_tsv to read in tab-separated files


data <- read_tsv("data/xyz.tsv")

Reading tabular data directly from a URL


 read_csv( ), read_tsv( ), read_delim( ) functions are used to read in data directly from
a Uniform Resource Locator (URL) that contains tabular data.
url <- "https://xxx.com/data/xyz.csv"
data <- read_csv(url)

Reading tabular data from a Microsoft Excel file


data <- read_excel("data/xyz.xlsx")

Reading data from a database


 Relational database is a common form of data storage for large data sets or multiple
users working on a project.
 There are many relational database management systems, such as SQLite, MySQL,
PostgreSQL, Oracle and many more.
 Reading data from a SQLite database
o SQLite database is self-contained and usually stored and accessed locally.
o Data is usually stored in a file with a .db extension.
o To read data into R from a database we need to connect the database.
o dbConnect( ) function is used from the DBI (database interface) package to
connect the database.
data <- dbConnect(RSQLite::SQLite(), "data/xyz.db")
o Relational databases may have many tables. In order to retrieve data from a
database, we need to know the name of the table in which the data is stored.
o We can get the names of all the tables in the database using
the dbListTables function:
tables <- dbListTables(conn_lang_data)

Obtaining data from the web using API


 Accessing data stored in a plain text, spread sheets, comma or tab separated files from a
web URL using one of the read_* functions from the tidyverse.
 Now websites use Application Programming Interface (API), which provides a
programmatic way to read data set.
 This allows the website owner to control who has access to the data, what portion of the
data they have access to, and how much data they can access.
 We can collect data programmatically - in the form of Hypertext Markup Language
(HTML) and Cascading Style Sheet (CSS) code - and process it to extract useful
information.
 HTML provides the basic structure of a site and CSS helps style the content.
What is Tidy Data?
 In a Data Science project, tidying data is a necessary after importing data in order to
communicate results.

 Tidy datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
o Structure is the form and shape of data. In statistics, most datasets are rectangular
data tables(data frames) and are made up of rows and columns.
o Semantics is the meaning for the dataset. Datasets are a collection of values,
either quantitative or qualitative. These values are organized in 2 ways —
variable & observation.
 Variables — all values that measure the same underlying attribute across units
 Observations — all values measured on the same unit across attributes
o The 3 rules of tidy data help simplify the concept and make it more intuitive.

 Each variable is a column


 Each observation is a row
 Each type of observational unit is a table

Messy Data
 Messy data is any kind of data that does not follow the above framework.
 To narrow it down, the paper gives 5 common problems of messy data:
o Column headers are values, not variable names.
o Multiple variables are stored in one column.
o Variables are stored in both rows and columns.
o Multiple types of observational units are stored in the same table.
o A single observational unit is stored in multiple tables.

Why is Tidy Data important?


 If the data set is in standardized framework then we spend less time on data cleaning and
wrangling and more time to focus on answering the problem.
 It is a good practice to have the data in a format which makes it reproducible and easy for
others to understand.
 Another more technical reason is that the concept of tidy data is complemented with the tools
in R to work with. Since R works with vectors of values (R functions are vectorized by nature),
we able to naturally apply our tidy data to the tools used.

You might also like