ME204 - ? Lab 01 - Recap of Base R and Tidyverse Fundamentals

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

 My courses / ME204 / 10 July - 16 July / 💻 Lab 01 - Recap of base R and tidyverse fundamentals

ME204: Data Engineering for the Social World


Data science offers exciting possibilities for social scientists, but it's often underestimated how much time is spent on data cleaning and pre-processing.
This course focuses on essential data handling skills, enabling you to extract insights from your data before applying complex algorithms.
By the end, you'll create visual dashboards that showcase your data-wrangling abilities and emphasize the importance of data cleaning in data science.

💻 Lab 01 - Recap of base R and tidyverse fundamentals

📋 LAB DIFFICULTY: 😁 "EASY" (assumes just basic experience with R)


🥅 Learning Objectives
Refresh your R skills
Compare and contrast base R and tidyverse solutions to the same problem
Practice loading, manipulating, and visualizing data using R and tidyverse
📋 Lab Tasks
In our first lab, you will be given some practical exercises to practice loading, manipulating, and visualizing data using R and tidyverse. We’ll dive into a real dataset
called Tesco Groceries 1.0, curated by researchers from Nokia Bell Labs, King’s College London, University of Turing, and Tesco Labs (Aiello et al. 2020).
You can find a detailed description of this dataset in the 📖 Data Dictionary: Tesco Grocery 1.0 webpage.
Now, let’s get started!
Part 1: ⚙️ Setup (15 minutes)
🎯 ACTION POINT
(Whenever you come across this text ‘🎯 ACTION POINT’, it means you have a set of tasks to complete)
Before you dive into coding, take a moment to complete the following steps:
1. Give a warm hello to your instructor! And don’t forget to high-five the two classmates sitting closest to you! 🙌
Yes, we’re serious about this one!
2. Ensure that R is installed on your computer.
3. If you haven’t already, we also suggest using an integrated development environment (IDE) like RStudio, which is available for free download.
4. Open RStudio and create two new R scripts.
Save the first script as lab01.R
Save the second as lab01-tidyverse.R.
Start writing your code in the first script. Then, after you have completed the base R exercises, copy and paste your code into the second script and modify it to
use tidyverse functions instead.
5. Head to the dataset page and download the file named Dec_lsoa_grocery.csv.
6. Save the file in the same folder as your R script.
Part 2: Let’s view our data! (20 minutes)
👩🏻‍🏫 TEACHING MOMENT
(Whenever you come across this text ‘👩🏻‍🏫 TEACHING MOMENT’, it means your instructor deserves your full attention)
Your instructor will load the dataset into R and name it df. She will run View(df) so you all explore the dataset’s structure and variables together.
Your instructor will filter df to show only the row(s) corresponding to the region of London we are currently in. The LSOA code for the Aldwych area surrounding
LSE is E01004735.
She will show the base R and the tidyverse ways of doing this.
Your instructor will open the Open Geography portal, made available by the Office for National Statistics of the UK, to show you how you can highlight a region
on the map by its LSOA code. Keep a tab open on this page, as we will use it later in the lab.
🗣️ CLASSROOM-WIDE DISCUSSION: Why do you think the authors gave us the dataset in this format instead of, say, simply a list of all the products
purchased by customers in that area?
Part 3: Now you are the data analyst! (55 minutes)
Try to complete the action points below using base R first (type your solutions in lab01.R). After you’ve finished, convert your results to tidyverse (type them in
lab01-tidyverse.R). If you get stuck, ask your instructor for help.

Feel free to 👥 pair up with a classmate to work on the exercises together.


🎯 ACTION POINT
1. Filter the dataset to contain only the following columns:
The identifier column (area_id)
Columns with demographic data (population, age, area, etc.)
Columns that represent the average consumption of nutrients (check data dictionary for examples) across all LSOA regions – ignore the columns with
suffixes.
2. Identify the top three regions with the highest average alcohol consumption and print them out. Also, determine the three regions with the lowest average
alcohol consumption. Repeat the process for sugar consumption.
Can you also find out where these regions are located?
3. Calculate the average and standard deviation of the population sizes across all LSOA regions. Save the results in a single data frame. Print out the data frame.
4. Choose two nutrients (carbs, sugar, fat, saturated fat, protein, or fibre) and create a scatterplot to visualize their relationship. What observations can you make?
Please note that for the base R solution, you should not use the ggplot2 package. You should use the plot() function instead.
👩🏻‍🏫 TEACHING MOMENT
Just before you wrap up, your instructor will assess everyone’s progress with base R and tidyverse. Make sure to jot down any areas that are still unclear to you after
the lab, as you’ll have the opportunity to discuss them in tomorrow’s lecture.
Additionally, she might request you to complete a brief poll to gauge the ease with which you were able to generate the base R and tidyverse solutions.

References
Aiello, Luca Maria, Daniele Quercia, Rossano Schifanella, and Lucia Del Prete. 2020. “Tesco Grocery 1.0, a Large-Scale Dataset of Grocery Purchases in London.”
Scientific Data 7 (1): 57. https://doi.org/10.1038/s41597-020-0397-7.

Last modified: Monday, 10 July 2023, 7:42 PM

◄ 🗓️ Week 01 – Day 01: Introduction Jump to... 📖 Data Dictionary: Tesco Grocery 1.0 ►

You might also like