Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

DATASETS JOIN

By: Sumurduc Teodora


Content
I. Introduction.......................................................................................................................2
A. General description of the project.................................................................................2
B. The purpose and objectives of the project....................................................................2
II. Configuration and Setup...................................................................................................2
A. Environment Setup........................................................................................................2
B. Library Installation.........................................................................................................3
III. Downloading and Preparing Datasets...........................................................................3
A. Download Datasets.......................................................................................................3
B. Dataset Preparation......................................................................................................3
IV. Usage Instructions............................................................................................................3
A. Running the Script or Application..................................................................................3
B. Interpreting the Results.................................................................................................4
V. File Structure and Main Modules......................................................................................4
A. File Structure.................................................................................................................4
B. Main Modules................................................................................................................4
VI. Functionality of Important Functions.............................................................................6
VII. Identifying and Managing Common Data......................................................................8
VIII. Details of Dataset Reading, Filtering, and Processing..................................................8
A. Dataset Reading:...........................................................................................................8
B. Data Filtering:................................................................................................................8
C. Data Processing:...........................................................................................................9
IX. Identifying and Managing Common Data......................................................................9
A. Identification Process:...................................................................................................9
B. Eliminating Duplicate Entries:........................................................................................9
X. Writing Common Data to New File..................................................................................10
A. Writing Process...........................................................................................................10
XI. Results........................................................................................................................10
A. Presentation of the obtained results............................................................................10
B. Performance Evaluation..............................................................................................11
XII. The questions I asked myself......................................................................................12
XIII. Reference....................................................................................................................12
I. Introduction

A. General description of the project

This project involves the integration of three datasets obtained from different sources,
namely Facebook, Google, and company websites. The datasets contain information about
various companies and share common columns such as Category, Address (including
country and region), Phone, and Company Names.

B. The purpose and objectives of the project

The primary objective of this project is to create a consolidated dataset that combines
information from the Facebook, Google, and Website datasets. By merging these datasets,
the aim is to enhance the accuracy and completeness of the data, particularly in the
common columns of interest.

II. Configuration and Setup

A. Environment Setup

Ensure you have a Python environment installed on your system. You can download
Python from the official website (https://www.python.org/) and follow the installation
instructions for your operating system.

B. Library Installation

Install the required Python libraries using pip, the package manager for Python. You
can install the necessary libraries by running the following command in your command-line
interface:
pip intall csv

III. Downloading and Preparing Datasets

A. Download Datasets

 Download the datasets required for the project from the provided link: Dataset
Archive.
 Extract the contents of the archive to a location on your local machine.
B. Dataset Preparation

 Ensure that the dataset files are in CSV format and have the following filenames:
o facebook_dataset.csv
o google_dataset.csv
o website_dataset.csv
 Place these dataset files in the same directory as your Python script or application.

IV. Usage Instructions

A. Running the Script or Application

 Open your preferred Python environment or code editor.


 Open the Python script file associated with the project or create a new one.
 Copy and paste the provided code into your Python script file.
 Save the script file.
 Run the script file using the Python interpreter. You can do this by navigating to the
directory containing the script file in your command-line interface and running the
command:

python your_script_file.py

 Alternatively, if you are using an integrated development environment (IDE) such as


PyCharm or Visual Studio Code, you can run the script directly from the IDE.

B. Interpreting the Results

 After running the script, the merged dataset containing the integrated information
from the Facebook, Google, and Website datasets will be generated.
 You can analyze the merged dataset to gain insights into the combined company
data, including categories, addresses, phone numbers, and company names.
Note
Ensure that you have the necessary permissions to access and modify files in the
directories where the datasets and script files are located. Additionally, make sure that
your Python environment is properly configured to run the script and install any missing
dependencies as needed.

V. File Structure and Main Modules


A. File Structure

The project directory contains the following files:

 „ facebook_dataset.csv ”: Dataset file containing Facebook company data.


 „ google_dataset.csv ”: Dataset file containing Google company data.
 „ website_dataset.csv ”: Dataset file containing website company data.
 „ merge_datasets.py ”: Python script for merging and processing datasets.
 „ common_dataset.csv ”: Output file containing the merged dataset.

B. Main Modules

 „ merge_datasets.py ”: This is the main Python script responsible for merging the
datasets,
filtering the data, and identifying common records. It contains functions for reading
CSV files, filtering columns, and processing data:

import csv

# Define the function to filter the desired columns


def filter_columns(header, rows, columns):
# Define the indices of the desired columns
domain_index = header.index(columns[0])
name_index = header.index(columns[1])
category_index = header.index(columns[2])
city_index = header.index(columns[3])
country_index = header.index(columns[4])
region_index = header.index(columns[5])
phone_index = header.index(columns[6])

# Extract the desired columns from each row and return them
return [[row[domain_index], row[name_index], row[category_index], row[city_index],
row[country_index],
row[region_index], row[phone_index]] for row in rows]

# Read data from CSV files


def read_csv_file(filename):
with open(filename, 'r', newline='', encoding='utf-8') as File:
if filename == 'website_dataset.csv':
fileReader = csv.reader(File, delimiter=';')
header = next(fileReader)
header[0] = 'root_domain'
else:
fileReader = csv.reader(File, delimiter=',')
header = next(fileReader)

# Read the data from the file and add it to a list


cnt = 0
rows = []
for row in fileReader:
if cnt == 8820 and filename == 'google_dataset.csv':
a = 'Ufimskoye Shosse, 4, Ufa, Republic of Bashkortostan, Russia,
450000","",ufa,ru,russia,"Makheyev\"",+78003333693,ru,"Ufa, Russia",+7 800 333-36-
93,ba,bashkortostan republic,"Food store Ufa, Russia Permanently closed · +7 800 333-36-
93",450000,maheev.ru'
row = a.split(',')
if cnt == 190542 and filename == 'google_dataset.csv':
a = '"Ulitsa Zarubina, 150, Saratov, Saratov Oblast, Russia,
410005","",saratov,ru,russia,"Firmennyy Ofis \"Angstrem\",",+78452291768,ru,"",+7 845 229-17-
68,sar,saratovskaya oblast,"Saratov, Russia · In Vash Dom · +7 845 229-17-68 Closed ⋅ Opens 10AM
Delivery",410005,angstrem-mebel.ru'
row = a.split(',')
rows.append(row)
cnt = cnt + 1

return header, rows

# Define a function to filter and organize the data


def filter_data(rows):
data = {}
for row in rows:
key = row[0] # Assuming the first element in each row is the key (e.g., the domain)
data[key] = row[1:] # Store the rest of the elements as the value associated with the key
return data

# Define the columns for each dataset


columns_website = ['root_domain', 'legal_name', 's_category', 'main_city', 'main_country',
'main_region', 'phone']
columns_google = ['domain', 'name', 'category', 'city', 'country_name', 'region_name', 'phone']
columns_facebook = ['domain', 'name', 'categories', 'city', 'country_name', 'region_name', 'phone']

# Read and filter data for each dataset


header_website, rows_website = read_csv_file('website_dataset.csv')
header_google, rows_google = read_csv_file('google_dataset.csv')
header_facebook, rows_facebook = read_csv_file('facebook_dataset.csv')

filtered_website = filter_columns(header_website, rows_website, columns_website)


filtered_google = filter_columns(header_google, rows_google, columns_google)
filtered_facebook = filter_columns(header_facebook, rows_facebook, columns_facebook)

# Organize filtered data into dictionaries


data_website = filter_data(filtered_website)
data_google = filter_data(filtered_google)
data_facebook = filter_data(filtered_facebook)

# Identify common domains


common_domains =
set(data_website.keys()).intersection(set(data_google.keys())).intersection(set(data_facebook.keys()))

# Write common data to a new CSV file


with open('common_dataset.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(header_website) # Write the header
for domain in common_domains:
writer.writerow([domain] + data_website[domain]) # Add data associated with the domain
VI. Functionality of Important Functions

1. filter_columns(header, rows, columns):

 Description: This function filters the columns of the dataset based on the specified
column names.
 Parameters:
o header: The header of the dataset containing column names.
o rows: The rows of the dataset containing the actual data.
o columns: A list of column names to be filtered.
 Returns: A list of filtered rows with only the specified columns.

2. read_csv_file(filename):

 Description: This function reads a CSV file and extracts the header and rows of
data. It also handles special cases such as modifying the header for the website
dataset.
 Parameters:
o filename: The name of the CSV file to be read.
 Returns: A tuple containing the header and rows of data extracted from the CSV file.
3. filter_data(rows):

 Description: This function filters the data rows into a dictionary structure where the
key is a unique identifier (e.g., domain) and the value is a list of attributes associated
with that identifier.
 Parameters:
o rows: The rows of data to be filtered.
 Returns: A dictionary containing the filtered data.

VII. Identifying and Managing Common Data

 The script uses sets to identify common keys (e.g., domains) across the datasets
efficiently.
 Common domains are identified by taking the intersection of sets created from the
keys of each dataset.
 Duplicate domains are eliminated using the set data structure, which automatically
removes duplicates.
VIII. Details of Dataset Reading, Filtering, and Processing

A. Dataset Reading:

The script reads each dataset file („ facebook_dataset.csv ”, „ google_dataset.csv ”,


„ website_dataset.csv ”) using the „ read_csv_file() ” function, which extracts the header and
rows of data from the CSV files.

B. Data Filtering:

After reading each dataset, the script filters the relevant columns using the
„ filter_columns() ” function to extract only the columns of interest (e.g., domain, name,
category, city, etc.).

C. Data Processing:

The filtered data from each dataset is then converted into a structured format using the
„ filter_data() ” function, which creates a dictionary where each key corresponds to a unique
identifier (e.g., domain) and the associated value is a list of attributes.
Common data is identified by finding the intersection of the keys across the datasets,
and duplicate entries are eliminated using set operations.

IX. Identifying and Managing Common Data

A. Identification Process:
To identify common data across the datasets, the script uses sets to efficiently compare
the keys (e.g., domains) from each dataset.
Sets are created from the keys of each dataset (e.g., „ data_website.keys() ”,
„ data_google.keys() ”, „ data_facebook.keys() ”).
The common keys are determined by taking the intersection of these sets using set
operations (& operator).

B. Eliminating Duplicate Entries:

To ensure that the final dataset does not contain duplicate entries, the script utilizes set
data structures.
After identifying the common keys, they are stored in a set („ common_domains ”) to
automatically eliminate duplicates.
This set contains only unique keys that are common across all three datasets.

X. Writing Common Data to New File

A. Writing Process

Once the common data (common domains) is identified, the script proceeds to write this
data to a new CSV file named „ common_dataset.csv ”.
The file is opened in write mode ('w') using the „ csv.writer ” object for writing data to CSV
format.
The header for the new CSV file is written first using the „ writer.writerow() ” function.
Then, for each domain in the set of common domains, the associated data from the
website dataset („ data_website ”) is written to the CSV file using „ writer.writerow() ”.
This process ensures that only the common data is written to the output file, with each
row containing information from all three datasets.
XI. Results

A. Presentation of the obtained results


B. Performance Evaluation

The script demonstrates efficient performance in processing large datasets, handling


hundreds of thousands of records with ease.
The identification and management of common data are executed swiftly, thanks to
optimized data processing algorithms and techniques.

XII. The questions I asked myself

1. What are the common columns between all data sets?

Common columns are those that contain: category, domain, city, country,
phone, name, region. According to these columns, we filtered the data, being also the
most important types of data that we check.

2. What is the delimiter that helps us divide each line into columns?

I first checked this in excel. I saw that for the file "website_dataset.csv" there
is the character ';', and for the rest a comma.

3. Which is the column that helps me filter the data best?

The column that helps me the most is the one that contains the domain and I
put the key in the dictionary on it.

4. What is the data structure that does not take my duplicates?

The data structure that takes only the unique values is the set and helps me
reduce the complexity of the entire program.

XIII. Reference

 https://realpython.com/python-csv/
 https://www.digitalocean.com/community/tutorials/python-string-encode-decode

You might also like