Professional Documents
Culture Documents
Dataset Join
Dataset Join
This project involves the integration of three datasets obtained from different sources,
namely Facebook, Google, and company websites. The datasets contain information about
various companies and share common columns such as Category, Address (including
country and region), Phone, and Company Names.
The primary objective of this project is to create a consolidated dataset that combines
information from the Facebook, Google, and Website datasets. By merging these datasets,
the aim is to enhance the accuracy and completeness of the data, particularly in the
common columns of interest.
A. Environment Setup
Ensure you have a Python environment installed on your system. You can download
Python from the official website (https://www.python.org/) and follow the installation
instructions for your operating system.
B. Library Installation
Install the required Python libraries using pip, the package manager for Python. You
can install the necessary libraries by running the following command in your command-line
interface:
pip intall csv
A. Download Datasets
Download the datasets required for the project from the provided link: Dataset
Archive.
Extract the contents of the archive to a location on your local machine.
B. Dataset Preparation
Ensure that the dataset files are in CSV format and have the following filenames:
o facebook_dataset.csv
o google_dataset.csv
o website_dataset.csv
Place these dataset files in the same directory as your Python script or application.
python your_script_file.py
After running the script, the merged dataset containing the integrated information
from the Facebook, Google, and Website datasets will be generated.
You can analyze the merged dataset to gain insights into the combined company
data, including categories, addresses, phone numbers, and company names.
Note
Ensure that you have the necessary permissions to access and modify files in the
directories where the datasets and script files are located. Additionally, make sure that
your Python environment is properly configured to run the script and install any missing
dependencies as needed.
B. Main Modules
„ merge_datasets.py ”: This is the main Python script responsible for merging the
datasets,
filtering the data, and identifying common records. It contains functions for reading
CSV files, filtering columns, and processing data:
import csv
# Extract the desired columns from each row and return them
return [[row[domain_index], row[name_index], row[category_index], row[city_index],
row[country_index],
row[region_index], row[phone_index]] for row in rows]
Description: This function filters the columns of the dataset based on the specified
column names.
Parameters:
o header: The header of the dataset containing column names.
o rows: The rows of the dataset containing the actual data.
o columns: A list of column names to be filtered.
Returns: A list of filtered rows with only the specified columns.
2. read_csv_file(filename):
Description: This function reads a CSV file and extracts the header and rows of
data. It also handles special cases such as modifying the header for the website
dataset.
Parameters:
o filename: The name of the CSV file to be read.
Returns: A tuple containing the header and rows of data extracted from the CSV file.
3. filter_data(rows):
Description: This function filters the data rows into a dictionary structure where the
key is a unique identifier (e.g., domain) and the value is a list of attributes associated
with that identifier.
Parameters:
o rows: The rows of data to be filtered.
Returns: A dictionary containing the filtered data.
The script uses sets to identify common keys (e.g., domains) across the datasets
efficiently.
Common domains are identified by taking the intersection of sets created from the
keys of each dataset.
Duplicate domains are eliminated using the set data structure, which automatically
removes duplicates.
VIII. Details of Dataset Reading, Filtering, and Processing
A. Dataset Reading:
B. Data Filtering:
After reading each dataset, the script filters the relevant columns using the
„ filter_columns() ” function to extract only the columns of interest (e.g., domain, name,
category, city, etc.).
C. Data Processing:
The filtered data from each dataset is then converted into a structured format using the
„ filter_data() ” function, which creates a dictionary where each key corresponds to a unique
identifier (e.g., domain) and the associated value is a list of attributes.
Common data is identified by finding the intersection of the keys across the datasets,
and duplicate entries are eliminated using set operations.
A. Identification Process:
To identify common data across the datasets, the script uses sets to efficiently compare
the keys (e.g., domains) from each dataset.
Sets are created from the keys of each dataset (e.g., „ data_website.keys() ”,
„ data_google.keys() ”, „ data_facebook.keys() ”).
The common keys are determined by taking the intersection of these sets using set
operations (& operator).
To ensure that the final dataset does not contain duplicate entries, the script utilizes set
data structures.
After identifying the common keys, they are stored in a set („ common_domains ”) to
automatically eliminate duplicates.
This set contains only unique keys that are common across all three datasets.
A. Writing Process
Once the common data (common domains) is identified, the script proceeds to write this
data to a new CSV file named „ common_dataset.csv ”.
The file is opened in write mode ('w') using the „ csv.writer ” object for writing data to CSV
format.
The header for the new CSV file is written first using the „ writer.writerow() ” function.
Then, for each domain in the set of common domains, the associated data from the
website dataset („ data_website ”) is written to the CSV file using „ writer.writerow() ”.
This process ensures that only the common data is written to the output file, with each
row containing information from all three datasets.
XI. Results
Common columns are those that contain: category, domain, city, country,
phone, name, region. According to these columns, we filtered the data, being also the
most important types of data that we check.
2. What is the delimiter that helps us divide each line into columns?
I first checked this in excel. I saw that for the file "website_dataset.csv" there
is the character ';', and for the rest a comma.
The column that helps me the most is the one that contains the domain and I
put the key in the dictionary on it.
The data structure that takes only the unique values is the set and helps me
reduce the complexity of the entire program.
XIII. Reference
https://realpython.com/python-csv/
https://www.digitalocean.com/community/tutorials/python-string-encode-decode