Professional Documents
Culture Documents
01Lb References and Articles
01Lb References and Articles
Database Systems
Administration and Management
(Foundations of transforming
and storing Big Data)
Algonquin College
20W_CST2200
References and Articles
CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data
Contents
Introduction .................................................................................................................................................. 3
ETL Core Competencies ................................................................................................................................ 4
Creativity ....................................................................................................................................................... 6
Technology and tools .................................................................................................................................... 7
Microsoft Excel.......................................................................................................................................... 8
Python ....................................................................................................................................................... 9
PostgreSQL .............................................................................................................................................. 10
Geospatial related................................................................................................................................... 10
Talend Open Studio................................................................................................................................. 10
General ETL & Data Science ........................................................................................................................ 11
Interesting sources of data ......................................................................................................................... 12
Canadian Open Data (municipal and federal)......................................................................................... 12
Useful Data Science and ETL tools .............................................................................................................. 13
R .............................................................................................................................................................. 13
KNIME ..................................................................................................................................................... 13
Microsoft VBA ......................................................................................................................................... 13
QGIS ........................................................................................................................................................ 14
Windows Subsystem for Linux ................................................................................................................ 14
Introduction
This document contains a collection of references and articles that may be helpful to your success in
CST2200. Content in this document is not examinable (will not appear on tests or exams) unless
specifically referred in a slide or during a lecture.
As you work through these resources, be sure to keep notes on any issues you run into so we can talk
about them in class. Be sure to ask about anything here that sparks your curiosity.
BI analytics cannot be performed without data. But raw data is (usually/often) next to useless. It needs
to be structured in such a way that it can be recognized and differentiated in order to be understood.
Only then can a business glean intelligence out of their data.
At its root, the process of applying a controlled structure to data is known as ETL (extract, transform and
load). Having a solid foundation in ETL will significantly strengthen your analytical skills and abilities.
Modelling theory
- For those who are a more tech savvy, stronger skills here will allow you to differentiate yourself
from the growing crowd of ETL developers.
- Learning how to parameterize your ETL jobs can save time and reduce headaches. Using
parameters allows you to dynamically change certain aspects of your ETL job without altering
the job itself.
- There will be times where the ETL tools alone cannot do everything that is needed. Scripting
languages can aid with juggling files, directories, users, and permissions. Popular scripting
languages for ETL include Python, Perl, and Bash.
Creativity
- The ability to transcend traditional ideas, rules, patterns, relationships, to create meaningful
new ideas, forms, methods, interpretations. 1
- Strong creative skills can make ETL work much more satisfying and is often an overlooked as a
core competency. This is another area where you can differentiate yourself from the crowd of
ETL developers.
Continuous learning
- To keep your career moving, don’t forget to watch trends in the market. Knowing when to adopt
new methods must be balanced with resources such as time and money. Technology tends to be
much less frustrating when you understand the inner workings.
- Respecting data stewardship is another key competency. Being able to maintain data integrity
and security (from acquisition to dissemination) will require all of the above competencies.
- Consider ethical uses of data; lossy vs. lossless compression; data encryption; dropping
unneeded data; maintaining chains of ownership; laws governing retention and distribution of
data; international regulations about privacy and access to information; on premise vs. cloud
storage – These (and more) all come into play in the design and implementation of any data
solution that feeds BI efforts.
1
https://www.dictionary.com/browse/creativity
Creativity
The ability to transcend traditional ideas, rules, patterns, relationships, or the like, and to create
meaningful new ideas, forms, methods, interpretations.2
Some of the messages in the following videos are subtle, but they all have a common element of
creativity.
2
https://www.dictionary.com/browse/creativity
- Python: A programming language / environment that servers many functions. From scripting
(executing other programs in a controlled way) to full applications with user interfaces.
- Microsoft Excel: This is much more than a number cruncher. Of course, Excel can be used to
perform actual analytical work. But that’s for a different course. We’ll be using it to help us move
data through the transform stage. 1-of transformations that are less than 1 million rows work well in
Excel.
- UNIX (Ubuntu): Ubuntu is one of many offerings in the world of UNIX operating systems. We will
learn to use typical UNIX tools like sed, awk, tail and grep.
- PostgreSQL: Most of our database work will be completed with PostgreSQL
- VMware: We will be leveraging virtual machines to install required software. This way we can
reduce the clutter on our host computers.
- Talend open studio for data integration: A free, open source tool that simplify the loading,
extraction, transformation and processing of large and diverse data sets.
- Data Visualization Tools: Tableau and Power BI will be used to validate our ETL efforts. We will
connect to data at various stages of ETL (raw, CSV, Excel, Database) to help us understand
limitations prevalent at each of those stages.
Microsoft Excel
500 Excel Formula Examples
https://exceljet.net/formulas
Intermediate level
Lynda – Excel Data Visualization Part 1: Mastering 20+ Charts and Graphs
https://www.lynda.com/Excel-tutorials/Excel-Data-Visualization-Part-1-Mastering-20-Charts-Graphs/791339-2.html?org=algonquincollege.com
Advanced
Python
Python Libraries that are interesting
SQLite (a single-user SQL database, great for embedding SQL into Python without a full RDBMS)
https://www.sqlite.org
http://www.sqlitetutorial.net/sqlite-python
https://sqlitebrowser.org
PostgreSQL
PostgreSQL primary website
https://www.postgresql.org/
PostgreSQL Tutorial
http://www.postgresqltutorial.com/
Learn PostgreSQL
https://www.tutorialspoint.com/postgresql/
DB Designer Fork
https://sourceforge.net/projects/dbdesigner-fork
System Architect
https://www.codebydesign.com
Geospatial related
Intro to Python GIS
https://automating-gis-processes.github.io/CSC18/course-info/Installing_Anacondas_GIS.html
https://automating-gis-processes.github.io/CSC18/lessons/L2/projections.html
Components list
https://www.talendforge.org/components/index.php
Install walkthrough.
https://youtu.be/MtR-o0asWRU
R
R is a programming language for statistical computing. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. It is considered a
mainstream tool within data science communities.
R is flexible enough that you can find it being used as an ETL tool in addition to statistical modelling.
KNIME
KNIME is an open-source data analytics platform. It integrates components through a modular data
pipelining concept. Modeling, data analysis and visualization can be performed with little programming.
To some extent, NIME can be considered as a SAS alternative.
- KNIME: https://www.knime.com/
- Wikipedia KNIME: https://en.wikipedia.org/wiki/KNIME
Microsoft VBA
VBA (Visual Basic for Applications) is the programming language of Excel and other MS Office programs.
If, for example, you have tasks in Microsoft Excel that you do repeatedly, you can record a macro to
automate those tasks – These macros are written in VBA.
VBA code can link most of the MS Office suite. The most relevant for us are Excel and Access. VBA code
normally can only run within a host application, rather than as a standalone program. VBA can, however,
control one application from another using OLE Automation.
While VBA skills are not too useful when creating corporate ETL solutions, they can certainly make a
difference in your personal productivity (testing and preparing data).
Should you run out and become an expert VBA programmer? No, not really, Python and R skills are
more practical.
QGIS
is a free and open-source geographic information system application that supports viewing, editing, and
analysis of geospatial data.
QGIS integrates with other open-source GIS packages, including PostGIS, GRASS GIS, and MapServer.
Plugins can be written in Python or C++ to extend QGIS's capabilities. Plugins can geocode using the
Google Geocoding API, perform geoprocessing functions similar to those of the standard tools found in
ArcGIS, and interface with PostgreSQL/PostGIS, SpatiaLite and MySQL databases.
The Windows Subsystem for Linux (WSL) is a new Windows 10 feature that enables you to run native
Linux command-line tools directly on Windows, alongside your traditional Windows desktop and
modern store apps.
This is primarily a tool for developers -- especially web developers and those who work on or with open
source projects. This allows those who want/need to use Bash, common Linux tools (sed, awk, etc.) and
many Linux-first tools (Ruby, Python, etc.) to use their toolchain on Windows.
WSL provides an application called Bash.exe that, when started, opens a Windows console running the
Bash shell. Using Bash, you can run command-line Linux tools and apps. For example, type lsb_release -a
and hit enter; you’ll see details of the Linux distro currently running:
https://docs.microsoft.com/en-us/windows/wsl/faq
https://docs.microsoft.com/en-us/learn/modules/get-started-with-windows-subsystem-for-linux/