Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Data Science tools

Dr.M.Dhurgadevi
Associate Professor
Sri Krishna College of Technology
Coimbatore
Data science tools
Data science tools are used for diving into
raw and complicated data (unstructured or
structured data) and processing,
extracting, and analyzing it to dig out
valuable insights by applying different
data processing techniques such as
statistics, computer science, predictive
modeling and analysis, and deep learning.
Statistical analyzing techniques
 Probability and Statistics
 Distribution
 Regression analysis
 Descriptive statistics
 Inferential statistics
 Non-Parametric statistics
 Hypothesis testing
 Linear Regression
 Logistic Regression
 Neural Networks
 K-Means clustering
 Decision Trees
1. Data Collection Tools

Semantria
 Semantria is a cloud-based tool that extracts data and
information through analyzing the text and sentiments in it. It
is a high-end NLP (neuro-linguistic programming) based tool
that can detect the sentiments on specific elements based on
the language used in it (sounds like magic? No, it is science!).
Trackur
 It is yet another tool that collects data, especially on social
media platforms, by tracking the feedback on brands and
products. It also works on sentiment analysis. It is a tool used
for monitoring and can be of great value for marketing
companies.
 Today, many other apps use similar text /semantics analysis
and content management, e.g., Open Text, Opinion Crawl.
2. Data Storage Tools
 These tools are used to store a huge amount of data – which is typically
stored in shared computers – and interact with it. These tools provide a
platform to unite servers so that data can be assessed easily.
Apache Hadoop
 It is a framework for software that deals with huge data volume and its
computation. It provides a layered structure to distribute the storage of data
among clusters of computers for easy data processing of big data.
Apache Cassandra
 This tool is free and an open-source platform. It uses SQL and CSL
(Cassandra structure language) to communicate with the database. It can
provide swift availability of data stored on various servers.
Mongo DB
 It is a database that is document-oriented and also free to use. It is available
on multiple platforms like Windows, Solaris, and Linux. It is very easy to
learn and is reliable.
 Similar data storage platforms are CouchDB, Apache Ignite, and Oracle
NoSQL Database.
3. Data Extraction Tools
 Data extraction tools are also known as web scraping tools. They
are automated and extract information and data automatically from
websites. The following tools can be used for data extraction.
OctoParse
 It is a web scraping tool available in both free and paid versions. It
gives data as output in structured spreadsheets, which are readable
and easy to use for further operations on it. It can extract phone
numbers, IP addresses, and email IDs along with different data from
the websites.
Content Grabber
 It is also a web scraping tool but comes with advanced skills such as
debugging and error handling. It can extract data from almost every
website and provide structured data as output in user preferred
formats.
 Similar tools are Mozenda, Pentaho, and import.io.
4. Data Cleaning / Refining Tools
 Integrated with databases, data cleaning tools are time-saving and
reduce the time consumption by searching, sorting, and filtering
data to be used by the data analysts. The refined data becomes easy
to use and is relevant. (Blei and Smyth, 2017)
Data Cleaner
 Data cleaner works with the Hadoop database and is a very
powerful data indexing tool. It improves the quality of data by
removing duplicates and transforming them into one record. It can
also find missing patterns and a specific data group.
OpenRefine
 This refining tool deals with tangled data. It cleans before
transforming it into another form. It provides data access with
speed and ease.
 Similar data cleaning tools are MapReduce, Rapidminer, and
Talend.
5. Data Analysis Tools
 Data analysis tools not only analyze the data but also perform certain operations on
the data. These tools inspect the data and study data modeling to draw useful
information out of the data, which is conclusive and helps in decision-making for a
certain problem or query.
R
 The R programming language is the widely used programming language that is used
by software engineers to develop software that helps in statistical computing and
graphics too. It supports various platforms like Windows, Mac operating system, and
Linux. It is widely used by data analysts, statisticians, and researchers.
Apache Spark
 Apache Spark is a powerful analytical engine that provides real-time analysis and
processes data along with enabling mini and micro-batches and streaming. It is
productive as it provides workflows that are highly interactive.
Python
 Python has been a very powerful and high-level programming language that has been
around for quite a while. It was used for application development, but now it has been
upgraded with new tools to be used, especially with data science. It gives output files
that can be saved as CSV formats and used as spreadsheets.
 Similar data analysis tools are Apache storm, SAS, Flink, Hive, etc..
6. Data Visualization Tools
 Data visualization tools are used to present data in a graphical representation for clear insight. Many
visualization tools are a combination of previous functions we discussed and can also support data extraction
and analysis along with visualization.
Python
 Python, as mentioned above, is a powerful and general-purpose programming language that also provides
data visualization. It is packed with vast graphical libraries to support the graphical representation of a wide
variety of data.
Tableau
 Having a very large consumer market, Tableau is referred to as the grandmaster of all visualization software
by Forbes. It is open-source software that can be integrated with the database, is easy to use, and furnishes
interactive data visualization in the form of bars, charts, and maps.
Orange
 Orange also happens to be an open-source data visualization tool supporting data extraction, data analysis,
and machine learning. It does not require programming but rather has an interactive and user-friendly
graphical user interface that displays the data in the form of bar charts, networks, heat maps, scatter plots, and
trees.
Google Fusion Table
 It is a web service powered by Google, which can be easily used by non-programmers for collecting data. You
can upload your data in the form of CSV files and save them too. It looks more like an excel spreadsheet and
allows editing by which you can see real-time changes in visualizations. It displays data in the form of pie
charts, bars, timelines, line plots, and scatter plots. It allows you to link the data tables to your websites. You
can also create a map based on your data, which can be further modified by coloring and can also be shared.
 Similar popular data visualization apps and tools are DataWrapper, Qlik, and Gephi, which are all open
source and also support CSV files as data input.
Data Scientist-Key tools

You might also like