Professional Documents
Culture Documents
Chapter 2 - Data ScienceOR Ing
Chapter 2 - Data ScienceOR Ing
Data
CONT.
Introduction
Data science is now one of the most influential topics all around.
Companies and enterprises are focusing a lot on gathering data science
talent further creating more viable roles in the data science industry.
Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.
Example: The data involved in buying a box of cereal from the store or
supermarket
Saayinsii daataa amma mata dureewwan naannoo hundatti dhiibbaa guddaa
geessisan keessaa isa tokkodha.
Dhaabbileen fi dhaabbileen dandeettii saayinsii daataa walitti qabuu irratti baay’ee
xiyyeeffachuun industirii saayinsii daataa keessatti gahee caalaatti hojiirra ooluu
danda’u daran uumuudha.
Saayinsiin daataa damee ogummaa hedduu qabu yoo ta’u, mala saayinsii, adeemsa,
algoritmota, fi sirna fayyadamuun beekumsaa fi hubannoo daataa caaseffama,
walakkaa caaseffama qabuu fi caaseffama hin qabne irraa baasuudha.
Fakkeenya: Daataa saanduqa midhaanii mana daldalaa ykn suuqii guddaa irraa
bitachuu keessatti hirmaatu
Data Science vs Data scientist
Data Science is defined as the extraction of actionable
knowledge directly from the data through the process of
discovery, hypothesis, and analytical hypotheses analysis.
It is a process of effectively producing or helping to produce
some tool, method, or other product that derives intelligence
from datasets too large.
Saayinsii Daataa adeemsa argannoo, yaada (hypothesis) fi xiinxala yaada-
qabeessa xiinxalaatiin beekumsa gochaan hojjetamuu danda’u kallattiin daataa
irraa baasuu jedhamee hiikama.
Adeemsa meeshaa, mala ykn oomisha biroo tokko tokko kan kuusaa deetaa
garmalee guddaa ta’e irraa sammuu argatu bu’a qabeessa ta’een oomishuuf
ykn gargaaruudha.
Data Science vs Data scientist
A data scientist (is a job title) is a person engaging in a systematic activity to acquire
knowledge from data.
In a more restricted sense, a data scientist may refer to an individual who uses the scientific
method on existing data.
The processed and filtered data are handed to them which are then fed to various
analytics programs and machine learning with statistical methods to generate
data which will soon be used in predictive analysis and other fields
Daataan adeemsifamee fi calalame isaanif kan kennamu yoo ta’u, sana booda mala istaatistiksiitiin
sagantaalee xiinxala adda addaa fi barumsa maashiniitti nyaatamee daataa yeroo dhiyootti xiinxala
tilmaamaafi dameewwan biroo keessatti kan oolu maddisiisu
Scientific method requires data to begin iterating towards a more convincing hypothesis.
Science doesn’t exist without data.
Data scientist
possess a strong
Quantitative background in statistics
Linear algebra
Programming knowledge with focuses on data warehousing, mining, and modeling to
build and analyze algorithm
Malli saayinsaawaa daataa gara yaada amansiisaa ta’etti irra deddeebi’uu jalqabuu gaafata.
Saayintistii daataa
cimaa qabaachuu
Aljebraa sararaawaa
Beekumsi sagantaa waliin kuusaa daataa, albuuda baasuu, fi moodeela algoritmota ijaaruu fi xiinxaluuf irratti
xiyyeeffata
Data vs. Information
Data
Data analytics (DA) is that the method of examining knowledge sets to conclude
the data they contain, progressively with the help of specialized systems and
software package
From a data analytics point of view, it is important to understand that there are
three common types of data types or structures:
Structured(clearly defined and represented)
Unstructured data types (often qualitative often difficult to organize and input in data base)
Semi-structured(includes both)
Xiinxala daataa (DA) jechuun mala tuuta beekumsaa qorachuun daataa isaan of keessaa qaban xumuruuf, tartiibaan gargaarsa sirna
addaa fi paakeejii sooftiweeriitiin
Ija xiinxala deetaatiin yoo ilaalle, gosootni ykn caasaa deetaa waliigalaa sadii akka jiran hubachuun barbaachisaadha:
Akaakuuwwan deetaa hin caaseeffamne (yeroo baay’ee qulqullina kan qaban yeroo baay’ee kuusa deetaa keessatti qindeessuun fi
galchuun rakkisaadha) .
Dhugaa: daataa amanachuu dandeenyaa? Hammam sirrii dha? kkf(madda )
Clustered Computing
Cluster Computing addresses the latest results in these fields that support High Performance
Distributed Computing .(Working by networking in group as a single entity to achieve a common
goal ).
The Clustering methods have identified as- HPC(high performance computing)
IAAS(infrastructure as a service), PAAS(platform as a service), that are more expensive and
difficult to setup and maintain than a single computer.
In HPDC(high pressure die casting) environments, parallel and/or distributed computing techniques
are applied to the solution of computationally intensive applications across networks of computers.
(technology for production of huge machineries like cars.)
Shallaggiin Kilaastaraa bu’aawwan haaraa dameewwan kanneen keessatti kanneen Shallaggii Raabsa Ga’umsa
Ol’aanaa deeggaran ilaala .(Kaayyoo waloo galmaan ga’uuf akka qaama tokkootti garee keessatti networking gochuun
hojjechuu ).
Malli Clustering kanneen akka- HPC(high performance computing) IAAS(infrastructure as a service), PAAS(platform
as a service), kanneen kompiitara tokko caalaa qaala’aa fi setup fi kunuunsuuf rakkisaa ta’an adda baasuun isaanii ni
yaadatama.
Naannoo HPDC(high pressure die casting) keessatti, tooftaaleen shallaggii walfakkaatu fi/ykn raabsame furmaata
hojiiwwan shallaggii cimaa ta’an networkii kompiitaraa hunda irratti hojiirra oola.(teknooloojii oomisha maashinoota
gurguddoo akka konkolaataa.)
Clustered Computing
“Computer cluster” basically refers to a set of connected computer working
together.
The cluster represents one system and the objective is to improve performance.
The computers are generally connected in a LAN (Local Area Network).
So, when this cluster of computers works to perform some tasks and gives an
impression of only a single entity, it is called “cluster computing”.
Kilaastara kompiitaraa” jechuun bu’uuraan tuuta kompiitara walitti hidhame
waliin hojjetan agarsiisa.
Kilaastarri sirna tokko kan bakka bu’u yoo ta’u, kaayyoon isaas raawwii hojii
fooyyessuudha.
Kompiitaroonni akka waliigalaatti LAN (Local Area Network) keessatti walitti
hidhamaniiru.
Egaa, kilaastarri kompiitaraa kun hojiiwwan tokko tokko raawwachuuf yeroo
hojjetuu fi yaada qaama tokko qofa yeroo kennu “cluster computing” jedhama.
Clustered Computing
Big data clustering software combines the resources of many smaller machines, seeking to provide a number of
benefits:
Resource Pooling: (grouping resources together)
Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are
also extremely important. Processing large datasets requires large amounts of all three of these
resources.
Object Pooling is a way which enable storing of group of object(called pool storage) in memory.
(grouping objects in different categories)
Whenever new object is needs to be created, it is first checked in pool storage and if available it is
reused and like this it provide reusability of object and system resources, improves the scalability of
program.
It seems like cloud resource polling like resources of storage, memory, processing and bandwidth tec .)
Softiweeriin daataa guddaa walitti qabuu qabeenya maashiniiwwan xixiqqoo hedduu walitti fida, faayidaa hedduu
kennuu barbaada:
Qabeenya Walitti Qabuu: (qabeenya walitti qabuu) .
Bakka kuusaa jiru walitti makuun deetaa qabachuun faayidaa ifa ta'e qaba, garuu CPU fi memory pooling illee
baayyee barbaachisaa dha. Kuusaa deetaa gurguddoo adeemsisuun qabeenya kana sadan hunda hamma guddaa
barbaada.
Wantoota walitti qabuun karaa garee wantaa(kuusaa kuusaa jedhamu) yaadannoo keessatti kuusuu dandeessisuudha.
(wantoota gosoota adda addaa keessatti garee gochuu) .
Yeroo wanti haaraan uumamuu qabu hundatti jalqaba kuusaa poolii keessatti sakatta'amee yoo jiraate irra deebi'amee
fayyadama akkasumas akka kanaatti irra deebi'amee fayyadamuu qabeenya wantaa fi sirnaa ni kenna, guddina
sagantichaa ni fooyyessa.
Inni akka cloud resource polling fakkaata akka qabeenya kuusaa, memory, processing fi bandwidth tec.)
Clustered Computing
High Availability: In computing, the term availability is used to describe the period of time when a
service is available, as well as the time required by a system to respond to a request made by a user.
High availability is a quality of a system or component that assures a high level of operational
performance for a given period of time.
Clusters can provide varying levels of fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data and processing. This becomes
increasingly important as we continue to emphasize the importance of real-time analytics.
Easy Scalability(expanding by adding resources to the system): Clusters make it easy to scale
horizontally by adding additional machines to the group. This means the system can react to
changes in resource requirements without expanding the physical resources on a machine.
Argamuu Ol’aanaa: Shallaggii keessatti jechi argama jedhu yeroo tajaajilli tokko itti argamu,
akkasumas yeroo sirni tokko gaaffii fayyadamaan dhiheesseef deebii kennuuf barbaadu ibsuuf kan
gargaarudha. Argamuun olaanaan qulqullina sirna ykn qaama yeroo murtaa’eef raawwii hojii
sadarkaa olaanaa mirkaneessudha.
Kilaastaroonni kufaatiin haardwaarii ykn sooftiweerii qaqqabummaa daataa fi adeemsa irratti
dhiibbaa akka hin geessisneef wabii dogongora dandamachuu fi argamuu sadarkaa adda addaa
kennuu danda’u. Kun barbaachisummaa xiinxala yeroo qabatamaa cimsinee ibsuu keenya itti
fufnaan barbaachisaa ta’aa dhufeera.
Easy Scalability(qabeenya sirnicha irratti dabaluudhaan babal'isuu): Kilaastaroonni maashiniiwwan
dabalataa gareetti dabaluudhaan horizontally scale gochuun salphaa taasisa. Kana jechuun sirnichi
qabeenya fiizikaalaa maashinii tokko irratti osoo hin babal’isin jijjiirama qabeenya barbaachisu
irratti deebii kennuu danda’a.
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction with big data easier. It is a framework that allows for the
distributed processing of large datasets across clusters of computers using simple programming models.(it stores and process a
large data sets ranging from gigabytes to petabytes of data and its eco system includes multiple components that support big
data processing)
The four key characteristics of Hadoop (software library) are:
Economical: Its systems are highly economical as ordinary computers can be used for data processing.
Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them
later.
Hadoop madda banaa kan ta'ee fi walqunnamtii daataa guddaa wajjin taasifamuu salphisuuf yaadameedha. Innis unkaa adeemsa raabsame kuusaawwan
deetaa gurguddoo kilaastaroota kompiitaraa irratti moodeelota sagantaa salphaa fayyadamuun hayyamuudha.(tuuta deetaa guddaa deetaa giigaabaayitii
hanga peetaabaayitii ta’e kuusaa fi adeemsisuuf sirni ikoo isaa qaamolee hedduu adeemsa daataa guddaa deeggaran of keessatti qabata ) .
Amaloonni ijoo afran Hadoop (mana kitaabaa sooftiweerii) kanneen armaan gadiiti:
Diinagdee: Sirnoonni isaa kompiitaroota idilee adeemsa daataaf oolu waan danda’aniif dinagdee guddaa qaba.
Amanamaadha: Koppii daataa maashinii adda addaa irratti waan kuufatuuf akkasumas kufaatii haardwaarii kan dandamatu waan ta’eef amanamaa dha.
Scalable: Salphaatti lamaan isaanii, horizontal fi vertically scalable dha. Noodonni dabalataa muraasni unkaa guddisuu keessatti gargaaru.
Daddabbii: Daddabbii kan qabuu fi hamma barbaaddetti deetaa caaseffama qabuu fi hin caaseeffamne kuusuu fi booda itti fayyadamuuf murteessuu
dandeessa.
Hadoop and its Ecosystem
Determines
who has
authority and
control over
data and how
these data
assets may be
used.
THANKS
End of chapter 2