Chapter 2

CHAPTER 2
Data
CONT.
Introduction
 Data science is now one of the most influential topics all around.
 Companies and enterprises are focusing a lot on gathering data science
talent further creating more viable roles in the data science industry.
 Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.
 Example: The data involved in buying a box of cereal from the store or
supermarket
Saayinsii daataa amma mata dureewwan naannoo hundatti dhiibbaa guddaa
geessisan keessaa isa tokkodha.
Dhaabbileen fi dhaabbileen dandeettii saayinsii daataa walitti qabuu irratti baay’ee
xiyyeeffachuun industirii saayinsii daataa keessatti gahee caalaatti hojiirra ooluu
danda’u daran uumuudha.
Saayinsiin daataa damee ogummaa hedduu qabu yoo ta’u, mala saayinsii, adeemsa,
algoritmota, fi sirna fayyadamuun beekumsaa fi hubannoo daataa caaseffama,
walakkaa caaseffama qabuu fi caaseffama hin qabne irraa baasuudha.
Fakkeenya: Daataa saanduqa midhaanii mana daldalaa ykn suuqii guddaa irraa
bitachuu keessatti hirmaatu

Data Science vs Data scientist
 Data Science is defined as the extraction of actionable
knowledge directly from the data through the process of
discovery, hypothesis, and analytical hypotheses analysis.
 It is a process of effectively producing or helping to produce
some tool, method, or other product that derives intelligence
from datasets too large.
Saayinsii Daataa adeemsa argannoo, yaada (hypothesis) fi xiinxala yaada-
qabeessa xiinxalaatiin beekumsa gochaan hojjetamuu danda’u kallattiin daataa
irraa baasuu jedhamee hiikama.
Adeemsa meeshaa, mala ykn oomisha biroo tokko tokko kan kuusaa deetaa
garmalee guddaa ta’e irraa sammuu argatu bu’a qabeessa ta’een oomishuuf
ykn gargaaruudha.
Data Science vs Data scientist
 A data scientist (is a job title) is a person engaging in a systematic activity to acquire
knowledge from data.
 In a more restricted sense, a data scientist may refer to an individual who uses the scientific
method on existing data.
 Data Scientists perform research toward a more comprehensive understanding of products,

systems, or nature, including physical, mathematical and social realms( environment or
sphere).
 Saayintistiin daataa (maqaa hojiiti) nama daataa irraa beekumsa argachuuf hojii sirnaa irratti bobba’eedha. Miira
daangeffameen, saayintistiin daataa nama dhuunfaa mala saayinsii daataa jiru irratti fayyadamu agarsiisuu danda’a.
Saayintistoonni Daataa qorannoo gara hubannoo bal’aa oomishaalee, sirna, ykn uumama, kanneen akka naannoo
fiizikaalaa, herregaa fi hawaasummaa( naannoo ykn sphere) dabalatee raawwatu .

Role of a Data Scientist
 Advance the skills of analyzing large amounts of data, data mining, and
programming skills.
 The processed and filtered data are handed to them which are then fed to various
analytics programs and machine learning with statistical methods to generate
data which will soon be used in predictive analysis and other fields
 Explore for more cryptic(hidden) patterns to procure(obtain) proper insights.

 Dandeettii daataa baay’ee xiinxaluu, daataa qotuu, fi dandeettii sagantaa qopheessuu tarkaanfachiisuu.
 Daataan adeemsifamee fi calalame isaanif kan kennamu yoo ta’u, sana booda mala istaatistiksiitiin
sagantaalee xiinxala adda addaa fi barumsa maashiniitti nyaatamee daataa yeroo dhiyootti xiinxala
tilmaamaafi dameewwan biroo keessatti kan oolu maddisiisu
 Hubannoo sirrii ta’e bitachuuf(argachuuf) paateenoota dhokataa(dhokataa) caalaa qoradhu.

Data Science Saayinsii Daataa
 Scientific method requires data to begin iterating towards a more convincing hypothesis.
 Science doesn’t exist without data.
 Data scientist
 possess a strong
 Quantitative background in statistics
 Linear algebra
 Programming knowledge with focuses on data warehousing, mining, and modeling to
build and analyze algorithm
Malli saayinsaawaa daataa gara yaada amansiisaa ta’etti irra deddeebi’uu jalqabuu gaafata.
Saayinsiin daataa malee hin jiru.
Saayintistii daataa
cimaa qabaachuu
Duubbee baay’inaan istaatiksii keessatti
Aljebraa sararaawaa
Beekumsi sagantaa waliin kuusaa daataa, albuuda baasuu, fi moodeela algoritmota ijaaruu fi xiinxaluuf irratti
xiyyeeffata
Data vs. Information
 Data
 Can be defined as a representation of facts, concepts, or instructions in a formalized manner, which

should be suitable for communication, interpretation, or processing, by human or electronic machines.
 It can be described as unprocessed facts and figures
 It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special
characters (+, -, /, *, <,>, =, etc.
 Information
 The processed data on which decisions and actions are based
 Information is interpreted data; created from organized, structured, and processed data in a particular
context
Odeeffannoo funaaname
Bakka bu’iinsa dhugaa, yaad-rimee, ykn qajeelfama haala sirnaawaa ta’een, kan walqunnamtii, hiikkaa, ykn adeemsaaf mijatu ta’uu qabu,
maashinii namaa ykn elektirooniksii jedhamee hiikamuu danda’a.
Dhugaa fi lakkoofsa hin adeemsifamne jedhamee ibsamuu danda’a
Gargaarsa arfiilee akka qubee (A-Z, a-z), dijiitota (0-9) ykn arfii addaa (+, -, /, *, <,>, =, fi kkf) tiin bakka bu’a.
Odeeffannoo
Daataa adeemsifame kan murtii fi gochi irratti hundaa’e
Odeeffannoon daataa hiikamuudha; haala addaa keessatti daataa qindaa’e, caaseffama, fi adeemsifame irraa kan uumame
Data Processing Cycle
 Data processing is the conversion of raw data to meaningful information through a process.
 Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing
situation.
 The process includes activities like data entry/input, calculation/process, output and storage
 Input is the task where verified data is coded or converted into machine readable form so that it can be
processed through a computer. Data entry is done through the use of a keyboard, digitizer, scanner, or data
entry from an existing source .
Adeemsi daataa adeemsa tokkoon daataa raw gara odeeffannoo hiika qabuutti jijjiiruudha.
Daataan bu’aa furmaata rakkoo ykn haala jiru fooyyessuuf geessu argachuuf too’atama.
Adeemsi kun hojiiwwan akka galtee/galtee deetaa, shallaggii/adeemsa, oomishaa fi kuusaa of keessatti qabata
Galteen hojii deetaan mirkanaa’e koodii itti godhamu ykn gara bifa maashiniin dubbifamuu danda’utti jijjiiramee
karaa kompiitaraa adeemsifamuu danda’uudha. Galmeen deetaa kan raawwatamu fayyadama kiiboordii,
dijiitaayizarii, iskaanara, ykn galmee deetaa madda jiru irraati
Data Processing Cycle
 Processing is when the data is subjected to various means and methods of manipulation, the point where a
computer program is being executed, and it contains the program code and its current activity.
 Output and interpretation is the stage where processed information is now transmitted to the user. Output
is presented to users in various report formats like printed report, audio, video, or on monitor.
 Storage is the last stage in the data processing cycle, where data, instruction and information are held for
future use. The importance of this cycle is that it allows quick access and retrieval of the processed
information, allowing it to be passed on to the next stage directly, when needed.
 Adeemsi daataa malaa fi mala adda addaatiin kan hojjetamu yoo ta’u, bakka sagantaa kompiitaraa tokko itti
raawwatamaa jiruu fi koodii sagantichaa fi sochii isaa ammaa kan of keessaa qabudha.
 Bu’aa fi hiikuun sadarkaa odeeffannoon adeemsifame amma fayyadamaaf itti dabarfamuudha. Bu'aan
fayyadamtootaaf bifa gabaasa adda addaa kan akka gabaasa maxxanfame, sagalee, viidiyoo, ykn monitor
irratti dhiyaata.
 Kuusaan marsaa adeemsa deetaa keessatti sadarkaa dhumaa yoo ta’u, daataa, qajeelfamni fi odeeffannoon
gara fuulduraatti itti fayyadamuuf kan qabamudha. Barbaachisummaan marsaa kanaa odeeffannoo
adeemsifame saffisaan argachuu fi deebisuu kan dandeessisu yoo ta’u, kallattiin sadarkaa itti aanutti akka
darbu kan taasisudha, yeroo barbaachisaa ta’etti.
Data types
 A data type is way to tell compiler as to which data (integer, character, float, etc.) is supposed to be stored and what amount of
memory consequently to allocate.
 A data type is way to tell the compiler that at a cell x in a memory space, a bit value of some range y is only supposed to be stored.
It restricts the compiler to store anything else other than that value range
 Gosti deetaa karaa qindeessaa deetaa kamtu (lakkoofsa guutuu, arfii, float fi kkf) kuufamuu akka qabuu fi kana irraa kan ka'e
hamma yaadannoo akkamii akka ramadamu itti himuudha.
 Gosti deetaa karaa qindeessaadhaaf man'ee x iddoo yaadannoo keessatti, gatii bittii hanga tokko tokkoo y qofa kuufamuu akka qabu
itti himuudha. Qindeessaan hammangaa gatii sanaan alatti waan biraa akka kuufatu daangessa
 Common data types include
 Integers(int)- is used to store whole numbers, mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of two values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and numbers
 AA
 Formal language
 Logic in computer science
Data representation
 Types are an abstraction letting us model
things in categories and it is largely a mental
construct.
 All computer represent data nothing more than
a string of ones and zeroes.
 In order for said ones and zeroes to convey
any meaning, they need to be contextualized.
 Data types provide that context.
 E.g. 01100001
Data types from Data Analytics perspective
 Data analytics (DA) is that the method of examining knowledge sets to conclude
the data they contain, progressively with the help of specialized systems and
software package
 From a data analytics point of view, it is important to understand that there are
three common types of data types or structures:
 Structured(clearly defined and represented)
 Unstructured data types (often qualitative often difficult to organize and input in data base)
 Semi-structured(includes both)
Xiinxala daataa (DA) jechuun mala tuuta beekumsaa qorachuun daataa isaan of keessaa qaban xumuruuf, tartiibaan gargaarsa sirna
addaa fi paakeejii sooftiweeriitiin
Ija xiinxala deetaatiin yoo ilaalle, gosootni ykn caasaa deetaa waliigalaa sadii akka jiran hubachuun barbaachisaadha:
Caaseffama(ifaan ibsamee fi bakka bu'ee) .
Akaakuuwwan deetaa hin caaseeffamne (yeroo baay’ee qulqullina kan qaban yeroo baay’ee kuusa deetaa keessatti qindeessuun fi
galchuun rakkisaadha) .
Walakkaa caaseffama(lamaanuu dabalata) .

Structured Data
 Structured data is data that adheres to a pre-defined data model and is therefore
straightforward to analyze.
 Structured data concerns all data which can be stored in database SQL in table with
rows and columns. They have relational key and can be easily mapped into pre-
designed fields.
 Structured data is highly organized information that uploads neatly into a relational
database
 Structured data is relatively simple to enter, store, query, and analyze, but it must be
strictly defined in terms of field name and type.
 Daataan caaseffame daataa moodeela deetaa dursee ibsametti kan maxxanu waan ta’eef xiinxaluuf qajeelaa dha.
 Deetaan caaseffame deetaa hunda kan ilaallatu yoo ta'u kunis SQL kuusdeetaa keessatti gabatee tarree fi tarjaa
wajjin kuufamuu danda'a. Isaan furtuu hariiroo kan qaban yoo ta’u, salphaatti gara dirree dursanii qophaa’etti
kaartaan kaa’amuu danda’u.
 Deetaan caaseffame odeeffannoo haalaan qindaa’ee fi sirnaan gara kuusdeetaa hariirootti olkaa’uudha
 Deetaan caaseffame galchuu, kuusuu, gaafachuu fi xiinxaluuf salphaadha, garuu maqaa dirree fi gosa cimsee
ibsamuu qaba.
Unstructured Data
 Unstructured data is information that either does not have a predefined data model or is not
organized in a pre-defined manner.
 Unstructured data may have its own internal structure, but does not conform neatly into a
spreadsheet or database.
 Most business interactions, in fact, are unstructured in nature.
 Today more than 80% of the data generated is unstructured.
 The fundamental challenge of unstructured data sources is that they are difficult for nontechnical
business users and data analysts alike to unbox, understand, and prepare for analytic use.
 Deetaan hin caaseffamne odeeffannoo yookaan moodeela deetaa dursee hin ibsamne yookaan haala
dursee hin ibsamneen hin gurmoofnedha.
 Deetaan hin caaseffamne caasaa keessoo mataa isaa qabaachuu danda'a, garuu sirnaan gara gabatee
ykn kuusdeetaa hin walsimne.
 Walqunnamtiin daldalaa irra caalaan isaa, dhugaa dubbachuuf, uumamaan kan hin
caaseeffamnedha.
 Har’a daataa maddisiifamu keessaa %80 ol kan caaseffama hin qabnedha.
 Qormaanni bu’uuraa maddoota daataa caaseffama hin qabne fayyadamtoonni daldalaa teeknikaa
hin taanee fi xiinxaltoonni daataa walqixa saanduqa hiikuu, hubachuu fi itti fayyadama xiinxalaaf
qophaa’uun rakkisaa ta’uu isaati.
Semi structured Data
 Semi-structured data is a form of structured data that does not conform with the
formal structure of data models associated with relational databases or other
forms of data tables, but nonetheless, contains tags or other markers to separate
semantic elements and enforce hierarchies of records and fields within the data.
 Semi-structured data is information that doesn’t reside in a relational database but
that does have some organizational properties that make it easier to analyze.
 Examples of semi-structured : CSV (comma separated value)XML(extensible
mark up language) and JSON (java script object notation) documents are semi
structured documents, NoSQL databases are considered as semi structured.
 Deetaan walakkaa caaseffame bifa deetaa caaseffama qabuu fi caasaa idilee moodeelota deetaa
kuusdeetaa hariiroo ykn bifa gabatee deetaa biroo wajjin walqabate waliin kan wal hin simne yoo
ta’u, kanas ta’e sana, mallattoolee ykn mallattoolee biroo qaamolee hiika adda baasuu fi sadarkaa
galmeewwanii fi man'eewwan deetaa keessa jiran.
 Daataan walakkaa caaseffame odeeffannoo kuusdeetaa hariiroo keessa hin jiraanne garuu
amaloota jaarmiyaa tokko tokko kan xiinxaluuf salphaa taasisanidha.
 Fakkeenyonni walakkaa caaseffama : CSV (gatii komaandiin adda baafame)XML(afaan mallattoo
babal'ifamuu danda'u) fi JSON (java script object notation) galmeewwan walakkaa caaseffama,
kuusdeetaaleen NoSQL akka walakkaa caaseffamaatti ilaalamu.
Metadata – Data about Data
 Metadata is data about data. Data that describes other data.
 It provides additional information about a specific set of data.
 Metadata summarizes basic information about data, which can make finding and working with
particular instances of data easier.
 For example, author, date created and date modified and file size are examples of very basic
document metadata.
 Having the ability to filter through that metadata makes it much easier for someone to locate a
specific document.
 In context of databases, metadata would be info on tables, views, columns, arguments etc.
Meetadaataan daataa waa'ee deetaa ti. Daataa daataa biroo ibsu.
 Waa'ee tuuta deetaa murtaa'e tokkoo odeeffannoo dabalataa kenna.
 Meetadaataan odeeffannoo bu'uuraa waa'ee deetaa gabaabsee kan ibsu yoo ta'u, kunis fakkeenyota addaa deetaa argachuu fi hojjechuu salphaa taasisuu
danda'a.
 Fakkeenyaaf, barreessaa, guyyaa uumamee fi guyyaa fooyya'ee fi guddina faayilii fakkeenya meetadaataa galmee baay'ee bu'uuraati.
 Dandeettii meetadaataa sana keessaa calaluu qabaachuun namni tokko galmee murtaa'e tokko akka argatu baay'ee salphisa.
 Haala kuusdeetaa keessatti, meetadaataan info gabatee, ilaalcha, tarjaa, murfiiwwan fi kkf ta'a.
Data value chain
 The Data Value Chain is introduced to describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data.
 Data acquisition, data analysis(converting data with useful statistics e.g. charts, maps etc.), data
curation or grouping(organization and integration of data collected from various sources), data
storage, data usage.
 Data acquisition is the process of digitizing data from the world around us so it can be
displayed, analyzed, and stored in a computer. It is the processes for bringing data that has
been created by a source outside the organization, into the organization, for production use.
 Sanduuqa Gatii Deetaa kan jalqabame dhangala’aa odeeffannoo sirna deetaa guddaa keessa
jiru akka tarkaanfiiwwan walduraa duubaan gatii fi hubannoo faayidaa qabu deetaa irraa
maddisiisuuf barbaachisanitti ibsuuf.
 Daataa argachuu, xiinxala daataa(daataa istaatistiksii faayidaa qabu jijjiiruu fkn chaartii,
kaartaa fi kkf), daataa kuureeshinii ykn garee(gurmeessuu fi walitti makuu daataa madda adda
addaa irraa walitti qabame), kuusaa daataa, itti fayyadama daataa.
 Daataa argachuun adeemsa daataa addunyaa naannoo keenya jiru dijiitaala gochuun akka
agarsiifamu, xiinxalamuu fi kompiitara keessatti kuufamuu danda’uudha. Innis adeemsa
daataa madda dhaabbatichaan ala uumame, gara dhaabbatichaatti, itti fayyadama oomishaaf
fiduudha.
Data value chain
 Data analysis is the process of evaluating data using analytical and
logical reasoning to examine each component of the data provided.
Data from various sources is gathered, reviewed, and then analyzed
to form some sort of finding or conclusion.
 Data analytics is process of finding information from data to make a
decision and subsequently act on it. Xiinxalli daataa adeemsa daataa
madaaluun sababeeffannaa xiinxalaafi loojikii fayyadamuun tokkoon
tokkoon qaama daataa kenname qorachuudha. Odeeffannoon
maddoota adda addaa irraa walitti qabamee, gamaaggamamee, achiis
xiinxalamee argannoo ykn xumura gosa tokkoo uuma.
 Xiinxalli daataa adeemsa odeeffannoo daataa irraa argachuun murtoo
murteessuu fi sana booda irratti hojjechuudha.
Data value chain
 Data curation is about managing data throughout its lifecycle. Collecting,
organizing, cleaning and much more are included in data curation. Data
curators manage the data through various stages and make the data usable for
data analysts and scientists.
 Data storage is defined as a way of keeping information in the memory
storage for use by a computer. An example of data storage is a folder for
storing Microsoft Word documents.
 Data usage is the amount of data (things like images, movies, photos, videos,
and other files) that you send, receive, download and/or upload .
 Data curation jechuun daataa marsaa jireenya isaa guutuu bulchuudha. Walitti qabuu, qindeessuun,
qulqulleessuu fi kanneen biroo hedduun data curation keessatti hammatamaniiru. Kiyureetaroonni
daataa daataa sadarkaa adda addaatiin kan bulchan yoo ta’u, daataa xiinxaltoota daataa fi
saayintistootaaf akka fayyadaman taasisa.
 Kuusaan deetaa karaa odeeffannoo kuusaa mimoorii keessa kaa’anii kompiitaraan akka itti
fayyadamuuf gargaaru jedhamee ibsama. Fakkeenyi kuusaa deetaa galmee galmeewwan Microsoft
Word kuusuudha.
 Fayyadamni deetaa hamma deetaa (wantoota akka fakkii, fiilmii, suuraa, viidiyoo, fi faayiloota
biroo) kan ati ergitu, fudhattu, buufachuu fi/ykn olkaa'uudha.
Data value chain
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity require

new architecture, techniques, algorithms, and analytics to manage it
and extract value and hidden knowledge from it…
Hiikni istaandaardii tokko hin jiru...
“Big Data” jechuun daataa iskeeliin, adda addummaa fi walxaxiinsi isaa

bulchuu fi gatii fi beekumsa dhokataa irraa baasuuf arkiteekcharii,
tooftaalee, algoritmota, fi xiinxala haaraa barbaadudha..
Big data
 Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
 In other words, data that is the range of 100s of TBs(Terabyte) or PB( Peta byte) comes
into Big Data. 1 TB=1000 GB and 1PB=1000TB
 But it doesn't mean the amount of data, the thing matters is what organization do with
data.
 Big Data is analyzed for insights that lead to better decisions.
 Daataan guddaan jecha walitti qabama tuuta deetaa guddaa fi walxaxaa ta’eef meeshaalee
bulchiinsa kuusdeetaa harka jiran ykn aplikeeshiniiwwan adeemsa deetaa aadaa
fayyadamuun adeemsisuun rakkisaa ta’a.
 Kana jechuun daataa 100s TBs(Terabyte) ykn PB( Peta byte) ta'e gara Big Data dhufa. 1
TB=1000 GB fi 1PB=1000TB ta'a
 Garuu hamma daataa jechuu miti, wanti barbaachisaa ta'e dhaabbati daataa maal akka
hojjetudha.
 Big Data hubannoo murtoo fooyya’aa ta’e argamsiisuuf xiinxalama.
Big data
 Big Data is associated with the concept of 3 V that is volume, velocity,
and variety. Big data is characterized by 3V and more:
 Volume: large amounts of data Zeta bytes/Massive datasets.
1 ZB=1Triloin GB(10 the power of 21 that is 21 0’s)
 Velocity: Data is live streaming or in motion.
 Variety: data comes in many different forms from diverse sources.
 Veracity: can we trust the data? How accurate is it? etc.(source)
 Big Data yaad-rimee 3 V jechuunis ulfaatina, saffisa, fi adda addaa wajjin walqabatee jira. Daataan guddaan
3V fi kanneen birootiin kan beekamudha:
 Jildii: deetaa baay'ee Zeta bytes/Massive datasets.
 1 ZB=1Triloin GB(10 humna 21 jechuun 21 0’s) .
 Saffisa: Daataan live streaming ykn sochii keessa jira.
 Garaagarummaa: daataa bifa adda addaa hedduudhaan maddoota adda addaa irraa dhufa.

Dhugaa: daataa amanachuu dandeenyaa? Hammam sirrii dha? kkf(madda )
Clustered Computing
 Cluster Computing addresses the latest results in these fields that support High Performance
Distributed Computing .(Working by networking in group as a single entity to achieve a common
goal ).
 The Clustering methods have identified as- HPC(high performance computing)
IAAS(infrastructure as a service), PAAS(platform as a service), that are more expensive and
difficult to setup and maintain than a single computer.
 In HPDC(high pressure die casting) environments, parallel and/or distributed computing techniques
are applied to the solution of computationally intensive applications across networks of computers.
(technology for production of huge machineries like cars.)
 Shallaggiin Kilaastaraa bu’aawwan haaraa dameewwan kanneen keessatti kanneen Shallaggii Raabsa Ga’umsa
Ol’aanaa deeggaran ilaala .(Kaayyoo waloo galmaan ga’uuf akka qaama tokkootti garee keessatti networking gochuun
hojjechuu ).
 Malli Clustering kanneen akka- HPC(high performance computing) IAAS(infrastructure as a service), PAAS(platform
as a service), kanneen kompiitara tokko caalaa qaala’aa fi setup fi kunuunsuuf rakkisaa ta’an adda baasuun isaanii ni
yaadatama.
 Naannoo HPDC(high pressure die casting) keessatti, tooftaaleen shallaggii walfakkaatu fi/ykn raabsame furmaata
hojiiwwan shallaggii cimaa ta’an networkii kompiitaraa hunda irratti hojiirra oola.(teknooloojii oomisha maashinoota
gurguddoo akka konkolaataa.)
Clustered Computing
 “Computer cluster” basically refers to a set of connected computer working
together.
 The cluster represents one system and the objective is to improve performance.
 The computers are generally connected in a LAN (Local Area Network).
 So, when this cluster of computers works to perform some tasks and gives an
impression of only a single entity, it is called “cluster computing”.
 Kilaastara kompiitaraa” jechuun bu’uuraan tuuta kompiitara walitti hidhame
waliin hojjetan agarsiisa.
 Kilaastarri sirna tokko kan bakka bu’u yoo ta’u, kaayyoon isaas raawwii hojii
fooyyessuudha.
 Kompiitaroonni akka waliigalaatti LAN (Local Area Network) keessatti walitti
hidhamaniiru.
 Egaa, kilaastarri kompiitaraa kun hojiiwwan tokko tokko raawwachuuf yeroo
hojjetuu fi yaada qaama tokko qofa yeroo kennu “cluster computing” jedhama.
Clustered Computing
 Big data clustering software combines the resources of many smaller machines, seeking to provide a number of
benefits:
 Resource Pooling: (grouping resources together)
 Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are
also extremely important. Processing large datasets requires large amounts of all three of these
resources.
 Object Pooling is a way which enable storing of group of object(called pool storage) in memory.
(grouping objects in different categories)
 Whenever new object is needs to be created, it is first checked in pool storage and if available it is
reused and like this it provide reusability of object and system resources, improves the scalability of
program.
 It seems like cloud resource polling like resources of storage, memory, processing and bandwidth tec .)
Softiweeriin daataa guddaa walitti qabuu qabeenya maashiniiwwan xixiqqoo hedduu walitti fida, faayidaa hedduu
kennuu barbaada:
 Qabeenya Walitti Qabuu: (qabeenya walitti qabuu) .
 Bakka kuusaa jiru walitti makuun deetaa qabachuun faayidaa ifa ta'e qaba, garuu CPU fi memory pooling illee
baayyee barbaachisaa dha. Kuusaa deetaa gurguddoo adeemsisuun qabeenya kana sadan hunda hamma guddaa
barbaada.
 Wantoota walitti qabuun karaa garee wantaa(kuusaa kuusaa jedhamu) yaadannoo keessatti kuusuu dandeessisuudha.
(wantoota gosoota adda addaa keessatti garee gochuu) .
 Yeroo wanti haaraan uumamuu qabu hundatti jalqaba kuusaa poolii keessatti sakatta'amee yoo jiraate irra deebi'amee
fayyadama akkasumas akka kanaatti irra deebi'amee fayyadamuu qabeenya wantaa fi sirnaa ni kenna, guddina
sagantichaa ni fooyyessa.
 Inni akka cloud resource polling fakkaata akka qabeenya kuusaa, memory, processing fi bandwidth tec.)
Clustered Computing
 High Availability: In computing, the term availability is used to describe the period of time when a
service is available, as well as the time required by a system to respond to a request made by a user.
High availability is a quality of a system or component that assures a high level of operational
performance for a given period of time.
 Clusters can provide varying levels of fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data and processing. This becomes
increasingly important as we continue to emphasize the importance of real-time analytics.
 Easy Scalability(expanding by adding resources to the system): Clusters make it easy to scale
horizontally by adding additional machines to the group. This means the system can react to
changes in resource requirements without expanding the physical resources on a machine.
 Argamuu Ol’aanaa: Shallaggii keessatti jechi argama jedhu yeroo tajaajilli tokko itti argamu,
akkasumas yeroo sirni tokko gaaffii fayyadamaan dhiheesseef deebii kennuuf barbaadu ibsuuf kan
gargaarudha. Argamuun olaanaan qulqullina sirna ykn qaama yeroo murtaa’eef raawwii hojii
sadarkaa olaanaa mirkaneessudha.
 Kilaastaroonni kufaatiin haardwaarii ykn sooftiweerii qaqqabummaa daataa fi adeemsa irratti
dhiibbaa akka hin geessisneef wabii dogongora dandamachuu fi argamuu sadarkaa adda addaa
kennuu danda’u. Kun barbaachisummaa xiinxala yeroo qabatamaa cimsinee ibsuu keenya itti
fufnaan barbaachisaa ta’aa dhufeera.
 Easy Scalability(qabeenya sirnicha irratti dabaluudhaan babal'isuu): Kilaastaroonni maashiniiwwan
dabalataa gareetti dabaluudhaan horizontally scale gochuun salphaa taasisa. Kana jechuun sirnichi
qabeenya fiizikaalaa maashinii tokko irratti osoo hin babal’isin jijjiirama qabeenya barbaachisu
irratti deebii kennuu danda’a.

Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction with big data easier. It is a framework that allows for the
distributed processing of large datasets across clusters of computers using simple programming models.(it stores and process a
large data sets ranging from gigabytes to petabytes of data and its eco system includes multiple components that support big
data processing)
 The four key characteristics of Hadoop (software library) are:
 Economical: Its systems are highly economical as ordinary computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in scaling up the framework.
 Flexible: It is flexible and you can store as much structured and unstructured data as you need to and decide to use them
later.
 Hadoop madda banaa kan ta'ee fi walqunnamtii daataa guddaa wajjin taasifamuu salphisuuf yaadameedha. Innis unkaa adeemsa raabsame kuusaawwan
deetaa gurguddoo kilaastaroota kompiitaraa irratti moodeelota sagantaa salphaa fayyadamuun hayyamuudha.(tuuta deetaa guddaa deetaa giigaabaayitii
hanga peetaabaayitii ta’e kuusaa fi adeemsisuuf sirni ikoo isaa qaamolee hedduu adeemsa daataa guddaa deeggaran of keessatti qabata ) .
 Amaloonni ijoo afran Hadoop (mana kitaabaa sooftiweerii) kanneen armaan gadiiti:
 Diinagdee: Sirnoonni isaa kompiitaroota idilee adeemsa daataaf oolu waan danda’aniif dinagdee guddaa qaba.
 Amanamaadha: Koppii daataa maashinii adda addaa irratti waan kuufatuuf akkasumas kufaatii haardwaarii kan dandamatu waan ta’eef amanamaa dha.
 Scalable: Salphaatti lamaan isaanii, horizontal fi vertically scalable dha. Noodonni dabalataa muraasni unkaa guddisuu keessatti gargaaru.
 Daddabbii: Daddabbii kan qabuu fi hamma barbaaddetti deetaa caaseffama qabuu fi hin caaseeffamne kuusuu fi booda itti fayyadamuuf murteessuu
dandeessa.
Hadoop and its Ecosystem
○ PIG, HIVE: Query-based processing of data services

 Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
 It is continuously growing to meet the needs of Big Data. ○ HBase: NoSQL Database(not relational)
 It comprises the following components and many others:
○ Mahout, Spark MLLib: Machine Learning algorithm libraries.
 HDFS: Hadoop Distributed File System
○ Solar, Lucene: Searching and Indexing
 YARN: Yet Another Resource Negotiator
○ Zookeeper: Managing cluster.
 MapReduce: Programming based Data Processing
○ Oozie: Job Scheduling.
 Spark: In-Memory data processing ● PIG, HIVE: Adeemsa tajaajila daataa gaaffii irratti hundaa’e
● HBase: Kuusaa Deetaa NoSQL(hariiroo miti) .
● Mahout, Spark MLLib: Mana kitaabaa algoritmiin Barnoota Maashinii.
 Hadoop sirna ikoo qaamolee ijoo afran irraa guddate qaba: bulchiinsa ● Solar, Lucene: Barbaacha fi Indeeksii
deetaa, qaqqabummaa, adeemsa, fi kuusaa. ● Zookeeper: Kilaastara bulchuu.
● Oozie: Sagantaa Hojii.
 Fedhii Big Data guutuuf itti fufiinsaan guddachaa jira.
 Qaamolee armaan gadii fi kanneen biroo hedduu of keessaa qaba:
 HDFS: Sirna Faayilii Raabsame Hadoop
 YARN: Ammas Mariisisaa Qabeenyaa Biraa
 MapReduce: Adeemsa Deetaa Sagantaa irratti hundaa'e
 Spark: Adeemsa deetaa In-Memory

Big Data Life Cycle with Hadoop/Marsaa Jireenyaa Daataa Guddaa Hadoop waliin
Ingesting data into the system

■ The first stage of Big Data processing is Ingest. The data is ingested or transferred to Hadoop from various sources such as relational
databases, systems, or local files. Sqoop (a tool for transfer) transfers data from RDBMS to HDFS(Hadoop Distributed File System) ,
whereas Flume(tool) transfers event data.(information about change that occurs on data on specific time)
Processing the data in storage
■ The second stage is Processing. In this stage, the data is stored and processed. The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce perform data processing.
■ Daataa sirna keessa galchuu
■ Sadarkaan jalqabaa adeemsa Big Data Ingest dha. Deetaan maddoota adda addaa kan akka kuusdeetaa hariiroo, sirnoota, ykn faayilii
naannoo irraa gara Hadoop tti liqimfama ykn dabarfama. Sqoop (meeshaa dabarsuu) deetaa RDBMS irraa gara HDFS(Hadoop Distributed
File System) tti kan dabarsu yoo ta'u, Flume(meeshaan) ammoo deetaa taatee dabarsa.(odeeffannoo waa'ee jijjiirama deetaa irratti yeroo
murtaa'e irratti uumamu)
■ Deetaa kuusaa keessatti adeemsisuu
■ Sadarkaan lammaffaan Adeemsa (Processing) dha. Sadarkaa kana keessatti daataa kuufamee adeemsifama. Deetaan sirna faayilii raabsame,
HDFS, fi deetaa raabsame NoSQL, HBase keessatti kuufameera. Spark fi MapReduce adeemsa deetaa raawwatu.
Big Data Life Cycle with Hadoop
Computing and analyzing data :

• The third stage is to Analyze. Here, the data is analyzed by processing frameworks such as Pig, Hive, and
Impala. Pig converts the data using a map and reduce and then analyzes it. Hive is also based on the map
and reduce programming and is most suitable for structured data .
• Daataa shallaguu fi xiinxaluu : .
• Sadarkaan sadaffaan Xiinxala. Asitti, daataa kan xiinxalamu bu’uuraalee adeemsa hojii kanneen akka Pig, Hive, fi Impala. Allaattii kaartaa fayyadamuun daataa jijjiiree
hir’isa sana booda xiinxala. Hive akkasumas kaartaa fi sagantaa hir’isuu irratti kan hundaa’ee fi daataa caaseffameef baay’ee mijataadha.
Big Data Life Cycle
Determines
who has
authority and
control over
data and how
these data
assets may be
used.
THANKS
End of chapter 2

Chapter 2 - Data ScienceOR Ing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Data ScienceOR Ing

Uploaded by

Copyright:

Available Formats

 Data Scientists perform research toward a more comprehensive understanding of products,

fiizikaalaa, herregaa fi hawaasummaa( naannoo ykn sphere) dabalatee raawwatu .

 Explore for more cryptic(hidden) patterns to procure(obtain) proper insights.

 Hubannoo sirrii ta’e bitachuuf(argachuuf) paateenoota dhokataa(dhokataa) caalaa qoradhu.

Saayinsiin daataa malee hin jiru.

Duubbee baay’inaan istaatiksii keessatti

 Can be defined as a representation of facts, concepts, or instructions in a formalized manner, which

Caaseffama(ifaan ibsamee fi bakka bu'ee) .

Walakkaa caaseffama(lamaanuu dabalata) .

“Big Data” is data whose scale, diversity, and complexity require

“Big Data” jechuun daataa iskeeliin, adda addummaa fi walxaxiinsi isaa

○ PIG, HIVE: Query-based processing of data services

 Qaamolee armaan gadii fi kanneen biroo hedduu of keessaa qaba:

 HDFS: Sirna Faayilii Raabsame Hadoop

 YARN: Ammas Mariisisaa Qabeenyaa Biraa

 MapReduce: Adeemsa Deetaa Sagantaa irratti hundaa'e

 Spark: Adeemsa deetaa In-Memory

Ingesting data into the system

Computing and analyzing data :

You might also like