Data Science only the stuff going on in companies like Google and Facebook and tech companies? Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking place in tech? just how big is big? Or is it just a relative term? These terms are so ambiguous, they’re well-high meaningless. Here’s a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades(in some cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the media describes it, machine learning algorithms were just invented last week and data was never “big” until Google came along. Statisticians already feel that they are studying and working on the “Science of Data”. That’s their bread and butter. Mabe you, dear reader, are not a statistician and don’t care, but image that for the statistician, this feels a little bit like how identity theft might feel for you. Although we will make the case that data science is not just a rebranding of statistics or machine learning but rather a field unto itself, the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry. People have said to us, “Anything that has to call itself a science isn’t.” Although there might be truth in there, that doesn't mean that the terms “data science” itself represents nothing, but of course what it represents may not be science but more of a craft. Fig. 1: Architecture of big data and data science There are many debates as to whether data science is a new field. Many argue that similar practices have been used and branded as statistics, analytics, business intelligence, and so forth. In either case, data science is a very popular and prominent term used to describe many different data-related processes and techniques that will be discussed here. Big data on the other hand is relatively new in the sense that the amount of data collected and the associated challenges continue to require new and innovative hardware and techniques for handling it. This article is meant to give the non-data scientist a solid overview of the many concepts and terms behind data science and big data. While related terms will be mentioned at a very high level, the reader is encouraged to explore the references and other resources for additional detail. Another post will follow as well that will explore related technologies, algorithms, and methodologies in much greater detail. Need of data science and big data Need of Data Science: The main goal of data science is to discover patterns in data. It analyses and draws conclusions from the data using a variety of statistical approaches. A data scientist must evaluate the data extensively from data extraction through wrangling and pre-processing. Then it’s up to him to make forecasts based on the data. A Data Scientist’s mission is to draw conclusions from data. He is able to assist businesses in making better business decisions as a result of his findings. Data is essential to propel the movement forward in everything from business to the health industry, science to our daily lives, marketing to research and so on. Computer Science and Information Technology have taken over our lives, and they are progressing at such a rapid and diverse rate that the operational procedures utilized just a few years ago are now useless. Challenges and issues are in the same boat. In terms of complexity, the challenges and worries of the past for a certain theme, ailment, or deficiency may not be the same now. To stay up with the difficulties of today and tomorrow, as well as to find answers to unresolved issues, every field of science and study, as well as every company, requires an updated set of operational systems and technologies. Need of Big Data: The value of big data isn’t solely determined by the amount of data available. Its worth is determined by how you use it. You can get answers that 1) streamline resource management, 2) increase operational efficiencies, 3) optimise product development, 4) drive new revenue and growth prospects, and 5) enable smart decision making by evaluating data from any source. When big data and high-performance analytics are combined, you can do business-related tasks such as: • In near-real time, determining the root causes of failures, difficulties, and flaws. • Anomalies are detected faster and more correctly than the naked eye. • Improving patient outcomes by transforming medical picture data into insights as quickly as possible. • In minutes, whole risk portfolios can be recalculated. • Increasing the ability of deep learning models to effectively categorise and respond to changing variables. • Detecting fraudulent activity before it has a negative impact on your company. 1.2Application of Data Science: Data Science Application haven’t taken on a new function overnight. We can now forecast outcomes in minutes, which used to take many human hours to process, because to faster computing and cheaper storage. A Data Scientist earns a remarkable $124,000 per year, thanks to a scarcity of qualified workers in this industry. Python of Data Science Certification are at an all-time high because of this! 10 apps that build on Data Science concepts and explore a variety of domains, including: Fraud and Risk Detection: Finance was one of the first industries to use data science. Every year, businesses were fed up with bad loans and losses. They did, however, a lot of data that was acquired during the first filling for loan approval. They decided to hire data scientists to help them recover from their losses. Banking business have learned to divide and conquer data over time using consumer proofing, historical spending, and other critical indicators to assess risk and default possibilities. Furthermore, it aided them in promoting their banking products depending on the purchasing power pf their customers. Healthcare: Data Science applications are very beneficial to the healthcare industry: 1. Medical Image Analysis: To identify appropriate parameters for jobs like lung texture categorization, procedures like detecting malignancies, artery stenosis, and organ delineation use a variety of approaches and frameworks like MapReduce. For solid texture classification, it uses machine learning techniques such as support vector machine(SVM), content-based medical picture indexing, and wavelet analysis. 2. Drug Development: The drug discovery process is quite complex and entails a wide range of professions. The best ideas are frequently circumscribed by billions of dollars in testing, as well as significant money and time commitment. An formal submission takes an average of twelve years. From the first screening of therapeutic compounds through the prediction of the success rate based on biological parameters, data science applications and machine learning algorithms simplify and shorten this process, bringing a new viewpoint to each step. Instead of “lab experiments,” these algorithms can predict how the substance will operate in the body using extensive is to develop computer modelling and simulations. The goal of computational drug discovery is to develop computer model simulations in the form of a physiologically appropriate network, which makes it easier to anticipate future outcomes with high accuracy. 3. Genetics & Genomics: Trough genetics and genomics research, Data Science applications also provide a higher level of therapy customisation. The goal is discover specific biological linkages between genetics, illnesses, and treatment response in order to better understand the impact of DNA on our health. Data Science tools enable the integration of various types of data with genomic data in illness research, allowing for a better understanding of genetic concerns in medication and disease reactions. We will have a better grasp of human DNA as soon as we have solid personal genome data. Advanced genetic risk prediction will be a significant step toward more personalised care. Internet Search: When you think about Data Science Application, this is usually the first thing that comes to mind. When we think of search, we immediately think of Google. Right? However, ther are other search engines, such as Yahoo, Bing, Ask, and others. Data Science techniques are used by all of these search engines(including Google) to offer the best result for our searched query in a matter of seconds. In light of the fact that Google processes over 20 petabytes of data per day. Targeted Advertising: If you thought Search was the most important data science use, consider this: the full digital marketing spectrum. Data Science algorithms are used to determine practically anything, from display banners on various websites to digital billboards at airports. This is why digital advertisements have a far greater CTR (Call-Through Rate) than traditional advertisements. They can be tailored to a user’s previous actions. This is why you may see adverts for Data Science Training Programs while I see an advertisement for apparels in the same spot at the same time. Website Recommendations: Aren’t we all used to Amazon’s suggestions for similar products? They not only assist you in locating suitable products from the billions of products accessible, but they also enhance the user experience. Many businesses have aggressively employed this engine to market their products based on user interest and information relevance. This technique is used by internet companies such as Amazon, Twitter, Google Play, Netflix, LinkedIn, imdb, and many others to improve the user experience. The recommendations are based on a user’s previous search results. Advanced Image Recognition: You share a photograph on Facebook with your pals, and you start receiving suggestions to tag your friends. Face recognition method is used in this automatic tag recommendation function. Facebook’s recent post details the extra progress they’ve achieved in this area, highlighting their improvements in image recognition accuracy and capacity. Speech Recognition: Google Voice, Siri, Cortana, and other speech recognition products are some of the best examples. Even if you are unable to compose a message, your life will not come to a halt if you use the speech-recognition option. Simply say the message out loud, and it will be transformed to text. However, you will notice that voice recognition is not always correct. Airline Route Planning: The airline industry has been known to suffer significant losses all over the world. Companies are fighting to retain their occupancy ratios and operational earnings, with the exception of a few aviation service providers. The issue has worsened due to the huge rise in air-fuel prices and the requirement to give significant discounts to clients. It wasn’t long before airlines began to use data science to pinpoint important areas for development. Airlines can now, thanks to data science, do the following: • Calculate the likelihood of a flight delay. • Choose the type of plane you want to buy. • Whether to land at the destination immediately or make a stop in between (for example, a flight from New Delhi to New York can take a straight route). It can also opt to come to a halt in any country. • Drive consumer loyalty programmes effectively. Gaming: Machine learning algorithms are increasingly used to create games that develop and upgrade as the player progresses through the levels. In motion gaming, your opponent(Computer) also studies your previous moves and adjusts its game accordingly. EA Sports, Zynga, Sony, Nintendo, and Activision-Blizzard have all used data science to take gaming to the next level. Augmented Reality: This is the last of the data science applications that appears to have the most potential in the future. Augmented reality is a term that refers to the use of technology. 1.3Data explosion: • Parallel to expansion in service offerings of IT companies, there is growth in another environment – the data environment. The volume of data is practically exploding by the day. Not only this, the data that is available now in becoming increasingly unstructured. Statistics from IDC state that 2011 will see global data grow by up to 44 times amounting to a massive 35.2 zettabytes(ZB-a billion terabytes). • These factors, coupled with the need for real-time data, constitute the “Big-Data” environment. How can organizations stay afloat in the big data environment? How can they manage this copious amount of data. • I believe a three-tier approach to managing big data would be the key the first tier to handle structured data, the second involving appliances for real-time processing and the third for analysing unstructured content. Can this structure be tailored for your organization? • No matter what the approach might be, organizations need to create a cost effective method that provides a structure to big data. According to a report by McKinsey & Company, accurate interpretation of Big Data can improve retail operating margins by as much as 60 %. This is where information management comes in. • Information management is viral to be able to summarise the data into a manageable and understandable form. It is also needed to extract useful and relevant data from the large pool that is available and to standardize the data. With information management, data can be standardized in a fixed form. Standardized data can be used to find underlying patterns and trends. • Statistics say that the United States alone could face a shortage of 140,000 to 190,000 persons with requisite analytic and decision-making skills by 2018. Organizations are now looking for partners for effective information management to form mutually beneficial long sighted arrangements. • The challenge before the armed forces is to develop tools that enable extraction of relevant information from the data for mission planning and intelligence gathering. And for that, armed forces require data scientists like never before. • Big Data describes a massive volume of both structured and unstructured data. This data is so large that it is difficult to process using traditional database and software techniques. While the term refers to the volume of data, it includes technology, tools and processes required to handle the large amounts of data and storage facilities. 1.4V’s of Big Data: In recent years, the “3Vs” of Big Data have been replaced by the “5Vs”, which are also known as the characteristics of Big Data and are as follows: 1) Volume • Volume refers to the amount of data generated through websites, portals and online applications. Especially for B2C companies, Volume encompasses the available data that are out there and need to be assessed for relevance. • Volume defines the data infrastructure capability of an organization’s storage, management and delivery of data to end users and applications. Volume focuses on planning current and future storage capacity – particularly as it relates to velocity -but also in reaping the optimal benefits of effectively utilising a current storage infrastructure. • Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions. • Facebook, for example, stores photographs. That statement doesn’t begin to boogle the mind until you start to realize that facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Face book is storing roughly 250 billion images. • Try to wrap your head around 250 billion images. Try this one. As far back as 2016, Facebook had 2.5 trillion posts. Seriously, that’s a number so big it’s pretty much impossible to picture. • So, in the world of big data, when we start talking about volume, we're talking about insanely large amounts of data. As we move forward, we're going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up. • How much will it add up? Consider this. Gartner, Cisco, and Intel estimate there will be between 20 & 200 (no, they don’t agree, surprise!) connected IoT devices, the number are huge no matter what. But it's not just the quantity of devices. • Consider how much data is coming off of each one. I have a temperature sensor in my garage. Even with a one-minute level of granularity (one measurement a minute), that’s still 525,950 data points in a year, and that’s just one sensor. Let’s say you have a factory with a thousand sensors, you’re looking at half a billion data points, ust for the temperature alone. 2) Velocity: • With Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on YouTube and .5 billion searches are performed in Google. • This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks. • 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months. • Also: Facebook explains Fabric Aggregator, its distributed network system • Velocity is the measure of how fast the data is coming in. Face book has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it. • Here’s another example. Let’s say you’re running a marketing campaign and you want to know how the folks “out there” are feeling about your brand right now. How would you do it? One wat would be to license some Twitter data from Grip (acquired by Twitter) to grab a constant stream of tweets, and subject them to sentiment analysis. • That feed of Twitter data is often called “the firehouse” because so much data is being produced, it feels like being at the business end of a firehouse. • Here’s another velocity example: packet analysis for cyber security. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network. 3) Variety: • It refers to the structured, semi-structured, and unstructured data types. • It can also refer to a variety of sources. • Variety refers to the influx of data from new sources both inside and outside of an organisation. It might be organised, semi-organised, or unorganised. • Structured data – is a type of data that is semi-organised. It’s a type of data that doesn’t follow the traditional data structure. This type of data is represented by log files. • Semi-structured data – is a type of data that is semi-organized. It’s a type of data that doesn’t follow the traditional data structure. This type of data is represented by log files. • Unstructured data – is just data that has not been arranged. It usually refers to data that doesn’t fit cleanly into a relational database’s standard row and column structure. Texts, pictures, videos, etc. are examples of unstructured data which can’t be stored in the form of rows and columns. 4) Veracity: • It refers to data inconsistencies and uncertainty, i.e., available data can become untidy at times, and quality and accuracy are difficult to control. • Because of the numerous data dimensions originating from multiple distinct data kinds and sources, Big Data is also volatile. • For example, a large amount of data can cause confusion, yet a smaller amount of data can only convey half or incomplete information. 5) Value: • After considering the four V’s, there is one more V to consider: Value! The majority of data with no value is useless to the organization until it is converted into something beneficial. • Data is of no utility or relevance in and of itself; it must be turned into something useful in order to extract information. As a result, Value! Can be considered the most essential of the five V’s. 1.5Relationship between Data Science and Information Science: The finding of knowledge or actionable information in data is what data science is all about. The design of procedures for storing and retrieving information is known as information science.
Data Science Vs Information Science:
Data Science and information science are two separate but related fields. Harry is a computer scientist and mathematician who focuses on data science. Library science, cognitive science, and communications are all areas of interest in information science. Business tasks such as strategy formation, decision making, and operational processes all require data science. It discusses Artificial Intelligence, analytics, predictive analytics, and algorithms design, among other topics. Knowledge management, data management, and interaction design are all domains where information science is employed. Data Science Information Science Definition The finding of The design of knowledge or procedures for storing actionable information and retrieving in data is what data information is known as science is all about. information science.
1.6Business Intelligence versus Data Science:
Data Science Data Science is a field in which data is mined for information and knowledge using a variety of scientific methods, algorithms, and processes. It can thus be characterized as a collection of mathematical tools, algorithms, statistics, and machine learning techniques that are used to uncover hidden patterns and insights in data to aid decision making. Both organised and unstructured data are dealt with in data science. It has to do with data mining as well as big data. Data Science is researching historical trends and then applying the findings to reshape current trends and forecast future trends. Business Intelligence Business Intelligence (BI) is a combination of technology, tools, and processes that businesses utilize to analyse business data. It is mostly used to transform raw data into useful information that can then be used to make business decisions and take profitable actions. It is concerned with the analysis of organised and unstructured data in order to open up new and profitable business opportunities. It favours fact-based decision making over assumptions-based decisions-making. As a result, it has a direct impact on a company’s business decisions. Business intelligence tools improve a company's prospects of entering a new market and aid in the analysis of marketing activities. The following table compares and contrasts Data Science and Business Intelligence Factor Data Science Business Intelligence Concept It is a discipline that It is a collection of employs mathematics, technology, applications, statistics, and other and processes that methods to uncover businesses employ to hidden patterns in data analyse business data Focus It is centred on the It concentrates on both future. the past and the present Data It can handle both It primarily works with structured and structured data unstructured data Flexibility Data Science is more It is less flexible because adaptable since data data sources for business sources can be added as intelligence must be needed. planned ahead of time Method It employs the scientific It employs the analytic process method. Complexity In comparison to When compared to data business intelligence, it science, it is lot easier. is more sophisticated. Expertise Data scientist is its area Its area of specialisation of competence. is for business users. Questions It addresses the It is concerned with the questions of what will question of what happen and what might occurred. happen Tools SAS, BigML, InsightSquared Sales MATLAB, Excel, and Analytics, Klipfolio, other programmes are ThoughtSpot, Cyfe, among its tools TIBCO spotfire, and more solutions are among them