Download as pdf
Download as pdf
You are on page 1of 10
SESSION A Significance of Data {A Type of Data used in Al Projects Simplifying A Dota Acquisition Dats Acquisition 3.1 INTRODUCTION The ag stage of Ab Project Cycle, after Problem Scoping, is Data Acquisition, where data required for the project is acquired in specific forms and formats. In order to work upon and produce outcomes, correct data in right form, must be fed to an AI Project. In this session, we shall discuss about Data Acquisition for an AI Project. 3.2 SIGNIFICANCE OF DATA Al Project means an (artificially) intelligent project that is capable of making decisions or performing some intelligent tasks. Data plays a crucial role for an Al project to behave intelligently as the AI project is trained using data to behave in a specific way. To build an AI system, you would need to source large amounts of data and create data sets for training, testing and evaluation, and then deployment of the AT project. This process is repeated through several rounds of training, testing and evaluation until the desired outcome is achieved and data plays an important role at each step. ¢ For Testing) previously rarting|ata| with i ee eae eee ae specific outcomes is fed into an AI system conificates, chars, files; historical data etc. and the system is trained using this data. are often referred to as existing/old data. ® Then Evaluation takes place where validation data, which is new data, is fed to evaluate the working of system. Validation data provides the first test against unseen cata ee Siete Hoe ec thegalimodel Freshly collected data for the purpose of makes predictions/decisions based on the evaluation is called the new dana new data. ® Then Testing happens when the Al system is fed with some data whose outcomes are known beforehand and produced outcomes of AI system are compared with the expected outcomes to test if the system is ae Fi Data is integral part and plays an working efficiently or not, Testing data once important role for taining, testing and again validates that the developed AI model evaluation of an AI project until it starts can make accurate predictions/decisions. _producing the desired outcome. ia sf a 72 ARTIFICIAL INTELLIGENCE-X 3.2.1 3.3 For example, if you are developing an AI project that predicts the relationship of tay, days and medals won, you need to have access to the past data that consists of information, i.e., Days Trained and Medals Won. It should be a lot of data using wp, the AI project will be trained first and with repeated training, the AI project wit) a Predicting the outcome. a If this data is not available or so little in quantity, or is incorrect or invalid, the project won't be able to predict correctly. For an AI project to work efficiently, et important that authentic and ample data is made available. In short, the data shoulg a quality data. Quality Data Characteristics As data is crucial for the success of any AI project, it is important to ensure that it is quality data. Quality data has these characteristics : Relevance (i.e., Do you really need this information?) ; Accuracy correct in every detail?) ; Completeness (ie., How full (comprehensive) is the information?) ; Timeliness (i.e., How up-to-date is the information?) ; and Reliability (ie., Does the information contradict other trusted resources?) ; Validity (ie., Is the ie., Is the information information compliant with requirements?). Following figure lists the characteristics of quality data : Accuracy Relevance [V retciy a8 : Is the data accurate Is itrelevant to intended purpose and isitreally as per timeliness and real data ? needed ? | Completeness. ! 4 Has the data been “Is the data compl Is the data comprehensive ? a colected quickly - vith ea Has invalid and incomplete J after the event and | data been removed ? frequenty enough ? Has missing-data Isitupto-date ? been handled property 2 TYPE OF DATA USED IN Al PROJECTS Artificial Intelligence (AI) projects are required to process the vast amounts of data produced as a result of the increase of Internet-based technologies in areas such as stock exchanges and financial services, industry and manufacturing, telecommunications and transport, healthcare, academia and so forth. Session 3 : SIMPLIFYING DATA ACQUISITION 73, Data of AI systems, broadly belongs to one of the following two categories : (i) Structured Data a ee me has @ purposely designed, pre-defined structure as per some on ae as simple 2D spreadsheet arrays, complex relational databases or a Fi 2 Z a et - The structured data has well-defined relationships among its i a Si uate for determining the telationship between inflation rate » the data - inflation rate and ing - i sructure ani tee average saving - has predefined (i) Unstructured Data Unstructured data is data that is not organised according to any pre-existing data model. Unstructured data is unprocessed and is often generated by machine-led systems for example, social media posts, surveillance camera footage, ot satellite imagery etc. The unstructured data can have its own intemal structure, which may not fit in some well-defined format. For example, in an AI system for analysing the most popular social media posts, the data - social-media-post, does not have a predefined structure; it can be text or video or a link or an image or even some other undefined structure. Following Table 3.1 lists the differences between structured and unstructured data, Table 3.1 + Structured vs. Unstructured Data | S.No. Structured Data Unstructured Data 1, | It conforms to some existing data model. | It does not conform to any pre-existing data | mode. 2. | It has well-defined relationships among its | It has undefined relationships among its | elements. elements. 3. | It does not require extra pre-processing | It requires more pre-processing before being before being analysed or searched. analysed or searched. Note More sophisticated AI systems are required to extract meaningful insights into unstructured data. Unsupervised learning is one technique used to gain insight in this area, whereby patterns and relations are identified in unlabelled and unstructured input data. Dota Features Both structured and unstructured data have certain data features. Data features refer to the type of data you want to collect. For example, for the AI system used for predicting average saving, the data features could be salary/income, inflation rate, average spending, average saving etc. Similarly, for an Al system analysing social media posts, the data features required would be social-media-post, platform, time-posted etc. ‘Dota Feature UNIT IL: Al Project C 74 ARTIFICIAL INTELLIGENCE-X 3.4 DATA ACQUISITION Data acquisition begins when you acquire required data in quality form. Before actual data acquisition happens, some preprocessing is required where following questions are answered, — What are the data features needed ? — Where can you get the data 2 — How frequent do you have to collect the data ? — What happens if you don't have enough data ? — What kind of analysis needs to be done? — How will it be validated ? — How does the analysis inform the action ? You have already read and understood these questions in your previous class. In case you need to reread it, you can refer to Session 4, Unit 2 (Part B) - Problem Scoping - Explore and Understand Data, Class 9 AI book by Sumita Arora. 3.4.1. Identifying Data Requirements After answering all the questions mentioned above, you can finalise the data requirement by doing these : (i) Group together the relevant data features in logically related structures. (ii) Be clear about the relationship of data in and outside the logical data structure. (iii) Use consistent and standardised terminology and format. 3.4.2 Finding Reliable Data Sources There can be many different sources wherefrom data may be collected. Most commonly used data sources are being discussed below. Interview are drafted accordingly. Then using this, the It is one of the most effective sources of data | 1esPonses e all users and stakeholders are gathering, In this method, an analyst talks to | documented. the users and clients who know about the system, its functions and flaws. 3. Observation Foterview Under the observation method, the ies _____ | gesponsible person observes the team in a An interview refers to a one-on- A i re er real working environment and gets ues about the required data and its form, and clients to find out about the systems, i functions, shortcomings and flaws. subsequently documents the observation. 2. Survey Observation In Surveys, first the goal of the survey is ascertained and thereafter the questionnaires Session 3 : SIMPLIFYING DATA ACQUISITION 75. 4, Application Programming Interface (API) API is a specialized technique in which specific type of data is collected through the | For instance, you can see that modern use of a programming interface, such as medical diagnosis and wearables like Fitbit, using social media programs’ interface, data | ‘Apple watch’ ; et of "Thi like people's most preferred game, most liked | make good use [memet of Things (lor) weather, humidity, body temperature, blood Pressure, heart beat, weight and many more. cannot function without post, most used time etc, may be gathered, | of sensors. sensors, “API ie — Sensors An API refers to Application Program- = ; ming Interface that works behind a popular Sensors are mini devices that can collect software program or game to collect specific Saearebe a lees aces boar Grae ‘ype of data pertaining to users? way of using specific task. that program. 7. Cameras 5. Web Scroping Cameras, because of their video recording and image capturing features have proven to be good data collection tools in various situations such as traffic rules violations, automatic detection of flaws in design and outlook of products, places, buildings etc. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. A web scraper is a specialized tool designed to carry the web scraping. = Web Scrapin; (LWveb Scraping imeras Web ring refe a lect & | Scraping refers to a data collection The method -of data collection using, technique using a tool called web scraper that aera 4 Cameras is a way to collect data graphi- - cally or in video form about the look, design or action as per the requirements. 6. Sensors Saeed Sensors or electronic sensors can measure various different parameters such as 8. The Internet Searching the Internet for data as per one's requirements is a commonly used technique. However, you should not take data directly from the Internet for the following two reasons : (i) The data might not be authentic as its accuracy cannot be proved. Studies have shown that more than half of the data of the Internet comes from unreliable sources or is inaccurate. (ii) Even if the data is reliable, it cannot be directly taken if it is copyright protected because of IPR (Intellectual Property Rights). Thus, you can take data from the Internet only after ensuring the following two things : (i) The source of data is authentic and reliable. ') The data has been licensed for public use through licenses like Creative-commons, Copyleft and other open-source licenses or through personally obtained permission from the copyright owner. UNIT IL: Al Project Cycte 76 ARTIFICIAL INTELLIGENCE-X There are some open-sourced websites hosted by the government, which host vatious types of data and information (such as data.gov.in, india.gov.in, mospi.nic.in/time-use- and many others). You can use such websites for taking data if censusindia.gov. meets your requirements. There are two terms associated based on who collects the data : Primary and Secondary data. Primary Data is the type that you gather by yourself. It means you are actively involve in the sourcing of information. Secondary data is all around us. It is easily accessible on the internet and requires fewer resources to gather, unlike Primary Data. In this case, the collection of primary data has been done by someone else before getting uploaded to the internet. Secondary Data comes in the form of search results. 3.4.3. Acquiring Dato After identifying the data requirements, required data 5.4 acquiston features and appropriate and reliable data sources, finally ati data is collected in required form. That is, in data acquisition, understanding, ga data is understood, gathered, filtered, cleaned and finally filtering, cleaning data ; stored in a data storage system. per the requirement of system so as to train it the collected data. The purpose of data acquisition is to collect relevant, authentic, quality data required to train the AI system. Activity LIST DOWN THE WAYS OF ACQUIRING DATA FOR YOUR Al PROJECT (Face Mask Detection) Carefully go through the problem statement of your AI project and its required data features. Now using the methods of collecting data discussed above, find out the most suitable ways of collecting data as per your chosen Al project. Solution As our Al project is Face Mask Detection which will capture the images of moving people and identify those who are without masks, we need to collect lots of images of : X People wearing masks * People without masks * Images with people and complex backgrounds * People having different colour and face structure * People of all ages x People of all genders * People of various cultural backgrounds and similar The Internet can be a good source of such ima i es and also, government's public domain databases as well,» oe a cael ee Session 3 : SIMPLIFYING DATA ACQUSITION 77 (check Point — 1, To build an Al system, you need to source large am (@) Training, (@ Testing fi ae of data and create datasets for , (©) Evaluation (d) Allo 2 is crucial for the success of any AI project and « ae om thy rae Se central to all activities and phases. : esting i) 3, Which of the following is not a characteristic of quality data? Gi ages (@) Accuracy ( Rete na data having a pr stru a icture and well-defined relationships among its elements, is called 2) Complete @ - (¥) Structured (0) Unstructured (@) Valid 5 Ded znot fitting in any specific format and having undefined relationship, i called data. (@) Complete (®) Structured (0) Unstructured (@ Valid 6, The type of data, having a name, collected for an AI project is called a) Rel Fe (a) Relationship _(b) Data Feature (0) Structured Data _(d) Unstructured Data 7, During Data Acquisition feeding previous data into the machine is called (a) Training Data (b) Predicting Data Te ita (0) Testing Da (@ Evaluating Data {CBSE 2021-22 (Term D] involves collecting data from various authentic sources such as reliable websites, observations, surveys. (@) Data Acquisition () Data Evaluation (@) Data Modelling [CBSE 2021-22 (Term 1)] (0) Data Testing ‘AI model makes predictions/ 9, In unseen validation data is used to check how well the decisions, (a) Testing, (b) Training (©) Evaluation (@ Deployment a with known outcomes is used to validate that the AI model can make accurate 10, In____, dati predictions/decisions. (a) Testing, (b) Training (0) Evaluation (@ Deployment 11, The __ data is collected by the users themselves and not sourced from other sources like, the Internet. (@ Primary (W) Secondary (6) Structured (@ Unstructured 12, The ___ data is resourced from other sources like, the Internet and not collected by the users themselves. (0) Primary (b) Secondary (6) Structured (@ Unstructured 1 through the internet and extract data. of using automated bots to craw! (0) Surveys the mouth of people who reply to some open ended 1B. is a process (@ Sensors (6) Web scraping. (a) Interviews wu gather data directly from 4, help yo questions. (a) Interviews (c) Surveys (@ Sensors (b) Web Scraping 78 ARTIFICIAL INTELLIGENCE-X 15. A ___uses a set of standardized questions surrounding a specific topic people about their opinions, attitudes, or behaviour towards that topic, > “lest d (@) Interview () Web scraping () Survey (@ Sensor 16. The ___ are some electronic devices used to collect data regarding which can be devices. measured (a) Interviews (b) Web scraping (c) Surveys @s 17, To collect images or video data, which ofthe following methods of data collection wil be appropriate? the (@) Sensors (0) Internet (c) Cameras (@ Allot | Competency Based Questions 18, You have been given a task to develop an AI system that allows the owner of the vehicle tof timely information about the services done. Based on the input received, the Al system shoy detect what other parameters need service, when the next service is due, what part replacement, and so on. The data related to vehicle parameters and their need of service is to be collected from vehicle drivers and vehicle workshop staff, Which of the following would be appropriate of collecting data ? (@ Questionnaire _() Interview (6) Sensor (@ Intemet 19, Latika has successfully started her start-up that works with many village women and sells handmade products online. Many other websites are also doing the similar business. Lata monitor the prices of their competitor websites regularly. To check the pricing of their com products and services, is the best way to collect prices’ data. (2) Questionnaire _(b) Internet (0) Web Scraping _(d) Survey 20. Maryam belongs to a family of farmers. To help her family, she wants to develop a device! system that will help in detecting and targeting weeds. When touched over crops, by texture and size, it can detect weeds. This will help to prevent over-application of he will eventually prevent high levels of toxins in food. For this, what will be the best source data so as to identify weeds from crops ? (a) Internet (b) Web Scraping (c) Camera (@ Sensors < LeT Us REVISE ~ Quality data has these characteristics : accuracy, relevance, reliability, timeliness, | completeness. The data used for AI systems can be structured as well as unstructured data, q ‘The structured data conforms to some predefined data model and has well-defined relationships ¢ its elements. © The unstructured data does not conform to any pre-existing data model and has undefined among its elements. a + Data Acquisition refers to processes, methods or systems that are used to collect informati certain theme or objective, to document or analyse some phenomenon. + Data must be gathered from reliable sources. Session 3 : SIMPLIFYING DATA ACQUISITION 79) + Most commonly used data sources are Inte Cameras, the Internet, Problem reports eta” SU: Observation, APL, Web Scraping, Sensors, mings and flaws, A survey refers to a study of the opinions, Tesponses ete. of a group of stakeholders. «The observation method refers to human or me aD chanical tg or par wi ‘An API ae to Application Programming Interface that works behind a popular software program or game to collect specific type of data pertaining to users’ way of using that program. + ie seu refers to a data collection technique using a tool called web scraper that extracts data from websites. Sensors are mini devices that can collect data about an environment or a body or a specific task. The method of data collection using Cameras is a way to collect data graphically or in video form about the look, design or action as per the requirements, Splution Time ion ime 1 What is a dataset ? What is its other name 2 Ans. A dataset is a collection of data objects related to a common theme or project. A dataset is also known as a database. What are data features ? Give examples Ans, A data feature is an individual measurable property or characteristic of a data object being recorded or stored, e.g, colour, mileage and power can be considered as features of a car. Describe the meaning of these data characteristics : Accuracy, Completeness, Reliability, Relevance, and Timeliness. Ans. Characteristic | —=—=s—sHowiit's measured =i Accuracy, Is the information correct in every detail ? Completeness __| How comprehensive is the information 2 Reliability Does the information contradict other trusted resources 2 Relevance Do you really need this information ? Timetiness How up-to-date isthe information ? What is Data Acquisition ? Ans. Data Acquisition refers to processes, methods or systems that are used to collect information related to a certain theme or objective, to document or analyse some phenomenon. List some commonly used sources of data collection, Ans. Most commonly used data sources are Interview, Survey, Obseroation, API, Web Scraping, Sensors, Cameras, the Internet, Problem reports etc. List some examples of situations where these data sources would be the most suitable : ( Sensors (i) Cameras (iti) Interview (iv) Problem Reports (v) Survey 80 ARTIFICIAL INTELLIGENCE-% Ans. @ For situations where environmental or human body parameters are to be measured such temperature of a furnace, level of a tank, body temperature of humans, blood pressure or heart-t " of humans, and’so on. : — ) For situations where images or video inputs are required to register data such as violations, correctness of design of a product ete. a (iii) For situations where personal knowledge or information or experience of stakeholders a required such as taste of a food item or experience after using an adventure sport etc, For situations where authentic problems are to be registered such as the problems in electronic gadgets or for cause of diseases in human body etc. (2) For situations where responses for set of questions are required from stakeholders such as before installing a fountain in local park of a resident colony, a survey may be conducted among its residents to know about their choices. < GLOSSARY Dataset A collection of data objects related to a common situation, theme or topic Database A data set Data Feature An individual measurable property or characteristic of a data object being recorded or stored 4 Web Scraping A data collection technique using a tool that extracts data from websites ; Data Acquisition Processes, methods or systems used to collect information related a theme or objective 3 ‘Survey A study of the opinions, responses, etc. of a group of stakeholders Interview A one-on-one conversation between an analyst and the users and clients to find out about the systems, its functions, shortcomings and flaws Structured Data Data having a predefined structure and well-defined relationships among its elements Unstructured Data Data without a structure and undefined relationships among its elements E Assignment ssignmess How do you determine data features for your data ? di 2, What are the qualities of data ? 3. How do you choose sources of data ? 4, What is Data Acquisition ? Why is it important ? 5. Why should data be acquired from reliable sources ? 6. List some most commonly used data sources. 7. Discuss briefly about the following methods of data collection : (@ Interview (b) Survey (¢) Observation (@ APL (© Web scraping (f) Sensors (g) Cameras _(it) The Internet (i) Problem reports 8. Give examples of situations where the following way of collecting data would be the most suitable = (@) Interview (b) Survey (c) Observation (a) API (e) Web scraping (f Sensors (g) Cameras (hh) The Internet ( Problem reports 9. Why should the Internet be avoided as a data collection source ? 10. What must be ensured before taking down data from the Internet ? 11. List some government sites that can be used for data collection. ~. wee

You might also like