Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

DWHAssignmentsandProject Important Note: You can download the data for the project and assignments from https://www.dropbox.com/s/7s1sj1m1hzzmaqe/DATAFILES%20for%20project.

rar Readtheseentirepages,beforestartingtheassignment/project. Youhavebeenprovidedwiththefollowingtwoitems: 1.Dataobtainedfromthesourcesystems. 2.BusinessQuestionsourDWHneedstoanswer(asaseparatesectionlaterinthisdocument). AssignmentI(Dueon10thMarch2014) This phase involves data cleansingand transformations. Allthe cleansing/transformations work must be done inside the RDBMS (MS SQL Server or MySQL). Please do not underestimate this phase and start early as the data given to you has many problems and need lots of cleansing work. You need to identify the anomalies in data and use your creativity and innovation to eliminate them. I am providing you a brief overview of the methodology you will be following. You may fill in the gaps yourselves. Of course, I will also be there to guide you as well. First of all, load all the data arriving from the source into a staging table. The schema for the staging tables will be similar to the source data. Only add the tehsil name to the schema. The format forthesourcedatainallcasesisasfollows: 1.DistrictName 2.MouzaName 3.FarmerName,FatherName 4.Area 5.VarietyofCrop 6.SowingDate 7.VisitDate 8.PestPopulation1 9.PestPopulation2 10.PestPopulation3

11.PestPopulation4 12.PestPopulation5 13.PestPopulation6 14.PestPopulation7 15.PestPopulation8 16.PestPopulation9 17.PestPopulation10 18.PestPopulation11 19.PestPopulation12 20.PesticideUsed 21.PesticideSprayDate 22.PesticideDosage 23.CLCV(Disease) 24.PlantHeight DataProfilingandAnalytics OnceyouhavethedatainthestagingareainsideDBMS,performdataprofilingforallthefields. BydataprofilingImeanthefollowingstatistics: 1.No.ofuniquevalues(foreachcolumn) 2.No.ofnulls(foreachcolumn) 3.Invalidvalues(foreachcolumn,youhavetouseyourknowledgeandunderstanding.) 4.Totalno.offarmers 5.Totalno.ofPestVsPredators 6.Totalnoofpeoplewhohastakenmorethan1varietyinaSeason. 7.AveragenooffarmersusingaparticularPesticide.

8.AveragenooffarmersineachMouza. Nowrunthebusinessquestionsyouhavebeenprovidedwith.Documenttheresultsofprofiling andthebusinessquestions,alongwiththeSQLyouhaveusedforthepurpose.Pleasenotethat at this stage, both for profiling and business questions, you will have to write separate SQL for each Tehsil since the schema is different for each tehsil and the data has not been integrated yet. DataCleansing After the profiling, you should have identified the anomalies and data cleansing issues. Here I amprovidinganexample 1. Separate first and last names both for the farmer and father. Standardize the first and last names.(Hint:Findalltheuniquenamesindataandcreatealookuptablewithtwocolumnsi.e. correct_name, variation etc. Use the lookup table to update the name fields with standardized names. Never hardcode names in your SQL.) In some cases, you will find the format as famerNames/ofathernames/oIntheabovecaseyouhavetoremovethelasts/oasthat isatypo. 2. In some cases where you have to find out the Plant height, or pesticide usage etc. you may be missing the units. Identify them. If you are able to remedy them, congrats, you already are startingasDWHdevelopershouldbethinking. 3. Some dates are not valid. If you identify and report any additional interesting exceptions/anomalies, you will get EXTRA CREDIT for them. Date Profiling and Analytics (2nd Time)Performdataprofilingwiththecleaneddataanddocumenttheresults.Runthebusiness questionsandreporttheresultsandtherunningtimes.NOTE:Thistime,bothforprofilingand business queries, you must be using same SQL for all the four campuses. However, the queries will be run separately, of course. Here the phas I is over Please remember that whatever scripts/SQLyoucodetodothecleansingorquerying,istobegiveninthereport. MoreTips 1.Themoreyouidentifythedataqualitydefectsandcorrectthem,themorepointsyouget. 2.Becarefulwiththeanomalieswiththedatevalues. 3.DonotforgettousetheINSERT/SELECTandCREATETABLEASSQLconstructs.Theywillbeof greathelpduringthisphase. 4.YouwillalsohavetolearnanduseIdentitycolumnsbutonyouown.

5.Alsoconsidertheuseofderivedtables.TheywillhelpyousimplifyandcutdownyourSQL. 6. Use SQL for running the cleansing/transformation SQL. Programs allowed in C# (Windows or WebApplication)only. AssignmentII(Dueon20thMarch2014)Integration This phase starts with the integration of data. Create one large table by gluing the Pest scouting tables to create one large table. Now, run the business questions, record the results, timetakenandthetotalspaceused. Normalization Subsequently, normalize the large tables in 3NF. Again run the business questions, record the results, time taken and the total space used. Give a comparative analysis performed of the flat tableapproachandthenormalizedstructurefor(i)querytime(ii)resultsand(iii)spaceused. AssignmentIII(Dueon30thMarch2014) Using the normalized tables in 3NF of PhaseII, identify which denormalization technique to use such that the maximum number of queries benefit and that benefit to be quantified in terms of space and time. Extra Credit Make a star schema. Run the business questions; report therunningtimeandresults. Submission At the end of each phase, you will be submitting the MS Word documentation. Also, all the required tables should be existent in your user spaces. The report name should be like 48_49_50_DWHAssignmentNo1 (change the assignment no for each assignment).doc. Email your assignments at: dr.kamran.teaching@gmail.com. The subject of the email should be same asthefilename. BusinessQuestionstoAnswer Following is the list of business question we want our DWH to answer. Each question is assigned a frequency of execution (given in brackets). The DWH should be tuned according to the frequency of execution of queries. By tuning we mean indexing choices and other factors that can affect performance. Obviously, one would not like to improve the performance of the queriesthatareusedinfrequentlyatthecostofthosequeriesthatareusedfrequently.

1.Whichgroupofpesticideiseffectiveagainstcertaingroupofpests? 2.Whatistheeffectofpredatorsonpestpopulation? 3.Whatistheeffectofpesticidesonpopulationofpestsandpredators? 4.WhichpestshavebeendominantinthelastXyears? 5.Whichpesticidesarecommonlyused? 6.Whatarethemajorvarietiesbeingsowedindifferentagroecologicalzones? 7.Whatistheeffectonpestpopulationasregardsthesowingdate? 8.Whatistheratioofincreaseinthepesticideusageinthelastfewyears?

DWHProject OLAPTool Scope: Many organizations analyze their businesscritical data using Online Analytical Processing (OLAP) technology. OLAPbased data mining provides a way to query multidimensional data sets and drill down into the data to find patterns. A cube contains a set of attributes called dimensions, which roughly correspond to database fields, except that they also contain a hierarchicalcollectionoflevels.Forexample,anannualtimedimensionmaybesubdividedinto levelsofquarters,months,andweeks.Acubealsocontainsacollectionofmeasures,whichare the actual data values and are typically numeric. For example, a retail cube will allow you to viewunitsales(measure)accordingtostorelocation(dimension),andtimeofyear(dimension). Tasks: This project is divided into different deliverables that will give you an insight into the development of OLAP. You will use the data provided to you in lab project in the relevant deliverables. All the deliverables have different weights and due dates. All due dates are hard and late submission gets you zero credit in that deliverable. There will a relative grading in all deliverables,sothebestonegetshighestandallotherswillberelativelygraded. Deliverable1:CubeGeneration(Weight:30%)DueDate:10/04/2014before11:59p.m. You will have to generate the aggregates for pest scouting data. Aggregates can be generated with the help of any program written by you that may use nested queries in nested loops to generateaggregatesthatarethenwrittenintextfilesoryoucanuseSQLServer2000(orlater) Analysis Services to illustrate this process, as discussed in the theory class or u can use third partytoolsorCubeclause. Deliverable2:PivotTableGeneration(Weight:30%)DueDate:17/04/2014before11:59p.m. You have to generate Pivot tables that show the aggregated data. You can use any utility like Microsoft Office Web Components (OWC), RadarCube (powerful API designed to create true OLAP applications) etc You have to provide functionality for roll up and drill down e.g we can checkaggregatesonthebasisofyearaswellasonthebasisofmonth.

Deliverable3:GraphGeneration(Weight:30%)DueDate:24/04/2014before11:59p.m. Youhavetocreategraphsonaggregates. GraphscanbegeneratedwiththehelpofJavaScript. Oryoucanuseanythirdpartytools. Graphsmustbedynamic,aswedrilldownorrollupthegraphsalsochangewiththedata. Deliverable 4: Report & Integrated Components (Weight: 10%) Due Date: 01/05/2014 11:59 pm. Inthisdeliverableyouhavetofinallysubmittheprojectwithallthecomponentsintegratedand properGUIinterface.AlsouhavetosubmitareportwhichincludeClassdiagramVariable,type anddescription. Function,typeanddescription. Classdescription. DetailDesign. HighLevelDesign. Reportofusingthetoolondataenteredwithfindings,ifany. For your assistance, a sample OLAP Tool is shown below. You can build according to you own style.

ExtraCredit: The extra credit will be provided to any group that will do any innovative task other then specifiedabovee.g.ucanprovidefunctionalityfordrilldownorrollupbyclickingonthegraph orcompressionofcube. Notes: Planyourworkondailybasissothatyoudonotmissanydeadline. AlldeadlinesarehardandsimplyNocreditforlatesubmissions Thegroupsfortheprojectwillbe2personspergroup. SubmissionGuidelines: SubmityourworkinazipfilenamedlikeD1_Roll#_DWH.zipwhichcorrespondstoDeliverable 1submittedbyGroup1. Mailyourworkatdr.kamran.teaching@gmail.com. Thesubjectlineofthemailshouldbe:D1_Roll#_DWH2014 Any zip file containing virus or corrupted will not be graded in any case. You can Cc the submissiontoyourselftoseeifthefileisdeliveredfineornot. Properlyfollowtheseguidelinesastheyhavemarks

Important 1. Please dont try to cheat. I know all the tactics students use. Cheating will not be tolerated underanycircumstances. 2. Do not allow others to copy your work. You are totally responsible for managing the access rightsonyourdatabaseobjectsothattheothersdonotstealyourwork. 3. Your work will be evaluated on quality and the extent to which you have utilized the knowledgethathasbeenimpartedinthelectures. 4.Therewillbeakindofrelativemarking. 5. Use MS Excel charts/graphs for comparison purposes, and for reporting of running times for yourqueries. 6.Differentindividualswithinasinglegroupmaygetunevengradesbasedonthecontributions andlevelofunderstanding.Sostaypreparedforanytypeofevaluationmethodology. Well,guysthatall. Remember, you can take as much help as you want. Remember, its not necessary that you can complete ahundred percent of theproject. The thing isyour learning. And I will see how much youlearned.

You might also like