Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Deriving insights from Data The Right way

Colin Shearer once remarked Find me something interesting in my data is a question from hell. A lot of literature is being published today about Big Data and Predictive analytics. In this mad rush of finding insights from data, many times organization forget the basic paradigm Analysis should be guided by business goals. Huh, enough of Gyan!! Let me walk the talk. In this blog, I wish to demonstrate a simple exercise of deriving value from the data for an Automobile business. The intended audience could be any one ,who brain cells starts to play soccer on hearing words like Big Data, R,Analytics,Insights, in fact anything which makes sense out of data. Nerds, Geeks and Dodo like me, all can benefit from this blog. By the time you reach the end of blog and manage to be awake, here is what you will learn:

First thing first Be clear of what do you want to dig from the Data mine? Identify the attributes(independent), which are useful and how do they relate to the Objective of your analysis Once the above point is answered, get a sense of the data What is the type of data attributes, nature, sample values etc Is the data workable are there duplicate values, missing values? If so clean the data Once the data is cleansed, Try to fit a model between the independent variables(means) and the dependent variable( the objective or the end)

Validate the model, make it accurate so that the values predicted are close to actual data; Run Diagnostics to validate that model is well-built. You may use a smaller test data( subset of the actual data set) set to validate your model, before applying the model on the actual dataset

Automobile industry in India is going through a down turn due to sluggish economy. Pricing a new car in todays scenario is one of the most challenging problems for any car manufacturer. So, the Car manufacturer has collected some the data and wishes to use it to predict the price of new car model or may be, perform a price correction of the existing models, which have lower sales, possibly due to pricing issue. Problem statement: Given the available data, can predictive analytics be used to establish relationships between how the various features of car impact the Car price and more importantly how strong is this relationship? A snap shot of the Automobile price and various attributes data is shown below:

Data Dictionary of the Automobile data:


Price: Suggested retail price of the car Mileage: Number of miles the car has been driven Make: Manufacturer of the car such as Saturn, Pontiac, and Chevrolet Model: Specific models for each car manufacturer such as Ion, Vibe, Cavalier Trim (of car): Specific type of car model such as SE Sedan 4D, Quad Coupe 2D Type: Body type such as sedan, coupe, etc. Cylinder: Number of cylinders in the engine

Liter: A more specific measure of engine size Doors: Number of doors Cruise: Indicator variable representing whether the car has cruise control (1 = cruise) Sound: Indicator variable representing whether the car has upgraded speakers (1 = upgraded) Leather: Indicator variable representing whether the car has leather seats (1 = leather)

I have used here open source platform Language R to perform the analysis. I would be happy to share the source code.(mail me @ shrivastav.gaurav@hotmail.com) Data Exploration: So the first step is to get a sense of how data looks. Here is snapshot of the data:

This is how data looks like. There are 804 rows of data with various columns like Price, Mileage, Make and so on. Make has sample value like Buick and so on.

This does not gives a clear picture of how Mileage governs the Car Price, but a dense region in the graph is indicative of the fact that several Cars have close Price~Mileage relationship. In fact the scatter in

some way suggests that apart from mileage, there are some other related variables/attributes ( like number of Cylinders, Size of engine in Liters and so on), which also influence the price of the Car.

The above graph is indicative of increasing price with increase in number of Cylinders ( With 4 cylinders the price range is from 1000 to 4000, with 6 cylinders the Price range of cars is from 2000 to 4500 and so on). This below graph, is again suggestive of increase in Price with increase in the Engine size (in Liter).

Now connecting the dots, I have just taken a few attributes from the data set to see their relationship with Price and also how they are interrelated.

A Data Scientists needs to similar Data exploration to better understand the nuisance of the data set before getting in to further analysis. Here we added a few more dummy columns to the data set. These dummy columns are required as there are a few attributes which are Boolean in nature or are discrete with numeric values ( 1,2,3..). Using such attributes for further analysis, requires transforming them in to dummy variables to so that 0, 1, 2 can be converted in to discrete values, which if not done, will led to bias in the results. Mathematically 1 is greater than 0, and 2 is greater than 1, but for our analysis case both 0 , 1 or 2 are just discrete cases where 1 does not have more significance than0; they are just different values of an attribute like Apples and oranges in case of fruits and not small Apple, Big Apple and even Bigger Apple . Data Cleansing: Fortunately there were not any missing values in the data, otherwise the missing values have to be plugged in one of the many methods. Easiest is to ignore the data tuples, which have missing values or using average of the remaining values or some more scientific method based on the need. If there are any duplicate records, even they have to be removed. Fitting the Model: Looking at the data, the Price and Mileage are numeric values of high order, where as all other attributes like Model, trim, type etc also have discrete numeric values, but not of same order as Price and Mileage. Hence instead of linear model, a Logarithmic model will be a better fit.

Here is the sample code in R Language to generate the model.

The result seems to show some correlation between the various attributes, which have some impact on the Car price.

R-Squared value close to 1 is indicative of the accuracy of the model. However at 95% significance level, we see only a few variable (attributes like Mileage, Cylinder and so on) to be significant. So we have to revisit the model.

So the model can be described as: Price = Function ( Mileage, Cylinder, Doors and some specific make like Buick, Type like Convertible, Hatchback and so on) Now the result indicates that we have a better model, with all the attributes having significant (even greater than at 95% Significance Level) impact on the car Price.

Model Validation: To test our model on how well it can predict the Price of the car, we plot the Actual price of the care (available in the Data set) against the Predicted price (derived from our model).The Green line is the Actual Price and the Red line is the Predicted Price.

The model seems to be doing good job in predicting the price!! There are some other tests which are performed to check the soundness of the model. I am just putting here just a few of them for the readers benefit ( to avoid headache;) )

As seen in graph below, the distribution above and below the line is quite the same, implying that the model is free from Hetroscedasticity (Tongue twister indeed!!) Error Term variance is constant.

The below graph checks for Normality, Independence, Linearity and Homoscedasticity [ again a tongue twister ] I do not intend to go in to depth of these graphs as is the subject is quite dry and honestly speaking, I too am learning the tricks of this trade!! Last but not the least, if these test run fine and give good result on the smaller test set, you may run this on the much bigger actual data set to realize the outcome of your model. Nevertheless, for those who have started to snooze, its time to say Enjoy your Sleep!

You might also like