The document discusses different decision tree models and techniques. It covers:
1. Decision trees involve splitting decisions into nodes and branches to model relationships between variables. Common techniques include CHAID, CART, and Random Forest.
2. CHAID performs multi-level splits of categorical and continuous data to discover relationships between variables. CART classifies objects and predicts outcomes by selecting important variables.
3. Random Forest involves growing many decision trees and pruning them to reduce overfitting, finding the optimal level of complexity for predictions.
The document discusses different decision tree models and techniques. It covers:
1. Decision trees involve splitting decisions into nodes and branches to model relationships between variables. Common techniques include CHAID, CART, and Random Forest.
2. CHAID performs multi-level splits of categorical and continuous data to discover relationships between variables. CART classifies objects and predicts outcomes by selecting important variables.
3. Random Forest involves growing many decision trees and pruning them to reduce overfitting, finding the optimal level of complexity for predictions.
The document discusses different decision tree models and techniques. It covers:
1. Decision trees involve splitting decisions into nodes and branches to model relationships between variables. Common techniques include CHAID, CART, and Random Forest.
2. CHAID performs multi-level splits of categorical and continuous data to discover relationships between variables. CART classifies objects and predicts outcomes by selecting important variables.
3. Random Forest involves growing many decision trees and pruning them to reduce overfitting, finding the optimal level of complexity for predictions.
take a decision sitting on a tree What is a Decision Tree?
I. We come up with a decision dividing our
decisions in node and branches. II. It is a Supervised Model III. Can do both Linear and Logistic regression models using decision trees IV. Will discuss about I. CHAID II. CART III. RandomForest Measures of Decision Trees Measures of Decision Trees (Contd.)
Gini Impurity Measure:
Measures of Decision Trees (Contd.)
Deviance:
-2 (nB log PB + nG log PG)
CHAID (Chi-square Automatic Interaction Detector)
• Used to discover the relationship between variables.
• Performs multi-level splits • Nominal, ordinal, and continuous data can be used, where continuous predictors are split into categories with approximately equal number of observations. • Creates all possible cross tabulations for each categorical predictor until the best outcome is achieved and no further splitting can be performed. • Well suited for large data sets. • Commonly used for Marketing Segmentation. CHAID (Contd.)
• We can visually see the relationships between the split
variables and the associated related factor within the tree • The development of the decision, or classification tree, starts with identifying the target variable or dependent variable; which would be considered the root. • Splits the target into two or more categories that are called the initial, or parent nodes, and then the nodes are split using statistical algorithms into child nodes. Benefit of CHAID
• Unlike in regression analysis, the CHAID
technique does not require the data to be normally distributed. Merging in CHAID
• In CHAID analysis, if the dependent variable is continuous,
the F test is used. F Test is used to check if the variances of two variables are equal. In R it is done using var.test command. • If the dependent variable is categorical, the chi-square test is used • Each pair of predictor categories are assessed to determine what is least significantly different with respect to the dependent variable. • Due to these steps of merging, a Bonferroni adjusted p- value is calculated for the merged cross tabulation. CHAID Components
• Root Node: Root node contains the dependent, or target, variable.
• Parent’s Node: The algorithm splits the target variable into two or more categories. These categories are called parent node or initial node. • Child Node: Independent variable categories which come below the parent's categories in the CHAID analysis tree are called the child node. • Terminal Node: The last categories of the CHAID analysis tree are called the terminal node. In the CHAID analysis tree, the category that is a major influence on the dependent variable comes first and the less important category comes last. Thus, it is called the terminal node. CART (Classification and Regression Trees)
• Classifies objects and predicts outcomes by
selecting from a large number of variables, the most important ones in determining the outcome variable. • CART Analysis is a form of binary recursive partitioning. Strengths of CART
• No distributional assumptions are required.
• No assumption of homogeneity. • The explanatory variables can be a mixture of categorical, interval and continuous. • Especially good for high-dimensional and large data sets. Produce useful results by using a few important variables. • Unaffected by outliers, collinearities or hetroscedasticity. Weakness of CART
• Not based on a probabilistic model and has no
confidence interval. The best part of CART
• Sophisticated methods of dealing with missing
values. • CART does not drop cases with missing values. • It follows the concept of SURROGATE SPLITS – Defines the measure of similarity between two splits – If best split is ‘s’, on a variable; find s’ on other variables that is most similar to s. Find the second best and so on… – If a case has the variable missing, it refers to the surrogate Steps of Tree Building
• Start with splitting a variable at all of its split
points. Sample splits into two binary nodes at each split point. • Select the best split in the variable in terms of the reduction in impurity. • Repeat steps 1 and 2 for all variables at the root node. • Rank all of the best splits and select the variable that achieves the highest purity at the root. Steps of Tree Building (Contd.)
• Assign classes to the nodes according to the
rule that minimises misclassification errors. • Repeat steps 1-5 for each non-terminal node. • Grow a very large tree Tmax until all terminal nodes are either small or pure or contain identical measurement vectors. • Prune and choose final tree using the cross validation. What happens in pruning??
• CART lets tree grow to full extent, then prunes
it back • Idea is to find that point at which the validation error begins to rise • Generate successively smaller trees by pruning leaves • At each pruning stage, multiple trees are possible What do I do before PRUNING?
• We will cover the pruning concept going ahead when