Professional Documents
Culture Documents
Hierarchical Clustering As A Novel Solution To The Notorious Multicollinearity Problem in Observational Causal Inference
Hierarchical Clustering As A Novel Solution To The Notorious Multicollinearity Problem in Observational Causal Inference
Hierarchical Clustering As A Novel Solution To The Notorious Multicollinearity Problem in Observational Causal Inference
2
Hierarchical Clustering As a Novel Solution to the Notorious 59
60
3 Multicollinearity Problem in Observational Causal Inference 61
4 62
5
Yufei Wu Zhiying Gu Alex Deng 63
6
Airbnb, Inc. Airbnb, Inc. Airbnb, Inc. 64
7 65
yufei.wu@airbnb.com zhiying.gu@airbnb.com alex.deng@airbnb.com
8 66
117 the issue of multicollinearity, we propose a novel approach that week C, which could be sales or log transformed sales. We in- 175
118 leverages the variations in correlation patterns across di�erent ge- clude `C : , B40B>=0;8C~6,C , and contemporaneous correlation with 176
119 ographic locations. The objective is to restructure the data in a covariates /6,C to capture the evolution of organic sales over time. 177
120 way that signi�cantly reduces the multicollinearity problem. Our 3(C>2: (G6,C ) is the transformed impressions that captures: (1) di- 178
121 proposed method involves utilizing hierarchical clustering to group minishing return; (2) lag of the e�ect; (3) carryover e�ect of the 179
122 geographic locations based on their correlation patterns. The key impressions. Ng et al. [13] uses a di�erent formulation which only 180
123 aspect of this approach lies in de�ning the distance metric used in estimates the saturation e�ect. In this paper, we adapt Google’s pro- 181
124 the clustering algorithm. posed shape formulation of marketing e�ects that is more �exible.[10] 182
125 In our methodology, we de�ne the distance between two DMAs The 3(C>2: function can be de�ned as: 183
126 as the sum of channel-speci�c distances. Each channel-speci�c 184
127 distance measures the similarity in the cross-channel correlation ✓ Õ! (; \: ) 2 ◆ 185
g
;=0 :
GC ;,< d
128 between the two geographic locations. By incorporating this dis- 3(C>2::,6 = Õ! (; \: ) 2 186
129 tance metric into the hierarchical clustering process, we can e�ec- g
;=0 : 187
130 tively group the DMAs in a manner that minimizes multicollinear- d 2 (0, 1] captures (potentially diminishing) returns to scale; 188
131 ity across channels. This innovative approach allows us to trans- g 2 (0, 1) governs the carryover rate over time; and \ indicates the 189
132 form the data structure, mitigating the challenges posed by mul- lagged peak e�ect. 190
133 ticollinearity and providing a more robust foundation for further 191
134 analysis. By adopting this methodology, we can improve our un- 2.2 Account for Confounding factors 192
135 derstanding of the causal relationships between channels and accu- While it is not the main focus of this paper, we take into account
193
136 rately attribute their impact on sales or other relevant outcomes. confounding factors when modeling trend and seasonality in order
194
137 We will demonstrate this improvement with both data descriptive to properly capture organic demand. When modeling trend and
195
138 evidence and regression results. seasonality, it is crucial to strike the right balance between �exibility
196
139 The remainder of this paper is organized as follows. In section and strictness – excessive �exibility may lead to over�tting, whereas
197
140 2, we describe the Marketing Mix Modelling (MMM) problem for- overly rigid parametric formulations can result in a poor �t of the
198
141 mulation and the related work. In section 3, we present the data model. In this marketing use case, we can easily over�t a model that
199
142 properties that motivate our approach to reduce multicollinearity. performs poorly out of sample because we have high dimensional
200
143 In section 4, we introduce Hierarchical Clustering and the distance parameters space - we have to estimate 4 parameters per channel.
201
144 metric designed speci�cally to address the multicollinearity prob- Keeping this tradeo� in mind, in addition to including exponential
202
145 lem. In section 5, we show how this novel method improves results, trend and sinusoidal seasonality following Jin et al. [10], we also
203
146 and in section 6, we brie�y discuss other applications that can include as an additional covariate an index of Google Search query
204
147 utilize this methodology. volume for travel and accommodation brands excluding Airbnb
205
148 206
(/6,C in the above equation), to capture confounding factors that
149 2 BAYESIAN STRUCTURAL MODEL a�ect organic demand contemporaneously.
207
150 FORMULATION 208
151 209
This paper focuses on providing an innovative and intuitive solu- 3 DATA PROPERTIES
152 210
tion to the notorious multicollinearity problem. To demonstrate the 3.1 Pre-Process Data
153 211
e�ectiveness of their approach, we apply it to a speci�c application
154
called Bayesian Marketing Mix Modeling (MMM), built upon Jin We take two important steps to pre-process data in preparation for 212
155
et al. [10]. MMM is a widely applied method in the industry for esti- descriptive analysis and modeling. First, we normalize bookings, 213
156
mating the performance of various marketing channels in a holistic channel impressions, and the covariate to establish a common scale 214
157
manner. It takes into account factors such as seasonality, trend (rep- across DMAs of di�erent sizes. This will make it easier (A) to inter- 215
158
resenting organic demand), and mix of di�erent marketing channels pret the impact of a certain level of marketing activity; and (B) to 216
159
when forecasting sales. 1 model the common trend and seasonality later. Second, we decom- 217
160 pose channel impressions into the common trend and seasonality 218
161 and residual variation across DMAs, so we can focus on correlation 219
2.1 Marketing Mix Model Setup
162 in the residual variation next. 220
163 We model sales as a non-linear function of seasonality, and adver- 221
164 tisement impressions of each channel with a Bayesian Model. Let 6 3.2 Characteristics of DMA-Level Data 222
165 denote a DMA and C = 1, ...,) denote time (we use weekly data). 223
There is a decent level of correlation between marketing channels
166 ’ and DMAs, even after eliminating the common trend and seasonal- 224
167 ~6,C = `C _ + B40B>=0;8C~6,C + U/6,C + V: 3(C>2: (G:,6,C ) + n6,C ity across DMAs for each channel. As Figure 1 shows, the variation 225
168 226
:=1 in residual impressions is moderately to strongly correlated across
169
There are media channels, and ⌧ DMAs. G6,C is the impres- the �ve channels.2 227
170 228
sion of channel : at week C. Let ~6,C be the response variable at Such correlation is more pronounced across some DMAs than
171 229
other DMAs. Figure 2 compares two sets of DMAs – DMAs in the
172 1While 230
experimentation can be used to measure the performance of some channels, it
173 is not always feasible due to practical constraints. 2We anonymize the channels as A, B, C, D, and E. 231
174 2023-08-05 02:28. Page 2 of 1–6. 232
Hierarchical Clustering As a Novel Solution to the Notorious Multicollinearity Problem in Observational
Causal Inference
Causaland
Inference
Machine Learning in Practice, August 2023, Long Beach, CA
Figure 3: Dendrogram Illustrating Hierarchical Clustering of DMAs Figure 4: Heat Maps of Channel Residual Impressions Across DMAs Over Time
349 407
350 (a) DMAs in Cluster 1 408
351 409
352 410
353 411
354 412
355 413
356 414
357 415
358 416
359 417
360 418
361 419
362 420
363 421
364 Figure 4 exempli�es such patterns using one small cluster (Clus- 422
365 ter 1) and one larger cluster (Cluster 4). Each chart illustrates the 423
366 variation in channel residual impression over time (the horizontal 424
367 axis) and across DMAs (the vertical axis). Within each cluster, the 425
368 color patterns over time are quite similar across DMAs for most 426
369 channels, re�ecting moderate to high correlation. 427
370 428
371 5.2 Descriptive Evidence Shows Clustering 429
372
Mitigates Multicollinearity 430
373 431
374
Descriptive evidence con�rms our intuition that hierarchical clus- 432
375
tering can e�ectively mitigate collinearity. By grouping moderately 433
376
to highly correlated DMAs into the same cluster, we have signif- 434
377
icantly reduced correlation in residual channel impressions. As 435
378
Figure 5 visualizes, correlation decreased generally across channels, 436
379
by 8% to 43%. 437
380
Further, as Figure 6 demonstrates, clustering preserves variation 438
(b) DMAs in Cluster 4
381
in channel impressions both (A) within clusters over time and (B) 439
382
across clusters within the same time period. This is promising for 440
383
separately identifying the impact of di�erent channels using panel 441
384
data at the cluster-week level. When testing alternative cluster- 442
385
ing strategies, we consider the reduction in correlation and the 443
386
preservation of variation as two important criteria. 444
387 445
388
5.3 Regression Results Con�rm Clustering 446
389 Alleviates Multicollinearity 447
390 Panel linear regression analysis a�rms the e�ectiveness of this hier- 448
391 archical clustering method in mitigating collinearity and facilitating 449
392 the separate identi�cation of the impact of di�erent marketing chan- 450
393 nels. After clustering, channel coe�cient estimates are no longer 451
394 subject to the problem of �ipped signs when included together with 452
395 other channels, and instead produce mostly intuitive results. 453
396 Table 1 summarizes panel linear regression results before and 454
397 after clustering, with geo (DMA or cluster) �xed e�ects and week 455
398 �xed e�ects included throughout the di�erent speci�cations.4 As 456
399 Column 1 summarizes, most channel coe�cients are negative if 457
400 we use DMA-level data, while they would be positive if included 458
401 individually. After clustering, in Column 2, the results become 459
402 mostly intuitive – all four lower-funnel channels have positive 460
403 estimated impact on sales (three of which are signi�cant at 0.001 461
404 462
405 4 All coe�cient estimates have been scaled by a constant. 463
406 2023-08-05 02:28. Page 4 of 1–6. 464
Hierarchical Clustering As a Novel Solution to the Notorious Multicollinearity Problem in Observational
Causal Inference
Causaland
Inference
Machine Learning in Practice, August 2023, Long Beach, CA
Figure 5: Clustering Reduces Cross-Channel Correlation Table 1: Panel Linear Regression Summary
465 523
466 DMA Level Cluster Level Cluster Level, Weighted 524
467 525
channel_a 5.570*** 5.041*** 5.227***
468 526
(0.075) (0.149) (0.146)
469 527
channel_b 0.055*** 0.128*** 0.097***
470 528
(0.014) (0.028) (0.027)
471 529
channel_c 0.003 0.043*** 0.038***
472 530
(0.004) (0.008) (0.007)
473 531
channel_d 0.007 0.007 0.009
474 532
(0.005) (0.010) (0.010)
475 533
channel_e 0.011*** 0.023** 0.025***
476 534
Figure 6: Variation in Residual Channel Impressions Across Clusters Over
(0.003) (0.008) (0.007)
477 535
Time
478
Num.Obs. 35 700 7140 7140 536
479
R2 0.862 0.946 0.946 537
480
R2 Adj. 0.861 0.944 0.944 538
481
AIC 113 008.3 34 358.7 48 912.9 539
482
BIC 114 492.8 35 561.6 50 115.8 540
483
RMSE 1.17 2.62 7.27 541
484 + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 542
485 543
486 544
487
frequentist regressions, the Bayesian model also produces intuitive 545
488
results, even with uninformative priors. Figure 7 visualizes the pos- 546
489
terior distribution of channel-speci�c impact parameters (V) and 547
490
carryover rate parameters (g). Consistently with frequentist regres- 548
491
sion results earlier, we estimate a higher impact for channels A and 549
492
B than the other channels.5 Furthermore, the carryover estimates 550
493
are also intuitive and consistent with our previous learning and 551
494
knowledge about the di�erent channels. For example, we expect 552
495
some carryover for Channels C, D, and E, but not for Channels A 553
496
and B, and it is reassuring that the estimates con�rm that under- 554
497
standing even though we are using uninformative priors.6 555
498 556
499
6 DISCUSSION OF BROADER APPLICATIONS 557
500 We have focused on an application in marketing mix modeling 558
501 to demonstrate how and why hierarchical clustering can mitigate 559
502 multicollinearity. However, this method is not constrained to the 560
503 setting of marketing and instead is generally applicable to obser- 561
504 vational causal inference problems featuring multicollinearity. In 562
505 our example, marketing data properties motivated us to cluster 563
506 geographic units based on correlation in marketing activities. In 564
507 other settings, one can decide which dimensions and criteria to use 565
508 for clustering based on relevant data properties. The dimension to 566
509 cluster data does not always have to be geographic. Futher more, 567
510 in some settings, natural clusters might exist for one to consider. 568
511
level). The remaining channel with a negative coe�cient is upper- For example, in the context of Customer Service, we would like 569
512
funnel, where we expect extreme di�culty in detecting a lower to understand how customers’ each interaction with our support 570
513
funnel impact. Finally, these �ndings are robust to weighting the agents contribute to the long term retention. However, oftentimes, 571
514
cluster based on the natural log of their baseline size. these interaction experience metrics are highly correlated, such 572
515 5 Note that the impact parameter estimates here should be interpreted di�erently 573
516
5.4 Bayesian Model Results Using Cluster-Level from the frequentist panel linear regressions above, because the Bayesian structural 574
517 Data model also estimates parameters that transforms the impressions for each channel
575
into adstock based on the lag, carryover, and shape parameters. But we can still make
518 Now that both descriptive evidence and frequentist regression anal- broad comparisons of the impact parameter across channels, taking into account the 576
519 ysis a�rm our hierarchical clustering approach e�ectively mitigates other parameters. 577
6 It is expected that the variance is high for the estimates of Channel E, such as the
520 collinearity, we move forward to estimate the Bayesian model de- carryover parameter. Unlike the other channels, Channel E is upper funnel, so it is
578
521 scribed in Section 2 using data at the cluster level. Similar to the especially di�cult to estimate its impact on lower funnel conversions. 579
522 2023-08-05 02:28. Page 5 of 1–6. 580
Causal Inference and Machine Learning in Practice, August 2023, Long Beach, CA Yufei Wu, Zhiying Gu, et al.