Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

1 1

2 00:00:01,550 --> 00:00:07,588


3 hello everyone welcome to the lecture on
4
5 2
6 00:00:04,440 --> 00:00:09,690
7 instant segmentation let's start by
8
9 3
10 00:00:07,589 --> 00:00:12,480
11 defining the problem so in the last
12
13 4
14 00:00:09,689 --> 00:00:15,388
15 lecture we saw what the task of semantic
16
17 5
18 00:00:12,480 --> 00:00:18,359
19 segmentation is essentially what we want
20
21 6
22 00:00:15,388 --> 00:00:21,660
23 is to label every pixel including the
24
25 7
26 00:00:18,359 --> 00:00:25,140
27 background into a semantic class so we
28
29 8
30 00:00:21,660 --> 00:00:29,730
31 want to label a pixel into sky grass
32
33 9
34 00:00:25,140 --> 00:00:31,830
35 Road but also into objects that at the
36
37 10
38 00:00:29,730 --> 00:00:34,558
39 time of doing semantic segmentation
40
41 11
42 00:00:31,829 --> 00:00:37,589
43 we're not really counting so essentially
44
45 12
46 00:00:34,558 --> 00:00:39,689
47 all the pixels that are coming from
48
49 13
50 00:00:37,590 --> 00:00:42,690
51 different instances of the same class
52
53 14
54 00:00:39,689 --> 00:00:45,479
55 are labeled with the same label in the
56
57 15
58 00:00:42,689 --> 00:00:48,089
59 task of semantic segmentation you see
60
61 16
62 00:00:45,479 --> 00:00:51,179
63 for example here that we're labeling all
64
65 17
66 00:00:48,090 --> 00:00:53,399
67 the pixels belonging to these three cars
68
69 18
70 00:00:51,179 --> 00:00:56,488
71 with the same label which is the label
72
73 19
74 00:00:53,399 --> 00:00:59,189
75 car so in semantic segmentation the
76
77 20
78 00:00:56,488 --> 00:01:01,979
79 objects that can be counted like cars or
80
81 21
82 00:00:59,189 --> 00:01:03,988
83 people are treated in the exact same way
84
85 22
86 00:01:01,979 --> 00:01:08,310
87 as the objects that cannot be counted
88
89 23
90 00:01:03,988 --> 00:01:10,408
91 like sky grass or Road now for the task
92
93 24
94 00:01:08,310 --> 00:01:13,710
95 of instant segmentation we want to go
96
97 25
98 00:01:10,409 --> 00:01:16,590
99 one step further essentially we don't
100
101 26
102 00:01:13,709 --> 00:01:19,618
103 focus on labeling pixels that are coming
104
105 27
106 00:01:16,590 --> 00:01:21,780
107 from uncountable objects like the sky
108
109 28
110 00:01:19,618 --> 00:01:26,159
111 grass or road that were discussing
112
113 29
114 00:01:21,780 --> 00:01:29,728
115 before and focus only on segmenting the
116
117 30
118 00:01:26,159 --> 00:01:32,189
119 pixels that are coming from instances of
120
121 31
122 00:01:29,728 --> 00:01:35,219
123 the same class of objects that we can
124
125 32
126 00:01:32,188 --> 00:01:39,508
127 actually count for example cars or
128
129 33
130 00:01:35,219 --> 00:01:42,298
131 people and the idea here is that we do
132
133 34
134 00:01:39,509 --> 00:01:46,129
135 want to differentiate the pixels that
136
137 35
138 00:01:42,299 --> 00:01:48,420
139 form one car one instance of a car and
140
141 36
142 00:01:46,129 --> 00:01:51,989
143 differentiate them from the pixels that
144
145 37
146 00:01:48,420 --> 00:01:53,849
147 form another instance of another car so
148
149 38
150 00:01:51,989 --> 00:01:57,569
151 essentially instead of assigning the
152
153 39
154 00:01:53,849 --> 00:02:00,509
155 same label to all the pixels that are
156
157 40
158 00:01:57,569 --> 00:02:04,228
159 forming these three cars we now will
160
161 41
162 00:02:00,509 --> 00:02:06,920
163 assign different labels to for example
164
165 42
166 00:02:04,228 --> 00:02:10,258
167 the yellow car the first instance and
168
169 43
170 00:02:06,920 --> 00:02:13,920
171 this Bluegreen car which will be a
172
173 44
174 00:02:10,258 --> 00:02:15,958
175 second instance so not only we want
176
177 45
178 00:02:13,919 --> 00:02:18,389
179 to find the semantics class of the
180
181 46
182 00:02:15,959 --> 00:02:22,319
183 object but we want to find what kind of
184
185 47
186 00:02:18,389 --> 00:02:26,848
187 instance is that object within that
188
189 48
190 00:02:22,318 --> 00:02:29,068
191 class is this one instance do the pixels
192
193 49
194 00:02:26,848 --> 00:02:31,738
195 to the yellow pixels here belong to the
196
197 50
198 00:02:29,068 --> 00:02:32,818
199 same instance as the blue green pixels
200
201 51
202 00:02:31,739 --> 00:02:34,890
203 there or not
204
205 52
206 00:02:32,818 --> 00:02:36,899
207 so differentiate between semantics
208
209 53
210 00:02:34,889 --> 00:02:41,250
211 classes and differentiate between
212
213 54
214 00:02:36,900 --> 00:02:44,039
215 instances so we can go about doing
216
217 55
218 00:02:41,250 --> 00:02:47,519
219 instant segmentation in two ways
220
221 56
222 00:02:44,039 --> 00:02:50,159
223 the first way is to use the knowledge
224
225 57
226 00:02:47,519 --> 00:02:53,549
227 that we have from object detection and
228
229 58
230 00:02:50,159 --> 00:02:55,379
231 start with the series of proposals so
232
233 59
234 00:02:53,549 --> 00:02:58,530
235 essentially we would start with for
236
237 60
238 00:02:55,379 --> 00:03:00,810
239 example object proposals and then in a
240
241 61
242 00:02:58,530 --> 00:03:03,810
243 second step we would assign a semantics
244
245 62
246 00:03:00,810 --> 00:03:06,030
247 class to each of these proposals now of
248
249 63
250 00:03:03,810 --> 00:03:08,189
251 course in the first step we do the
252
253 64
254 00:03:06,030 --> 00:03:10,860
255 actual instance separation with the
256
257 65
258 00:03:08,189 --> 00:03:12,900
259 proposals and in the second step we do
260
261 66
262 00:03:10,860 --> 00:03:16,290
263 the semantics part and probably also the
264
265 67
266 00:03:12,900 --> 00:03:20,370
267 segmentation part so this is one way to
268
269 68
270 00:03:16,289 --> 00:03:24,030
271 go the other way to go is to is what I
272
273 69
274 00:03:20,370 --> 00:03:26,159
275 call the LCN based method which
276
277 70
278 00:03:24,030 --> 00:03:28,890
279 essentially starts from the semantic
280
281 71
282 00:03:26,159 --> 00:03:31,019
283 segmentation map so here for example I
284
285 72
286 00:03:28,889 --> 00:03:34,259
287 see an example where all of these
288
289 73
290 00:03:31,019 --> 00:03:36,480
291 instances are labeled as person and so
292
293 74
294 00:03:34,259 --> 00:03:40,139
295 in a second step what I need to do is I
296
297 75
298 00:03:36,479 --> 00:03:44,488
299 need to separate the instances inside of
300
301 76
302 00:03:40,139 --> 00:03:48,179
303 this semantic level so let's focus first
304
305 77
306 00:03:44,489 --> 00:03:50,670
307 on the second type of methods so of
308
309 78
310 00:03:48,180 --> 00:03:52,799
311 course the great advantage of sen based
312
313 79
314 00:03:50,669 --> 00:03:55,199
315 methods is that they start from a
316
317 80
318 00:03:52,799 --> 00:03:57,959
319 semantic map and we saw in the last
320
321 81
322 00:03:55,199 --> 00:04:00,060
323 lecture already how to do this and we
324
325 82
326 00:03:57,959 --> 00:04:01,979
327 saw that this was usually done with
328
329 83
330 00:04:00,060 --> 00:04:05,879
331 fully convolutional networks that were
332
333 84
334 00:04:01,979 --> 00:04:08,939
335 able to act on any image size and this
336
337 85
338 00:04:05,879 --> 00:04:11,639
339 is why LCN based methods are actually so
340
341 86
342 00:04:08,939 --> 00:04:13,560
343 powerful because they start from an
344
345 87
346 00:04:11,639 --> 00:04:17,189
347 already pretty good semantic
348
349 88
350 00:04:13,560 --> 00:04:19,829
351 segmentation so of course once you get
352
353 89
354 00:04:17,189 --> 00:04:22,019
355 the semantic segmentation your goal is
356
357 90
358 00:04:19,829 --> 00:04:26,038
359 actually to separate the instances
360
361 91
362 00:04:22,019 --> 00:04:27,720
363 within each of the classes and there are
364
365 92
366 00:04:26,038 --> 00:04:30,120
367 three methods that I would
368
369 93
370 00:04:27,720 --> 00:04:33,420
371 recommend you to read outside of the
372
373 94
374 00:04:30,120 --> 00:04:36,269
375 lecture to get more in depth on how do
376
377 95
378 00:04:33,420 --> 00:04:39,390
379 they actually performs instant
380
381 96
382 00:04:36,269 --> 00:04:42,449
383 segmentation and I will talk about the
384
385 97
386 00:04:39,389 --> 00:04:46,169
387 second method just briefly to give the
388
389 98
390 00:04:42,449 --> 00:04:51,029
391 intuition of how to go from edges to
392
393 99
394 00:04:46,170 --> 00:04:53,460
395 instances with multi card so a lot of
396
397 100
398 00:04:51,029 --> 00:04:56,879
399 the methods that actually want to find
400
401 101
402 00:04:53,459 --> 00:04:59,519
403 instances inside a semantic map they use
404
405 102
406 00:04:56,879 --> 00:05:02,399
407 the concept of clustering so you can
408
409 103
410 00:04:59,519 --> 00:05:04,229
411 imagine that you have a set a set of
412
413 104
414 00:05:02,399 --> 00:05:06,870
415 pixels that actually represent the class
416
417 105
418 00:05:04,230 --> 00:05:09,150
419 person and within these pixels what I
420
421 106
422 00:05:06,870 --> 00:05:11,910
423 want to do is I want to cluster the
424
425 107
426 00:05:09,149 --> 00:05:15,569
427 pixels that actually belong to one
428
429 108
430 00:05:11,910 --> 00:05:17,910
431 instance so in this case what this
432
433 109
434 00:05:15,569 --> 00:05:20,250
435 method proposes to do is to start from
436
437 110
438 00:05:17,910 --> 00:05:22,650
439 the input image perform semantic
440
441 111
442 00:05:20,250 --> 00:05:27,329
443 segmentation which gives you the per
444
445 112
446 00:05:22,649 --> 00:05:31,168
447 pixel semantic class scores and then try
448
449 113
450 00:05:27,329 --> 00:05:34,859
451 to perform an image partition that would
452
453 114
454 00:05:31,168 --> 00:05:38,189
455 actually separate the images into small
456
457 115
458 00:05:34,860 --> 00:05:41,910
459 sets like for example super pixels so
460
461 116
462 00:05:38,189 --> 00:05:45,139
463 groups of pixels that show certain
464
465 117
466 00:05:41,910 --> 00:05:49,340
467 characteristics for example a smooth
468
469 118
470 00:05:45,139 --> 00:05:52,560
471 transition in in the color space and
472
473 119
474 00:05:49,339 --> 00:05:54,119
475 once you have this separation here now
476
477 120
478 00:05:52,560 --> 00:05:56,879
479 what you want to do is you want to put
480
481 121
482 00:05:54,120 --> 00:06:00,930
483 these super pixels together in two
484
485 122
486 00:05:56,879 --> 00:06:03,779
487 instances so these two branches act in
488
489 123
490 00:06:00,930 --> 00:06:06,569
491 kind of a parallel way so the left
492
493 124
494 00:06:03,779 --> 00:06:10,319
495 branch performs semantic segmentation
496
497 125
498 00:06:06,569 --> 00:06:13,918
499 the right the right branch actually
500
501 126
502 00:06:10,319 --> 00:06:16,199
503 performs separation of the image and
504
505 127
506 00:06:13,918 --> 00:06:18,418
507 usually this is done using very
508
509 128
510 00:06:16,199 --> 00:06:22,228
511 low-level features like for example edge
512
513 129
514 00:06:18,418 --> 00:06:24,209
515 detection and from this you want to
516
517 130
518 00:06:22,228 --> 00:06:27,089
519 obtain these super pixels which will
520
521 131
522 00:06:24,209 --> 00:06:32,969
523 then be your units that you will want to
524
525 132
526 00:06:27,089 --> 00:06:34,649
527 put together in two instances so I'm not
528
529 133
530 00:06:32,970 --> 00:06:37,200
531 going to go into more details on these
532
533 134
534 00:06:34,649 --> 00:06:39,929
535 methods because we want to focus this
536
537 135
538 00:06:37,199 --> 00:06:41,159
539 lecture on proposal based methods and
540
541 136
542 00:06:39,930 --> 00:06:43,889
543 this is because
544
545 137
546 00:06:41,160 --> 00:06:46,620
547 so far they have shown a much better
548
549 138
550 00:06:43,889 --> 00:06:50,219
551 performance so what a proposal based
552
553 139
554 00:06:46,620 --> 00:06:53,040
555 methods well we already know how to
556
557 140
558 00:06:50,220 --> 00:06:55,140
559 obtain bounding boxes how to do object
560
561 141
562 00:06:53,040 --> 00:06:58,200
563 detection and this is essentially what
564
565 142
566 00:06:55,139 --> 00:06:59,969
567 proposal based methods are leveraging so
568
569 143
570 00:06:58,199 --> 00:07:03,389
571 if you already know how to separate
572
573 144
574 00:06:59,970 --> 00:07:05,070
575 different instances of the different of
576
577 145
578 00:07:03,389 --> 00:07:08,250
579 the same semantics class or different
580
581 146
582 00:07:05,069 --> 00:07:11,839
583 objects different cars for example with
584
585 147
586 00:07:08,250 --> 00:07:15,000
587 object detection why not use it as a
588
589 148
590 00:07:11,839 --> 00:07:18,179
591 condition as as your input essentially
592
593 149
594 00:07:15,000 --> 00:07:21,569
595 and then trying to find the segmentation
596
597 150
598 00:07:18,180 --> 00:07:24,000
599 mask within this bounding box so of
600
601 151
602 00:07:21,569 --> 00:07:26,129
603 course it is much easier if you already
604
605 152
606 00:07:24,000 --> 00:07:28,079
607 have a bounding box and you know that
608
609 153
610 00:07:26,129 --> 00:07:30,750
611 the pixels of your instance can only be
612
613 154
614 00:07:28,079 --> 00:07:32,669
615 found inside this bounding box it's much
616
617 155
618 00:07:30,750 --> 00:07:35,279
619 easier to find the appropriate
620
621 156
622 00:07:32,670 --> 00:07:37,680
623 segmentation mask that if you start from
624
625 157
626 00:07:35,279 --> 00:07:41,279
627 all your image and you're just looking
628
629 158
630 00:07:37,680 --> 00:07:43,500
631 at the pixels there are two proposal
632
633 159
634 00:07:41,279 --> 00:07:46,859
635 based methods that I want to mention and
636
637 160
638 00:07:43,500 --> 00:07:50,009
639 I also propose you to read follow up
640
641 161
642 00:07:46,860 --> 00:07:52,560
643 words or previous works just to see how
644
645 162
646 00:07:50,009 --> 00:07:55,019
647 essentially the methods evolves how one
648
649 163
650 00:07:52,560 --> 00:07:57,149
651 starts with one approach and then
652
653 164
654 00:07:55,019 --> 00:07:59,729
655 evolves this approach to obtain better
656
657 165
658 00:07:57,149 --> 00:08:03,539
659 and better results so in this case we
660
661 166
662 00:07:59,730 --> 00:08:06,780
663 will discuss the use CCV 2014 paper SDS
664
665 167
666 00:08:03,540 --> 00:08:11,160
667 and you can read the follow-up work with
668
669 168
670 00:08:06,779 --> 00:08:14,459
671 choice presented at cpr 2015 and after
672
673 169
674 00:08:11,160 --> 00:08:16,919
675 this we will focus on multi task network
676
677 170
678 00:08:14,459 --> 00:08:18,959
679 skates and you can check then the
680
681 171
682 00:08:16,918 --> 00:08:21,209
683 previous work which was done a year
684
685 172
686 00:08:18,959 --> 00:08:25,109
687 before so let's start with an overview
688
689 173
690 00:08:21,209 --> 00:08:29,250
691 of SDS so SDS presents a very simple
692
693 174
694 00:08:25,110 --> 00:08:33,029
695 concept in which proposals are used as a
696
697 175
698 00:08:29,250 --> 00:08:36,629
699 starting point not only for bounding box
700
701 176
702 00:08:33,029 --> 00:08:38,939
703 prediction but also directly for region
704
705 177
706 00:08:36,629 --> 00:08:42,299
707 or segmentation predictions so
708
709 178
710 00:08:38,940 --> 00:08:45,000
711 essentially mask prediction so the idea
712
713 179
714 00:08:42,299 --> 00:08:47,370
715 here is to start from a set of proposals
716
717 180
718 00:08:45,000 --> 00:08:50,610
719 which you obtain with any algorithm that
720
721 181
722 00:08:47,370 --> 00:08:54,089
723 you like and then you perform a feature
724
725 182
726 00:08:50,610 --> 00:08:54,779
727 abstraction that is good for bounding
728
729 183
730 00:08:54,089 --> 00:08:58,170
731 box for each
732
733 184
734 00:08:54,779 --> 00:09:02,339
735 and also mask prediction so of course
736
737 185
738 00:08:58,169 --> 00:09:04,019
739 you have to separate CNN's here one that
740
741 186
742 00:09:02,340 --> 00:09:06,269
743 predicts the bounding box the other one
744
745 187
746 00:09:04,019 --> 00:09:08,549
747 that predicts the region and then the
748
749 188
750 00:09:06,269 --> 00:09:11,549
751 idea is that combining these two sources
752
753 189
754 00:09:08,549 --> 00:09:13,259
755 of information you can perform region
756
757 190
758 00:09:11,549 --> 00:09:17,039
759 classification and then use other
760
761 191
762 00:09:13,259 --> 00:09:19,230
763 methods for region refinement so the
764
765 192
766 00:09:17,039 --> 00:09:20,789
767 idea to keep in mind here is that there
768
769 193
770 00:09:19,230 --> 00:09:23,460
771 is a separate head for the box
772
773 194
774 00:09:20,789 --> 00:09:27,839
775 prediction and a separate head for the
776
777 195
778 00:09:23,460 --> 00:09:28,590
779 mask prediction multi task Network
780
781 196
782 00:09:27,840 --> 00:09:31,230
783 Cascades
784
785 197
786 00:09:28,590 --> 00:09:34,139
787 on the other hand proposes a slightly
788
789 198
790 00:09:31,230 --> 00:09:36,960
791 more complex approach so the idea is
792
793 199
794 00:09:34,139 --> 00:09:40,439
795 always to start from this region of
796
797 200
798 00:09:36,960 --> 00:09:43,139
799 interest these proposals which you are
800
801 201
802 00:09:40,440 --> 00:09:46,650
803 first going to convert into mask
804
805 202
806 00:09:43,139 --> 00:09:48,929
807 instances and later refine into
808
809 203
810 00:09:46,649 --> 00:09:52,079
811 categorized instances therefore
812
813 204
814 00:09:48,929 --> 00:09:55,739
815 assigning a class to each of these
816
817 205
818 00:09:52,080 --> 00:09:57,900
819 instances so in this case we also start
820
821 206
822 00:09:55,740 --> 00:10:01,080
823 from these proposals and we compute
824
825 207
826 00:09:57,899 --> 00:10:03,059
827 everything by just looking at the region
828
829 208
830 00:10:01,080 --> 00:10:05,639
831 of interest which is pulled which is
832
833 209
834 00:10:03,059 --> 00:10:07,679
835 worth and we can nicely work with it to
836
837 210
838 00:10:05,639 --> 00:10:13,049
839 create our mask instances and our
840
841 211
842 00:10:07,679 --> 00:10:15,989
843 categorized instances now one question
844
845 212
846 00:10:13,049 --> 00:10:19,679
847 that one might ask yourselves is why
848
849 213
850 00:10:15,990 --> 00:10:21,659
851 should I constrain my method to a fixed
852
853 214
854 00:10:19,679 --> 00:10:23,539
855 set of proposals which might be
856
857 215
858 00:10:21,659 --> 00:10:26,338
859 incorrect might contain some
860
861 216
862 00:10:23,539 --> 00:10:29,279
863 imprecisions or why should I constrain
864
865 217
866 00:10:26,339 --> 00:10:31,740
867 myself to a semantic segmentation map
868
869 218
870 00:10:29,279 --> 00:10:33,720
871 ideally what you would want is to
872
873 219
874 00:10:31,740 --> 00:10:35,759
875 leverage the best of both works to
876
877 220
878 00:10:33,720 --> 00:10:38,160
879 leverage the proposals which give me a
880
881 221
882 00:10:35,759 --> 00:10:40,740
883 lot of information on instances and to
884
885 222
886 00:10:38,159 --> 00:10:43,019
887 also leverage the semantic maps and not
888
889 223
890 00:10:40,740 --> 00:10:46,230
891 start from one and then try to direct
892
893 224
894 00:10:43,019 --> 00:10:50,639
895 the other so this is essentially how we
896
897 225
898 00:10:46,230 --> 00:10:53,730
899 come to one of the most famous methods
900
901 226
902 00:10:50,639 --> 00:10:56,309
903 for instant segmentation masks are CNN
904
905 227
906 00:10:53,730 --> 00:10:58,589
907 which derives from the work of fast and
908
909 228
910 00:10:56,309 --> 00:11:03,509
911 faster CNN which we saw in previous
912
913 229
914 00:10:58,589 --> 00:11:06,089
915 lectures so in Moscow CNN we essentially
916
917 230
918 00:11:03,509 --> 00:11:08,730
919 start from the faster C&N architecture
920
921 231
922 00:11:06,089 --> 00:11:10,620
923 that we already know so we have
924
925 232
926 00:11:08,730 --> 00:11:14,370
927 our famous image of the penguin which is
928
929 233
930 00:11:10,620 --> 00:11:17,310
931 processed by the CNN to perform feature
932
933 234
934 00:11:14,370 --> 00:11:19,409
935 extraction and from this we have at the
936
937 235
938 00:11:17,309 --> 00:11:22,469
939 bottom the region proposal Network which
940
941 236
942 00:11:19,409 --> 00:11:24,509
943 proposes these regions of interest these
944
945 237
946 00:11:22,470 --> 00:11:28,620
947 areas inside the image that are worth
948
949 238
950 00:11:24,509 --> 00:11:30,689
951 looking at and then faster CNN had the
952
953 239
954 00:11:28,620 --> 00:11:33,480
955 bounding box regression health which
956
957 240
958 00:11:30,690 --> 00:11:36,660
959 refined this bounding boxes which
960
961 241
962 00:11:33,480 --> 00:11:39,690
963 actually allowed you to obtain boxes
964
965 242
966 00:11:36,659 --> 00:11:41,549
967 that fit tightly to the object and also
968
969 243
970 00:11:39,690 --> 00:11:43,139
971 the classification head that tells you
972
973 244
974 00:11:41,549 --> 00:11:46,169
975 whether there's a penguin or there's a
976
977 245
978 00:11:43,139 --> 00:11:47,909
979 cattle image this is the basic
980
981 246
982 00:11:46,169 --> 00:11:51,778
983 architecture that masks are seen and
984
985 247
986 00:11:47,909 --> 00:11:54,929
987 start from and the main idea is to add a
988
989 248
990 00:11:51,778 --> 00:11:58,139
991 third head so to add a head which is
992
993 249
994 00:11:54,929 --> 00:12:01,500
995 very much based on fully convolutional
996
997 250
998 00:11:58,139 --> 00:12:03,600
999 networks to perform instant segmentation
1000
1001 251
1002 00:12:01,500 --> 00:12:06,059
1003 so you take kind of the best of both
1004
1005 252
1006 00:12:03,600 --> 00:12:07,949
1007 worlds you take the power of faster CNN
1008
1009 253
1010 00:12:06,059 --> 00:12:11,458
1011 and the proposals and the detection
1012
1013 254
1014 00:12:07,948 --> 00:12:13,708
1015 power and you take the power of full
1016
1017 255
1018 00:12:11,458 --> 00:12:18,989
1019 convolutional networks to perform
1020
1021 256
1022 00:12:13,708 --> 00:12:21,448
1023 semantic segmentation so this is another
1024
1025 257
1026 00:12:18,990 --> 00:12:25,259
1027 depiction of what I mean by this
1028
1029 258
1030 00:12:21,448 --> 00:12:29,099
1031 combination this faster simplice the LCN
1032
1033 259
1034 00:12:25,259 --> 00:12:31,769
1035 like mask head so we have the fast
1036
1037 260
1038 00:12:29,100 --> 00:12:35,959
1039 the faster CN n architecture depicted on
1040
1041 261
1042 00:12:31,769 --> 00:12:39,448
1043 the Left we have our regions of interest
1044
1045 262
1046 00:12:35,958 --> 00:12:41,609
1047 that are going to be pulled with an
1048
1049 263
1050 00:12:39,448 --> 00:12:43,169
1051 operation that is very similar to Roy
1052
1053 264
1054 00:12:41,610 --> 00:12:45,930
1055 pooling but it's now gonna be slightly
1056
1057 265
1058 00:12:43,169 --> 00:12:47,639
1059 adapted and we will talk about this so
1060
1061 266
1062 00:12:45,929 --> 00:12:49,109
1063 in this case we don't have Roy pooling
1064
1065 267
1066 00:12:47,639 --> 00:12:51,810
1067 but we have an operation that's called
1068
1069 268
1070 00:12:49,110 --> 00:12:54,810
1071 royal line but the idea is again to
1072
1073 269
1074 00:12:51,809 --> 00:12:56,578
1075 convert any bounding box size to a fixed
1076
1077 270
1078 00:12:54,809 --> 00:12:59,818
1079 representation so that then we can
1080
1081 271
1082 00:12:56,578 --> 00:13:02,429
1083 predict the class box and with a series
1084
1085 272
1086 00:12:59,818 --> 00:13:05,939
1087 of other convolutions we can also
1088
1089 273
1090 00:13:02,429 --> 00:13:08,669
1091 predict the mask and the mask loss is
1092
1093 274
1094 00:13:05,940 --> 00:13:11,550
1095 essentially going to be a binary percent
1096
1097 275
1098 00:13:08,669 --> 00:13:15,059
1099 repeat per pixel for the case semantics
1100
1101 276
1102 00:13:11,549 --> 00:13:17,338
1103 classes so we're going to directly try
1104
1105 277
1106 00:13:15,059 --> 00:13:20,219
1107 to predict the semantics class for that
1108
1109 278
1110 00:13:17,339 --> 00:13:21,630
1111 particular instance now of course the
1112
1113 279
1114 00:13:20,220 --> 00:13:23,879
1115 idea is that
1116
1117 280
1118 00:13:21,629 --> 00:13:26,578
1119 the whole instance problem has already
1120
1121 281
1122 00:13:23,879 --> 00:13:28,799
1123 been solved by faster CNN faster CNN
1124
1125 282
1126 00:13:26,578 --> 00:13:32,189
1127 already gives us this separation between
1128
1129 283
1130 00:13:28,799 --> 00:13:34,769
1131 objects and therefore the mask head can
1132
1133 284
1134 00:13:32,190 --> 00:13:40,769
1135 focus entirely on finding out the
1136
1137 285
1138 00:13:34,769 --> 00:13:42,929
1139 semantics class of this instance now the
1140
1141 286
1142 00:13:40,769 --> 00:13:45,049
1143 semantic head as I said is a series of
1144
1145 287
1146 00:13:42,929 --> 00:13:48,838
1147 convolutions right it is a fully
1148
1149 288
1150 00:13:45,049 --> 00:13:50,909
1151 convolutional network at CN and the idea
1152
1153 289
1154 00:13:48,839 --> 00:13:53,970
1155 is that you take your feature
1156
1157 290
1158 00:13:50,909 --> 00:13:57,448
1159 representation that still contains some
1160
1161 291
1162 00:13:53,970 --> 00:14:00,120
1163 spatial information and this is the
1164
1165 292
1166 00:13:57,448 --> 00:14:02,159
1167 representation before we actually have
1168
1169 293
1170 00:14:00,120 --> 00:14:05,159
1171 the classification hat in the bounding
1172
1173 294
1174 00:14:02,159 --> 00:14:08,068
1175 box regression head of faster CN n so we
1176
1177 295
1178 00:14:05,159 --> 00:14:10,919
1179 take this representation and we
1180
1181 296
1182 00:14:08,068 --> 00:14:14,099
1183 essentially do a series of convolution
1184
1185 297
1186 00:14:10,919 --> 00:14:16,708
1187 operations to process it until we have a
1188
1189 298
1190 00:14:14,100 --> 00:14:22,319
1191 nice output in which we can represent
1192
1193 299
1194 00:14:16,708 --> 00:14:25,049
1195 the semantic classes so the power of
1196
1197 300
1198 00:14:22,318 --> 00:14:27,240
1199 mask our CN n is that most of the
1200
1201 301
1202 00:14:25,049 --> 00:14:29,278
1203 features are shared so most of the
1204
1205 302
1206 00:14:27,240 --> 00:14:31,560
1207 computation is shared so we're really
1208
1209 303
1210 00:14:29,278 --> 00:14:34,259
1211 adding only a few operations on top of
1212
1213 304
1214 00:14:31,559 --> 00:14:37,588
1215 faster CN n in order to produce the
1216
1217 305
1218 00:14:34,259 --> 00:14:39,750
1219 segmentation results but let's now look
1220
1221 306
1222 00:14:37,589 --> 00:14:42,329
1223 at the two tasks of detection and
1224
1225 307
1226 00:14:39,750 --> 00:14:44,549
1227 segmentation can we actually use the
1228
1229 308
1230 00:14:42,328 --> 00:14:47,849
1231 same operations inside a neural network
1232
1233 309
1234 00:14:44,549 --> 00:14:50,129
1235 to perform detection and segmentation so
1236
1237 310
1238 00:14:47,850 --> 00:14:53,370
1239 in the case of detection we essentially
1240
1241 311
1242 00:14:50,129 --> 00:14:55,289
1243 want to do object classification so once
1244
1245 312
1246 00:14:53,370 --> 00:14:57,120
1247 you have a proposal you want to take
1248
1249 313
1250 00:14:55,289 --> 00:14:59,399
1251 that box and you want to perform
1252
1253 314
1254 00:14:57,120 --> 00:15:02,129
1255 classification is this a caddis is a dog
1256
1257 315
1258 00:14:59,399 --> 00:15:03,839
1259 or is this not an object at all so you
1260
1261 316
1262 00:15:02,129 --> 00:15:06,958
1263 actually require in variant
1264
1265 317
1266 00:15:03,839 --> 00:15:09,149
1267 representations and in particular you
1268
1269 318
1270 00:15:06,958 --> 00:15:11,818
1271 require translation invariance so
1272
1273 319
1274 00:15:09,149 --> 00:15:14,909
1275 wherever my penguin is inside of the
1276
1277 320
1278 00:15:11,818 --> 00:15:17,938
1279 image I still want to classify it as a
1280
1281 321
1282 00:15:14,909 --> 00:15:19,860
1283 penguin therefore I need translation
1284
1285 322
1286 00:15:17,938 --> 00:15:22,889
1287 invariant representations to perform
1288
1289 323
1290 00:15:19,860 --> 00:15:25,050
1291 detection let's not now add the
1292
1293 324
1294 00:15:22,889 --> 00:15:27,448
1295 segmentation problem the segmentation
1296
1297 325
1298 00:15:25,049 --> 00:15:29,068
1299 problem is slightly different so for
1300
1301 326
1302 00:15:27,448 --> 00:15:31,769
1303 every translated object
1304
1305 327
1306 00:15:29,068 --> 00:15:34,278
1307 I need a translated mask and for every
1308
1309 328
1310 00:15:31,769 --> 00:15:35,429
1311 scaled object I need a scaled mask
1312
1313 329
1314 00:15:34,278 --> 00:15:37,860
1315 therefore
1316
1317 330
1318 00:15:35,429 --> 00:15:41,039
1319 I'm going to require equal variant
1320
1321 331
1322 00:15:37,860 --> 00:15:43,560
1323 representations also for semantic
1324
1325 332
1326 00:15:41,039 --> 00:15:46,049
1327 segmentation the small objects are less
1328
1329 333
1330 00:15:43,559 --> 00:15:48,209
1331 important because they have less pixels
1332
1333 334
1334 00:15:46,049 --> 00:15:50,120
1335 and therefore they count less in the
1336
1337 335
1338 00:15:48,210 --> 00:15:52,620
1339 loss function but for instance
1340
1341 336
1342 00:15:50,120 --> 00:15:55,889
1343 segmentation all objects no matter the
1344
1345 337
1346 00:15:52,620 --> 00:15:59,639
1347 size are equally important same as for
1348
1349 338
1350 00:15:55,889 --> 00:16:01,708
1351 object detection so I'm going to need
1352
1353 339
1354 00:15:59,639 --> 00:16:03,870
1355 slightly different representations I'm
1356
1357 340
1358 00:16:01,708 --> 00:16:06,778
1359 going to need to make more changes to
1360
1361 341
1362 00:16:03,870 --> 00:16:09,720
1363 faster CNN in order to have an
1364
1365 342
1366 00:16:06,778 --> 00:16:14,120
1367 equivalent network instead of an Edward
1368
1369 343
1370 00:16:09,720 --> 00:16:17,040
1371 that gives me invariant representations
1372
1373 344
1374 00:16:14,120 --> 00:16:20,639
1375 so what kind of operations are inside
1376
1377 345
1378 00:16:17,039 --> 00:16:22,769
1379 faster CNN and what kind of operations
1380
1381 346
1382 00:16:20,639 --> 00:16:25,350
1383 are equi variant and therefore I can
1384
1385 347
1386 00:16:22,769 --> 00:16:27,750
1387 keep and what others are invariant and
1388
1389 348
1390 00:16:25,350 --> 00:16:30,720
1391 therefore I need to change in order to
1392
1393 349
1394 00:16:27,750 --> 00:16:32,429
1395 create masks our CNN so let's look at
1396
1397 350
1398 00:16:30,720 --> 00:16:34,980
1399 the first type of operation the first
1400
1401 351
1402 00:16:32,429 --> 00:16:36,299
1403 type is the feature extraction which is
1404
1405 352
1406 00:16:34,980 --> 00:16:38,730
1407 performed using a series of
1408
1409 353
1410 00:16:36,299 --> 00:16:41,039
1411 convolutional layers we all know that
1412
1413 354
1414 00:16:38,730 --> 00:16:43,500
1415 these operations are equal variance or
1416
1417 355
1418 00:16:41,039 --> 00:16:45,778
1419 no problem there and the same goes for
1420
1421 356
1422 00:16:43,500 --> 00:16:47,850
1423 the segmentation head right it's a fully
1424
1425 357
1426 00:16:45,778 --> 00:16:50,850
1427 convolutional network so these
1428
1429 358
1430 00:16:47,850 --> 00:16:53,399
1431 operations are also equivalent but we
1432
1433 359
1434 00:16:50,850 --> 00:16:57,990
1435 have one problem in the middle with
1436
1437 360
1438 00:16:53,399 --> 00:17:01,470
1439 faster CNN so remember that we had the
1440
1441 361
1442 00:16:57,990 --> 00:17:04,949
1443 operation of Roy pooling and we had a
1444
1445 362
1446 00:17:01,470 --> 00:17:07,970
1447 series of fully connected layers and all
1448
1449 363
1450 00:17:04,949 --> 00:17:10,860
1451 of these operations essentially give
1452
1453 364
1454 00:17:07,970 --> 00:17:12,959
1455 invariance to the representation so
1456
1457 365
1458 00:17:10,859 --> 00:17:18,058
1459 these are not operations that we can
1460
1461 366
1462 00:17:12,959 --> 00:17:20,459
1463 keep for masks our CNN so remember how
1464
1465 367
1466 00:17:18,058 --> 00:17:23,068
1467 the region of interest pooling operation
1468
1469 368
1470 00:17:20,459 --> 00:17:25,740
1471 was working so if you remember we had
1472
1473 369
1474 00:17:23,068 --> 00:17:28,500
1475 this proposal that is on top of my
1476
1477 370
1478 00:17:25,740 --> 00:17:31,319
1479 penguin I perform feature extraction
1480
1481 371
1482 00:17:28,500 --> 00:17:34,049
1483 with my CNN and now from my feature map
1484
1485 372
1486 00:17:31,319 --> 00:17:37,079
1487 I'm only interested in looking at the
1488
1489 373
1490 00:17:34,049 --> 00:17:41,250
1491 region of my green proposal so what I do
1492
1493 374
1494 00:17:37,079 --> 00:17:45,178
1495 is I put on top of this proposal an H by
1496
1497 375
1498 00:17:41,250 --> 00:17:47,460
1499 W grid and then I perform pulling for
1500
1501 376
1502 00:17:45,179 --> 00:17:49,200
1503 each of these regions in order to obtain
1504
1505 377
1506 00:17:47,460 --> 00:17:53,819
1507 a map that is H
1508
1509 378
1510 00:17:49,200 --> 00:17:56,308
1511 W by C so essentially what I do is I
1512
1513 379
1514 00:17:53,819 --> 00:17:59,788
1515 bring all proposal representation to a
1516
1517 380
1518 00:17:56,308 --> 00:18:03,509
1519 fixed spatial size of H by W with this
1520
1521 381
1522 00:17:59,788 --> 00:18:05,849
1523 pulling operation so what is the exact
1524
1525 382
1526 00:18:03,509 --> 00:18:09,239
1527 problem at this polling operation let us
1528
1529 383
1530 00:18:05,849 --> 00:18:12,298
1531 look at the example with specific sizes
1532
1533 384
1534 00:18:09,239 --> 00:18:15,389
1535 so let's assume that I have my 400 by
1536
1537 385
1538 00:18:12,298 --> 00:18:19,648
1539 400 image and I have my bounding box
1540
1541 386
1542 00:18:15,388 --> 00:18:22,378
1543 which is 300 by 150 now once I pass it
1544
1545 387
1546 00:18:19,648 --> 00:18:24,898
1547 through a CNN of course my feature map
1548
1549 388
1550 00:18:22,378 --> 00:18:28,168
1551 has been reduced in spatial size and now
1552
1553 389
1554 00:18:24,898 --> 00:18:29,758
1555 has more channels we will just represent
1556
1557 390
1558 00:18:28,169 --> 00:18:31,350
1559 the number of channels as C because
1560
1561 391
1562 00:18:29,759 --> 00:18:33,659
1563 we're not interested in how many
1564
1565 392
1566 00:18:31,349 --> 00:18:37,558
1567 channels we have but we're interested in
1568
1569 393
1570 00:18:33,659 --> 00:18:42,179
1571 the spatial size which is now 65 by 65
1572
1573 394
1574 00:18:37,558 --> 00:18:44,428
1575 coming down from 400 by 400 so this of
1576
1577 395
1578 00:18:42,179 --> 00:18:47,190
1579 course means that my bounding box has
1580
1581 396
1582 00:18:44,429 --> 00:18:50,038
1583 also been scaled down and for example
1584
1585 397
1586 00:18:47,190 --> 00:18:57,659
1587 the height which was before 300 pixels
1588
1589 398
1590 00:18:50,038 --> 00:19:00,778
1591 is now 48 point 75 pixels so now let's
1592
1593 399
1594 00:18:57,659 --> 00:19:04,739
1595 imagine that I take my grid I take my H
1596
1597 400
1598 00:19:00,778 --> 00:19:09,739
1599 by W grid and I put it on top of my box
1600
1601 401
1602 00:19:04,739 --> 00:19:13,829
1603 which has a height of 48 point 75 pixels
1604
1605 402
1606 00:19:09,739 --> 00:19:17,700
1607 now our ends up happening is that I have
1608
1609 403
1610 00:19:13,829 --> 00:19:18,269
1611 to choose how many pixels to take into
1612
1613 404
1614 00:19:17,700 --> 00:19:21,659
1615 my grid
1616
1617 405
1618 00:19:18,269 --> 00:19:25,048
1619 I cannot take 48 point 75 pixels and
1620
1621 406
1622 00:19:21,659 --> 00:19:28,169
1623 divide it by the number of pins that I
1624
1625 407
1626 00:19:25,048 --> 00:19:29,849
1627 need when I put the grid on top so I
1628
1629 408
1630 00:19:28,169 --> 00:19:33,090
1631 need to make a choice and I for example
1632
1633 409
1634 00:19:29,849 --> 00:19:35,519
1635 make the choice of 48 right so this is
1636
1637 410
1638 00:19:33,089 --> 00:19:40,019
1639 the first quantization effect that we're
1640
1641 411
1642 00:19:35,519 --> 00:19:42,089
1643 going to see so first in the output
1644
1645 412
1646 00:19:40,019 --> 00:19:45,028
1647 we're going to have this quantization
1648
1649 413
1650 00:19:42,089 --> 00:19:49,230
1651 effect reflected because now my bounding
1652
1653 414
1654 00:19:45,028 --> 00:19:51,569
1655 box is not truly 300 pixels high but
1656
1657 415
1658 00:19:49,230 --> 00:19:53,069
1659 it's much less due to the quantization
1660
1661 416
1662 00:19:51,569 --> 00:19:57,269
1663 and due to the fact that instead of
1664
1665 417
1666 00:19:53,069 --> 00:20:01,230
1667 taking 48 point 75 I took 48 as the box
1668
1669 418
1670 00:19:57,269 --> 00:20:02,940
1671 height so of course you can think that
1672
1673 419
1674 00:20:01,230 --> 00:20:04,980
1675 this is not really pseudo
1676
1677 420
1678 00:20:02,940 --> 00:20:08,910
1679 when you want to extract pixel wise
1680
1681 421
1682 00:20:04,980 --> 00:20:11,819
1683 precise masks if I'm already having a
1684
1685 422
1686 00:20:08,910 --> 00:20:14,430
1687 quantization problem only when I predict
1688
1689 423
1690 00:20:11,819 --> 00:20:17,309
1691 the bounding box if I predict pixel wise
1692
1693 424
1694 00:20:14,430 --> 00:20:20,820
1695 precise mask I'm going to lose a lot of
1696
1697 425
1698 00:20:17,309 --> 00:20:22,950
1699 the mask only with this operation so it
1700
1701 426
1702 00:20:20,819 --> 00:20:31,409
1703 is clear that Roy pooling is not
1704
1705 427
1706 00:20:22,950 --> 00:20:33,930
1707 suitable for masks our CNN so the idea
1708
1709 428
1710 00:20:31,410 --> 00:20:36,480
1711 of masks our CNN is then to essentially
1712
1713 429
1714 00:20:33,930 --> 00:20:39,120
1715 exchange all the invariant operations
1716
1717 430
1718 00:20:36,480 --> 00:20:41,069
1719 with operations that are equivalent
1720
1721 431
1722 00:20:39,119 --> 00:20:43,679
1723 and in this case they're going to
1724
1725 432
1726 00:20:41,069 --> 00:20:46,169
1727 exchange the ROI pooling with an
1728
1729 433
1730 00:20:43,680 --> 00:20:49,620
1731 equivalent operation which they call ROI
1732
1733 434
1734 00:20:46,170 --> 00:20:51,630
1735 aligned so one of the cause of ROI
1736
1737 435
1738 00:20:49,619 --> 00:20:54,209
1739 aligned is to actually erase those
1740
1741 436
1742 00:20:51,630 --> 00:20:57,180
1743 quantization effects so if you look at
1744
1745 437
1746 00:20:54,210 --> 00:21:00,990
1747 the example before where we had our 300
1748
1749 438
1750 00:20:57,180 --> 00:21:04,400
1751 pixel box that was converted to a box
1752
1753 439
1754 00:21:00,990 --> 00:21:08,370
1755 height of 48 point 75 in our feature map
1756
1757 440
1758 00:21:04,400 --> 00:21:11,519
1759 before we had to choose for example 48
1760
1761 441
1762 00:21:08,369 --> 00:21:13,559
1763 in order to perform our ROI pulling but
1764
1765 442
1766 00:21:11,519 --> 00:21:17,220
1767 the idea now is that we're going to be
1768
1769 443
1770 00:21:13,559 --> 00:21:20,009
1771 able to choose exactly 48 point 75 and
1772
1773 444
1774 00:21:17,220 --> 00:21:22,470
1775 still perform this region of interest
1776
1777 445
1778 00:21:20,009 --> 00:21:26,430
1779 pooling operation which is now called
1780
1781 446
1782 00:21:22,470 --> 00:21:29,250
1783 ROI aligned so let's look at this
1784
1785 447
1786 00:21:26,430 --> 00:21:32,039
1787 example where we have our feature map
1788
1789 448
1790 00:21:29,250 --> 00:21:34,609
1791 here on the left and our bounding box
1792
1793 449
1794 00:21:32,039 --> 00:21:39,319
1795 which is depicted in this Salman color
1796
1797 450
1798 00:21:34,609 --> 00:21:43,559
1799 now from this bounding box this ROI this
1800
1801 451
1802 00:21:39,319 --> 00:21:46,049
1803 proposal we want to get as output from
1804
1805 452
1806 00:21:43,559 --> 00:21:48,149
1807 the Royal line operation a fixed
1808
1809 453
1810 00:21:46,049 --> 00:21:52,470
1811 dimensional representation which in our
1812
1813 454
1814 00:21:48,150 --> 00:21:55,170
1815 case is a 2 by 2 representation so
1816
1817 455
1818 00:21:52,470 --> 00:21:58,710
1819 essentially we need to fill these four
1820
1821 456
1822 00:21:55,170 --> 00:22:02,430
1823 positions we need to obtain this one
1824
1825 457
1826 00:21:58,710 --> 00:22:05,039
1827 value for each of these positions but of
1828
1829 458
1830 00:22:02,430 --> 00:22:08,430
1831 course for each of these positions we
1832
1833 459
1834 00:22:05,039 --> 00:22:12,180
1835 actually have all of this area to cover
1836
1837 460
1838 00:22:08,430 --> 00:22:15,180
1839 and kind of distill into this only one
1840
1841 461
1842 00:22:12,180 --> 00:22:16,110
1843 number and of course instead of doing
1844
1845 462
1846 00:22:15,180 --> 00:22:18,690
1847 any quantity
1848
1849 463
1850 00:22:16,109 --> 00:22:22,349
1851 here what we say is we want to take into
1852
1853 464
1854 00:22:18,690 --> 00:22:24,900
1855 account really the values that can be
1856
1857 465
1858 00:22:22,349 --> 00:22:27,240
1859 found in this representation in the
1860
1861 466
1862 00:22:24,900 --> 00:22:31,320
1863 exact position where they are meant to
1864
1865 467
1866 00:22:27,240 --> 00:22:34,529
1867 be without any quantization effects so
1868
1869 468
1870 00:22:31,319 --> 00:22:37,048
1871 essentially what we do is we sample each
1872
1873 469
1874 00:22:34,529 --> 00:22:41,129
1875 of these units each of these output
1876
1877 470
1878 00:22:37,048 --> 00:22:44,730
1879 regions into four so example it four
1880
1881 471
1882 00:22:41,130 --> 00:22:46,830
1883 times now these points are going to be
1884
1885 472
1886 00:22:44,730 --> 00:22:49,919
1887 the grid points for the bilinear
1888
1889 473
1890 00:22:46,829 --> 00:22:53,460
1891 interpolation so essentially what I'm
1892
1893 474
1894 00:22:49,919 --> 00:22:56,700
1895 going to do is I'm going to take the
1896
1897 475
1898 00:22:53,460 --> 00:22:59,058
1899 pixel values that I do have from my
1900
1901 476
1902 00:22:56,700 --> 00:23:03,090
1903 feature map right my feature map
1904
1905 477
1906 00:22:59,058 --> 00:23:06,210
1907 contains values at each of these corners
1908
1909 478
1910 00:23:03,089 --> 00:23:08,970
1911 here so of course there is no value that
1912
1913 479
1914 00:23:06,210 --> 00:23:11,600
1915 represents this position here but what I
1916
1917 480
1918 00:23:08,970 --> 00:23:15,660
1919 can do is I can do by linear
1920
1921 481
1922 00:23:11,599 --> 00:23:19,109
1923 interpolation between these four corners
1924
1925 482
1926 00:23:15,660 --> 00:23:21,150
1927 of my orange box and bi linear
1928
1929 483
1930 00:23:19,109 --> 00:23:23,519
1931 interpolation actually means that of
1932
1933 484
1934 00:23:21,150 --> 00:23:26,540
1935 course this value is going to be much
1936
1937 485
1938 00:23:23,519 --> 00:23:29,910
1939 more influenced by the bottom left
1940
1941 486
1942 00:23:26,539 --> 00:23:32,490
1943 corner of this orange box because it is
1944
1945 487
1946 00:23:29,910 --> 00:23:33,990
1947 very close to that corner there but
1948
1949 488
1950 00:23:32,490 --> 00:23:36,900
1951 you're still going to take into account
1952
1953 489
1954 00:23:33,990 --> 00:23:40,440
1955 the values that can be found on the
1956
1957 490
1958 00:23:36,900 --> 00:23:42,660
1959 other corners of my orange box so
1960
1961 491
1962 00:23:40,440 --> 00:23:46,470
1963 through my linear interpolation I can
1964
1965 492
1966 00:23:42,660 --> 00:23:50,548
1967 essentially create a true value for this
1968
1969 493
1970 00:23:46,470 --> 00:23:52,350
1971 blue point for these grid points and now
1972
1973 494
1974 00:23:50,548 --> 00:23:54,869
1975 what I can do is I can essentially
1976
1977 495
1978 00:23:52,349 --> 00:23:57,240
1979 condense this information through for
1980
1981 496
1982 00:23:54,869 --> 00:24:02,909
1983 example max pooling in order to obtain
1984
1985 497
1986 00:23:57,240 --> 00:24:04,919
1987 one output value for the Royal line so
1988
1989 498
1990 00:24:02,910 --> 00:24:07,440
1991 this essentially avoids all the
1992
1993 499
1994 00:24:04,919 --> 00:24:09,690
1995 quantization effects and really takes
1996
1997 500
1998 00:24:07,440 --> 00:24:12,390
1999 into account the actual value in the
2000
2001 501
2002 00:24:09,690 --> 00:24:14,759
2003 actual position of the feature map and
2004
2005 502
2006 00:24:12,390 --> 00:24:16,650
2007 not a quantized value which is not
2008
2009 503
2010 00:24:14,759 --> 00:24:19,769
2011 accurate enough for segmentation
2012
2013 504
2014 00:24:16,650 --> 00:24:22,440
2015 prediction and we can see that actually
2016
2017 505
2018 00:24:19,769 --> 00:24:23,970
2019 mask our CNN works really well so the
2020
2021 506
2022 00:24:22,440 --> 00:24:27,419
2023 qualitative results are really
2024
2025 507
2026 00:24:23,970 --> 00:24:29,850
2027 impressive we get a lot of objects small
2028
2029 508
2030 00:24:27,419 --> 00:24:33,570
2031 and big they are quite well segments
2032
2033 509
2034 00:24:29,849 --> 00:24:36,349
2035 semantics classes are correct and we can
2036
2037 510
2038 00:24:33,569 --> 00:24:39,599
2039 actually use the same network to segment
2040
2041 511
2042 00:24:36,349 --> 00:24:50,219
2043 lots of object categories from person to
2044
2045 512
2046 00:24:39,599 --> 00:24:52,259
2047 motivate to cup to donut and the nice
2048
2049 513
2050 00:24:50,220 --> 00:24:55,259
2051 thing about masks are CNN is that it's a
2052
2053 514
2054 00:24:52,259 --> 00:24:57,480
2055 quite flexible architecture right so you
2056
2057 515
2058 00:24:55,259 --> 00:25:01,529
2059 can also for example extend it to pretty
2060
2061 516
2062 00:24:57,480 --> 00:25:04,099
2063 body joints so the idea is that you can
2064
2065 517
2066 00:25:01,529 --> 00:25:08,009
2067 actually model a key point location as a
2068
2069 518
2070 00:25:04,099 --> 00:25:10,919
2071 one-hot mask an adult mask our CNN to
2072
2073 519
2074 00:25:08,009 --> 00:25:14,609
2075 predict cave masks which are in the end
2076
2077 520
2078 00:25:10,920 --> 00:25:16,860
2079 only one pixel so every joint is going
2080
2081 521
2082 00:25:14,609 --> 00:25:20,359
2083 to be represented as a mask and you're
2084
2085 522
2086 00:25:16,859 --> 00:25:23,299
2087 going to predict a mask for K joints and
2088
2089 523
2090 00:25:20,359 --> 00:25:26,699
2091 each of the masks is going to represent
2092
2093 524
2094 00:25:23,299 --> 00:25:29,609
2095 left shoulder right elbow and all the
2096
2097 525
2098 00:25:26,700 --> 00:25:32,069
2099 other joints in the body and so
2100
2101 526
2102 00:25:29,609 --> 00:25:34,229
2103 essentially by just slightly changing
2104
2105 527
2106 00:25:32,069 --> 00:25:36,899
2107 the meaning of the representation you
2108
2109 528
2110 00:25:34,230 --> 00:25:39,900
2111 can use the same operations and you can
2112
2113 529
2114 00:25:36,900 --> 00:25:44,730
2115 take advantage of the Royal line power
2116
2117 530
2118 00:25:39,900 --> 00:25:47,250
2119 to actually predict precise body joints
2120
2121 531
2122 00:25:44,730 --> 00:25:54,000
2123 in this case and of course the the full
2124
2125 532
2126 00:25:47,250 --> 00:25:56,069
2127 skeleton as is depicted here so now the
2128
2129 533
2130 00:25:54,000 --> 00:25:57,660
2131 question is can we actually do better
2132
2133 534
2134 00:25:56,069 --> 00:26:00,450
2135 right so there works that have been
2136
2137 535
2138 00:25:57,660 --> 00:26:02,550
2139 building up on top of mascar Sinan and
2140
2141 536
2142 00:26:00,450 --> 00:26:03,360
2143 trying to improve the accuracy of mask
2144
2145 537
2146 00:26:02,549 --> 00:26:06,269
2147 our CNN
2148
2149 538
2150 00:26:03,359 --> 00:26:09,419
2151 so of course one problem is that the
2152
2153 539
2154 00:26:06,269 --> 00:26:12,720
2155 mask quality score is computed as the
2156
2157 540
2158 00:26:09,420 --> 00:26:15,320
2159 confidence score for the bounding box so
2160
2161 541
2162 00:26:12,720 --> 00:26:17,789
2163 essentially if you're bounding box
2164
2165 542
2166 00:26:15,319 --> 00:26:21,210
2167 doesn't have a high confidence score
2168
2169 543
2170 00:26:17,789 --> 00:26:26,220
2171 then your mask quality is also going to
2172
2173 544
2174 00:26:21,210 --> 00:26:29,039
2175 suffer and remember that the mask loss
2176
2177 545
2178 00:26:26,220 --> 00:26:31,259
2179 just evaluates if the pixels have the
2180
2181 546
2182 00:26:29,039 --> 00:26:33,839
2183 correct semantic class not the correct
2184
2185 547
2186 00:26:31,259 --> 00:26:37,910
2187 instance so for example in this case
2188
2189 548
2190 00:26:33,839 --> 00:26:42,059
2191 where we have three persons if all of
2192
2193 549
2194 00:26:37,910 --> 00:26:42,590
2195 the pixels inside the orange bounding
2196
2197 550
2198 00:26:42,059 --> 00:26:43,690
2199 box
2200
2201 551
2202 00:26:42,589 --> 00:26:47,079
2203 where
2204
2205 552
2206 00:26:43,690 --> 00:26:49,240
2207 classified as a person then it doesn't
2208
2209 553
2210 00:26:47,079 --> 00:26:52,210
2211 really matter if this is the purple
2212
2213 554
2214 00:26:49,240 --> 00:26:53,528
2215 person or the orange person they you see
2216
2217 555
2218 00:26:52,210 --> 00:26:57,038
2219 that there are pixels from both
2220
2221 556
2222 00:26:53,528 --> 00:27:00,878
2223 instances falling inside the orange box
2224
2225 557
2226 00:26:57,038 --> 00:27:03,429
2227 this is not really reflected inside the
2228
2229 558
2230 00:27:00,878 --> 00:27:06,128
2231 mask laws whether these two instances
2232
2233 559
2234 00:27:03,429 --> 00:27:08,230
2235 inside the same box should actually be
2236
2237 560
2238 00:27:06,128 --> 00:27:11,349
2239 the same or not the only thing you're
2240
2241 561
2242 00:27:08,230 --> 00:27:15,149
2243 interested in in there is that all of
2244
2245 562
2246 00:27:11,349 --> 00:27:17,769
2247 these pixels are classified as a person
2248
2249 563
2250 00:27:15,148 --> 00:27:19,989
2251 so of course it's a problem in this
2252
2253 564
2254 00:27:17,769 --> 00:27:22,839
2255 particular case where you have two
2256
2257 565
2258 00:27:19,990 --> 00:27:25,388
2259 instances or more of the same class in
2260
2261 566
2262 00:27:22,839 --> 00:27:31,569
2263 this case the class person inside one
2264
2265 567
2266 00:27:25,388 --> 00:27:33,219
2267 bounding box so the idea here is that if
2268
2269 568
2270 00:27:31,569 --> 00:27:34,269
2271 you are actually predicting instance
2272
2273 569
2274 00:27:33,220 --> 00:27:37,298
2275 segmentations
2276
2277 570
2278 00:27:34,269 --> 00:27:40,210
2279 but the only way the actual instance is
2280
2281 571
2282 00:27:37,298 --> 00:27:43,028
2283 evaluated is through the Box loss then
2284
2285 572
2286 00:27:40,210 --> 00:27:48,308
2287 you are losing some mask quality in
2288
2289 573
2290 00:27:43,028 --> 00:27:51,669
2291 there so this is why in a subsequent CPR
2292
2293 574
2294 00:27:48,308 --> 00:27:54,190
2295 2019 paper there were other authors that
2296
2297 575
2298 00:27:51,669 --> 00:27:57,700
2299 actually proposed what's called mask
2300
2301 576
2302 00:27:54,190 --> 00:28:00,940
2303 scoring or CNN which was essentially a
2304
2305 577
2306 00:27:57,700 --> 00:28:03,788
2307 mask intersection over Union head so the
2308
2309 578
2310 00:28:00,940 --> 00:28:06,220
2311 idea here is that you actually want to
2312
2313 579
2314 00:28:03,788 --> 00:28:08,259
2315 measure the intersection of a union
2316
2317 580
2318 00:28:06,220 --> 00:28:10,899
2319 between the predicted mask and the
2320
2321 581
2322 00:28:08,259 --> 00:28:13,599
2323 ground truth mask so you want to have a
2324
2325 582
2326 00:28:10,898 --> 00:28:15,969
2327 loss that is really acting on the
2328
2329 583
2330 00:28:13,599 --> 00:28:17,740
2331 instance level on the mask instance
2332
2333 584
2334 00:28:15,970 --> 00:28:23,919
2335 level and not only on the bounding box
2336
2337 585
2338 00:28:17,740 --> 00:28:25,929
2339 instance level now typically the mask
2340
2341 586
2342 00:28:23,919 --> 00:28:28,360
2343 wearing our CNN gives actually lower
2344
2345 587
2346 00:28:25,929 --> 00:28:30,850
2347 confidence course than mask our CNN
2348
2349 588
2350 00:28:28,359 --> 00:28:34,119
2351 because of course the masks are not
2352
2353 589
2354 00:28:30,849 --> 00:28:36,579
2355 perfect but this tiny modification
2356
2357 590
2358 00:28:34,119 --> 00:28:40,418
2359 actually achieves much better results so
2360
2361 591
2362 00:28:36,579 --> 00:28:42,849
2363 just by having the proper loss this is
2364
2365 592
2366 00:28:40,419 --> 00:28:46,090
2367 something that really changes how your
2368
2369 593
2370 00:28:42,849 --> 00:28:48,158
2371 neural network is going to Train and
2372
2373 594
2374 00:28:46,089 --> 00:28:52,000
2375 what kind of outputs it's going to
2376
2377 595
2378 00:28:48,159 --> 00:28:55,210
2379 predict so of course mask our CNN
2380
2381 596
2382 00:28:52,000 --> 00:28:57,380
2383 derives from faster CNN which is a two
2384
2385 597
2386 00:28:55,210 --> 00:28:59,840
2387 stage detector and we
2388
2389 598
2390 00:28:57,380 --> 00:29:02,720
2391 saw that besides two-stage detector we
2392
2393 599
2394 00:28:59,839 --> 00:29:05,449
2395 also had one stage detectors which were
2396
2397 600
2398 00:29:02,720 --> 00:29:09,670
2399 actually faster so now the question is
2400
2401 601
2402 00:29:05,450 --> 00:29:13,009
2403 can I also apply this one stage 2 stage
2404
2405 602
2406 00:29:09,670 --> 00:29:16,370
2407 concept to masks so essentially can I
2408
2409 603
2410 00:29:13,009 --> 00:29:21,230
2411 have one stage instant segmentation
2412
2413 604
2414 00:29:16,369 --> 00:29:23,149
2415 methods so as I said in detectors we had
2416
2417 605
2418 00:29:21,230 --> 00:29:26,269
2419 this difference between one stage
2420
2421 606
2422 00:29:23,150 --> 00:29:29,870
2423 methods like Yolo that had faster but
2424
2425 607
2426 00:29:26,269 --> 00:29:31,759
2427 lower performance and faster CNN versus
2428
2429 608
2430 00:29:29,869 --> 00:29:34,879
2431 two stage detectors like for example
2432
2433 609
2434 00:29:31,759 --> 00:29:36,529
2435 faster CNN and all variants which are of
2436
2437 610
2438 00:29:34,880 --> 00:29:40,850
2439 course slower but have a higher
2440
2441 611
2442 00:29:36,529 --> 00:29:43,519
2443 performance so now we saw as a two stage
2444
2445 612
2446 00:29:40,849 --> 00:29:46,429
2447 instance segmentation method mask our
2448
2449 613
2450 00:29:43,519 --> 00:29:48,769
2451 CNN which we also hope that is slower
2452
2453 614
2454 00:29:46,430 --> 00:29:51,860
2455 but has actually a higher performance
2456
2457 615
2458 00:29:48,769 --> 00:29:56,990
2459 than the one stage counterpart which is
2460
2461 616
2462 00:29:51,859 --> 00:29:58,659
2463 actually Yeol at so even though masks
2464
2465 617
2466 00:29:56,990 --> 00:30:01,279
2467 are actually a very meaningful
2468
2469 618
2470 00:29:58,660 --> 00:30:05,120
2471 representation it's not so easy to
2472
2473 619
2474 00:30:01,279 --> 00:30:08,930
2475 extend Yolo to predict masks instead of
2476
2477 620
2478 00:30:05,119 --> 00:30:12,979
2479 boxes and this was already found by the
2480
2481 621
2482 00:30:08,930 --> 00:30:16,880
2483 by the creator of Yolo so it wasn't
2484
2485 622
2486 00:30:12,980 --> 00:30:19,460
2487 until 2019 that we cut y'all act
2488
2489 623
2490 00:30:16,880 --> 00:30:22,250
2491 appearing at a conference y'all act
2492
2493 624
2494 00:30:19,460 --> 00:30:25,880
2495 stands for you only look at coefficients
2496
2497 625
2498 00:30:22,250 --> 00:30:29,000
2499 and the idea there is actually not so
2500
2501 626
2502 00:30:25,880 --> 00:30:32,090
2503 straightforward so to go from boxes to
2504
2505 627
2506 00:30:29,000 --> 00:30:34,250
2507 masks you need to actually design the
2508
2509 628
2510 00:30:32,089 --> 00:30:36,529
2511 network carefully so what they proposed
2512
2513 629
2514 00:30:34,250 --> 00:30:39,880
2515 to do was this problem which will
2516
2517 630
2518 00:30:36,529 --> 00:30:43,069
2519 analyze a bit more in detail here today
2520
2521 631
2522 00:30:39,880 --> 00:30:45,680
2523 so as a first step you actually need a
2524
2525 632
2526 00:30:43,069 --> 00:30:48,289
2527 network ahead that actually generates
2528
2529 633
2530 00:30:45,680 --> 00:30:50,090
2531 what they call mask prototypes so
2532
2533 634
2534 00:30:48,289 --> 00:30:55,250
2535 possible segmentations
2536
2537 635
2538 00:30:50,089 --> 00:30:58,309
2539 for a particular bounding box then you
2540
2541 636
2542 00:30:55,250 --> 00:31:00,559
2543 can generate the mask coefficients with
2544
2545 637
2546 00:30:58,309 --> 00:31:03,589
2547 another network called the prediction
2548
2549 638
2550 00:31:00,559 --> 00:31:06,889
2551 head which actually evaluate each of the
2552
2553 639
2554 00:31:03,589 --> 00:31:09,199
2555 mask prototypes and finally you will
2556
2557 640
2558 00:31:06,890 --> 00:31:10,759
2559 have a third step in which will combine
2560
2561 641
2562 00:31:09,200 --> 00:31:12,919
2563 the mass prototypes and
2564
2565 642
2566 00:31:10,759 --> 00:31:19,159
2567 mas coefficients to actually generate
2568
2569 643
2570 00:31:12,919 --> 00:31:21,799
2571 the final incent segmentation so let's
2572
2573 644
2574 00:31:19,159 --> 00:31:24,349
2575 see the architecture in a bit more
2576
2577 645
2578 00:31:21,798 --> 00:31:27,259
2579 detail we have the backbone which is
2580
2581 646
2582 00:31:24,348 --> 00:31:29,178
2583 actually resonant 101 and the features
2584
2585 647
2586 00:31:27,259 --> 00:31:31,578
2587 are computed at different skills so we
2588
2589 648
2590 00:31:29,179 --> 00:31:34,519
2591 also have this feature pyramid which is
2592
2593 649
2594 00:31:31,578 --> 00:31:37,098
2595 now pretty much present in all object
2596
2597 650
2598 00:31:34,519 --> 00:31:40,788
2599 detectors and segmentation instant
2600
2601 651
2602 00:31:37,098 --> 00:31:43,278
2603 segmentation methods we then have the
2604
2605 652
2606 00:31:40,788 --> 00:31:46,429
2607 Prada net which is responsible for
2608
2609 653
2610 00:31:43,278 --> 00:31:49,398
2611 generating a prototype masks and
2612
2613 654
2614 00:31:46,429 --> 00:31:51,619
2615 actually this K has no relationship with
2616
2617 655
2618 00:31:49,398 --> 00:31:53,988
2619 a number of semantic classes but it's
2620
2621 656
2622 00:31:51,618 --> 00:31:57,769
2623 rather a hyper parameter so you generate
2624
2625 657
2626 00:31:53,989 --> 00:32:01,219
2627 a fixed number of prototype masks which
2628
2629 658
2630 00:31:57,769 --> 00:32:05,199
2631 you can somehow relate to the anchors
2632
2633 659
2634 00:32:01,219 --> 00:32:08,028
2635 that we had in bounding box
2636
2637 660
2638 00:32:05,199 --> 00:32:10,068
2639 the architecture of proto net is
2640
2641 661
2642 00:32:08,028 --> 00:32:12,229
2643 actually a fully convolutional network
2644
2645 662
2646 00:32:10,068 --> 00:32:14,898
2647 which consists in a series of 3x3
2648
2649 663
2650 00:32:12,229 --> 00:32:17,719
2651 convolutions and then a final 1x1
2652
2653 664
2654 00:32:14,898 --> 00:32:19,698
2655 convolution that simply converts the
2656
2657 665
2658 00:32:17,719 --> 00:32:22,009
2659 number of channels into these K
2660
2661 666
2662 00:32:19,699 --> 00:32:24,769
2663 predictions that we actually need to
2664
2665 667
2666 00:32:22,009 --> 00:32:27,469
2667 have and in each of these channels at
2668
2669 668
2670 00:32:24,769 --> 00:32:31,899
2671 the end on the on this feature map of
2672
2673 669
2674 00:32:27,469 --> 00:32:35,359
2675 138 by 130 by K we'll have actually K
2676
2677 670
2678 00:32:31,898 --> 00:32:37,698
2679 possible mask prototypes so this is
2680
2681 671
2682 00:32:35,358 --> 00:32:40,249
2683 actually very similar to the mask branch
2684
2685 672
2686 00:32:37,699 --> 00:32:43,459
2687 in mask our CNN but there is no loss
2688
2689 673
2690 00:32:40,249 --> 00:32:48,588
2691 function applied at this stage so this
2692
2693 674
2694 00:32:43,459 --> 00:32:50,599
2695 is actually a very crucial difference so
2696
2697 675
2698 00:32:48,588 --> 00:32:52,848
2699 once this Burnett has generated the mask
2700
2701 676
2702 00:32:50,598 --> 00:32:55,098
2703 prototypes we have another network that
2704
2705 677
2706 00:32:52,848 --> 00:32:58,818
2707 predicts a coefficient for every
2708
2709 678
2710 00:32:55,098 --> 00:33:03,769
2711 predicted mask and just judges how
2712
2713 679
2714 00:32:58,818 --> 00:33:06,408
2715 reliable the mask is so the mass
2716
2717 680
2718 00:33:03,769 --> 00:33:09,888
2719 coefficient are essentially intertwine
2720
2721 681
2722 00:33:06,409 --> 00:33:12,829
2723 also with the Box predictions so
2724
2725 682
2726 00:33:09,888 --> 00:33:15,588
2727 essentially we have a series of anchor
2728
2729 683
2730 00:33:12,828 --> 00:33:18,158
2731 boxes that one can pray and one can
2732
2733 684
2734 00:33:15,588 --> 00:33:22,009
2735 predict the class of the anchor box our
2736
2737 685
2738 00:33:18,159 --> 00:33:24,120
2739 regression to actually fit this anchor
2740
2741 686
2742 00:33:22,009 --> 00:33:27,450
2743 box tightly to the object
2744
2745 687
2746 00:33:24,119 --> 00:33:30,479
2747 and then the scale coefficients one per
2748
2749 688
2750 00:33:27,450 --> 00:33:33,630
2751 prototype mask and per anchor so note
2752
2753 689
2754 00:33:30,480 --> 00:33:36,569
2755 here that we have W by H and then
2756
2757 690
2758 00:33:33,630 --> 00:33:39,150
2759 multiplied multiplied by K which is the
2760
2761 691
2762 00:33:36,569 --> 00:33:42,500
2763 number of prototype masks and multiplied
2764
2765 692
2766 00:33:39,150 --> 00:33:46,800
2767 by a which is the number of anchor boxes
2768
2769 693
2770 00:33:42,500 --> 00:33:50,009
2771 so we have these K coefficients per
2772
2773 694
2774 00:33:46,799 --> 00:33:53,490
2775 anchor and these are actually the ones
2776
2777 695
2778 00:33:50,009 --> 00:33:55,769
2779 that define the mask and note actually
2780
2781 696
2782 00:33:53,490 --> 00:34:00,809
2783 how the network is similar to right in
2784
2785 697
2786 00:33:55,769 --> 00:34:02,400
2787 the net but a little bit shallower so
2788
2789 698
2790 00:34:00,809 --> 00:34:03,960
2791 essentially now the question is how do
2792
2793 699
2794 00:34:02,400 --> 00:34:06,540
2795 you actually generate the mask from
2796
2797 700
2798 00:34:03,960 --> 00:34:08,820
2799 these mask prototypes and these mask
2800
2801 701
2802 00:34:06,539 --> 00:34:11,070
2803 coefficients so essentially what you do
2804
2805 702
2806 00:34:08,820 --> 00:34:14,250
2807 is a linear combination between the mask
2808
2809 703
2810 00:34:11,070 --> 00:34:16,470
2811 coefficients and the mask prototypes so
2812
2813 704
2814 00:34:14,250 --> 00:34:19,139
2815 you're going to predict a mask as this
2816
2817 705
2818 00:34:16,469 --> 00:34:23,369
2819 linear combination and then pass through
2820
2821 706
2822 00:34:19,139 --> 00:34:26,849
2823 a non-linearity so in this case P is the
2824
2825 707
2826 00:34:23,369 --> 00:34:28,529
2827 H by W by K feature map the output of
2828
2829 708
2830 00:34:26,849 --> 00:34:32,549
2831 proto net which is essentially the
2832
2833 709
2834 00:34:28,530 --> 00:34:35,100
2835 matrix of prototyped masks see is an N
2836
2837 710
2838 00:34:32,550 --> 00:34:39,450
2839 by K matrix of mask coefficients that
2840
2841 711
2842 00:34:35,099 --> 00:34:41,759
2843 have survived NMS also remember that the
2844
2845 712
2846 00:34:39,449 --> 00:34:44,309
2847 coefficients are predicted together with
2848
2849 713
2850 00:34:41,760 --> 00:34:47,640
2851 the anchor boxes so the NMS can be
2852
2853 714
2854 00:34:44,309 --> 00:34:50,880
2855 applied on that side and then Sigma is a
2856
2857 715
2858 00:34:47,639 --> 00:34:54,239
2859 non linearity so you can see here an
2860
2861 716
2862 00:34:50,880 --> 00:34:58,050
2863 example of how you actually construct
2864
2865 717
2866 00:34:54,239 --> 00:35:01,169
2867 the final mask by assembling these sort
2868
2869 718
2870 00:34:58,050 --> 00:35:03,360
2871 of pieces of masks so you have here on
2872
2873 719
2874 00:35:01,170 --> 00:35:05,340
2875 the top side
2876
2877 720
2878 00:35:03,360 --> 00:35:07,200
2879 first of all these images are the
2880
2881 721
2882 00:35:05,340 --> 00:35:09,180
2883 prototypes and these are the
2884
2885 722
2886 00:35:07,199 --> 00:35:12,679
2887 coefficients positive coefficient
2888
2889 723
2890 00:35:09,179 --> 00:35:16,169
2891 negative coefficient and then you have
2892
2893 724
2894 00:35:12,679 --> 00:35:18,569
2895 for example the mask that has this this
2896
2897 725
2898 00:35:16,170 --> 00:35:23,420
2899 person with the racket here there's only
2900
2901 726
2902 00:35:18,570 --> 00:35:26,340
2903 the person then we have a negative
2904
2905 727
2906 00:35:23,420 --> 00:35:29,550
2907 weight on the prototype that actually
2908
2909 728
2910 00:35:26,340 --> 00:35:32,039
2911 represents the racket which what it
2912
2913 729
2914 00:35:29,550 --> 00:35:34,410
2915 essentially gives us is this separation
2916
2917 730
2918 00:35:32,039 --> 00:35:36,539
2919 between the two objects so the
2920
2921 731
2922 00:35:34,409 --> 00:35:36,869
2923 separation between the human the tennis
2924
2925 732
2926 00:35:36,539 --> 00:35:39,269
2927 player
2928
2929 733
2930 00:35:36,869 --> 00:35:42,389
2931 year and the racket which is actually
2932
2933 734
2934 00:35:39,269 --> 00:35:45,389
2935 another object therefore with this
2936
2937 735
2938 00:35:42,389 --> 00:35:48,210
2939 assembly of masks prototypes we can
2940
2941 736
2942 00:35:45,389 --> 00:35:51,409
2943 actually obtain the two objects the two
2944
2945 737
2946 00:35:48,210 --> 00:35:54,869
2947 objects the two incense segmentations
2948
2949 738
2950 00:35:51,409 --> 00:35:57,179
2951 separate one is the tennis player at the
2952
2953 739
2954 00:35:54,869 --> 00:36:00,929
2955 top and the other is the racket at the
2956
2957 740
2958 00:35:57,179 --> 00:36:04,289
2959 bottom and you can see that these are in
2960
2961 741
2962 00:36:00,929 --> 00:36:07,379
2963 fact the same prototypes that we use for
2964
2965 742
2966 00:36:04,289 --> 00:36:09,630
2967 both detections and adjust about the
2968
2969 743
2970 00:36:07,380 --> 00:36:13,079
2971 coefficients whether they're positive or
2972
2973 744
2974 00:36:09,630 --> 00:36:16,829
2975 negative that actually define the final
2976
2977 745
2978 00:36:13,079 --> 00:36:21,619
2979 mask and thanks to these coefficients we
2980
2981 746
2982 00:36:16,829 --> 00:36:24,659
2983 can separate the person from the racket
2984
2985 747
2986 00:36:21,619 --> 00:36:26,400
2987 now the bounding box for Yola we're
2988
2989 748
2990 00:36:24,659 --> 00:36:28,889
2991 going to have a simple cross entropy
2992
2993 749
2994 00:36:26,400 --> 00:36:31,740
2995 between the assemble mask and the ground
2996
2997 750
2998 00:36:28,889 --> 00:36:33,210
2999 truth in addition to the standard losses
3000
3001 751
3002 00:36:31,739 --> 00:36:36,239
3003 which are the regression for the
3004
3005 752
3006 00:36:33,210 --> 00:36:40,320
3007 bounding box and the classification for
3008
3009 753
3010 00:36:36,239 --> 00:36:42,179
3011 the actual semantic glass and the
3012
3013 754
3014 00:36:40,320 --> 00:36:45,330
3015 results are actually pretty good so we
3016
3017 755
3018 00:36:42,179 --> 00:36:48,179
3019 can see that we can segment a lot of
3020
3021 756
3022 00:36:45,329 --> 00:36:51,420
3023 semantic glasses lot of instances with
3024
3025 757
3026 00:36:48,179 --> 00:36:53,519
3027 quite a large degree of occlusion so see
3028
3029 758
3030 00:36:51,420 --> 00:36:58,500
3031 for example these two purses being
3032
3033 759
3034 00:36:53,519 --> 00:37:01,039
3035 separated correctly and the method is of
3036
3037 760
3038 00:36:58,500 --> 00:37:04,800
3039 course fast because it's a one stage
3040
3041 761
3042 00:37:01,039 --> 00:37:07,170
3043 segmentation method now for large
3044
3045 762
3046 00:37:04,800 --> 00:37:09,900
3047 objects the quality of the masks is even
3048
3049 763
3050 00:37:07,170 --> 00:37:12,539
3051 better than those of two statute actors
3052
3053 764
3054 00:37:09,900 --> 00:37:14,700
3055 but of course the quality is a little
3056
3057 765
3058 00:37:12,539 --> 00:37:17,190
3059 bit reduced for small objects so for
3060
3061 766
3062 00:37:14,699 --> 00:37:22,589
3063 small objects the quality is not as good
3064
3065 767
3066 00:37:17,190 --> 00:37:26,460
3067 as for masks our CNN now of course the
3068
3069 768
3070 00:37:22,590 --> 00:37:29,460
3071 main selling point right is the FPS so
3072
3073 769
3074 00:37:26,460 --> 00:37:32,820
3075 the frames per second that your life is
3076
3077 770
3078 00:37:29,460 --> 00:37:35,220
3079 able to process and of course compared
3080
3081 771
3082 00:37:32,820 --> 00:37:38,640
3083 to mask our CNN we can see that your
3084
3085 772
3086 00:37:35,219 --> 00:37:40,889
3087 lock is actually much faster it's a
3088
3089 773
3090 00:37:38,639 --> 00:37:43,109
3091 little bit less accurate but of course
3092
3093 774
3094 00:37:40,889 --> 00:37:46,469
3095 it depends for what kind of application
3096
3097 775
3098 00:37:43,110 --> 00:37:47,910
3099 you want to use it if you want to use if
3100
3101 776
3102 00:37:46,469 --> 00:37:50,489
3103 you want to have some segmentation
3104
3105 777
3106 00:37:47,909 --> 00:37:53,009
3107 output in real time then you have to
3108
3109 778
3110 00:37:50,489 --> 00:37:56,159
3111 Yoshio like this is the only method up
3112
3113 779
3114 00:37:53,010 --> 00:37:58,710
3115 to this point up to 2019
3116
3117 780
3118 00:37:56,159 --> 00:38:01,469
3119 that was actually reaching these these
3120
3121 781
3122 00:37:58,710 --> 00:38:04,500
3123 real-time capabilities so of course if
3124
3125 782
3126 00:38:01,469 --> 00:38:06,599
3127 you need the real time then you might be
3128
3129 783
3130 00:38:04,500 --> 00:38:12,239
3131 able to sustain this little bit of a
3132
3133 784
3134 00:38:06,599 --> 00:38:14,039
3135 drop in precision so then of course
3136
3137 785
3138 00:38:12,239 --> 00:38:16,229
3139 there are improvements on top of your
3140
3141 786
3142 00:38:14,039 --> 00:38:18,269
3143 leg right so the the authors have been
3144
3145 787
3146 00:38:16,230 --> 00:38:20,789
3147 actually active in developing your lat
3148
3149 788
3150 00:38:18,269 --> 00:38:24,300
3151 plus plus and you can read the paper
3152
3153 789
3154 00:38:20,789 --> 00:38:26,099
3155 there there is actually a specially
3156
3157 790
3158 00:38:24,300 --> 00:38:27,510
3159 designed version of non maximum
3160
3161 791
3162 00:38:26,099 --> 00:38:30,230
3163 suppression in order to make the
3164
3165 792
3166 00:38:27,510 --> 00:38:32,790
3167 procedure even faster and an auxiliary
3168
3169 793
3170 00:38:30,230 --> 00:38:35,400
3171 segmentation semantic segmentation loss
3172
3173 794
3174 00:38:32,789 --> 00:38:38,250
3175 function which is performed on the final
3176
3177 795
3178 00:38:35,400 --> 00:38:43,380
3179 features of the SPN so trying to bring a
3180
3181 796
3182 00:38:38,250 --> 00:38:45,989
3183 little bit this idea of the FCN concept
3184
3185 797
3186 00:38:43,380 --> 00:38:48,450
3187 that actually looked at all the image in
3188
3189 798
3190 00:38:45,989 --> 00:38:55,169
3191 order to predict a semantic segmentation
3192
3193 799
3194 00:38:48,449 --> 00:38:57,480
3195 results okay so we have seen semantic
3196
3197 800
3198 00:38:55,170 --> 00:38:59,608
3199 segmentation in the previous lecture we
3200
3201 801
3202 00:38:57,480 --> 00:39:03,630
3203 have seen instant segmentation in this
3204
3205 802
3206 00:38:59,608 --> 00:39:07,469
3207 lecture and now the idea is why not
3208
3209 803
3210 00:39:03,630 --> 00:39:11,190
3211 combining both concepts so remember that
3212
3213 804
3214 00:39:07,469 --> 00:39:13,799
3215 semantic segmentation labels each of the
3216
3217 805
3218 00:39:11,190 --> 00:39:16,800
3219 pixels in the image with a unique
3220
3221 806
3222 00:39:13,800 --> 00:39:18,810
3223 semantic label and this means that all
3224
3225 807
3226 00:39:16,800 --> 00:39:21,450
3227 cars are going to have the same label
3228
3229 808
3230 00:39:18,809 --> 00:39:23,969
3231 car and it's only through instant
3232
3233 809
3234 00:39:21,449 --> 00:39:26,879
3235 segmentation that we can separate the
3236
3237 810
3238 00:39:23,969 --> 00:39:28,559
3239 different cars within this label but the
3240
3241 811
3242 00:39:26,880 --> 00:39:31,920
3243 problem with intense segmentation is
3244
3245 812
3246 00:39:28,559 --> 00:39:33,750
3247 that it does not give us any label for
3248
3249 813
3250 00:39:31,920 --> 00:39:37,500
3251 the things that are not countable for
3252
3253 814
3254 00:39:33,750 --> 00:39:40,199
3255 example the grass the sky or the road so
3256
3257 815
3258 00:39:37,500 --> 00:39:43,889
3259 the task that Panoptix segmentation is
3260
3261 816
3262 00:39:40,199 --> 00:39:46,589
3263 trying to resolve is actually the task
3264
3265 817
3266 00:39:43,889 --> 00:39:49,139
3267 of combining semantic segmentation and
3268
3269 818
3270 00:39:46,590 --> 00:39:51,960
3271 instant segmentation therefore each
3272
3273 819
3274 00:39:49,139 --> 00:39:54,960
3275 pixel needs to receive a label a
3276
3277 820
3278 00:39:51,960 --> 00:39:57,690
3279 semantic level and at the same time if
3280
3281 821
3282 00:39:54,960 --> 00:40:00,449
3283 possible if the object is countable you
3284
3285 822
3286 00:39:57,690 --> 00:40:04,200
3287 need to receive essentially an instance
3288
3289 823
3290 00:40:00,449 --> 00:40:08,069
3291 label so for semantics
3292
3293 824
3294 00:40:04,199 --> 00:40:10,980
3295 imitation we have the FCN like methods
3296
3297 825
3298 00:40:08,070 --> 00:40:14,220
3299 for instant segmentation we have masks
3300
3301 826
3302 00:40:10,980 --> 00:40:17,070
3303 our CNN and your act and other devised
3304
3305 827
3306 00:40:14,219 --> 00:40:19,679
3307 methods and for Panoptix segmentation we
3308
3309 828
3310 00:40:17,070 --> 00:40:21,660
3311 have what is called the UPS net of
3312
3313 829
3314 00:40:19,679 --> 00:40:24,598
3315 course now we have many more methods
3316
3317 830
3318 00:40:21,659 --> 00:40:27,210
3319 there are that are trying to solve the
3320
3321 831
3322 00:40:24,599 --> 00:40:29,130
3323 Panoptix segmentation task but this is
3324
3325 832
3326 00:40:27,210 --> 00:40:31,940
3327 one of the first methods that actually
3328
3329 833
3330 00:40:29,130 --> 00:40:35,460
3331 tackle the task of fanatic segmentation
3332
3333 834
3334 00:40:31,940 --> 00:40:38,338
3335 so in fanatic segmentation we have to
3336
3337 835
3338 00:40:35,460 --> 00:40:41,190
3339 predict both the labels for uncountable
3340
3341 836
3342 00:40:38,338 --> 00:40:45,088
3343 objects which we call stuff in computer
3344
3345 837
3346 00:40:41,190 --> 00:40:46,349
3347 vision so things like sky road grass etc
3348
3349 838
3350 00:40:45,088 --> 00:40:48,059
3351 for which you cannot really
3352
3353 839
3354 00:40:46,349 --> 00:40:51,088
3355 differentiate between different
3356
3357 840
3358 00:40:48,059 --> 00:40:54,630
3359 instances and the labels are usually
3360
3361 841
3362 00:40:51,088 --> 00:40:56,519
3363 obtained with networks similar to the
3364
3365 842
3366 00:40:54,630 --> 00:40:58,619
3367 fully convolutional networks that we
3368
3369 843
3370 00:40:56,519 --> 00:41:01,829
3371 have seen for the semantic segmentation
3372
3373 844
3374 00:40:58,619 --> 00:41:05,880
3375 tasks and on the other hand one also has
3376
3377 845
3378 00:41:01,829 --> 00:41:08,130
3379 to label all the objects which belong to
3380
3381 846
3382 00:41:05,880 --> 00:41:09,990
3383 the countable classes
3384
3385 847
3386 00:41:08,130 --> 00:41:12,680
3387 so these countable objects which are
3388
3389 848
3390 00:41:09,989 --> 00:41:15,629
3391 called things in computer vision cars
3392
3393 849
3394 00:41:12,679 --> 00:41:17,279
3395 pedestrian from for which we actually
3396
3397 850
3398 00:41:15,630 --> 00:41:20,400
3399 also have to differentiate between
3400
3401 851
3402 00:41:17,280 --> 00:41:22,440
3403 pixels coming from different instances
3404
3405 852
3406 00:41:20,400 --> 00:41:25,730
3407 of the same class so differentiate
3408
3409 853
3410 00:41:22,440 --> 00:41:29,010
3411 between car one car two and car three
3412
3413 854
3414 00:41:25,730 --> 00:41:32,280
3415 now of course if we just tackle the task
3416
3417 855
3418 00:41:29,010 --> 00:41:35,250
3419 with an FC anonymous car CNN separately
3420
3421 856
3422 00:41:32,280 --> 00:41:38,010
3423 some pixels might get classified as
3424
3425 857
3426 00:41:35,250 --> 00:41:39,780
3427 stuff from the FCN Network and at the
3428
3429 858
3430 00:41:38,010 --> 00:41:42,390
3431 same time they might be classified as
3432
3433 859
3434 00:41:39,780 --> 00:41:45,810
3435 instances of some class from mask our
3436
3437 860
3438 00:41:42,389 --> 00:41:48,719
3439 CNN so if we just kind of put together a
3440
3441 861
3442 00:41:45,809 --> 00:41:51,809
3443 CNN mask our CNN for both tasks we might
3444
3445 862
3446 00:41:48,719 --> 00:41:53,879
3447 have conflicting results so the solution
3448
3449 863
3450 00:41:51,809 --> 00:41:56,519
3451 that they proposed in this EPR 19 paper
3452
3453 864
3454 00:41:53,880 --> 00:41:58,769
3455 is very simple a parametric free
3456
3457 865
3458 00:41:56,519 --> 00:42:00,960
3459 panoptic head which actually combines
3460
3461 866
3462 00:41:58,769 --> 00:42:03,630
3463 the information from the Sen and the
3464
3465 867
3466 00:42:00,960 --> 00:42:05,699
3467 mask our CNN so essentially what we want
3468
3469 868
3470 00:42:03,630 --> 00:42:08,690
3471 to do is we want to create this network
3472
3473 869
3474 00:42:05,699 --> 00:42:10,679
3475 that combines both the stuff
3476
3477 870
3478 00:42:08,690 --> 00:42:13,858
3479 classification as well as the things
3480
3481 871
3482 00:42:10,679 --> 00:42:15,598
3483 classification and the network
3484
3485 872
3486 00:42:13,858 --> 00:42:17,969
3487 architecture looks like this so we have
3488
3489 873
3490 00:42:15,599 --> 00:42:19,950
3491 a set of shared features right with all
3492
3493 874
3494 00:42:17,969 --> 00:42:22,319
3495 have two completely separate networks
3496
3497 875
3498 00:42:19,949 --> 00:42:26,279
3499 but we actually have this set of shared
3500
3501 876
3502 00:42:22,320 --> 00:42:28,590
3503 features with also feature pyramids then
3504
3505 877
3506 00:42:26,280 --> 00:42:30,599
3507 we have the semantic head on top which
3508
3509 878
3510 00:42:28,590 --> 00:42:32,940
3511 is the one that gives us the semantic
3512
3513 879
3514 00:42:30,599 --> 00:42:35,070
3515 segmentation and the instance head at
3516
3517 880
3518 00:42:32,940 --> 00:42:37,800
3519 the bottom which is essentially Mars
3520
3521 881
3522 00:42:35,070 --> 00:42:39,300
3523 Carson and inspired and in the end we're
3524
3525 882
3526 00:42:37,800 --> 00:42:41,039
3527 going to have this pond optic which
3528
3529 883
3530 00:42:39,300 --> 00:42:43,800
3531 actually puts the information together
3532
3533 884
3534 00:42:41,039 --> 00:42:46,108
3535 and puts the semantic logics and the
3536
3537 885
3538 00:42:43,800 --> 00:42:49,530
3539 instance logics together to create what
3540
3541 886
3542 00:42:46,108 --> 00:42:51,719
3543 they call the panopticon logics so let's
3544
3545 887
3546 00:42:49,530 --> 00:42:54,000
3547 take another look at this semantic head
3548
3549 888
3550 00:42:51,719 --> 00:42:57,329
3551 because it has some interesting design
3552
3553 889
3554 00:42:54,000 --> 00:42:59,400
3555 choices so in particular the semantic
3556
3557 890
3558 00:42:57,329 --> 00:43:02,789
3559 head the fully convolutional network
3560
3561 891
3562 00:42:59,400 --> 00:43:05,700
3563 that outputs the semantic logics or the
3564
3565 892
3566 00:43:02,789 --> 00:43:07,230
3567 semantic segmentation map and the new
3568
3569 893
3570 00:43:05,699 --> 00:43:11,069
3571 thing that they use in this architecture
3572
3573 894
3574 00:43:07,230 --> 00:43:12,719
3575 are deformable convolutions so before
3576
3577 895
3578 00:43:11,070 --> 00:43:15,390
3579 introducing the concept we're gonna
3580
3581 896
3582 00:43:12,719 --> 00:43:17,189
3583 recall the dilated convolutions because
3584
3585 897
3586 00:43:15,389 --> 00:43:20,309
3587 these two types of convolutions are
3588
3589 898
3590 00:43:17,190 --> 00:43:22,889
3591 actually very similar so if you remember
3592
3593 899
3594 00:43:20,309 --> 00:43:25,789
3595 a normal convolution which would be a
3596
3597 900
3598 00:43:22,889 --> 00:43:30,088
3599 convolution with dilation parameter one
3600
3601 901
3602 00:43:25,789 --> 00:43:35,460
3603 would be as the operation depicted here
3604
3605 902
3606 00:43:30,088 --> 00:43:37,170
3607 in in this image a so in this case we
3608
3609 903
3610 00:43:35,460 --> 00:43:39,809
3611 have a three by three convolutional
3612
3613 904
3614 00:43:37,170 --> 00:43:42,630
3615 filter dilation one which means a normal
3616
3617 905
3618 00:43:39,809 --> 00:43:45,539
3619 convolution hence the receptive field is
3620
3621 906
3622 00:43:42,630 --> 00:43:47,730
3623 three by three with the same number of
3624
3625 907
3626 00:43:45,539 --> 00:43:50,400
3627 parameters but now with a dilation of
3628
3629 908
3630 00:43:47,730 --> 00:43:52,500
3631 two what we can get is essentially a
3632
3633 909
3634 00:43:50,400 --> 00:43:55,410
3635 receptive field of seven by seven and
3636
3637 910
3638 00:43:52,500 --> 00:43:57,869
3639 help dilated convolution actually chip
3640
3641 911
3642 00:43:55,409 --> 00:44:00,088
3643 this is by essentially spreading these
3644
3645 912
3646 00:43:57,869 --> 00:44:02,338
3647 weights and not applying them to
3648
3649 913
3650 00:44:00,088 --> 00:44:06,960
3651 neighboring pixels but applying them to
3652
3653 914
3654 00:44:02,338 --> 00:44:09,480
3655 pixel separated by to say music the
3656
3657 915
3658 00:44:06,960 --> 00:44:11,639
3659 dilation parameter is four then each
3660
3661 916
3662 00:44:09,480 --> 00:44:13,949
3663 element produced by it has a receptive
3664
3665 917
3666 00:44:11,639 --> 00:44:17,279
3667 field of fifteen by fifteen but we still
3668
3669 918
3670 00:44:13,949 --> 00:44:18,779
3671 have these nine parameters to learn so
3672
3673 919
3674 00:44:17,280 --> 00:44:22,589
3675 increase incentive field without
3676
3677 920
3678 00:44:18,780 --> 00:44:24,690
3679 increasing the number of parameters now
3680
3681 921
3682 00:44:22,588 --> 00:44:27,599
3683 the formal convolution is a very similar
3684
3685 922
3686 00:44:24,690 --> 00:44:31,349
3687 concept but now we're not going to have
3688
3689 923
3690 00:44:27,599 --> 00:44:31,740
3691 this kind of stacked dilation right in
3692
3693 924
3694 00:44:31,349 --> 00:44:35,430
3695 which
3696
3697 925
3698 00:44:31,739 --> 00:44:38,009
3699 in which you just spread the weights but
3700
3701 926
3702 00:44:35,429 --> 00:44:40,559
3703 always in the same way so what the
3704
3705 927
3706 00:44:38,010 --> 00:44:43,200
3707 deformable convolution actually proposes
3708
3709 928
3710 00:44:40,559 --> 00:44:45,960
3711 to do is kind of a generalization of
3712
3713 929
3714 00:44:43,199 --> 00:44:47,609
3715 dilated convolutions and in this case
3716
3717 930
3718 00:44:45,960 --> 00:44:51,000
3719 what you want to do is you want to also
3720
3721 931
3722 00:44:47,610 --> 00:44:53,910
3723 learn the offset so essentially I have
3724
3725 932
3726 00:44:51,000 --> 00:44:56,099
3727 here my weights of my 3x3 kernel
3728
3729 933
3730 00:44:53,909 --> 00:44:58,199
3731 depicted in green and now I'm going to
3732
3733 934
3734 00:44:56,099 --> 00:45:01,170
3735 learn essentially where to send them
3736
3737 935
3738 00:44:58,199 --> 00:45:04,469
3739 where to get that information to be
3740
3741 936
3742 00:45:01,170 --> 00:45:07,230
3743 multiplied by the weight so of course
3744
3745 937
3746 00:45:04,469 --> 00:45:12,809
3747 dilated convolutions are a special case
3748
3749 938
3750 00:45:07,230 --> 00:45:15,389
3751 of deformable convolutions so in this
3752
3753 939
3754 00:45:12,809 --> 00:45:18,570
3755 case what we need to do is to get our
3756
3757 940
3758 00:45:15,389 --> 00:45:21,420
3759 input feature map with a cone flayer
3760
3761 941
3762 00:45:18,570 --> 00:45:24,269
3763 we actually learn what is called the
3764
3765 942
3766 00:45:21,420 --> 00:45:27,389
3767 offset field which actually tells us
3768
3769 943
3770 00:45:24,269 --> 00:45:29,909
3771 where to send these weights where to
3772
3773 944
3774 00:45:27,389 --> 00:45:32,368
3775 multiply this weight or by which pixel
3776
3777 945
3778 00:45:29,909 --> 00:45:34,679
3779 to multiply these weights in order to
3780
3781 946
3782 00:45:32,369 --> 00:45:37,680
3783 create one pixel of the output feature
3784
3785 947
3786 00:45:34,679 --> 00:45:39,750
3787 map and you have here the formulation
3788
3789 948
3790 00:45:37,679 --> 00:45:41,669
3791 for regular convolution and deformable
3792
3793 949
3794 00:45:39,750 --> 00:45:46,739
3795 convolution in case you want to take a
3796
3797 950
3798 00:45:41,670 --> 00:45:48,980
3799 look so of course with respect to
3800
3801 951
3802 00:45:46,739 --> 00:45:51,329
3803 standard standard convolutions
3804
3805 952
3806 00:45:48,980 --> 00:45:54,539
3807 deformable convolutions are much more
3808
3809 953
3810 00:45:51,329 --> 00:45:56,549
3811 flexible so you can see here a couple of
3812
3813 954
3814 00:45:54,539 --> 00:45:59,880
3815 convolutional layers that are acting on
3816
3817 955
3818 00:45:56,550 --> 00:46:02,580
3819 this on this image and how they create
3820
3821 956
3822 00:45:59,880 --> 00:46:05,490
3823 one pixel in the output space at the top
3824
3825 957
3826 00:46:02,579 --> 00:46:06,920
3827 and of course the deformable convolution
3828
3829 958
3830 00:46:05,489 --> 00:46:10,079
3831 will pick the values at different
3832
3833 959
3834 00:46:06,920 --> 00:46:13,200
3835 location for for in order to compute the
3836
3837 960
3838 00:46:10,079 --> 00:46:15,659
3839 actual convolution operation and these
3840
3841 961
3842 00:46:13,199 --> 00:46:18,480
3843 locations will be conditioned on the
3844
3845 962
3846 00:46:15,659 --> 00:46:20,339
3847 input image therefore you can imagine
3848
3849 963
3850 00:46:18,480 --> 00:46:22,740
3851 that if you actually have an object with
3852
3853 964
3854 00:46:20,340 --> 00:46:24,960
3855 a lot of trained structures you're going
3856
3857 965
3858 00:46:22,739 --> 00:46:27,569
3859 to really place your weights in precise
3860
3861 966
3862 00:46:24,960 --> 00:46:30,869
3863 location in order to get the most useful
3864
3865 967
3866 00:46:27,570 --> 00:46:33,720
3867 information and not just spread the
3868
3869 968
3870 00:46:30,869 --> 00:46:36,050
3871 values like in normal convolutions so
3872
3873 969
3874 00:46:33,719 --> 00:46:39,750
3875 this is actually a very interesting
3876
3877 970
3878 00:46:36,050 --> 00:46:41,039
3879 operation for segmentation outputs for
3880
3881 971
3882 00:46:39,750 --> 00:46:45,050
3883 when you actually need to have these
3884
3885 972
3886 00:46:41,039 --> 00:46:45,050
3887 pixel wise accurate outputs
3888
3889 973
3890 00:46:45,119 --> 00:46:51,240
3891 so let's go now into the panopticon so
3892
3893 974
3894 00:46:47,639 --> 00:46:53,670
3895 what is this Panoptix together the mask
3896
3897 975
3898 00:46:51,239 --> 00:46:57,358
3899 output and the semantic information
3900
3901 976
3902 00:46:53,670 --> 00:47:00,329
3903 output so we have at the top the mask
3904
3905 977
3906 00:46:57,358 --> 00:47:02,670
3907 what they call logits from the instance
3908
3909 978
3910 00:47:00,329 --> 00:47:05,490
3911 head right this comes from a masker
3912
3913 979
3914 00:47:02,670 --> 00:47:08,490
3915 cement type of head that tells you where
3916
3917 980
3918 00:47:05,489 --> 00:47:11,518
3919 these instances are located and what is
3920
3921 981
3922 00:47:08,489 --> 00:47:13,919
3923 the extent of them the object logics
3924
3925 982
3926 00:47:11,518 --> 00:47:16,229
3927 come from the semantic head and tell you
3928
3929 983
3930 00:47:13,920 --> 00:47:18,809
3931 what is the probability that this pixel
3932
3933 984
3934 00:47:16,230 --> 00:47:21,809
3935 belongs to the class car the class
3936
3937 985
3938 00:47:18,809 --> 00:47:23,970
3939 person etc etc and then you had the
3940
3941 986
3942 00:47:21,809 --> 00:47:27,420
3943 stuff logics which is exactly the same
3944
3945 987
3946 00:47:23,969 --> 00:47:30,929
3947 but for the classes which are not count
3948
3949 988
3950 00:47:27,420 --> 00:47:32,940
3951 alike sky for example or world now for
3952
3953 989
3954 00:47:30,929 --> 00:47:34,710
3955 the stuff trusses one needs to do
3956
3957 990
3958 00:47:32,940 --> 00:47:36,798
3959 nothing right there are no instances
3960
3961 991
3962 00:47:34,710 --> 00:47:39,659
3963 there so one needs to do really nothing
3964
3965 992
3966 00:47:36,798 --> 00:47:43,230
3967 so the stuff logics can be evaluated
3968
3969 993
3970 00:47:39,659 --> 00:47:45,598
3971 directly but the objects actually need
3972
3973 994
3974 00:47:43,230 --> 00:47:47,818
3975 to be masked by their instance right
3976
3977 995
3978 00:47:45,599 --> 00:47:50,778
3979 look at this example where we have this
3980
3981 996
3982 00:47:47,818 --> 00:47:55,920
3983 instance of I think this is a car and
3984
3985 997
3986 00:47:50,778 --> 00:47:58,679
3987 actually all of like the extent of the
3988
3989 998
3990 00:47:55,920 --> 00:48:01,739
3991 image depicts more things than just this
3992
3993 999
3994 00:47:58,679 --> 00:48:05,578
3995 car but in the end since we're going to
3996
3997 1000
3998 00:48:01,739 --> 00:48:08,880
3999 have this mask logic this instance logic
4000
4001 1001
4002 00:48:05,579 --> 00:48:12,509
4003 that actually is going to evaluate only
4004
4005 1002
4006 00:48:08,880 --> 00:48:14,579
4007 this part of the semantic head right the
4008
4009 1003
4010 00:48:12,509 --> 00:48:18,028
4011 rest doesn't really apply because we
4012
4013 1004
4014 00:48:14,579 --> 00:48:20,489
4015 know that the object is bounded by this
4016
4017 1005
4018 00:48:18,028 --> 00:48:23,400
4019 bounding box therefore we are only
4020
4021 1006
4022 00:48:20,489 --> 00:48:28,619
4023 interested in having a good prediction
4024
4025 1007
4026 00:48:23,400 --> 00:48:31,950
4027 inside this instance box that is coming
4028
4029 1008
4030 00:48:28,619 --> 00:48:33,660
4031 from the instance head therefore what
4032
4033 1009
4034 00:48:31,949 --> 00:48:35,278
4035 we're going to do is we're going to
4036
4037 1010
4038 00:48:33,659 --> 00:48:38,818
4039 perform this operation where we
4040
4041 1011
4042 00:48:35,278 --> 00:48:41,099
4043 essentially cut what we're interested in
4044
4045 1012
4046 00:48:38,818 --> 00:48:43,079
4047 what is actually inside the box
4048
4049 1013
4050 00:48:41,099 --> 00:48:46,769
4051 that were interested in we cut the
4052
4053 1014
4054 00:48:43,079 --> 00:48:49,769
4055 semantic map with the instance map and
4056
4057 1015
4058 00:48:46,768 --> 00:48:52,469
4059 then this is what we're going to use for
4060
4061 1016
4062 00:48:49,768 --> 00:48:55,318
4063 our instance logics which are depicted
4064
4065 1017
4066 00:48:52,469 --> 00:48:58,618
4067 here in green and the rest of the image
4068
4069 1018
4070 00:48:55,318 --> 00:49:02,579
4071 the rest that doesn't fit into an
4072
4073 1019
4074 00:48:58,619 --> 00:49:04,680
4075 instance is going to end up in this last
4076
4077 1020
4078 00:49:02,579 --> 00:49:09,539
4079 channel here which is actually the
4080
4081 1021
4082 00:49:04,679 --> 00:49:12,358
4083 channel unknown and so we're going to
4084
4085 1022
4086 00:49:09,539 --> 00:49:16,349
4087 perform a softmax over the pond optics
4088
4089 1023
4090 00:49:12,358 --> 00:49:19,469
4091 and the key here is that if the maximum
4092
4093 1024
4094 00:49:16,349 --> 00:49:22,048
4095 value falls into the first stuffed
4096
4097 1025
4098 00:49:19,469 --> 00:49:26,730
4099 channels then it belongs to one of the
4100
4101 1026
4102 00:49:22,048 --> 00:49:29,278
4103 stuff classes right otherwise the index
4104
4105 1027
4106 00:49:26,730 --> 00:49:32,460
4107 of the maximum value tells us the
4108
4109 1028
4110 00:49:29,278 --> 00:49:34,559
4111 instance ID that the pixel belongs to
4112
4113 1029
4114 00:49:32,460 --> 00:49:36,568
4115 and like this there are no conflicts
4116
4117 1030
4118 00:49:34,559 --> 00:49:39,569
4119 right because we take the maximum over
4120
4121 1031
4122 00:49:36,568 --> 00:49:42,058
4123 all the channels over the instance or
4124
4125 1032
4126 00:49:39,568 --> 00:49:44,460
4127 the unknown and over the stuff so we
4128
4129 1033
4130 00:49:42,059 --> 00:49:47,789
4131 have to make a decision between these
4132
4133 1034
4134 00:49:44,460 --> 00:49:51,539
4135 three and the last thing that we need to
4136
4137 1035
4138 00:49:47,789 --> 00:49:54,509
4139 know is actually how to use this unknown
4140
4141 1036
4142 00:49:51,539 --> 00:49:57,749
4143 trust this last class which doesn't
4144
4145 1037
4146 00:49:54,509 --> 00:49:59,309
4147 belong to instances or to stuff and this
4148
4149 1038
4150 00:49:57,748 --> 00:50:01,318
4151 is something that I would actually
4152
4153 1039
4154 00:49:59,309 --> 00:50:03,839
4155 recommend you to read how it exactly
4156
4157 1040
4158 00:50:01,318 --> 00:50:07,230
4159 works and what are the details on how to
4160
4161 1041
4162 00:50:03,838 --> 00:50:11,969
4163 use this unknown to us in the CPR 2019
4164
4165 1042
4166 00:50:07,230 --> 00:50:14,960
4167 paper excellent so we can now move to
4168
4169 1043
4170 00:50:11,969 --> 00:50:17,219
4171 the metrics right we know how to measure
4172
4173 1044
4174 00:50:14,960 --> 00:50:19,710
4175 semantic segmentation we know how to
4176
4177 1045
4178 00:50:17,219 --> 00:50:22,998
4179 measure object detection but how do we
4180
4181 1046
4182 00:50:19,710 --> 00:50:26,759
4183 measure Panoptix segmentation quality
4184
4185 1047
4186 00:50:22,998 --> 00:50:30,238
4187 now the panopticon t measure contains
4188
4189 1048
4190 00:50:26,759 --> 00:50:32,789
4191 two parts the first part the first term
4192
4193 1049
4194 00:50:30,239 --> 00:50:35,548
4195 measures the segmentation quality
4196
4197 1050
4198 00:50:32,789 --> 00:50:38,460
4199 meaning how close the predicted segments
4200
4201 1051
4202 00:50:35,548 --> 00:50:40,170
4203 are to the ground truth segments and you
4204
4205 1052
4206 00:50:38,460 --> 00:50:43,679
4207 can see that this is measured with the
4208
4209 1053
4210 00:50:40,170 --> 00:50:47,130
4211 iou the intersection over union of the
4212
4213 1054
4214 00:50:43,679 --> 00:50:50,118
4215 two true positive masks so essentially
4216
4217 1055
4218 00:50:47,130 --> 00:50:53,190
4219 not taking into account bad predictions
4220
4221 1056
4222 00:50:50,119 --> 00:50:55,650
4223 so this is just one part of the penalty
4224
4225 1057
4226 00:50:53,190 --> 00:50:58,349
4227 quality measure the part of the
4228
4229 1058
4230 00:50:55,650 --> 00:51:01,108
4231 segmentation quality and now we go
4232
4233 1059
4234 00:50:58,349 --> 00:51:04,289
4235 towards measuring the actual recognition
4236
4237 1060
4238 00:51:01,108 --> 00:51:06,719
4239 quality so the same as for detection we
4240
4241 1061
4242 00:51:04,289 --> 00:51:09,390
4243 actually want to know if we are missing
4244
4245 1062
4246 00:51:06,719 --> 00:51:12,538
4247 any instances which would lead to false
4248
4249 1063
4250 00:51:09,389 --> 00:51:14,759
4251 negatives or we are predicting more in
4252
4253 1064
4254 00:51:12,539 --> 00:51:17,699
4255 instances which would lead to false
4256
4257 1065
4258 00:51:14,759 --> 00:51:20,159
4259 positives so we have in this case an
4260
4261 1066
4262 00:51:17,699 --> 00:51:23,130
4263 example of this ground truth where we
4264
4265 1067
4266 00:51:20,159 --> 00:51:26,909
4267 have three persons three instances we
4268
4269 1068
4270 00:51:23,130 --> 00:51:28,798
4271 have a dog and in our prediction the sky
4272
4273 1069
4274 00:51:26,909 --> 00:51:31,379
4275 and the grass are predicted more or less
4276
4277 1070
4278 00:51:28,798 --> 00:51:34,228
4279 correctly but there is one person
4280
4281 1071
4282 00:51:31,380 --> 00:51:37,709
4283 missing and the dog is actually miss
4284
4285 1072
4286 00:51:34,228 --> 00:51:39,629
4287 classified as a person so in this case
4288
4289 1073
4290 00:51:37,708 --> 00:51:41,219
4291 we would have several true positives
4292
4293 1074
4294 00:51:39,630 --> 00:51:43,979
4295 right well it has this light brown
4296
4297 1075
4298 00:51:41,219 --> 00:51:46,318
4299 person matched we'd have the orange
4300
4301 1076
4302 00:51:43,978 --> 00:51:48,899
4303 person matched but this dark brown
4304
4305 1077
4306 00:51:46,318 --> 00:51:50,938
4307 person is actually not found in your
4308
4309 1078
4310 00:51:48,900 --> 00:51:54,059
4311 prediction therefore it is a false
4312
4313 1079
4314 00:51:50,938 --> 00:51:57,298
4315 negative and at the same time we have a
4316
4317 1080
4318 00:51:54,059 --> 00:51:59,819
4319 false positive from this person which is
4320
4321 1081
4322 00:51:57,298 --> 00:52:03,239
4323 actually in the wrong class so the
4324
4325 1082
4326 00:51:59,818 --> 00:52:06,028
4327 prediction is the wrong class so this
4328
4329 1083
4330 00:52:03,239 --> 00:52:08,818
4331 all this much false positives true
4332
4333 1084
4334 00:52:06,028 --> 00:52:11,338
4335 positive computation needs to be done in
4336
4337 1085
4338 00:52:08,818 --> 00:52:12,838
4339 a similar way as we did for detection so
4340
4341 1086
4342 00:52:11,338 --> 00:52:15,298
4343 essentially there needs to be a way to
4344
4345 1087
4346 00:52:12,838 --> 00:52:17,489
4347 match ground truth and predictions and
4348
4349 1088
4350 00:52:15,298 --> 00:52:20,518
4351 in this case we actually have to do
4352
4353 1089
4354 00:52:17,489 --> 00:52:22,739
4355 segment matching so the segment is
4356
4357 1090
4358 00:52:20,518 --> 00:52:25,858
4359 actually matched if the intersection of
4360
4361 1091
4362 00:52:22,739 --> 00:52:28,048
4363 reunion is about 0.5 and a suite
4364
4365 1092
4366 00:52:25,858 --> 00:52:32,938
4367 detections no pixel can actually belong
4368
4369 1093
4370 00:52:28,048 --> 00:52:36,929
4371 to two predicted segments so in this
4372
4373 1094
4374 00:52:32,938 --> 00:52:39,958
4375 case where we have one cat but we
4376
4377 1095
4378 00:52:36,929 --> 00:52:42,298
4379 actually have two predictions we compute
4380
4381 1096
4382 00:52:39,958 --> 00:52:44,308
4383 the intersection over union of the two
4384
4385 1097
4386 00:52:42,298 --> 00:52:46,018
4387 predictions and we find that the blue
4388
4389 1098
4390 00:52:44,309 --> 00:52:49,048
4391 prediction has intersection over you
4392
4393 1099
4394 00:52:46,018 --> 00:52:51,988
4395 know 0.6 therefore is considered as true
4396
4397 1100
4398 00:52:49,048 --> 00:52:53,849
4399 positive but the top part of the cat is
4400
4401 1101
4402 00:52:51,989 --> 00:52:55,918
4403 considered as a false positive pink
4404
4405 1102
4406 00:52:53,849 --> 00:53:00,778
4407 because the intersection of reunion is
4408
4409 1103
4410 00:52:55,918 --> 00:53:05,449
4411 0.4 and of course no two segments can
4412
4413 1104
4414 00:53:00,778 --> 00:53:05,449
4415 actually belong to the same ground truth
4416
4417 1105
4418 00:53:06,650 --> 00:53:11,818
4419 now qualitatively results for Panoptix
4420
4421 1106
4422 00:53:09,688 --> 00:53:15,719
4423 segmentation actually look pretty good
4424
4425 1107
4426 00:53:11,818 --> 00:53:19,619
4427 so it's quite nice what computer vision
4428
4429 1108
4430 00:53:15,719 --> 00:53:23,039
4431 can do nowadays and we can pretty stuff
4432
4433 1109
4434 00:53:19,619 --> 00:53:25,910
4435 classes without the instance IDs as well
4436
4437 1110
4438 00:53:23,039 --> 00:53:28,190
4439 as things class
4440
4441 1111
4442 00:53:25,909 --> 00:53:31,399
4443 with the correct semantic level and also
4444
4445 1112
4446 00:53:28,190 --> 00:53:33,409
4447 by separating the different instances so
4448
4449 1113
4450 00:53:31,400 --> 00:53:35,358
4451 for example here we have the separation
4452
4453 1114
4454 00:53:33,409 --> 00:53:43,368
4455 of the different glasses in two
4456
4457 1115
4458 00:53:35,358 --> 00:53:46,728
4459 different instances okay so finally I
4460
4461 1116
4462 00:53:43,369 --> 00:53:48,949
4463 will present a third way of doing
4464
4465 1117
4466 00:53:46,728 --> 00:53:51,739
4467 instant segmentation and this is a way
4468
4469 1118
4470 00:53:48,949 --> 00:53:55,489
4471 that I particularly like and it is
4472
4473 1119
4474 00:53:51,739 --> 00:54:00,440
4475 inspired to what we used to do a little
4476
4477 1120
4478 00:53:55,489 --> 00:54:04,009
4479 bit before CNN's came in so all the our
4480
4481 1121
4482 00:54:00,440 --> 00:54:08,568
4483 CNN methods all the methods in the arson
4484
4485 1122
4486 00:54:04,009 --> 00:54:12,108
4487 and family or even deformable part
4488
4489 1123
4490 00:54:08,568 --> 00:54:15,170
4491 models use a sliding window approach for
4492
4493 1124
4494 00:54:12,108 --> 00:54:17,509
4495 detection so essentially the basic ideas
4496
4497 1125
4498 00:54:15,170 --> 00:54:21,979
4499 we have seen is to tensely enumerate box
4500
4501 1126
4502 00:54:17,509 --> 00:54:23,900
4503 proposals and then classify them so this
4504
4505 1127
4506 00:54:21,978 --> 00:54:25,998
4507 is a successful paradigm we have seen
4508
4509 1128
4510 00:54:23,900 --> 00:54:28,579
4511 it's well engineered it achieves soda
4512
4513 1129
4514 00:54:25,998 --> 00:54:31,598
4515 results and most of the state-of-the-art
4516
4517 1130
4518 00:54:28,579 --> 00:54:36,380
4519 methods are still based on this paradigm
4520
4521 1131
4522 00:54:31,599 --> 00:54:39,469
4523 nonetheless before DPN before our CNN we
4524
4525 1132
4526 00:54:36,380 --> 00:54:41,599
4527 used to do detection as voting or let's
4528
4529 1133
4530 00:54:39,469 --> 00:54:44,028
4531 say one of the paradigms that existed
4532
4533 1134
4534 00:54:41,599 --> 00:54:47,959
4535 for detection was the one for polling
4536
4537 1135
4538 00:54:44,028 --> 00:54:51,018
4539 and this was way before we had actual
4540
4541 1136
4542 00:54:47,958 --> 00:54:54,440
4543 convolutional neural networks so what do
4544
4545 1137
4546 00:54:51,018 --> 00:54:57,649
4547 I mean by how voting so in this case we
4548
4549 1138
4550 00:54:54,440 --> 00:55:00,170
4551 can see the very simple example in which
4552
4553 1139
4554 00:54:57,650 --> 00:55:03,380
4555 we want to detect analytical shapes for
4556
4557 1140
4558 00:55:00,170 --> 00:55:06,469
4559 example lines speaks in a dual
4560
4561 1141
4562 00:55:03,380 --> 00:55:08,329
4563 parametric space so essentially what we
4564
4565 1142
4566 00:55:06,469 --> 00:55:11,630
4567 would have is each pixel would cast
4568
4569 1143
4570 00:55:08,329 --> 00:55:13,489
4571 about in these two all space and then we
4572
4573 1144
4574 00:55:11,630 --> 00:55:16,009
4575 will detect the peaks in the dual space
4576
4577 1145
4578 00:55:13,489 --> 00:55:18,940
4579 and kind of back project them to the
4580
4581 1146
4582 00:55:16,009 --> 00:55:22,219
4583 image phase to detect for example align
4584
4585 1147
4586 00:55:18,940 --> 00:55:25,369
4587 so let me put it into into a visual
4588
4589 1148
4590 00:55:22,219 --> 00:55:28,940
4591 example so we want to do line detection
4592
4593 1149
4594 00:55:25,369 --> 00:55:31,219
4595 and all we have in order to detect a
4596
4597 1150
4598 00:55:28,940 --> 00:55:33,650
4599 line are different points that are
4600
4601 1151
4602 00:55:31,219 --> 00:55:35,809
4603 placed on top of a line so essentially
4604
4605 1152
4606 00:55:33,650 --> 00:55:38,660
4607 would want to do is to fit a line
4608
4609 1153
4610 00:55:35,809 --> 00:55:39,350
4611 through these points now what we can do
4612
4613 1154
4614 00:55:38,659 --> 00:55:41,809
4615 is there
4616
4617 1155
4618 00:55:39,349 --> 00:55:46,190
4619 each point in the image space for
4620
4621 1156
4622 00:55:41,809 --> 00:55:50,259
4623 example in this case x0 y0 actually cast
4624
4625 1157
4626 00:55:46,190 --> 00:55:53,929
4627 a vote into the half parameter space and
4628
4629 1158
4630 00:55:50,260 --> 00:55:57,560
4631 this vote actually takes the form of a
4632
4633 1159
4634 00:55:53,929 --> 00:56:00,859
4635 line that crosses that point so it is
4636
4637 1160
4638 00:55:57,559 --> 00:56:06,079
4639 important to note that this line which
4640
4641 1161
4642 00:56:00,860 --> 00:56:09,289
4643 is parametrized by M and B is always
4644
4645 1162
4646 00:56:06,079 --> 00:56:11,179
4647 going to cross the point x0 y0 so
4648
4649 1163
4650 00:56:09,289 --> 00:56:13,989
4651 essentially this represents all the
4652
4653 1164
4654 00:56:11,179 --> 00:56:18,579
4655 lines they're going to cross that point
4656
4657 1165
4658 00:56:13,989 --> 00:56:21,229
4659 now if you take another point x1 y1
4660
4661 1166
4662 00:56:18,579 --> 00:56:23,299
4663 we're going to cast another boat we're
4664
4665 1167
4666 00:56:21,230 --> 00:56:26,329
4667 going to cast another line in the 1/2
4668
4669 1168
4670 00:56:23,300 --> 00:56:29,030
4671 parameter space and so for all the
4672
4673 1169
4674 00:56:26,329 --> 00:56:31,610
4675 points of this line that we're trying to
4676
4677 1170
4678 00:56:29,030 --> 00:56:33,350
4679 find out we're going to cast votes in
4680
4681 1171
4682 00:56:31,610 --> 00:56:35,300
4683 the parameter space we're going to cast
4684
4685 1172
4686 00:56:33,349 --> 00:56:38,210
4687 all of these lines in this parameter
4688
4689 1173
4690 00:56:35,300 --> 00:56:39,980
4691 space and then what we're going to do is
4692
4693 1174
4694 00:56:38,210 --> 00:56:42,380
4695 we're going to go to the parameter space
4696
4697 1175
4698 00:56:39,980 --> 00:56:45,349
4699 to the half power meter space and we're
4700
4701 1176
4702 00:56:42,380 --> 00:56:47,840
4703 going to read out the Maxima from this
4704
4705 1177
4706 00:56:45,349 --> 00:56:50,420
4707 parameter space and in this case the
4708
4709 1178
4710 00:56:47,840 --> 00:56:52,460
4711 Maxima is going to be this point here
4712
4713 1179
4714 00:56:50,420 --> 00:56:55,010
4715 where all the lines are going to cross
4716
4717 1180
4718 00:56:52,460 --> 00:56:59,780
4719 and this point here where all the lines
4720
4721 1181
4722 00:56:55,010 --> 00:57:03,170
4723 cross is a point represented by a value
4724
4725 1182
4726 00:56:59,780 --> 00:57:06,260
4727 of M and a value of P that represent the
4728
4729 1183
4730 00:57:03,170 --> 00:57:12,680
4731 line that actually best fits all of
4732
4733 1184
4734 00:57:06,260 --> 00:57:14,720
4735 these points that have casted votes so
4736
4737 1185
4738 00:57:12,679 --> 00:57:17,299
4739 let's see how can we actually use this
4740
4741 1186
4742 00:57:14,719 --> 00:57:19,969
4743 to actually perform object detection or
4744
4745 1187
4746 00:57:17,300 --> 00:57:23,630
4747 only line detection but object detection
4748
4749 1188
4750 00:57:19,969 --> 00:57:26,329
4751 in the form of voting so the idea is
4752
4753 1189
4754 00:57:23,630 --> 00:57:29,150
4755 that objects are going to be detected as
4756
4757 1190
4758 00:57:26,329 --> 00:57:32,989
4759 consistent configuration of observed
4760
4761 1191
4762 00:57:29,150 --> 00:57:37,010
4763 parts so in this case we have a car and
4764
4765 1192
4766 00:57:32,989 --> 00:57:39,709
4767 we know that a car has two wheels and we
4768
4769 1193
4770 00:57:37,010 --> 00:57:41,990
4771 know that these two wheels are always
4772
4773 1194
4774 00:57:39,710 --> 00:57:44,570
4775 roughly in the same position with
4776
4777 1195
4778 00:57:41,989 --> 00:57:48,319
4779 respect for example to the center of the
4780
4781 1196
4782 00:57:44,570 --> 00:57:51,590
4783 car so the rough idea would be to use
4784
4785 1197
4786 00:57:48,320 --> 00:57:53,019
4787 the wheel patch and whenever this wheel
4788
4789 1198
4790 00:57:51,590 --> 00:57:56,318
4791 patch is detected
4792
4793 1199
4794 00:57:53,018 --> 00:57:59,648
4795 I'm going to cast a vote for the center
4796
4797 1200
4798 00:57:56,318 --> 00:58:02,108
4799 of the car and I know that the center of
4800
4801 1201
4802 00:57:59,648 --> 00:58:03,909
4803 the car is always going to have the same
4804
4805 1202
4806 00:58:02,108 --> 00:58:06,909
4807 relationship with respect to the wheel
4808
4809 1203
4810 00:58:03,909 --> 00:58:09,999
4811 more or less therefore I'm going to cast
4812
4813 1204
4814 00:58:06,909 --> 00:58:11,829
4815 votes for from the wheels of the car the
4816
4817 1205
4818 00:58:09,998 --> 00:58:12,879
4819 window of the car at the back of the car
4820
4821 1206
4822 00:58:11,829 --> 00:58:15,130
4823 at the front of the car
4824
4825 1207
4826 00:58:12,880 --> 00:58:17,739
4827 they're all gonna cast votes to the
4828
4829 1208
4830 00:58:15,130 --> 00:58:20,108
4831 center of the car and by detecting these
4832
4833 1209
4834 00:58:17,739 --> 00:58:24,969
4835 Peaks I'm going to be able to know if
4836
4837 1210
4838 00:58:20,108 --> 00:58:26,708
4839 there was indeed a car there or not so
4840
4841 1211
4842 00:58:24,969 --> 00:58:28,809
4843 let's look at this in more detail how
4844
4845 1212
4846 00:58:26,708 --> 00:58:31,958
4847 can we actually train a method to
4848
4849 1213
4850 00:58:28,809 --> 00:58:35,140
4851 perform this kind of voting so I'm going
4852
4853 1214
4854 00:58:31,958 --> 00:58:38,048
4855 to do is I'm going to first extract some
4856
4857 1215
4858 00:58:35,139 --> 00:58:41,588
4859 features from the image so again this is
4860
4861 1216
4862 00:58:38,048 --> 00:58:45,998
4863 before CNN so we used to use in that
4864
4865 1217
4866 00:58:41,588 --> 00:58:48,998
4867 period was point detection methods so
4868
4869 1218
4870 00:58:45,998 --> 00:58:52,268
4871 interest key point detection methods for
4872
4873 1219
4874 00:58:48,998 --> 00:58:54,548
4875 example sift or serve and this basically
4876
4877 1220
4878 00:58:52,268 --> 00:58:57,968
4879 extracted interesting points from the
4880
4881 1221
4882 00:58:54,548 --> 00:59:00,759
4883 image salient points from the image now
4884
4885 1222
4886 00:58:57,969 --> 00:59:04,208
4887 from this point what we did is we placed
4888
4889 1223
4890 00:59:00,759 --> 00:59:07,539
4891 a patch centered around this interest
4892
4893 1224
4894 00:59:04,208 --> 00:59:10,298
4895 point and this patch had to cast a vote
4896
4897 1225
4898 00:59:07,539 --> 00:59:12,880
4899 for the center of the object so of
4900
4901 1226
4902 00:59:10,298 --> 00:59:14,798
4903 course we took the interest points that
4904
4905 1227
4906 00:59:12,880 --> 00:59:16,479
4907 were on top of the object and these were
4908
4909 1228
4910 00:59:14,798 --> 00:59:19,809
4911 casting a vote for the center of the
4912
4913 1229
4914 00:59:16,478 --> 00:59:21,638
4915 object which we had as ground truth so
4916
4917 1230
4918 00:59:19,809 --> 00:59:24,429
4919 this was our training procedure right
4920
4921 1231
4922 00:59:21,639 --> 00:59:28,689
4923 each patch had had to learn how to vote
4924
4925 1232
4926 00:59:24,429 --> 00:59:31,479
4927 for a center point and a test timer with
4928
4929 1233
4930 00:59:28,688 --> 00:59:34,618
4931 those would get the original image cast
4932
4933 1234
4934 00:59:31,478 --> 00:59:39,428
4935 these complete these interest points
4936
4937 1235
4938 00:59:34,619 --> 00:59:42,608
4939 find the similarity between the interest
4940
4941 1236
4942 00:59:39,429 --> 00:59:44,648
4943 point and the code boot entries so
4944
4945 1237
4946 00:59:42,608 --> 00:59:48,608
4947 essentially find the most similar
4948
4949 1238
4950 00:59:44,648 --> 00:59:52,118
4951 patches from our training set and then
4952
4953 1239
4954 00:59:48,608 --> 00:59:55,259
4955 use the votes from the training set in
4956
4957 1240
4958 00:59:52,119 --> 00:59:57,969
4959 order to vote for the center of the car
4960
4961 1241
4962 00:59:55,259 --> 01:00:00,219
4963 now this is an interesting example for
4964
4965 1242
4966 00:59:57,969 --> 01:00:02,289
4967 the car here right because the front
4968
4969 1243
4970 01:00:00,219 --> 01:00:05,420
4971 wheel and the back wheel have very very
4972
4973 1244
4974 01:00:02,289 --> 01:00:07,760
4975 similar appearance so it is very likely
4976
4977 1245
4978 01:00:05,420 --> 01:00:10,608
4979 that they're both going to vote or
4980
4981 1246
4982 01:00:07,760 --> 01:00:13,400
4983 they're going to vote with the same
4984
4985 1247
4986 01:00:10,608 --> 01:00:16,670
4987 likelihood for the center of the car but
4988
4989 1248
4990 01:00:13,400 --> 01:00:19,099
4991 also for let's say the symmetrical
4992
4993 1249
4994 01:00:16,670 --> 01:00:21,519
4995 position so in this case this patch
4996
4997 1250
4998 01:00:19,099 --> 01:00:24,048
4999 would vote for the center but also for
5000
5001 1251
5002 01:00:21,519 --> 01:00:26,358
5003 this position here which would be the
5004
5005 1252
5006 01:00:24,048 --> 01:00:28,789
5007 center if this was the back wheel and
5008
5009 1253
5010 01:00:26,358 --> 01:00:30,619
5011 not the front wheel and the same happens
5012
5013 1254
5014 01:00:28,789 --> 01:00:32,900
5015 for the back wheel with roads for the
5016
5017 1255
5018 01:00:30,619 --> 01:00:33,818
5019 real center but also for this point here
5020
5021 1256
5022 01:00:32,900 --> 01:00:38,298
5023 in the back
5024
5025 1257
5026 01:00:33,818 --> 01:00:41,358
5027 now the key idea here is that we're
5028
5029 1258
5030 01:00:38,298 --> 01:00:43,219
5031 going to cast a lot of votes right so
5032
5033 1259
5034 01:00:41,358 --> 01:00:44,960
5035 the winders are going to vote the front
5036
5037 1260
5038 01:00:43,219 --> 01:00:47,239
5039 of the car is going to vote the door is
5040
5041 1261
5042 01:00:44,960 --> 01:00:49,369
5043 going to vote and you're gonna have
5044
5045 1262
5046 01:00:47,239 --> 01:00:52,519
5047 votes all over the place but there is
5048
5049 1263
5050 01:00:49,369 --> 01:00:56,809
5051 going to be a concentration of votes in
5052
5053 1264
5054 01:00:52,519 --> 01:00:59,269
5055 the center of the actual object so in
5056
5057 1265
5058 01:00:56,809 --> 01:01:01,670
5059 this half parametric space we can find
5060
5061 1266
5062 01:00:59,269 --> 01:01:04,940
5063 this peak of votes right where the votes
5064
5065 1267
5066 01:01:01,670 --> 01:01:07,818
5067 really concentrate now once we have done
5068
5069 1268
5070 01:01:04,940 --> 01:01:10,880
5071 this then we can do segmentation like
5072
5073 1269
5074 01:01:07,818 --> 01:01:14,179
5075 this because we can go back to the image
5076
5077 1270
5078 01:01:10,880 --> 01:01:17,240
5079 space and we can say let's look at all
5080
5081 1271
5082 01:01:14,179 --> 01:01:21,548
5083 the patches that voted for this position
5084
5085 1272
5086 01:01:17,239 --> 01:01:25,548
5087 here now we gather all of these patches
5088
5089 1273
5090 01:01:21,548 --> 01:01:28,068
5091 we perform some further processing and
5092
5093 1274
5094 01:01:25,548 --> 01:01:35,358
5095 now we have a rough segmentation of the
5096
5097 1275
5098 01:01:28,068 --> 01:01:35,960
5099 object so this was the method back in
5100
5101 1276
5102 01:01:35,358 --> 01:01:39,429
5103 the past
5104
5105 1277
5106 01:01:35,960 --> 01:01:44,059
5107 no CNN's here so now we come back to
5108
5109 1278
5110 01:01:39,429 --> 01:01:47,659
5111 2020 and we're going to present a paper
5112
5113 1279
5114 01:01:44,059 --> 01:01:51,200
5115 called pixel consensus voted for
5116
5117 1280
5118 01:01:47,659 --> 01:01:54,588
5119 Panoptix segmentation which was
5120
5121 1281
5122 01:01:51,199 --> 01:01:58,399
5123 published actually at cpr 2020 so really
5124
5125 1282
5126 01:01:54,588 --> 01:02:02,328
5127 recent research happening that merges
5128
5129 1283
5130 01:01:58,400 --> 01:02:04,670
5131 this concept of voting with the modern
5132
5133 1284
5134 01:02:02,329 --> 01:02:08,660
5135 CNN and the power of CNN feature
5136
5137 1285
5138 01:02:04,670 --> 01:02:11,869
5139 extraction so the overview of this
5140
5141 1286
5142 01:02:08,659 --> 01:02:15,649
5143 method is we're going to use of course
5144
5145 1287
5146 01:02:11,869 --> 01:02:18,900
5147 our FPN or backbone to extract features
5148
5149 1288
5150 01:02:15,650 --> 01:02:21,568
5151 we are going to have a semantics that
5152
5153 1289
5154 01:02:18,900 --> 01:02:24,838
5155 mutation branch at the top so the same
5156
5157 1290
5158 01:02:21,568 --> 01:02:27,630
5159 that we have been seeing so far and the
5160
5161 1291
5162 01:02:24,838 --> 01:02:30,778
5163 interesting part is this part here so
5164
5165 1292
5166 01:02:27,630 --> 01:02:34,140
5167 it's this second branch and it's the
5168
5169 1293
5170 01:02:30,778 --> 01:02:36,809
5171 instance voting branch and it actually
5172
5173 1294
5174 01:02:34,139 --> 01:02:39,750
5175 predicts for every pixel whether the
5176
5177 1295
5178 01:02:36,809 --> 01:02:42,839
5179 pixel is part of an instance mask and if
5180
5181 1296
5182 01:02:39,750 --> 01:02:46,528
5183 so the relative location of the instance
5184
5185 1297
5186 01:02:42,838 --> 01:02:51,298
5187 mesentery so the same idea as the paper
5188
5189 1298
5190 01:02:46,528 --> 01:02:53,909
5191 that actually used saved to to extract
5192
5193 1299
5194 01:02:51,298 --> 01:02:55,679
5195 meaningful patches of course this is a
5196
5197 1300
5198 01:02:53,909 --> 01:02:58,170
5199 much more powerful representation
5200
5201 1301
5202 01:02:55,679 --> 01:03:01,828
5203 because the authors what they proposed
5204
5205 1302
5206 01:02:58,170 --> 01:03:05,010
5207 to do was to put this instant voting
5208
5209 1303
5210 01:03:01,829 --> 01:03:07,200
5211 branch or or to code it into operations
5212
5213 1304
5214 01:03:05,010 --> 01:03:09,750
5215 that you can fully back propagate
5216
5217 1305
5218 01:03:07,199 --> 01:03:14,278
5219 through and therefore you can train this
5220
5221 1306
5222 01:03:09,750 --> 01:03:17,309
5223 whole thing and to end so in a nutshell
5224
5225 1307
5226 01:03:14,278 --> 01:03:19,710
5227 how does this method work well first of
5228
5229 1308
5230 01:03:17,309 --> 01:03:22,048
5231 all we need to make a decision for each
5232
5233 1309
5234 01:03:19,710 --> 01:03:24,449
5235 pixel and when we're going to do is
5236
5237 1310
5238 01:03:22,048 --> 01:03:26,519
5239 we're going to discretize the regions
5240
5241 1311
5242 01:03:24,449 --> 01:03:28,169
5243 around each pixel right the pixel has a
5244
5245 1312
5246 01:03:26,519 --> 01:03:30,119
5247 neighborhood we're interested in looking
5248
5249 1313
5250 01:03:28,170 --> 01:03:33,358
5251 at the neighborhood in order to make a
5252
5253 1314
5254 01:03:30,119 --> 01:03:36,750
5255 decision about the centroid right so
5256
5257 1315
5258 01:03:33,358 --> 01:03:38,940
5259 each pixel has to vote for the centroid
5260
5261 1316
5262 01:03:36,750 --> 01:03:40,949
5263 of the object that it belongs to so it
5264
5265 1317
5266 01:03:38,940 --> 01:03:45,210
5267 needs to have an idea of what is going
5268
5269 1318
5270 01:03:40,949 --> 01:03:48,179
5271 on around now every pixel is going to
5272
5273 1319
5274 01:03:45,210 --> 01:03:51,179
5275 vote for it its centroid if it belongs
5276
5277 1320
5278 01:03:48,179 --> 01:03:55,409
5279 to the category stuff so essentially no
5280
5281 1321
5282 01:03:51,179 --> 01:03:57,868
5283 instance in in road or sky or grass then
5284
5285 1322
5286 01:03:55,409 --> 01:04:01,558
5287 you're going to vote essentially for an
5288
5289 1323
5290 01:03:57,869 --> 01:04:04,500
5291 extra class which is no centroid but the
5292
5293 1324
5294 01:04:01,559 --> 01:04:06,269
5295 main idea is that every pixel is going
5296
5297 1325
5298 01:04:04,500 --> 01:04:09,088
5299 to vote for a century
5300
5301 1326
5302 01:04:06,269 --> 01:04:11,940
5303 if the centroid is located in this area
5304
5305 1327
5306 01:04:09,088 --> 01:04:14,400
5307 here if the center is not located in
5308
5309 1328
5310 01:04:11,940 --> 01:04:19,200
5311 this area here this is ignored for
5312
5313 1329
5314 01:04:14,400 --> 01:04:21,240
5315 training now in a third step we're going
5316
5317 1330
5318 01:04:19,199 --> 01:04:24,419
5319 to have this vote aggression right same
5320
5321 1331
5322 01:04:21,239 --> 01:04:30,929
5323 as we saw in the half space this vote
5324
5325 1332
5326 01:04:24,420 --> 01:04:32,380
5327 aggregation of an each pixel and this is
5328
5329 1333
5330 01:04:30,929 --> 01:04:35,009
5331 basically casted
5332
5333 1334
5334 01:04:32,380 --> 01:04:38,079
5335 into these accumulator space right and
5336
5337 1335
5338 01:04:35,009 --> 01:04:42,670
5339 disgusting it's very nicely formulated
5340
5341 1336
5342 01:04:38,079 --> 01:04:44,710
5343 as a dilated transpose convolution on a
5344
5345 1337
5346 01:04:42,670 --> 01:04:46,990
5347 fourth step we're going to detect these
5348
5349 1338
5350 01:04:44,710 --> 01:04:48,730
5351 objects as these Peaks in this case we
5352
5353 1339
5354 01:04:46,989 --> 01:04:51,578
5355 have these three objects these three
5356
5357 1340
5358 01:04:48,730 --> 01:04:54,730
5359 peaks and finally we're going to do
5360
5361 1341
5362 01:04:51,579 --> 01:04:56,829
5363 again a back projection of the peaks so
5364
5365 1342
5366 01:04:54,730 --> 01:04:59,199
5367 same as we presented for the method
5368
5369 1343
5370 01:04:56,829 --> 01:05:01,599
5371 before you look at who voted for that
5372
5373 1344
5374 01:04:59,199 --> 01:05:03,909
5375 Center you go back to the image space
5376
5377 1345
5378 01:05:01,599 --> 01:05:05,588
5379 and you can obtain the masks which are
5380
5381 1346
5382 01:05:03,909 --> 01:05:08,558
5383 all the pixels that voted for that
5384
5385 1347
5386 01:05:05,588 --> 01:05:10,659
5387 Center and the category information the
5388
5389 1348
5390 01:05:08,559 --> 01:05:15,160
5391 semantic information is provided by the
5392
5393 1349
5394 01:05:10,659 --> 01:05:17,199
5395 parallel semantic segmentation hat okay
5396
5397 1350
5398 01:05:15,159 --> 01:05:18,969
5399 so now the interesting thing is how to
5400
5401 1351
5402 01:05:17,199 --> 01:05:21,759
5403 implement this into a neural network
5404
5405 1352
5406 01:05:18,969 --> 01:05:24,548
5407 right so what the authors proposed to do
5408
5409 1353
5410 01:05:21,759 --> 01:05:27,818
5411 is to have what they call a voting
5412
5413 1354
5414 01:05:24,548 --> 01:05:30,670
5415 lookup table so first of all we need to
5416
5417 1355
5418 01:05:27,818 --> 01:05:32,980
5419 discretize the region around the pixel
5420
5421 1356
5422 01:05:30,670 --> 01:05:35,889
5423 right I am a pixel I need to cast a vote
5424
5425 1357
5426 01:05:32,980 --> 01:05:38,079
5427 for my centroid and I need to know where
5428
5429 1358
5430 01:05:35,889 --> 01:05:40,268
5431 to cast this vote so the first thing
5432
5433 1359
5434 01:05:38,079 --> 01:05:43,720
5435 that I'm going to do is I'm going to
5436
5437 1360
5438 01:05:40,268 --> 01:05:46,000
5439 place this voting filter center around
5440
5441 1361
5442 01:05:43,719 --> 01:05:48,939
5443 the pixel that has to cast a ball and
5444
5445 1362
5446 01:05:46,000 --> 01:05:52,900
5447 what this sorting filter does is it
5448
5449 1363
5450 01:05:48,940 --> 01:05:55,809
5451 converts this M by n cells right is this
5452
5453 1364
5454 01:05:52,900 --> 01:05:59,139
5455 square Center on this pixel into 17
5456
5457 1365
5458 01:05:55,809 --> 01:06:03,519
5459 indices so essentially I can cast a vote
5460
5461 1366
5462 01:05:59,139 --> 01:06:06,400
5463 for 17 positions note that there's of
5464
5465 1367
5466 01:06:03,518 --> 01:06:09,368
5467 course much more resolution closer to
5468
5469 1368
5470 01:06:06,400 --> 01:06:11,650
5471 the to the pixel and much less
5472
5473 1369
5474 01:06:09,369 --> 01:06:15,759
5475 resolution as we go further further away
5476
5477 1370
5478 01:06:11,650 --> 01:06:18,940
5479 from the object now in this case if I am
5480
5481 1371
5482 01:06:15,759 --> 01:06:21,880
5483 this instance mask and I'm this pixel in
5484
5485 1372
5486 01:06:18,940 --> 01:06:25,869
5487 the instance mask my Center is the red
5488
5489 1373
5490 01:06:21,880 --> 01:06:28,778
5491 square so I basically need to cast the
5492
5493 1374
5494 01:06:25,869 --> 01:06:33,190
5495 vote for the center which is going to be
5496
5497 1375
5498 01:06:28,778 --> 01:06:35,739
5499 on position 16 so me blue pixel I'm
5500
5501 1376
5502 01:06:33,190 --> 01:06:39,548
5503 going to cast the vote which is actually
5504
5505 1377
5506 01:06:35,739 --> 01:06:41,949
5507 the value 16 and thanks to the voting
5508
5509 1378
5510 01:06:39,548 --> 01:06:45,509
5511 filter I know exactly what this value
5512
5513 1379
5514 01:06:41,949 --> 01:06:45,509
5515 means in the image space
5516
5517 1380
5518 01:06:46,250 --> 01:06:51,230
5519 now an inference the instance voting
5520
5521 1381
5522 01:06:48,769 --> 01:06:54,500
5523 branch actually provides a tensor and
5524
5525 1382
5526 01:06:51,230 --> 01:06:58,010
5527 this tensor has size H by W so image
5528
5529 1383
5530 01:06:54,500 --> 01:07:01,179
5531 size and the number of channels is k
5532
5533 1384
5534 01:06:58,010 --> 01:07:06,080
5535 plus 1 so they essentially K positions
5536
5537 1385
5538 01:07:01,179 --> 01:07:10,159
5539 remember we had 17 indices here so 17
5540
5541 1386
5542 01:07:06,079 --> 01:07:12,858
5543 positions that I can vote for plus 1
5544
5545 1387
5546 01:07:10,159 --> 01:07:16,309
5547 which is basically for the class for all
5548
5549 1388
5550 01:07:12,858 --> 01:07:20,179
5551 the classes that are not countable so
5552
5553 1389
5554 01:07:16,309 --> 01:07:23,090
5555 the sky the grass or the road and now
5556
5557 1390
5558 01:07:20,179 --> 01:07:25,940
5559 the the ideas that I want to accumulate
5560
5561 1391
5562 01:07:23,090 --> 01:07:30,260
5563 the votes in my accumulator space and
5564
5565 1392
5566 01:07:25,940 --> 01:07:32,780
5567 how do I actually want to do that well
5568
5569 1393
5570 01:07:30,260 --> 01:07:35,750
5571 remember again our example at the blue
5572
5573 1394
5574 01:07:32,780 --> 01:07:38,300
5575 picture we get a vote for index 16 with
5576
5577 1395
5578 01:07:35,750 --> 01:07:41,119
5579 very high probability right probability
5580
5581 1396
5582 01:07:38,300 --> 01:07:44,150
5583 is 0.9 and this comes from a soft max
5584
5585 1397
5586 01:07:41,119 --> 01:07:47,329
5587 output with these 17 classifications
5588
5589 1398
5590 01:07:44,150 --> 01:07:49,820
5591 plus 1 17 class's sorry plus 1 where
5592
5593 1399
5594 01:07:47,329 --> 01:07:53,420
5595 each class is one of these positions in
5596
5597 1400
5598 01:07:49,820 --> 01:07:55,369
5599 the voting filter now what I need to do
5600
5601 1401
5602 01:07:53,420 --> 01:07:59,869
5603 is I basically need to transfer these
5604
5605 1402
5606 01:07:55,369 --> 01:08:01,789
5607 0.9 value to the cell number 16 so I'm
5608
5609 1403
5610 01:07:59,869 --> 01:08:04,309
5611 going to do this with a dilated
5612
5613 1404
5614 01:08:01,789 --> 01:08:07,880
5615 transpose convolution right I'm going to
5616
5617 1405
5618 01:08:04,309 --> 01:08:09,619
5619 place the value in there and then the
5620
5621 1406
5622 01:08:07,880 --> 01:08:11,690
5623 other person I need to do in this
5624
5625 1407
5626 01:08:09,619 --> 01:08:15,320
5627 particular case is to evenly distribute
5628
5629 1408
5630 01:08:11,690 --> 01:08:18,980
5631 this value among the pixels and for this
5632
5633 1409
5634 01:08:15,320 --> 01:08:22,969
5635 I'm going to do average poly so both of
5636
5637 1410
5638 01:08:18,979 --> 01:08:25,968
5639 these operations are very familiar for
5640
5641 1411
5642 01:08:22,969 --> 01:08:27,920
5643 for deep learning people for the
5644
5645 1412
5646 01:08:25,969 --> 01:08:30,289
5647 convolutional neural networks with no
5648
5649 1413
5650 01:08:27,920 --> 01:08:32,569
5651 transpose convolution in this case it's
5652
5653 1414
5654 01:08:30,289 --> 01:08:34,250
5655 a fixed dilated convolution and we know
5656
5657 1415
5658 01:08:32,569 --> 01:08:36,589
5659 pulling in this case it's average
5660
5661 1416
5662 01:08:34,250 --> 01:08:39,439
5663 pooling so we can actually map all of
5664
5665 1417
5666 01:08:36,588 --> 01:08:43,359
5667 these voting operations into essentially
5668
5669 1418
5670 01:08:39,439 --> 01:08:46,159
5671 convolutional neural network operations
5672
5673 1419
5674 01:08:43,359 --> 01:08:48,500
5675 now these transpose convolutions they
5676
5677 1420
5678 01:08:46,159 --> 01:08:51,229
5679 need to take this this single value in
5680
5681 1421
5682 01:08:48,500 --> 01:08:53,509
5683 the input and by multiplying it with a
5684
5685 1422
5686 01:08:51,229 --> 01:08:56,389
5687 kernel they will distribute this value
5688
5689 1423
5690 01:08:53,509 --> 01:08:58,670
5691 in the output map now this curl is
5692
5693 1424
5694 01:08:56,390 --> 01:08:59,850
5695 actually going to define the amount of
5696
5697 1425
5698 01:08:58,670 --> 01:09:02,880
5699 the input value
5700
5701 1426
5702 01:08:59,850 --> 01:09:05,160
5703 that is being distributed and of course
5704
5705 1427
5706 01:09:02,880 --> 01:09:07,199
5707 when we talk about transpose convolution
5708
5709 1428
5710 01:09:05,159 --> 01:09:10,889
5711 in general we talk about learn transpose
5712
5713 1429
5714 01:09:07,199 --> 01:09:14,010
5715 convolution however for this particular
5716
5717 1430
5718 01:09:10,890 --> 01:09:16,109
5719 purpose of vote aggregation we actually
5720
5721 1431
5722 01:09:14,010 --> 01:09:19,380
5723 fix the kernel parameters and these
5724
5725 1432
5726 01:09:16,109 --> 01:09:22,950
5727 kernel parameters are this one Hodge
5728
5729 1433
5730 01:09:19,380 --> 01:09:25,230
5731 encoding across each channel that marks
5732
5733 1434
5734 01:09:22,949 --> 01:09:29,039
5735 the target location so of course we know
5736
5737 1435
5738 01:09:25,229 --> 01:09:31,829
5739 exactly where to place this vote so this
5740
5741 1436
5742 01:09:29,039 --> 01:09:33,630
5743 is going to be a fixed operation we're
5744
5745 1437
5746 01:09:31,829 --> 01:09:37,859
5747 not going to have learn about parameters
5748
5749 1438
5750 01:09:33,630 --> 01:09:40,380
5751 in there now this is a more detailed
5752
5753 1439
5754 01:09:37,859 --> 01:09:42,750
5755 example on the voting on the
5756
5757 1440
5758 01:09:40,380 --> 01:09:47,130
5759 implementation and you're welcome to
5760
5761 1441
5762 01:09:42,750 --> 01:09:50,039
5763 take a look at home now for object
5764
5765 1442
5766 01:09:47,130 --> 01:09:53,340
5767 detection what happens then this is an
5768
5769 1443
5770 01:09:50,039 --> 01:09:56,489
5771 example where we have this several
5772
5773 1444
5774 01:09:53,340 --> 01:09:59,369
5775 objects motorbikes person we cast these
5776
5777 1445
5778 01:09:56,489 --> 01:10:01,969
5779 votes and now we detect these peaks in
5780
5781 1446
5782 01:09:59,369 --> 01:10:04,970
5783 the heat map and this Peaks essentially
5784
5785 1447
5786 01:10:01,970 --> 01:10:08,010
5787 determined the consensus between
5788
5789 1448
5790 01:10:04,970 --> 01:10:10,170
5791 different pixels in the image in the
5792
5793 1449
5794 01:10:08,010 --> 01:10:13,110
5795 instance that all have voted for the
5796
5797 1450
5798 01:10:10,170 --> 01:10:14,789
5799 same Center so by simply threshold in
5800
5801 1451
5802 01:10:13,109 --> 01:10:17,789
5803 and doing some connected component
5804
5805 1452
5806 01:10:14,789 --> 01:10:22,170
5807 analysis we can detect the center for
5808
5809 1453
5810 01:10:17,789 --> 01:10:24,269
5811 all of these objects now we need to do
5812
5813 1454
5814 01:10:22,170 --> 01:10:27,750
5815 the back projection now we need to
5816
5817 1455
5818 01:10:24,270 --> 01:10:29,760
5819 localize the mask of these objects so
5820
5821 1456
5822 01:10:27,750 --> 01:10:33,770
5823 for every peak we need to determine
5824
5825 1457
5826 01:10:29,760 --> 01:10:37,710
5827 which pixels voted for this Center and
5828
5829 1458
5830 01:10:33,770 --> 01:10:41,780
5831 therefore favored this region to be the
5832
5833 1459
5834 01:10:37,710 --> 01:10:44,609
5835 center above all other possibilities and
5836
5837 1460
5838 01:10:41,779 --> 01:10:47,039
5839 you see actually that by doing this back
5840
5841 1461
5842 01:10:44,609 --> 01:10:49,829
5843 projection we get fairly good results
5844
5845 1462
5846 01:10:47,039 --> 01:10:52,680
5847 right I mean look at this instance here
5848
5849 1463
5850 01:10:49,829 --> 01:10:55,409
5851 by just looking at the pixels about it
5852
5853 1464
5854 01:10:52,680 --> 01:11:00,030
5855 for the center we already have quite a
5856
5857 1465
5858 01:10:55,409 --> 01:11:02,639
5859 good segmentation of this person so in
5860
5861 1466
5862 01:11:00,029 --> 01:11:04,649
5863 order to determine which P which pixel
5864
5865 1467
5866 01:11:02,640 --> 01:11:07,770
5867 could have voted for a specific object
5868
5869 1468
5870 01:11:04,649 --> 01:11:09,809
5871 Center the authors proposed to use what
5872
5873 1469
5874 01:11:07,770 --> 01:11:12,750
5875 they call a query filter which is
5876
5877 1470
5878 01:11:09,810 --> 01:11:13,409
5879 essentially a spatial inversion of the
5880
5881 1471
5882 01:11:12,750 --> 01:11:17,219
5883 voting
5884
5885 1472
5886 01:11:13,408 --> 01:11:20,460
5887 so see how is Horizonte and and
5888
5889 1473
5890 01:11:17,219 --> 01:11:24,329
5891 vertically flipped so my question is if
5892
5893 1474
5894 01:11:20,460 --> 01:11:27,389
5895 when I did the boating I voted for pixel
5896
5897 1475
5898 01:11:24,329 --> 01:11:30,539
5899 eight position eight to be my center
5900
5901 1476
5902 01:11:27,389 --> 01:11:33,029
5903 now during back propagation I look at
5904
5905 1477
5906 01:11:30,539 --> 01:11:36,149
5907 this pixel and I say well the
5908
5909 1478
5910 01:11:33,029 --> 01:11:38,849
5911 bottom-left pixel right we're here where
5912
5913 1479
5914 01:11:36,149 --> 01:11:42,388
5915 the area's should had actually voted for
5916
5917 1480
5918 01:11:38,850 --> 01:11:44,880
5919 eight if I'm the instant center right
5920
5921 1481
5922 01:11:42,389 --> 01:11:48,000
5923 this is exactly the opposite operation
5924
5925 1482
5926 01:11:44,880 --> 01:11:50,880
5927 and this essentially by applying this
5928
5929 1483
5930 01:11:48,000 --> 01:11:52,948
5931 this query filter is how you can do this
5932
5933 1484
5934 01:11:50,880 --> 01:11:55,739
5935 this vote aggregation this back
5936
5937 1485
5938 01:11:52,948 --> 01:11:58,908
5939 propagation where you actually find out
5940
5941 1486
5942 01:11:55,738 --> 01:12:02,638
5943 which pixels voted for you as a center
5944
5945 1487
5946 01:11:58,908 --> 01:12:05,908
5947 and the qualitative results of this
5948
5949 1488
5950 01:12:02,639 --> 01:12:08,460
5951 voting scheme are also really good so
5952
5953 1489
5954 01:12:05,908 --> 01:12:10,888
5955 you can see here for example this crazy
5956
5957 1490
5958 01:12:08,460 --> 01:12:13,560
5959 image with all the teddy bears that are
5960
5961 1491
5962 01:12:10,889 --> 01:12:16,199
5963 correctly classified and also the the
5964
5965 1492
5966 01:12:13,560 --> 01:12:20,670
5967 instances of ten are actually really
5968
5969 1493
5970 01:12:16,198 --> 01:12:23,369
5971 really impressive so you might actually
5972
5973 1494
5974 01:12:20,670 --> 01:12:25,050
5975 wonder why do we need to go through all
5976
5977 1495
5978 01:12:23,369 --> 01:12:28,198
5979 the trouble of doing semantic
5980
5981 1496
5982 01:12:25,050 --> 01:12:31,520
5983 segmentation instant segmentation and in
5984
5985 1497
5986 01:12:28,198 --> 01:12:35,789
5987 the end Panoptix segmentation right so
5988
5989 1498
5990 01:12:31,520 --> 01:12:38,370
5991 the idea is that we want to use a camera
5992
5993 1499
5994 01:12:35,789 --> 01:12:41,279
5995 in computer vision to understand the
5996
5997 1500
5998 01:12:38,369 --> 01:12:43,559
5999 scene around us and the ultimate scene
6000
6001 1501
6002 01:12:41,279 --> 01:12:46,019
6003 interpretation is to know exactly what
6004
6005 1502
6006 01:12:43,560 --> 01:12:48,389
6007 every pixel represents so we want to
6008
6009 1503
6010 01:12:46,020 --> 01:12:51,409
6011 find individual objects we want to find
6012
6013 1504
6014 01:12:48,389 --> 01:12:53,969
6015 surfaces and for example for robots
6016
6017 1505
6018 01:12:51,408 --> 01:12:57,029
6019 surfaces are really important right we
6020
6021 1506
6022 01:12:53,969 --> 01:12:59,489
6023 need to actually allow robots to
6024
6025 1507
6026 01:12:57,029 --> 01:13:01,590
6027 understand water drivable surfaces for
6028
6029 1508
6030 01:12:59,488 --> 01:13:04,319
6031 example road and water non drivable
6032
6033 1509
6034 01:13:01,590 --> 01:13:06,869
6035 surfaces we need to allow the robot to
6036
6037 1510
6038 01:13:04,319 --> 01:13:10,250
6039 understand the type of objects the type
6040
6041 1511
6042 01:13:06,869 --> 01:13:13,079
6043 of obstacles that it can find and also
6044
6045 1512
6046 01:13:10,250 --> 01:13:15,359
6047 finally and this is not done through a
6048
6049 1513
6050 01:13:13,079 --> 01:13:17,488
6051 Panoptix eggman tation but more through
6052
6053 1514
6054 01:13:15,359 --> 01:13:20,009
6055 tracking and through trajectory
6056
6057 1515
6058 01:13:17,488 --> 01:13:24,178
6059 prediction as we will see we also need
6060
6061 1516
6062 01:13:20,010 --> 01:13:26,969
6063 to allow robots to understand or to
6064
6065 1517
6066 01:13:24,179 --> 01:13:29,069
6067 predict the intent of the agency
6068
6069 1518
6070 01:13:26,969 --> 01:13:32,038
6071 the vicinity for example whether person
6072
6073 1519
6074 01:13:29,069 --> 01:13:33,118
6075 is going to cross the path of the of the
6076
6077 1520
6078 01:13:32,038 --> 01:13:36,118
6079 robot or not
6080
6081 1521
6082 01:13:33,118 --> 01:13:38,670
6083 so understanding the scene around us
6084
6085 1522
6086 01:13:36,118 --> 01:13:42,448
6087 through Panoptix segmentation is one of
6088
6089 1523
6090 01:13:38,670 --> 01:13:48,269
6091 the pillars actually of mobile robot
6092
6093 1524
6094 01:13:42,448 --> 01:13:50,578
6095 vision thank you very much for following
6096
6097 1525
6098 01:13:48,269 --> 01:13:54,380
6099 this lecture on instant segmentation
6100
6101 1526
6102 01:13:50,578 --> 01:13:54,380
6103 stay tuned for the next lecture
6104
6105

You might also like