The document discusses instant segmentation, which aims to segment pixels belonging to individual countable objects like cars or people, rather than just labeling pixels with the same semantic class. It describes two approaches: 1) Using object proposals to first separate instances and then classify semantically, or 2) Starting from a semantic segmentation map and then separating instances within each class. It focuses on the second approach, noting methods that use clustering or multi-cuts on edges to separate object instances within a semantic segmentation.
The document discusses instant segmentation, which aims to segment pixels belonging to individual countable objects like cars or people, rather than just labeling pixels with the same semantic class. It describes two approaches: 1) Using object proposals to first separate instances and then classify semantically, or 2) Starting from a semantic segmentation map and then separating instances within each class. It focuses on the second approach, noting methods that use clustering or multi-cuts on edges to separate object instances within a semantic segmentation.
The document discusses instant segmentation, which aims to segment pixels belonging to individual countable objects like cars or people, rather than just labeling pixels with the same semantic class. It describes two approaches: 1) Using object proposals to first separate instances and then classify semantically, or 2) Starting from a semantic segmentation map and then separating instances within each class. It focuses on the second approach, noting methods that use clustering or multi-cuts on edges to separate object instances within a semantic segmentation.
3 hello everyone welcome to the lecture on 4 5 2 6 00:00:04,440 --> 00:00:09,690 7 instant segmentation let's start by 8 9 3 10 00:00:07,589 --> 00:00:12,480 11 defining the problem so in the last 12 13 4 14 00:00:09,689 --> 00:00:15,388 15 lecture we saw what the task of semantic 16 17 5 18 00:00:12,480 --> 00:00:18,359 19 segmentation is essentially what we want 20 21 6 22 00:00:15,388 --> 00:00:21,660 23 is to label every pixel including the 24 25 7 26 00:00:18,359 --> 00:00:25,140 27 background into a semantic class so we 28 29 8 30 00:00:21,660 --> 00:00:29,730 31 want to label a pixel into sky grass 32 33 9 34 00:00:25,140 --> 00:00:31,830 35 Road but also into objects that at the 36 37 10 38 00:00:29,730 --> 00:00:34,558 39 time of doing semantic segmentation 40 41 11 42 00:00:31,829 --> 00:00:37,589 43 we're not really counting so essentially 44 45 12 46 00:00:34,558 --> 00:00:39,689 47 all the pixels that are coming from 48 49 13 50 00:00:37,590 --> 00:00:42,690 51 different instances of the same class 52 53 14 54 00:00:39,689 --> 00:00:45,479 55 are labeled with the same label in the 56 57 15 58 00:00:42,689 --> 00:00:48,089 59 task of semantic segmentation you see 60 61 16 62 00:00:45,479 --> 00:00:51,179 63 for example here that we're labeling all 64 65 17 66 00:00:48,090 --> 00:00:53,399 67 the pixels belonging to these three cars 68 69 18 70 00:00:51,179 --> 00:00:56,488 71 with the same label which is the label 72 73 19 74 00:00:53,399 --> 00:00:59,189 75 car so in semantic segmentation the 76 77 20 78 00:00:56,488 --> 00:01:01,979 79 objects that can be counted like cars or 80 81 21 82 00:00:59,189 --> 00:01:03,988 83 people are treated in the exact same way 84 85 22 86 00:01:01,979 --> 00:01:08,310 87 as the objects that cannot be counted 88 89 23 90 00:01:03,988 --> 00:01:10,408 91 like sky grass or Road now for the task 92 93 24 94 00:01:08,310 --> 00:01:13,710 95 of instant segmentation we want to go 96 97 25 98 00:01:10,409 --> 00:01:16,590 99 one step further essentially we don't 100 101 26 102 00:01:13,709 --> 00:01:19,618 103 focus on labeling pixels that are coming 104 105 27 106 00:01:16,590 --> 00:01:21,780 107 from uncountable objects like the sky 108 109 28 110 00:01:19,618 --> 00:01:26,159 111 grass or road that were discussing 112 113 29 114 00:01:21,780 --> 00:01:29,728 115 before and focus only on segmenting the 116 117 30 118 00:01:26,159 --> 00:01:32,189 119 pixels that are coming from instances of 120 121 31 122 00:01:29,728 --> 00:01:35,219 123 the same class of objects that we can 124 125 32 126 00:01:32,188 --> 00:01:39,508 127 actually count for example cars or 128 129 33 130 00:01:35,219 --> 00:01:42,298 131 people and the idea here is that we do 132 133 34 134 00:01:39,509 --> 00:01:46,129 135 want to differentiate the pixels that 136 137 35 138 00:01:42,299 --> 00:01:48,420 139 form one car one instance of a car and 140 141 36 142 00:01:46,129 --> 00:01:51,989 143 differentiate them from the pixels that 144 145 37 146 00:01:48,420 --> 00:01:53,849 147 form another instance of another car so 148 149 38 150 00:01:51,989 --> 00:01:57,569 151 essentially instead of assigning the 152 153 39 154 00:01:53,849 --> 00:02:00,509 155 same label to all the pixels that are 156 157 40 158 00:01:57,569 --> 00:02:04,228 159 forming these three cars we now will 160 161 41 162 00:02:00,509 --> 00:02:06,920 163 assign different labels to for example 164 165 42 166 00:02:04,228 --> 00:02:10,258 167 the yellow car the first instance and 168 169 43 170 00:02:06,920 --> 00:02:13,920 171 this Bluegreen car which will be a 172 173 44 174 00:02:10,258 --> 00:02:15,958 175 second instance so not only we want 176 177 45 178 00:02:13,919 --> 00:02:18,389 179 to find the semantics class of the 180 181 46 182 00:02:15,959 --> 00:02:22,319 183 object but we want to find what kind of 184 185 47 186 00:02:18,389 --> 00:02:26,848 187 instance is that object within that 188 189 48 190 00:02:22,318 --> 00:02:29,068 191 class is this one instance do the pixels 192 193 49 194 00:02:26,848 --> 00:02:31,738 195 to the yellow pixels here belong to the 196 197 50 198 00:02:29,068 --> 00:02:32,818 199 same instance as the blue green pixels 200 201 51 202 00:02:31,739 --> 00:02:34,890 203 there or not 204 205 52 206 00:02:32,818 --> 00:02:36,899 207 so differentiate between semantics 208 209 53 210 00:02:34,889 --> 00:02:41,250 211 classes and differentiate between 212 213 54 214 00:02:36,900 --> 00:02:44,039 215 instances so we can go about doing 216 217 55 218 00:02:41,250 --> 00:02:47,519 219 instant segmentation in two ways 220 221 56 222 00:02:44,039 --> 00:02:50,159 223 the first way is to use the knowledge 224 225 57 226 00:02:47,519 --> 00:02:53,549 227 that we have from object detection and 228 229 58 230 00:02:50,159 --> 00:02:55,379 231 start with the series of proposals so 232 233 59 234 00:02:53,549 --> 00:02:58,530 235 essentially we would start with for 236 237 60 238 00:02:55,379 --> 00:03:00,810 239 example object proposals and then in a 240 241 61 242 00:02:58,530 --> 00:03:03,810 243 second step we would assign a semantics 244 245 62 246 00:03:00,810 --> 00:03:06,030 247 class to each of these proposals now of 248 249 63 250 00:03:03,810 --> 00:03:08,189 251 course in the first step we do the 252 253 64 254 00:03:06,030 --> 00:03:10,860 255 actual instance separation with the 256 257 65 258 00:03:08,189 --> 00:03:12,900 259 proposals and in the second step we do 260 261 66 262 00:03:10,860 --> 00:03:16,290 263 the semantics part and probably also the 264 265 67 266 00:03:12,900 --> 00:03:20,370 267 segmentation part so this is one way to 268 269 68 270 00:03:16,289 --> 00:03:24,030 271 go the other way to go is to is what I 272 273 69 274 00:03:20,370 --> 00:03:26,159 275 call the LCN based method which 276 277 70 278 00:03:24,030 --> 00:03:28,890 279 essentially starts from the semantic 280 281 71 282 00:03:26,159 --> 00:03:31,019 283 segmentation map so here for example I 284 285 72 286 00:03:28,889 --> 00:03:34,259 287 see an example where all of these 288 289 73 290 00:03:31,019 --> 00:03:36,480 291 instances are labeled as person and so 292 293 74 294 00:03:34,259 --> 00:03:40,139 295 in a second step what I need to do is I 296 297 75 298 00:03:36,479 --> 00:03:44,488 299 need to separate the instances inside of 300 301 76 302 00:03:40,139 --> 00:03:48,179 303 this semantic level so let's focus first 304 305 77 306 00:03:44,489 --> 00:03:50,670 307 on the second type of methods so of 308 309 78 310 00:03:48,180 --> 00:03:52,799 311 course the great advantage of sen based 312 313 79 314 00:03:50,669 --> 00:03:55,199 315 methods is that they start from a 316 317 80 318 00:03:52,799 --> 00:03:57,959 319 semantic map and we saw in the last 320 321 81 322 00:03:55,199 --> 00:04:00,060 323 lecture already how to do this and we 324 325 82 326 00:03:57,959 --> 00:04:01,979 327 saw that this was usually done with 328 329 83 330 00:04:00,060 --> 00:04:05,879 331 fully convolutional networks that were 332 333 84 334 00:04:01,979 --> 00:04:08,939 335 able to act on any image size and this 336 337 85 338 00:04:05,879 --> 00:04:11,639 339 is why LCN based methods are actually so 340 341 86 342 00:04:08,939 --> 00:04:13,560 343 powerful because they start from an 344 345 87 346 00:04:11,639 --> 00:04:17,189 347 already pretty good semantic 348 349 88 350 00:04:13,560 --> 00:04:19,829 351 segmentation so of course once you get 352 353 89 354 00:04:17,189 --> 00:04:22,019 355 the semantic segmentation your goal is 356 357 90 358 00:04:19,829 --> 00:04:26,038 359 actually to separate the instances 360 361 91 362 00:04:22,019 --> 00:04:27,720 363 within each of the classes and there are 364 365 92 366 00:04:26,038 --> 00:04:30,120 367 three methods that I would 368 369 93 370 00:04:27,720 --> 00:04:33,420 371 recommend you to read outside of the 372 373 94 374 00:04:30,120 --> 00:04:36,269 375 lecture to get more in depth on how do 376 377 95 378 00:04:33,420 --> 00:04:39,390 379 they actually performs instant 380 381 96 382 00:04:36,269 --> 00:04:42,449 383 segmentation and I will talk about the 384 385 97 386 00:04:39,389 --> 00:04:46,169 387 second method just briefly to give the 388 389 98 390 00:04:42,449 --> 00:04:51,029 391 intuition of how to go from edges to 392 393 99 394 00:04:46,170 --> 00:04:53,460 395 instances with multi card so a lot of 396 397 100 398 00:04:51,029 --> 00:04:56,879 399 the methods that actually want to find 400 401 101 402 00:04:53,459 --> 00:04:59,519 403 instances inside a semantic map they use 404 405 102 406 00:04:56,879 --> 00:05:02,399 407 the concept of clustering so you can 408 409 103 410 00:04:59,519 --> 00:05:04,229 411 imagine that you have a set a set of 412 413 104 414 00:05:02,399 --> 00:05:06,870 415 pixels that actually represent the class 416 417 105 418 00:05:04,230 --> 00:05:09,150 419 person and within these pixels what I 420 421 106 422 00:05:06,870 --> 00:05:11,910 423 want to do is I want to cluster the 424 425 107 426 00:05:09,149 --> 00:05:15,569 427 pixels that actually belong to one 428 429 108 430 00:05:11,910 --> 00:05:17,910 431 instance so in this case what this 432 433 109 434 00:05:15,569 --> 00:05:20,250 435 method proposes to do is to start from 436 437 110 438 00:05:17,910 --> 00:05:22,650 439 the input image perform semantic 440 441 111 442 00:05:20,250 --> 00:05:27,329 443 segmentation which gives you the per 444 445 112 446 00:05:22,649 --> 00:05:31,168 447 pixel semantic class scores and then try 448 449 113 450 00:05:27,329 --> 00:05:34,859 451 to perform an image partition that would 452 453 114 454 00:05:31,168 --> 00:05:38,189 455 actually separate the images into small 456 457 115 458 00:05:34,860 --> 00:05:41,910 459 sets like for example super pixels so 460 461 116 462 00:05:38,189 --> 00:05:45,139 463 groups of pixels that show certain 464 465 117 466 00:05:41,910 --> 00:05:49,340 467 characteristics for example a smooth 468 469 118 470 00:05:45,139 --> 00:05:52,560 471 transition in in the color space and 472 473 119 474 00:05:49,339 --> 00:05:54,119 475 once you have this separation here now 476 477 120 478 00:05:52,560 --> 00:05:56,879 479 what you want to do is you want to put 480 481 121 482 00:05:54,120 --> 00:06:00,930 483 these super pixels together in two 484 485 122 486 00:05:56,879 --> 00:06:03,779 487 instances so these two branches act in 488 489 123 490 00:06:00,930 --> 00:06:06,569 491 kind of a parallel way so the left 492 493 124 494 00:06:03,779 --> 00:06:10,319 495 branch performs semantic segmentation 496 497 125 498 00:06:06,569 --> 00:06:13,918 499 the right the right branch actually 500 501 126 502 00:06:10,319 --> 00:06:16,199 503 performs separation of the image and 504 505 127 506 00:06:13,918 --> 00:06:18,418 507 usually this is done using very 508 509 128 510 00:06:16,199 --> 00:06:22,228 511 low-level features like for example edge 512 513 129 514 00:06:18,418 --> 00:06:24,209 515 detection and from this you want to 516 517 130 518 00:06:22,228 --> 00:06:27,089 519 obtain these super pixels which will 520 521 131 522 00:06:24,209 --> 00:06:32,969 523 then be your units that you will want to 524 525 132 526 00:06:27,089 --> 00:06:34,649 527 put together in two instances so I'm not 528 529 133 530 00:06:32,970 --> 00:06:37,200 531 going to go into more details on these 532 533 134 534 00:06:34,649 --> 00:06:39,929 535 methods because we want to focus this 536 537 135 538 00:06:37,199 --> 00:06:41,159 539 lecture on proposal based methods and 540 541 136 542 00:06:39,930 --> 00:06:43,889 543 this is because 544 545 137 546 00:06:41,160 --> 00:06:46,620 547 so far they have shown a much better 548 549 138 550 00:06:43,889 --> 00:06:50,219 551 performance so what a proposal based 552 553 139 554 00:06:46,620 --> 00:06:53,040 555 methods well we already know how to 556 557 140 558 00:06:50,220 --> 00:06:55,140 559 obtain bounding boxes how to do object 560 561 141 562 00:06:53,040 --> 00:06:58,200 563 detection and this is essentially what 564 565 142 566 00:06:55,139 --> 00:06:59,969 567 proposal based methods are leveraging so 568 569 143 570 00:06:58,199 --> 00:07:03,389 571 if you already know how to separate 572 573 144 574 00:06:59,970 --> 00:07:05,070 575 different instances of the different of 576 577 145 578 00:07:03,389 --> 00:07:08,250 579 the same semantics class or different 580 581 146 582 00:07:05,069 --> 00:07:11,839 583 objects different cars for example with 584 585 147 586 00:07:08,250 --> 00:07:15,000 587 object detection why not use it as a 588 589 148 590 00:07:11,839 --> 00:07:18,179 591 condition as as your input essentially 592 593 149 594 00:07:15,000 --> 00:07:21,569 595 and then trying to find the segmentation 596 597 150 598 00:07:18,180 --> 00:07:24,000 599 mask within this bounding box so of 600 601 151 602 00:07:21,569 --> 00:07:26,129 603 course it is much easier if you already 604 605 152 606 00:07:24,000 --> 00:07:28,079 607 have a bounding box and you know that 608 609 153 610 00:07:26,129 --> 00:07:30,750 611 the pixels of your instance can only be 612 613 154 614 00:07:28,079 --> 00:07:32,669 615 found inside this bounding box it's much 616 617 155 618 00:07:30,750 --> 00:07:35,279 619 easier to find the appropriate 620 621 156 622 00:07:32,670 --> 00:07:37,680 623 segmentation mask that if you start from 624 625 157 626 00:07:35,279 --> 00:07:41,279 627 all your image and you're just looking 628 629 158 630 00:07:37,680 --> 00:07:43,500 631 at the pixels there are two proposal 632 633 159 634 00:07:41,279 --> 00:07:46,859 635 based methods that I want to mention and 636 637 160 638 00:07:43,500 --> 00:07:50,009 639 I also propose you to read follow up 640 641 161 642 00:07:46,860 --> 00:07:52,560 643 words or previous works just to see how 644 645 162 646 00:07:50,009 --> 00:07:55,019 647 essentially the methods evolves how one 648 649 163 650 00:07:52,560 --> 00:07:57,149 651 starts with one approach and then 652 653 164 654 00:07:55,019 --> 00:07:59,729 655 evolves this approach to obtain better 656 657 165 658 00:07:57,149 --> 00:08:03,539 659 and better results so in this case we 660 661 166 662 00:07:59,730 --> 00:08:06,780 663 will discuss the use CCV 2014 paper SDS 664 665 167 666 00:08:03,540 --> 00:08:11,160 667 and you can read the follow-up work with 668 669 168 670 00:08:06,779 --> 00:08:14,459 671 choice presented at cpr 2015 and after 672 673 169 674 00:08:11,160 --> 00:08:16,919 675 this we will focus on multi task network 676 677 170 678 00:08:14,459 --> 00:08:18,959 679 skates and you can check then the 680 681 171 682 00:08:16,918 --> 00:08:21,209 683 previous work which was done a year 684 685 172 686 00:08:18,959 --> 00:08:25,109 687 before so let's start with an overview 688 689 173 690 00:08:21,209 --> 00:08:29,250 691 of SDS so SDS presents a very simple 692 693 174 694 00:08:25,110 --> 00:08:33,029 695 concept in which proposals are used as a 696 697 175 698 00:08:29,250 --> 00:08:36,629 699 starting point not only for bounding box 700 701 176 702 00:08:33,029 --> 00:08:38,939 703 prediction but also directly for region 704 705 177 706 00:08:36,629 --> 00:08:42,299 707 or segmentation predictions so 708 709 178 710 00:08:38,940 --> 00:08:45,000 711 essentially mask prediction so the idea 712 713 179 714 00:08:42,299 --> 00:08:47,370 715 here is to start from a set of proposals 716 717 180 718 00:08:45,000 --> 00:08:50,610 719 which you obtain with any algorithm that 720 721 181 722 00:08:47,370 --> 00:08:54,089 723 you like and then you perform a feature 724 725 182 726 00:08:50,610 --> 00:08:54,779 727 abstraction that is good for bounding 728 729 183 730 00:08:54,089 --> 00:08:58,170 731 box for each 732 733 184 734 00:08:54,779 --> 00:09:02,339 735 and also mask prediction so of course 736 737 185 738 00:08:58,169 --> 00:09:04,019 739 you have to separate CNN's here one that 740 741 186 742 00:09:02,340 --> 00:09:06,269 743 predicts the bounding box the other one 744 745 187 746 00:09:04,019 --> 00:09:08,549 747 that predicts the region and then the 748 749 188 750 00:09:06,269 --> 00:09:11,549 751 idea is that combining these two sources 752 753 189 754 00:09:08,549 --> 00:09:13,259 755 of information you can perform region 756 757 190 758 00:09:11,549 --> 00:09:17,039 759 classification and then use other 760 761 191 762 00:09:13,259 --> 00:09:19,230 763 methods for region refinement so the 764 765 192 766 00:09:17,039 --> 00:09:20,789 767 idea to keep in mind here is that there 768 769 193 770 00:09:19,230 --> 00:09:23,460 771 is a separate head for the box 772 773 194 774 00:09:20,789 --> 00:09:27,839 775 prediction and a separate head for the 776 777 195 778 00:09:23,460 --> 00:09:28,590 779 mask prediction multi task Network 780 781 196 782 00:09:27,840 --> 00:09:31,230 783 Cascades 784 785 197 786 00:09:28,590 --> 00:09:34,139 787 on the other hand proposes a slightly 788 789 198 790 00:09:31,230 --> 00:09:36,960 791 more complex approach so the idea is 792 793 199 794 00:09:34,139 --> 00:09:40,439 795 always to start from this region of 796 797 200 798 00:09:36,960 --> 00:09:43,139 799 interest these proposals which you are 800 801 201 802 00:09:40,440 --> 00:09:46,650 803 first going to convert into mask 804 805 202 806 00:09:43,139 --> 00:09:48,929 807 instances and later refine into 808 809 203 810 00:09:46,649 --> 00:09:52,079 811 categorized instances therefore 812 813 204 814 00:09:48,929 --> 00:09:55,739 815 assigning a class to each of these 816 817 205 818 00:09:52,080 --> 00:09:57,900 819 instances so in this case we also start 820 821 206 822 00:09:55,740 --> 00:10:01,080 823 from these proposals and we compute 824 825 207 826 00:09:57,899 --> 00:10:03,059 827 everything by just looking at the region 828 829 208 830 00:10:01,080 --> 00:10:05,639 831 of interest which is pulled which is 832 833 209 834 00:10:03,059 --> 00:10:07,679 835 worth and we can nicely work with it to 836 837 210 838 00:10:05,639 --> 00:10:13,049 839 create our mask instances and our 840 841 211 842 00:10:07,679 --> 00:10:15,989 843 categorized instances now one question 844 845 212 846 00:10:13,049 --> 00:10:19,679 847 that one might ask yourselves is why 848 849 213 850 00:10:15,990 --> 00:10:21,659 851 should I constrain my method to a fixed 852 853 214 854 00:10:19,679 --> 00:10:23,539 855 set of proposals which might be 856 857 215 858 00:10:21,659 --> 00:10:26,338 859 incorrect might contain some 860 861 216 862 00:10:23,539 --> 00:10:29,279 863 imprecisions or why should I constrain 864 865 217 866 00:10:26,339 --> 00:10:31,740 867 myself to a semantic segmentation map 868 869 218 870 00:10:29,279 --> 00:10:33,720 871 ideally what you would want is to 872 873 219 874 00:10:31,740 --> 00:10:35,759 875 leverage the best of both works to 876 877 220 878 00:10:33,720 --> 00:10:38,160 879 leverage the proposals which give me a 880 881 221 882 00:10:35,759 --> 00:10:40,740 883 lot of information on instances and to 884 885 222 886 00:10:38,159 --> 00:10:43,019 887 also leverage the semantic maps and not 888 889 223 890 00:10:40,740 --> 00:10:46,230 891 start from one and then try to direct 892 893 224 894 00:10:43,019 --> 00:10:50,639 895 the other so this is essentially how we 896 897 225 898 00:10:46,230 --> 00:10:53,730 899 come to one of the most famous methods 900 901 226 902 00:10:50,639 --> 00:10:56,309 903 for instant segmentation masks are CNN 904 905 227 906 00:10:53,730 --> 00:10:58,589 907 which derives from the work of fast and 908 909 228 910 00:10:56,309 --> 00:11:03,509 911 faster CNN which we saw in previous 912 913 229 914 00:10:58,589 --> 00:11:06,089 915 lectures so in Moscow CNN we essentially 916 917 230 918 00:11:03,509 --> 00:11:08,730 919 start from the faster C&N architecture 920 921 231 922 00:11:06,089 --> 00:11:10,620 923 that we already know so we have 924 925 232 926 00:11:08,730 --> 00:11:14,370 927 our famous image of the penguin which is 928 929 233 930 00:11:10,620 --> 00:11:17,310 931 processed by the CNN to perform feature 932 933 234 934 00:11:14,370 --> 00:11:19,409 935 extraction and from this we have at the 936 937 235 938 00:11:17,309 --> 00:11:22,469 939 bottom the region proposal Network which 940 941 236 942 00:11:19,409 --> 00:11:24,509 943 proposes these regions of interest these 944 945 237 946 00:11:22,470 --> 00:11:28,620 947 areas inside the image that are worth 948 949 238 950 00:11:24,509 --> 00:11:30,689 951 looking at and then faster CNN had the 952 953 239 954 00:11:28,620 --> 00:11:33,480 955 bounding box regression health which 956 957 240 958 00:11:30,690 --> 00:11:36,660 959 refined this bounding boxes which 960 961 241 962 00:11:33,480 --> 00:11:39,690 963 actually allowed you to obtain boxes 964 965 242 966 00:11:36,659 --> 00:11:41,549 967 that fit tightly to the object and also 968 969 243 970 00:11:39,690 --> 00:11:43,139 971 the classification head that tells you 972 973 244 974 00:11:41,549 --> 00:11:46,169 975 whether there's a penguin or there's a 976 977 245 978 00:11:43,139 --> 00:11:47,909 979 cattle image this is the basic 980 981 246 982 00:11:46,169 --> 00:11:51,778 983 architecture that masks are seen and 984 985 247 986 00:11:47,909 --> 00:11:54,929 987 start from and the main idea is to add a 988 989 248 990 00:11:51,778 --> 00:11:58,139 991 third head so to add a head which is 992 993 249 994 00:11:54,929 --> 00:12:01,500 995 very much based on fully convolutional 996 997 250 998 00:11:58,139 --> 00:12:03,600 999 networks to perform instant segmentation 1000 1001 251 1002 00:12:01,500 --> 00:12:06,059 1003 so you take kind of the best of both 1004 1005 252 1006 00:12:03,600 --> 00:12:07,949 1007 worlds you take the power of faster CNN 1008 1009 253 1010 00:12:06,059 --> 00:12:11,458 1011 and the proposals and the detection 1012 1013 254 1014 00:12:07,948 --> 00:12:13,708 1015 power and you take the power of full 1016 1017 255 1018 00:12:11,458 --> 00:12:18,989 1019 convolutional networks to perform 1020 1021 256 1022 00:12:13,708 --> 00:12:21,448 1023 semantic segmentation so this is another 1024 1025 257 1026 00:12:18,990 --> 00:12:25,259 1027 depiction of what I mean by this 1028 1029 258 1030 00:12:21,448 --> 00:12:29,099 1031 combination this faster simplice the LCN 1032 1033 259 1034 00:12:25,259 --> 00:12:31,769 1035 like mask head so we have the fast 1036 1037 260 1038 00:12:29,100 --> 00:12:35,959 1039 the faster CN n architecture depicted on 1040 1041 261 1042 00:12:31,769 --> 00:12:39,448 1043 the Left we have our regions of interest 1044 1045 262 1046 00:12:35,958 --> 00:12:41,609 1047 that are going to be pulled with an 1048 1049 263 1050 00:12:39,448 --> 00:12:43,169 1051 operation that is very similar to Roy 1052 1053 264 1054 00:12:41,610 --> 00:12:45,930 1055 pooling but it's now gonna be slightly 1056 1057 265 1058 00:12:43,169 --> 00:12:47,639 1059 adapted and we will talk about this so 1060 1061 266 1062 00:12:45,929 --> 00:12:49,109 1063 in this case we don't have Roy pooling 1064 1065 267 1066 00:12:47,639 --> 00:12:51,810 1067 but we have an operation that's called 1068 1069 268 1070 00:12:49,110 --> 00:12:54,810 1071 royal line but the idea is again to 1072 1073 269 1074 00:12:51,809 --> 00:12:56,578 1075 convert any bounding box size to a fixed 1076 1077 270 1078 00:12:54,809 --> 00:12:59,818 1079 representation so that then we can 1080 1081 271 1082 00:12:56,578 --> 00:13:02,429 1083 predict the class box and with a series 1084 1085 272 1086 00:12:59,818 --> 00:13:05,939 1087 of other convolutions we can also 1088 1089 273 1090 00:13:02,429 --> 00:13:08,669 1091 predict the mask and the mask loss is 1092 1093 274 1094 00:13:05,940 --> 00:13:11,550 1095 essentially going to be a binary percent 1096 1097 275 1098 00:13:08,669 --> 00:13:15,059 1099 repeat per pixel for the case semantics 1100 1101 276 1102 00:13:11,549 --> 00:13:17,338 1103 classes so we're going to directly try 1104 1105 277 1106 00:13:15,059 --> 00:13:20,219 1107 to predict the semantics class for that 1108 1109 278 1110 00:13:17,339 --> 00:13:21,630 1111 particular instance now of course the 1112 1113 279 1114 00:13:20,220 --> 00:13:23,879 1115 idea is that 1116 1117 280 1118 00:13:21,629 --> 00:13:26,578 1119 the whole instance problem has already 1120 1121 281 1122 00:13:23,879 --> 00:13:28,799 1123 been solved by faster CNN faster CNN 1124 1125 282 1126 00:13:26,578 --> 00:13:32,189 1127 already gives us this separation between 1128 1129 283 1130 00:13:28,799 --> 00:13:34,769 1131 objects and therefore the mask head can 1132 1133 284 1134 00:13:32,190 --> 00:13:40,769 1135 focus entirely on finding out the 1136 1137 285 1138 00:13:34,769 --> 00:13:42,929 1139 semantics class of this instance now the 1140 1141 286 1142 00:13:40,769 --> 00:13:45,049 1143 semantic head as I said is a series of 1144 1145 287 1146 00:13:42,929 --> 00:13:48,838 1147 convolutions right it is a fully 1148 1149 288 1150 00:13:45,049 --> 00:13:50,909 1151 convolutional network at CN and the idea 1152 1153 289 1154 00:13:48,839 --> 00:13:53,970 1155 is that you take your feature 1156 1157 290 1158 00:13:50,909 --> 00:13:57,448 1159 representation that still contains some 1160 1161 291 1162 00:13:53,970 --> 00:14:00,120 1163 spatial information and this is the 1164 1165 292 1166 00:13:57,448 --> 00:14:02,159 1167 representation before we actually have 1168 1169 293 1170 00:14:00,120 --> 00:14:05,159 1171 the classification hat in the bounding 1172 1173 294 1174 00:14:02,159 --> 00:14:08,068 1175 box regression head of faster CN n so we 1176 1177 295 1178 00:14:05,159 --> 00:14:10,919 1179 take this representation and we 1180 1181 296 1182 00:14:08,068 --> 00:14:14,099 1183 essentially do a series of convolution 1184 1185 297 1186 00:14:10,919 --> 00:14:16,708 1187 operations to process it until we have a 1188 1189 298 1190 00:14:14,100 --> 00:14:22,319 1191 nice output in which we can represent 1192 1193 299 1194 00:14:16,708 --> 00:14:25,049 1195 the semantic classes so the power of 1196 1197 300 1198 00:14:22,318 --> 00:14:27,240 1199 mask our CN n is that most of the 1200 1201 301 1202 00:14:25,049 --> 00:14:29,278 1203 features are shared so most of the 1204 1205 302 1206 00:14:27,240 --> 00:14:31,560 1207 computation is shared so we're really 1208 1209 303 1210 00:14:29,278 --> 00:14:34,259 1211 adding only a few operations on top of 1212 1213 304 1214 00:14:31,559 --> 00:14:37,588 1215 faster CN n in order to produce the 1216 1217 305 1218 00:14:34,259 --> 00:14:39,750 1219 segmentation results but let's now look 1220 1221 306 1222 00:14:37,589 --> 00:14:42,329 1223 at the two tasks of detection and 1224 1225 307 1226 00:14:39,750 --> 00:14:44,549 1227 segmentation can we actually use the 1228 1229 308 1230 00:14:42,328 --> 00:14:47,849 1231 same operations inside a neural network 1232 1233 309 1234 00:14:44,549 --> 00:14:50,129 1235 to perform detection and segmentation so 1236 1237 310 1238 00:14:47,850 --> 00:14:53,370 1239 in the case of detection we essentially 1240 1241 311 1242 00:14:50,129 --> 00:14:55,289 1243 want to do object classification so once 1244 1245 312 1246 00:14:53,370 --> 00:14:57,120 1247 you have a proposal you want to take 1248 1249 313 1250 00:14:55,289 --> 00:14:59,399 1251 that box and you want to perform 1252 1253 314 1254 00:14:57,120 --> 00:15:02,129 1255 classification is this a caddis is a dog 1256 1257 315 1258 00:14:59,399 --> 00:15:03,839 1259 or is this not an object at all so you 1260 1261 316 1262 00:15:02,129 --> 00:15:06,958 1263 actually require in variant 1264 1265 317 1266 00:15:03,839 --> 00:15:09,149 1267 representations and in particular you 1268 1269 318 1270 00:15:06,958 --> 00:15:11,818 1271 require translation invariance so 1272 1273 319 1274 00:15:09,149 --> 00:15:14,909 1275 wherever my penguin is inside of the 1276 1277 320 1278 00:15:11,818 --> 00:15:17,938 1279 image I still want to classify it as a 1280 1281 321 1282 00:15:14,909 --> 00:15:19,860 1283 penguin therefore I need translation 1284 1285 322 1286 00:15:17,938 --> 00:15:22,889 1287 invariant representations to perform 1288 1289 323 1290 00:15:19,860 --> 00:15:25,050 1291 detection let's not now add the 1292 1293 324 1294 00:15:22,889 --> 00:15:27,448 1295 segmentation problem the segmentation 1296 1297 325 1298 00:15:25,049 --> 00:15:29,068 1299 problem is slightly different so for 1300 1301 326 1302 00:15:27,448 --> 00:15:31,769 1303 every translated object 1304 1305 327 1306 00:15:29,068 --> 00:15:34,278 1307 I need a translated mask and for every 1308 1309 328 1310 00:15:31,769 --> 00:15:35,429 1311 scaled object I need a scaled mask 1312 1313 329 1314 00:15:34,278 --> 00:15:37,860 1315 therefore 1316 1317 330 1318 00:15:35,429 --> 00:15:41,039 1319 I'm going to require equal variant 1320 1321 331 1322 00:15:37,860 --> 00:15:43,560 1323 representations also for semantic 1324 1325 332 1326 00:15:41,039 --> 00:15:46,049 1327 segmentation the small objects are less 1328 1329 333 1330 00:15:43,559 --> 00:15:48,209 1331 important because they have less pixels 1332 1333 334 1334 00:15:46,049 --> 00:15:50,120 1335 and therefore they count less in the 1336 1337 335 1338 00:15:48,210 --> 00:15:52,620 1339 loss function but for instance 1340 1341 336 1342 00:15:50,120 --> 00:15:55,889 1343 segmentation all objects no matter the 1344 1345 337 1346 00:15:52,620 --> 00:15:59,639 1347 size are equally important same as for 1348 1349 338 1350 00:15:55,889 --> 00:16:01,708 1351 object detection so I'm going to need 1352 1353 339 1354 00:15:59,639 --> 00:16:03,870 1355 slightly different representations I'm 1356 1357 340 1358 00:16:01,708 --> 00:16:06,778 1359 going to need to make more changes to 1360 1361 341 1362 00:16:03,870 --> 00:16:09,720 1363 faster CNN in order to have an 1364 1365 342 1366 00:16:06,778 --> 00:16:14,120 1367 equivalent network instead of an Edward 1368 1369 343 1370 00:16:09,720 --> 00:16:17,040 1371 that gives me invariant representations 1372 1373 344 1374 00:16:14,120 --> 00:16:20,639 1375 so what kind of operations are inside 1376 1377 345 1378 00:16:17,039 --> 00:16:22,769 1379 faster CNN and what kind of operations 1380 1381 346 1382 00:16:20,639 --> 00:16:25,350 1383 are equi variant and therefore I can 1384 1385 347 1386 00:16:22,769 --> 00:16:27,750 1387 keep and what others are invariant and 1388 1389 348 1390 00:16:25,350 --> 00:16:30,720 1391 therefore I need to change in order to 1392 1393 349 1394 00:16:27,750 --> 00:16:32,429 1395 create masks our CNN so let's look at 1396 1397 350 1398 00:16:30,720 --> 00:16:34,980 1399 the first type of operation the first 1400 1401 351 1402 00:16:32,429 --> 00:16:36,299 1403 type is the feature extraction which is 1404 1405 352 1406 00:16:34,980 --> 00:16:38,730 1407 performed using a series of 1408 1409 353 1410 00:16:36,299 --> 00:16:41,039 1411 convolutional layers we all know that 1412 1413 354 1414 00:16:38,730 --> 00:16:43,500 1415 these operations are equal variance or 1416 1417 355 1418 00:16:41,039 --> 00:16:45,778 1419 no problem there and the same goes for 1420 1421 356 1422 00:16:43,500 --> 00:16:47,850 1423 the segmentation head right it's a fully 1424 1425 357 1426 00:16:45,778 --> 00:16:50,850 1427 convolutional network so these 1428 1429 358 1430 00:16:47,850 --> 00:16:53,399 1431 operations are also equivalent but we 1432 1433 359 1434 00:16:50,850 --> 00:16:57,990 1435 have one problem in the middle with 1436 1437 360 1438 00:16:53,399 --> 00:17:01,470 1439 faster CNN so remember that we had the 1440 1441 361 1442 00:16:57,990 --> 00:17:04,949 1443 operation of Roy pooling and we had a 1444 1445 362 1446 00:17:01,470 --> 00:17:07,970 1447 series of fully connected layers and all 1448 1449 363 1450 00:17:04,949 --> 00:17:10,860 1451 of these operations essentially give 1452 1453 364 1454 00:17:07,970 --> 00:17:12,959 1455 invariance to the representation so 1456 1457 365 1458 00:17:10,859 --> 00:17:18,058 1459 these are not operations that we can 1460 1461 366 1462 00:17:12,959 --> 00:17:20,459 1463 keep for masks our CNN so remember how 1464 1465 367 1466 00:17:18,058 --> 00:17:23,068 1467 the region of interest pooling operation 1468 1469 368 1470 00:17:20,459 --> 00:17:25,740 1471 was working so if you remember we had 1472 1473 369 1474 00:17:23,068 --> 00:17:28,500 1475 this proposal that is on top of my 1476 1477 370 1478 00:17:25,740 --> 00:17:31,319 1479 penguin I perform feature extraction 1480 1481 371 1482 00:17:28,500 --> 00:17:34,049 1483 with my CNN and now from my feature map 1484 1485 372 1486 00:17:31,319 --> 00:17:37,079 1487 I'm only interested in looking at the 1488 1489 373 1490 00:17:34,049 --> 00:17:41,250 1491 region of my green proposal so what I do 1492 1493 374 1494 00:17:37,079 --> 00:17:45,178 1495 is I put on top of this proposal an H by 1496 1497 375 1498 00:17:41,250 --> 00:17:47,460 1499 W grid and then I perform pulling for 1500 1501 376 1502 00:17:45,179 --> 00:17:49,200 1503 each of these regions in order to obtain 1504 1505 377 1506 00:17:47,460 --> 00:17:53,819 1507 a map that is H 1508 1509 378 1510 00:17:49,200 --> 00:17:56,308 1511 W by C so essentially what I do is I 1512 1513 379 1514 00:17:53,819 --> 00:17:59,788 1515 bring all proposal representation to a 1516 1517 380 1518 00:17:56,308 --> 00:18:03,509 1519 fixed spatial size of H by W with this 1520 1521 381 1522 00:17:59,788 --> 00:18:05,849 1523 pulling operation so what is the exact 1524 1525 382 1526 00:18:03,509 --> 00:18:09,239 1527 problem at this polling operation let us 1528 1529 383 1530 00:18:05,849 --> 00:18:12,298 1531 look at the example with specific sizes 1532 1533 384 1534 00:18:09,239 --> 00:18:15,389 1535 so let's assume that I have my 400 by 1536 1537 385 1538 00:18:12,298 --> 00:18:19,648 1539 400 image and I have my bounding box 1540 1541 386 1542 00:18:15,388 --> 00:18:22,378 1543 which is 300 by 150 now once I pass it 1544 1545 387 1546 00:18:19,648 --> 00:18:24,898 1547 through a CNN of course my feature map 1548 1549 388 1550 00:18:22,378 --> 00:18:28,168 1551 has been reduced in spatial size and now 1552 1553 389 1554 00:18:24,898 --> 00:18:29,758 1555 has more channels we will just represent 1556 1557 390 1558 00:18:28,169 --> 00:18:31,350 1559 the number of channels as C because 1560 1561 391 1562 00:18:29,759 --> 00:18:33,659 1563 we're not interested in how many 1564 1565 392 1566 00:18:31,349 --> 00:18:37,558 1567 channels we have but we're interested in 1568 1569 393 1570 00:18:33,659 --> 00:18:42,179 1571 the spatial size which is now 65 by 65 1572 1573 394 1574 00:18:37,558 --> 00:18:44,428 1575 coming down from 400 by 400 so this of 1576 1577 395 1578 00:18:42,179 --> 00:18:47,190 1579 course means that my bounding box has 1580 1581 396 1582 00:18:44,429 --> 00:18:50,038 1583 also been scaled down and for example 1584 1585 397 1586 00:18:47,190 --> 00:18:57,659 1587 the height which was before 300 pixels 1588 1589 398 1590 00:18:50,038 --> 00:19:00,778 1591 is now 48 point 75 pixels so now let's 1592 1593 399 1594 00:18:57,659 --> 00:19:04,739 1595 imagine that I take my grid I take my H 1596 1597 400 1598 00:19:00,778 --> 00:19:09,739 1599 by W grid and I put it on top of my box 1600 1601 401 1602 00:19:04,739 --> 00:19:13,829 1603 which has a height of 48 point 75 pixels 1604 1605 402 1606 00:19:09,739 --> 00:19:17,700 1607 now our ends up happening is that I have 1608 1609 403 1610 00:19:13,829 --> 00:19:18,269 1611 to choose how many pixels to take into 1612 1613 404 1614 00:19:17,700 --> 00:19:21,659 1615 my grid 1616 1617 405 1618 00:19:18,269 --> 00:19:25,048 1619 I cannot take 48 point 75 pixels and 1620 1621 406 1622 00:19:21,659 --> 00:19:28,169 1623 divide it by the number of pins that I 1624 1625 407 1626 00:19:25,048 --> 00:19:29,849 1627 need when I put the grid on top so I 1628 1629 408 1630 00:19:28,169 --> 00:19:33,090 1631 need to make a choice and I for example 1632 1633 409 1634 00:19:29,849 --> 00:19:35,519 1635 make the choice of 48 right so this is 1636 1637 410 1638 00:19:33,089 --> 00:19:40,019 1639 the first quantization effect that we're 1640 1641 411 1642 00:19:35,519 --> 00:19:42,089 1643 going to see so first in the output 1644 1645 412 1646 00:19:40,019 --> 00:19:45,028 1647 we're going to have this quantization 1648 1649 413 1650 00:19:42,089 --> 00:19:49,230 1651 effect reflected because now my bounding 1652 1653 414 1654 00:19:45,028 --> 00:19:51,569 1655 box is not truly 300 pixels high but 1656 1657 415 1658 00:19:49,230 --> 00:19:53,069 1659 it's much less due to the quantization 1660 1661 416 1662 00:19:51,569 --> 00:19:57,269 1663 and due to the fact that instead of 1664 1665 417 1666 00:19:53,069 --> 00:20:01,230 1667 taking 48 point 75 I took 48 as the box 1668 1669 418 1670 00:19:57,269 --> 00:20:02,940 1671 height so of course you can think that 1672 1673 419 1674 00:20:01,230 --> 00:20:04,980 1675 this is not really pseudo 1676 1677 420 1678 00:20:02,940 --> 00:20:08,910 1679 when you want to extract pixel wise 1680 1681 421 1682 00:20:04,980 --> 00:20:11,819 1683 precise masks if I'm already having a 1684 1685 422 1686 00:20:08,910 --> 00:20:14,430 1687 quantization problem only when I predict 1688 1689 423 1690 00:20:11,819 --> 00:20:17,309 1691 the bounding box if I predict pixel wise 1692 1693 424 1694 00:20:14,430 --> 00:20:20,820 1695 precise mask I'm going to lose a lot of 1696 1697 425 1698 00:20:17,309 --> 00:20:22,950 1699 the mask only with this operation so it 1700 1701 426 1702 00:20:20,819 --> 00:20:31,409 1703 is clear that Roy pooling is not 1704 1705 427 1706 00:20:22,950 --> 00:20:33,930 1707 suitable for masks our CNN so the idea 1708 1709 428 1710 00:20:31,410 --> 00:20:36,480 1711 of masks our CNN is then to essentially 1712 1713 429 1714 00:20:33,930 --> 00:20:39,120 1715 exchange all the invariant operations 1716 1717 430 1718 00:20:36,480 --> 00:20:41,069 1719 with operations that are equivalent 1720 1721 431 1722 00:20:39,119 --> 00:20:43,679 1723 and in this case they're going to 1724 1725 432 1726 00:20:41,069 --> 00:20:46,169 1727 exchange the ROI pooling with an 1728 1729 433 1730 00:20:43,680 --> 00:20:49,620 1731 equivalent operation which they call ROI 1732 1733 434 1734 00:20:46,170 --> 00:20:51,630 1735 aligned so one of the cause of ROI 1736 1737 435 1738 00:20:49,619 --> 00:20:54,209 1739 aligned is to actually erase those 1740 1741 436 1742 00:20:51,630 --> 00:20:57,180 1743 quantization effects so if you look at 1744 1745 437 1746 00:20:54,210 --> 00:21:00,990 1747 the example before where we had our 300 1748 1749 438 1750 00:20:57,180 --> 00:21:04,400 1751 pixel box that was converted to a box 1752 1753 439 1754 00:21:00,990 --> 00:21:08,370 1755 height of 48 point 75 in our feature map 1756 1757 440 1758 00:21:04,400 --> 00:21:11,519 1759 before we had to choose for example 48 1760 1761 441 1762 00:21:08,369 --> 00:21:13,559 1763 in order to perform our ROI pulling but 1764 1765 442 1766 00:21:11,519 --> 00:21:17,220 1767 the idea now is that we're going to be 1768 1769 443 1770 00:21:13,559 --> 00:21:20,009 1771 able to choose exactly 48 point 75 and 1772 1773 444 1774 00:21:17,220 --> 00:21:22,470 1775 still perform this region of interest 1776 1777 445 1778 00:21:20,009 --> 00:21:26,430 1779 pooling operation which is now called 1780 1781 446 1782 00:21:22,470 --> 00:21:29,250 1783 ROI aligned so let's look at this 1784 1785 447 1786 00:21:26,430 --> 00:21:32,039 1787 example where we have our feature map 1788 1789 448 1790 00:21:29,250 --> 00:21:34,609 1791 here on the left and our bounding box 1792 1793 449 1794 00:21:32,039 --> 00:21:39,319 1795 which is depicted in this Salman color 1796 1797 450 1798 00:21:34,609 --> 00:21:43,559 1799 now from this bounding box this ROI this 1800 1801 451 1802 00:21:39,319 --> 00:21:46,049 1803 proposal we want to get as output from 1804 1805 452 1806 00:21:43,559 --> 00:21:48,149 1807 the Royal line operation a fixed 1808 1809 453 1810 00:21:46,049 --> 00:21:52,470 1811 dimensional representation which in our 1812 1813 454 1814 00:21:48,150 --> 00:21:55,170 1815 case is a 2 by 2 representation so 1816 1817 455 1818 00:21:52,470 --> 00:21:58,710 1819 essentially we need to fill these four 1820 1821 456 1822 00:21:55,170 --> 00:22:02,430 1823 positions we need to obtain this one 1824 1825 457 1826 00:21:58,710 --> 00:22:05,039 1827 value for each of these positions but of 1828 1829 458 1830 00:22:02,430 --> 00:22:08,430 1831 course for each of these positions we 1832 1833 459 1834 00:22:05,039 --> 00:22:12,180 1835 actually have all of this area to cover 1836 1837 460 1838 00:22:08,430 --> 00:22:15,180 1839 and kind of distill into this only one 1840 1841 461 1842 00:22:12,180 --> 00:22:16,110 1843 number and of course instead of doing 1844 1845 462 1846 00:22:15,180 --> 00:22:18,690 1847 any quantity 1848 1849 463 1850 00:22:16,109 --> 00:22:22,349 1851 here what we say is we want to take into 1852 1853 464 1854 00:22:18,690 --> 00:22:24,900 1855 account really the values that can be 1856 1857 465 1858 00:22:22,349 --> 00:22:27,240 1859 found in this representation in the 1860 1861 466 1862 00:22:24,900 --> 00:22:31,320 1863 exact position where they are meant to 1864 1865 467 1866 00:22:27,240 --> 00:22:34,529 1867 be without any quantization effects so 1868 1869 468 1870 00:22:31,319 --> 00:22:37,048 1871 essentially what we do is we sample each 1872 1873 469 1874 00:22:34,529 --> 00:22:41,129 1875 of these units each of these output 1876 1877 470 1878 00:22:37,048 --> 00:22:44,730 1879 regions into four so example it four 1880 1881 471 1882 00:22:41,130 --> 00:22:46,830 1883 times now these points are going to be 1884 1885 472 1886 00:22:44,730 --> 00:22:49,919 1887 the grid points for the bilinear 1888 1889 473 1890 00:22:46,829 --> 00:22:53,460 1891 interpolation so essentially what I'm 1892 1893 474 1894 00:22:49,919 --> 00:22:56,700 1895 going to do is I'm going to take the 1896 1897 475 1898 00:22:53,460 --> 00:22:59,058 1899 pixel values that I do have from my 1900 1901 476 1902 00:22:56,700 --> 00:23:03,090 1903 feature map right my feature map 1904 1905 477 1906 00:22:59,058 --> 00:23:06,210 1907 contains values at each of these corners 1908 1909 478 1910 00:23:03,089 --> 00:23:08,970 1911 here so of course there is no value that 1912 1913 479 1914 00:23:06,210 --> 00:23:11,600 1915 represents this position here but what I 1916 1917 480 1918 00:23:08,970 --> 00:23:15,660 1919 can do is I can do by linear 1920 1921 481 1922 00:23:11,599 --> 00:23:19,109 1923 interpolation between these four corners 1924 1925 482 1926 00:23:15,660 --> 00:23:21,150 1927 of my orange box and bi linear 1928 1929 483 1930 00:23:19,109 --> 00:23:23,519 1931 interpolation actually means that of 1932 1933 484 1934 00:23:21,150 --> 00:23:26,540 1935 course this value is going to be much 1936 1937 485 1938 00:23:23,519 --> 00:23:29,910 1939 more influenced by the bottom left 1940 1941 486 1942 00:23:26,539 --> 00:23:32,490 1943 corner of this orange box because it is 1944 1945 487 1946 00:23:29,910 --> 00:23:33,990 1947 very close to that corner there but 1948 1949 488 1950 00:23:32,490 --> 00:23:36,900 1951 you're still going to take into account 1952 1953 489 1954 00:23:33,990 --> 00:23:40,440 1955 the values that can be found on the 1956 1957 490 1958 00:23:36,900 --> 00:23:42,660 1959 other corners of my orange box so 1960 1961 491 1962 00:23:40,440 --> 00:23:46,470 1963 through my linear interpolation I can 1964 1965 492 1966 00:23:42,660 --> 00:23:50,548 1967 essentially create a true value for this 1968 1969 493 1970 00:23:46,470 --> 00:23:52,350 1971 blue point for these grid points and now 1972 1973 494 1974 00:23:50,548 --> 00:23:54,869 1975 what I can do is I can essentially 1976 1977 495 1978 00:23:52,349 --> 00:23:57,240 1979 condense this information through for 1980 1981 496 1982 00:23:54,869 --> 00:24:02,909 1983 example max pooling in order to obtain 1984 1985 497 1986 00:23:57,240 --> 00:24:04,919 1987 one output value for the Royal line so 1988 1989 498 1990 00:24:02,910 --> 00:24:07,440 1991 this essentially avoids all the 1992 1993 499 1994 00:24:04,919 --> 00:24:09,690 1995 quantization effects and really takes 1996 1997 500 1998 00:24:07,440 --> 00:24:12,390 1999 into account the actual value in the 2000 2001 501 2002 00:24:09,690 --> 00:24:14,759 2003 actual position of the feature map and 2004 2005 502 2006 00:24:12,390 --> 00:24:16,650 2007 not a quantized value which is not 2008 2009 503 2010 00:24:14,759 --> 00:24:19,769 2011 accurate enough for segmentation 2012 2013 504 2014 00:24:16,650 --> 00:24:22,440 2015 prediction and we can see that actually 2016 2017 505 2018 00:24:19,769 --> 00:24:23,970 2019 mask our CNN works really well so the 2020 2021 506 2022 00:24:22,440 --> 00:24:27,419 2023 qualitative results are really 2024 2025 507 2026 00:24:23,970 --> 00:24:29,850 2027 impressive we get a lot of objects small 2028 2029 508 2030 00:24:27,419 --> 00:24:33,570 2031 and big they are quite well segments 2032 2033 509 2034 00:24:29,849 --> 00:24:36,349 2035 semantics classes are correct and we can 2036 2037 510 2038 00:24:33,569 --> 00:24:39,599 2039 actually use the same network to segment 2040 2041 511 2042 00:24:36,349 --> 00:24:50,219 2043 lots of object categories from person to 2044 2045 512 2046 00:24:39,599 --> 00:24:52,259 2047 motivate to cup to donut and the nice 2048 2049 513 2050 00:24:50,220 --> 00:24:55,259 2051 thing about masks are CNN is that it's a 2052 2053 514 2054 00:24:52,259 --> 00:24:57,480 2055 quite flexible architecture right so you 2056 2057 515 2058 00:24:55,259 --> 00:25:01,529 2059 can also for example extend it to pretty 2060 2061 516 2062 00:24:57,480 --> 00:25:04,099 2063 body joints so the idea is that you can 2064 2065 517 2066 00:25:01,529 --> 00:25:08,009 2067 actually model a key point location as a 2068 2069 518 2070 00:25:04,099 --> 00:25:10,919 2071 one-hot mask an adult mask our CNN to 2072 2073 519 2074 00:25:08,009 --> 00:25:14,609 2075 predict cave masks which are in the end 2076 2077 520 2078 00:25:10,920 --> 00:25:16,860 2079 only one pixel so every joint is going 2080 2081 521 2082 00:25:14,609 --> 00:25:20,359 2083 to be represented as a mask and you're 2084 2085 522 2086 00:25:16,859 --> 00:25:23,299 2087 going to predict a mask for K joints and 2088 2089 523 2090 00:25:20,359 --> 00:25:26,699 2091 each of the masks is going to represent 2092 2093 524 2094 00:25:23,299 --> 00:25:29,609 2095 left shoulder right elbow and all the 2096 2097 525 2098 00:25:26,700 --> 00:25:32,069 2099 other joints in the body and so 2100 2101 526 2102 00:25:29,609 --> 00:25:34,229 2103 essentially by just slightly changing 2104 2105 527 2106 00:25:32,069 --> 00:25:36,899 2107 the meaning of the representation you 2108 2109 528 2110 00:25:34,230 --> 00:25:39,900 2111 can use the same operations and you can 2112 2113 529 2114 00:25:36,900 --> 00:25:44,730 2115 take advantage of the Royal line power 2116 2117 530 2118 00:25:39,900 --> 00:25:47,250 2119 to actually predict precise body joints 2120 2121 531 2122 00:25:44,730 --> 00:25:54,000 2123 in this case and of course the the full 2124 2125 532 2126 00:25:47,250 --> 00:25:56,069 2127 skeleton as is depicted here so now the 2128 2129 533 2130 00:25:54,000 --> 00:25:57,660 2131 question is can we actually do better 2132 2133 534 2134 00:25:56,069 --> 00:26:00,450 2135 right so there works that have been 2136 2137 535 2138 00:25:57,660 --> 00:26:02,550 2139 building up on top of mascar Sinan and 2140 2141 536 2142 00:26:00,450 --> 00:26:03,360 2143 trying to improve the accuracy of mask 2144 2145 537 2146 00:26:02,549 --> 00:26:06,269 2147 our CNN 2148 2149 538 2150 00:26:03,359 --> 00:26:09,419 2151 so of course one problem is that the 2152 2153 539 2154 00:26:06,269 --> 00:26:12,720 2155 mask quality score is computed as the 2156 2157 540 2158 00:26:09,420 --> 00:26:15,320 2159 confidence score for the bounding box so 2160 2161 541 2162 00:26:12,720 --> 00:26:17,789 2163 essentially if you're bounding box 2164 2165 542 2166 00:26:15,319 --> 00:26:21,210 2167 doesn't have a high confidence score 2168 2169 543 2170 00:26:17,789 --> 00:26:26,220 2171 then your mask quality is also going to 2172 2173 544 2174 00:26:21,210 --> 00:26:29,039 2175 suffer and remember that the mask loss 2176 2177 545 2178 00:26:26,220 --> 00:26:31,259 2179 just evaluates if the pixels have the 2180 2181 546 2182 00:26:29,039 --> 00:26:33,839 2183 correct semantic class not the correct 2184 2185 547 2186 00:26:31,259 --> 00:26:37,910 2187 instance so for example in this case 2188 2189 548 2190 00:26:33,839 --> 00:26:42,059 2191 where we have three persons if all of 2192 2193 549 2194 00:26:37,910 --> 00:26:42,590 2195 the pixels inside the orange bounding 2196 2197 550 2198 00:26:42,059 --> 00:26:43,690 2199 box 2200 2201 551 2202 00:26:42,589 --> 00:26:47,079 2203 where 2204 2205 552 2206 00:26:43,690 --> 00:26:49,240 2207 classified as a person then it doesn't 2208 2209 553 2210 00:26:47,079 --> 00:26:52,210 2211 really matter if this is the purple 2212 2213 554 2214 00:26:49,240 --> 00:26:53,528 2215 person or the orange person they you see 2216 2217 555 2218 00:26:52,210 --> 00:26:57,038 2219 that there are pixels from both 2220 2221 556 2222 00:26:53,528 --> 00:27:00,878 2223 instances falling inside the orange box 2224 2225 557 2226 00:26:57,038 --> 00:27:03,429 2227 this is not really reflected inside the 2228 2229 558 2230 00:27:00,878 --> 00:27:06,128 2231 mask laws whether these two instances 2232 2233 559 2234 00:27:03,429 --> 00:27:08,230 2235 inside the same box should actually be 2236 2237 560 2238 00:27:06,128 --> 00:27:11,349 2239 the same or not the only thing you're 2240 2241 561 2242 00:27:08,230 --> 00:27:15,149 2243 interested in in there is that all of 2244 2245 562 2246 00:27:11,349 --> 00:27:17,769 2247 these pixels are classified as a person 2248 2249 563 2250 00:27:15,148 --> 00:27:19,989 2251 so of course it's a problem in this 2252 2253 564 2254 00:27:17,769 --> 00:27:22,839 2255 particular case where you have two 2256 2257 565 2258 00:27:19,990 --> 00:27:25,388 2259 instances or more of the same class in 2260 2261 566 2262 00:27:22,839 --> 00:27:31,569 2263 this case the class person inside one 2264 2265 567 2266 00:27:25,388 --> 00:27:33,219 2267 bounding box so the idea here is that if 2268 2269 568 2270 00:27:31,569 --> 00:27:34,269 2271 you are actually predicting instance 2272 2273 569 2274 00:27:33,220 --> 00:27:37,298 2275 segmentations 2276 2277 570 2278 00:27:34,269 --> 00:27:40,210 2279 but the only way the actual instance is 2280 2281 571 2282 00:27:37,298 --> 00:27:43,028 2283 evaluated is through the Box loss then 2284 2285 572 2286 00:27:40,210 --> 00:27:48,308 2287 you are losing some mask quality in 2288 2289 573 2290 00:27:43,028 --> 00:27:51,669 2291 there so this is why in a subsequent CPR 2292 2293 574 2294 00:27:48,308 --> 00:27:54,190 2295 2019 paper there were other authors that 2296 2297 575 2298 00:27:51,669 --> 00:27:57,700 2299 actually proposed what's called mask 2300 2301 576 2302 00:27:54,190 --> 00:28:00,940 2303 scoring or CNN which was essentially a 2304 2305 577 2306 00:27:57,700 --> 00:28:03,788 2307 mask intersection over Union head so the 2308 2309 578 2310 00:28:00,940 --> 00:28:06,220 2311 idea here is that you actually want to 2312 2313 579 2314 00:28:03,788 --> 00:28:08,259 2315 measure the intersection of a union 2316 2317 580 2318 00:28:06,220 --> 00:28:10,899 2319 between the predicted mask and the 2320 2321 581 2322 00:28:08,259 --> 00:28:13,599 2323 ground truth mask so you want to have a 2324 2325 582 2326 00:28:10,898 --> 00:28:15,969 2327 loss that is really acting on the 2328 2329 583 2330 00:28:13,599 --> 00:28:17,740 2331 instance level on the mask instance 2332 2333 584 2334 00:28:15,970 --> 00:28:23,919 2335 level and not only on the bounding box 2336 2337 585 2338 00:28:17,740 --> 00:28:25,929 2339 instance level now typically the mask 2340 2341 586 2342 00:28:23,919 --> 00:28:28,360 2343 wearing our CNN gives actually lower 2344 2345 587 2346 00:28:25,929 --> 00:28:30,850 2347 confidence course than mask our CNN 2348 2349 588 2350 00:28:28,359 --> 00:28:34,119 2351 because of course the masks are not 2352 2353 589 2354 00:28:30,849 --> 00:28:36,579 2355 perfect but this tiny modification 2356 2357 590 2358 00:28:34,119 --> 00:28:40,418 2359 actually achieves much better results so 2360 2361 591 2362 00:28:36,579 --> 00:28:42,849 2363 just by having the proper loss this is 2364 2365 592 2366 00:28:40,419 --> 00:28:46,090 2367 something that really changes how your 2368 2369 593 2370 00:28:42,849 --> 00:28:48,158 2371 neural network is going to Train and 2372 2373 594 2374 00:28:46,089 --> 00:28:52,000 2375 what kind of outputs it's going to 2376 2377 595 2378 00:28:48,159 --> 00:28:55,210 2379 predict so of course mask our CNN 2380 2381 596 2382 00:28:52,000 --> 00:28:57,380 2383 derives from faster CNN which is a two 2384 2385 597 2386 00:28:55,210 --> 00:28:59,840 2387 stage detector and we 2388 2389 598 2390 00:28:57,380 --> 00:29:02,720 2391 saw that besides two-stage detector we 2392 2393 599 2394 00:28:59,839 --> 00:29:05,449 2395 also had one stage detectors which were 2396 2397 600 2398 00:29:02,720 --> 00:29:09,670 2399 actually faster so now the question is 2400 2401 601 2402 00:29:05,450 --> 00:29:13,009 2403 can I also apply this one stage 2 stage 2404 2405 602 2406 00:29:09,670 --> 00:29:16,370 2407 concept to masks so essentially can I 2408 2409 603 2410 00:29:13,009 --> 00:29:21,230 2411 have one stage instant segmentation 2412 2413 604 2414 00:29:16,369 --> 00:29:23,149 2415 methods so as I said in detectors we had 2416 2417 605 2418 00:29:21,230 --> 00:29:26,269 2419 this difference between one stage 2420 2421 606 2422 00:29:23,150 --> 00:29:29,870 2423 methods like Yolo that had faster but 2424 2425 607 2426 00:29:26,269 --> 00:29:31,759 2427 lower performance and faster CNN versus 2428 2429 608 2430 00:29:29,869 --> 00:29:34,879 2431 two stage detectors like for example 2432 2433 609 2434 00:29:31,759 --> 00:29:36,529 2435 faster CNN and all variants which are of 2436 2437 610 2438 00:29:34,880 --> 00:29:40,850 2439 course slower but have a higher 2440 2441 611 2442 00:29:36,529 --> 00:29:43,519 2443 performance so now we saw as a two stage 2444 2445 612 2446 00:29:40,849 --> 00:29:46,429 2447 instance segmentation method mask our 2448 2449 613 2450 00:29:43,519 --> 00:29:48,769 2451 CNN which we also hope that is slower 2452 2453 614 2454 00:29:46,430 --> 00:29:51,860 2455 but has actually a higher performance 2456 2457 615 2458 00:29:48,769 --> 00:29:56,990 2459 than the one stage counterpart which is 2460 2461 616 2462 00:29:51,859 --> 00:29:58,659 2463 actually Yeol at so even though masks 2464 2465 617 2466 00:29:56,990 --> 00:30:01,279 2467 are actually a very meaningful 2468 2469 618 2470 00:29:58,660 --> 00:30:05,120 2471 representation it's not so easy to 2472 2473 619 2474 00:30:01,279 --> 00:30:08,930 2475 extend Yolo to predict masks instead of 2476 2477 620 2478 00:30:05,119 --> 00:30:12,979 2479 boxes and this was already found by the 2480 2481 621 2482 00:30:08,930 --> 00:30:16,880 2483 by the creator of Yolo so it wasn't 2484 2485 622 2486 00:30:12,980 --> 00:30:19,460 2487 until 2019 that we cut y'all act 2488 2489 623 2490 00:30:16,880 --> 00:30:22,250 2491 appearing at a conference y'all act 2492 2493 624 2494 00:30:19,460 --> 00:30:25,880 2495 stands for you only look at coefficients 2496 2497 625 2498 00:30:22,250 --> 00:30:29,000 2499 and the idea there is actually not so 2500 2501 626 2502 00:30:25,880 --> 00:30:32,090 2503 straightforward so to go from boxes to 2504 2505 627 2506 00:30:29,000 --> 00:30:34,250 2507 masks you need to actually design the 2508 2509 628 2510 00:30:32,089 --> 00:30:36,529 2511 network carefully so what they proposed 2512 2513 629 2514 00:30:34,250 --> 00:30:39,880 2515 to do was this problem which will 2516 2517 630 2518 00:30:36,529 --> 00:30:43,069 2519 analyze a bit more in detail here today 2520 2521 631 2522 00:30:39,880 --> 00:30:45,680 2523 so as a first step you actually need a 2524 2525 632 2526 00:30:43,069 --> 00:30:48,289 2527 network ahead that actually generates 2528 2529 633 2530 00:30:45,680 --> 00:30:50,090 2531 what they call mask prototypes so 2532 2533 634 2534 00:30:48,289 --> 00:30:55,250 2535 possible segmentations 2536 2537 635 2538 00:30:50,089 --> 00:30:58,309 2539 for a particular bounding box then you 2540 2541 636 2542 00:30:55,250 --> 00:31:00,559 2543 can generate the mask coefficients with 2544 2545 637 2546 00:30:58,309 --> 00:31:03,589 2547 another network called the prediction 2548 2549 638 2550 00:31:00,559 --> 00:31:06,889 2551 head which actually evaluate each of the 2552 2553 639 2554 00:31:03,589 --> 00:31:09,199 2555 mask prototypes and finally you will 2556 2557 640 2558 00:31:06,890 --> 00:31:10,759 2559 have a third step in which will combine 2560 2561 641 2562 00:31:09,200 --> 00:31:12,919 2563 the mass prototypes and 2564 2565 642 2566 00:31:10,759 --> 00:31:19,159 2567 mas coefficients to actually generate 2568 2569 643 2570 00:31:12,919 --> 00:31:21,799 2571 the final incent segmentation so let's 2572 2573 644 2574 00:31:19,159 --> 00:31:24,349 2575 see the architecture in a bit more 2576 2577 645 2578 00:31:21,798 --> 00:31:27,259 2579 detail we have the backbone which is 2580 2581 646 2582 00:31:24,348 --> 00:31:29,178 2583 actually resonant 101 and the features 2584 2585 647 2586 00:31:27,259 --> 00:31:31,578 2587 are computed at different skills so we 2588 2589 648 2590 00:31:29,179 --> 00:31:34,519 2591 also have this feature pyramid which is 2592 2593 649 2594 00:31:31,578 --> 00:31:37,098 2595 now pretty much present in all object 2596 2597 650 2598 00:31:34,519 --> 00:31:40,788 2599 detectors and segmentation instant 2600 2601 651 2602 00:31:37,098 --> 00:31:43,278 2603 segmentation methods we then have the 2604 2605 652 2606 00:31:40,788 --> 00:31:46,429 2607 Prada net which is responsible for 2608 2609 653 2610 00:31:43,278 --> 00:31:49,398 2611 generating a prototype masks and 2612 2613 654 2614 00:31:46,429 --> 00:31:51,619 2615 actually this K has no relationship with 2616 2617 655 2618 00:31:49,398 --> 00:31:53,988 2619 a number of semantic classes but it's 2620 2621 656 2622 00:31:51,618 --> 00:31:57,769 2623 rather a hyper parameter so you generate 2624 2625 657 2626 00:31:53,989 --> 00:32:01,219 2627 a fixed number of prototype masks which 2628 2629 658 2630 00:31:57,769 --> 00:32:05,199 2631 you can somehow relate to the anchors 2632 2633 659 2634 00:32:01,219 --> 00:32:08,028 2635 that we had in bounding box 2636 2637 660 2638 00:32:05,199 --> 00:32:10,068 2639 the architecture of proto net is 2640 2641 661 2642 00:32:08,028 --> 00:32:12,229 2643 actually a fully convolutional network 2644 2645 662 2646 00:32:10,068 --> 00:32:14,898 2647 which consists in a series of 3x3 2648 2649 663 2650 00:32:12,229 --> 00:32:17,719 2651 convolutions and then a final 1x1 2652 2653 664 2654 00:32:14,898 --> 00:32:19,698 2655 convolution that simply converts the 2656 2657 665 2658 00:32:17,719 --> 00:32:22,009 2659 number of channels into these K 2660 2661 666 2662 00:32:19,699 --> 00:32:24,769 2663 predictions that we actually need to 2664 2665 667 2666 00:32:22,009 --> 00:32:27,469 2667 have and in each of these channels at 2668 2669 668 2670 00:32:24,769 --> 00:32:31,899 2671 the end on the on this feature map of 2672 2673 669 2674 00:32:27,469 --> 00:32:35,359 2675 138 by 130 by K we'll have actually K 2676 2677 670 2678 00:32:31,898 --> 00:32:37,698 2679 possible mask prototypes so this is 2680 2681 671 2682 00:32:35,358 --> 00:32:40,249 2683 actually very similar to the mask branch 2684 2685 672 2686 00:32:37,699 --> 00:32:43,459 2687 in mask our CNN but there is no loss 2688 2689 673 2690 00:32:40,249 --> 00:32:48,588 2691 function applied at this stage so this 2692 2693 674 2694 00:32:43,459 --> 00:32:50,599 2695 is actually a very crucial difference so 2696 2697 675 2698 00:32:48,588 --> 00:32:52,848 2699 once this Burnett has generated the mask 2700 2701 676 2702 00:32:50,598 --> 00:32:55,098 2703 prototypes we have another network that 2704 2705 677 2706 00:32:52,848 --> 00:32:58,818 2707 predicts a coefficient for every 2708 2709 678 2710 00:32:55,098 --> 00:33:03,769 2711 predicted mask and just judges how 2712 2713 679 2714 00:32:58,818 --> 00:33:06,408 2715 reliable the mask is so the mass 2716 2717 680 2718 00:33:03,769 --> 00:33:09,888 2719 coefficient are essentially intertwine 2720 2721 681 2722 00:33:06,409 --> 00:33:12,829 2723 also with the Box predictions so 2724 2725 682 2726 00:33:09,888 --> 00:33:15,588 2727 essentially we have a series of anchor 2728 2729 683 2730 00:33:12,828 --> 00:33:18,158 2731 boxes that one can pray and one can 2732 2733 684 2734 00:33:15,588 --> 00:33:22,009 2735 predict the class of the anchor box our 2736 2737 685 2738 00:33:18,159 --> 00:33:24,120 2739 regression to actually fit this anchor 2740 2741 686 2742 00:33:22,009 --> 00:33:27,450 2743 box tightly to the object 2744 2745 687 2746 00:33:24,119 --> 00:33:30,479 2747 and then the scale coefficients one per 2748 2749 688 2750 00:33:27,450 --> 00:33:33,630 2751 prototype mask and per anchor so note 2752 2753 689 2754 00:33:30,480 --> 00:33:36,569 2755 here that we have W by H and then 2756 2757 690 2758 00:33:33,630 --> 00:33:39,150 2759 multiplied multiplied by K which is the 2760 2761 691 2762 00:33:36,569 --> 00:33:42,500 2763 number of prototype masks and multiplied 2764 2765 692 2766 00:33:39,150 --> 00:33:46,800 2767 by a which is the number of anchor boxes 2768 2769 693 2770 00:33:42,500 --> 00:33:50,009 2771 so we have these K coefficients per 2772 2773 694 2774 00:33:46,799 --> 00:33:53,490 2775 anchor and these are actually the ones 2776 2777 695 2778 00:33:50,009 --> 00:33:55,769 2779 that define the mask and note actually 2780 2781 696 2782 00:33:53,490 --> 00:34:00,809 2783 how the network is similar to right in 2784 2785 697 2786 00:33:55,769 --> 00:34:02,400 2787 the net but a little bit shallower so 2788 2789 698 2790 00:34:00,809 --> 00:34:03,960 2791 essentially now the question is how do 2792 2793 699 2794 00:34:02,400 --> 00:34:06,540 2795 you actually generate the mask from 2796 2797 700 2798 00:34:03,960 --> 00:34:08,820 2799 these mask prototypes and these mask 2800 2801 701 2802 00:34:06,539 --> 00:34:11,070 2803 coefficients so essentially what you do 2804 2805 702 2806 00:34:08,820 --> 00:34:14,250 2807 is a linear combination between the mask 2808 2809 703 2810 00:34:11,070 --> 00:34:16,470 2811 coefficients and the mask prototypes so 2812 2813 704 2814 00:34:14,250 --> 00:34:19,139 2815 you're going to predict a mask as this 2816 2817 705 2818 00:34:16,469 --> 00:34:23,369 2819 linear combination and then pass through 2820 2821 706 2822 00:34:19,139 --> 00:34:26,849 2823 a non-linearity so in this case P is the 2824 2825 707 2826 00:34:23,369 --> 00:34:28,529 2827 H by W by K feature map the output of 2828 2829 708 2830 00:34:26,849 --> 00:34:32,549 2831 proto net which is essentially the 2832 2833 709 2834 00:34:28,530 --> 00:34:35,100 2835 matrix of prototyped masks see is an N 2836 2837 710 2838 00:34:32,550 --> 00:34:39,450 2839 by K matrix of mask coefficients that 2840 2841 711 2842 00:34:35,099 --> 00:34:41,759 2843 have survived NMS also remember that the 2844 2845 712 2846 00:34:39,449 --> 00:34:44,309 2847 coefficients are predicted together with 2848 2849 713 2850 00:34:41,760 --> 00:34:47,640 2851 the anchor boxes so the NMS can be 2852 2853 714 2854 00:34:44,309 --> 00:34:50,880 2855 applied on that side and then Sigma is a 2856 2857 715 2858 00:34:47,639 --> 00:34:54,239 2859 non linearity so you can see here an 2860 2861 716 2862 00:34:50,880 --> 00:34:58,050 2863 example of how you actually construct 2864 2865 717 2866 00:34:54,239 --> 00:35:01,169 2867 the final mask by assembling these sort 2868 2869 718 2870 00:34:58,050 --> 00:35:03,360 2871 of pieces of masks so you have here on 2872 2873 719 2874 00:35:01,170 --> 00:35:05,340 2875 the top side 2876 2877 720 2878 00:35:03,360 --> 00:35:07,200 2879 first of all these images are the 2880 2881 721 2882 00:35:05,340 --> 00:35:09,180 2883 prototypes and these are the 2884 2885 722 2886 00:35:07,199 --> 00:35:12,679 2887 coefficients positive coefficient 2888 2889 723 2890 00:35:09,179 --> 00:35:16,169 2891 negative coefficient and then you have 2892 2893 724 2894 00:35:12,679 --> 00:35:18,569 2895 for example the mask that has this this 2896 2897 725 2898 00:35:16,170 --> 00:35:23,420 2899 person with the racket here there's only 2900 2901 726 2902 00:35:18,570 --> 00:35:26,340 2903 the person then we have a negative 2904 2905 727 2906 00:35:23,420 --> 00:35:29,550 2907 weight on the prototype that actually 2908 2909 728 2910 00:35:26,340 --> 00:35:32,039 2911 represents the racket which what it 2912 2913 729 2914 00:35:29,550 --> 00:35:34,410 2915 essentially gives us is this separation 2916 2917 730 2918 00:35:32,039 --> 00:35:36,539 2919 between the two objects so the 2920 2921 731 2922 00:35:34,409 --> 00:35:36,869 2923 separation between the human the tennis 2924 2925 732 2926 00:35:36,539 --> 00:35:39,269 2927 player 2928 2929 733 2930 00:35:36,869 --> 00:35:42,389 2931 year and the racket which is actually 2932 2933 734 2934 00:35:39,269 --> 00:35:45,389 2935 another object therefore with this 2936 2937 735 2938 00:35:42,389 --> 00:35:48,210 2939 assembly of masks prototypes we can 2940 2941 736 2942 00:35:45,389 --> 00:35:51,409 2943 actually obtain the two objects the two 2944 2945 737 2946 00:35:48,210 --> 00:35:54,869 2947 objects the two incense segmentations 2948 2949 738 2950 00:35:51,409 --> 00:35:57,179 2951 separate one is the tennis player at the 2952 2953 739 2954 00:35:54,869 --> 00:36:00,929 2955 top and the other is the racket at the 2956 2957 740 2958 00:35:57,179 --> 00:36:04,289 2959 bottom and you can see that these are in 2960 2961 741 2962 00:36:00,929 --> 00:36:07,379 2963 fact the same prototypes that we use for 2964 2965 742 2966 00:36:04,289 --> 00:36:09,630 2967 both detections and adjust about the 2968 2969 743 2970 00:36:07,380 --> 00:36:13,079 2971 coefficients whether they're positive or 2972 2973 744 2974 00:36:09,630 --> 00:36:16,829 2975 negative that actually define the final 2976 2977 745 2978 00:36:13,079 --> 00:36:21,619 2979 mask and thanks to these coefficients we 2980 2981 746 2982 00:36:16,829 --> 00:36:24,659 2983 can separate the person from the racket 2984 2985 747 2986 00:36:21,619 --> 00:36:26,400 2987 now the bounding box for Yola we're 2988 2989 748 2990 00:36:24,659 --> 00:36:28,889 2991 going to have a simple cross entropy 2992 2993 749 2994 00:36:26,400 --> 00:36:31,740 2995 between the assemble mask and the ground 2996 2997 750 2998 00:36:28,889 --> 00:36:33,210 2999 truth in addition to the standard losses 3000 3001 751 3002 00:36:31,739 --> 00:36:36,239 3003 which are the regression for the 3004 3005 752 3006 00:36:33,210 --> 00:36:40,320 3007 bounding box and the classification for 3008 3009 753 3010 00:36:36,239 --> 00:36:42,179 3011 the actual semantic glass and the 3012 3013 754 3014 00:36:40,320 --> 00:36:45,330 3015 results are actually pretty good so we 3016 3017 755 3018 00:36:42,179 --> 00:36:48,179 3019 can see that we can segment a lot of 3020 3021 756 3022 00:36:45,329 --> 00:36:51,420 3023 semantic glasses lot of instances with 3024 3025 757 3026 00:36:48,179 --> 00:36:53,519 3027 quite a large degree of occlusion so see 3028 3029 758 3030 00:36:51,420 --> 00:36:58,500 3031 for example these two purses being 3032 3033 759 3034 00:36:53,519 --> 00:37:01,039 3035 separated correctly and the method is of 3036 3037 760 3038 00:36:58,500 --> 00:37:04,800 3039 course fast because it's a one stage 3040 3041 761 3042 00:37:01,039 --> 00:37:07,170 3043 segmentation method now for large 3044 3045 762 3046 00:37:04,800 --> 00:37:09,900 3047 objects the quality of the masks is even 3048 3049 763 3050 00:37:07,170 --> 00:37:12,539 3051 better than those of two statute actors 3052 3053 764 3054 00:37:09,900 --> 00:37:14,700 3055 but of course the quality is a little 3056 3057 765 3058 00:37:12,539 --> 00:37:17,190 3059 bit reduced for small objects so for 3060 3061 766 3062 00:37:14,699 --> 00:37:22,589 3063 small objects the quality is not as good 3064 3065 767 3066 00:37:17,190 --> 00:37:26,460 3067 as for masks our CNN now of course the 3068 3069 768 3070 00:37:22,590 --> 00:37:29,460 3071 main selling point right is the FPS so 3072 3073 769 3074 00:37:26,460 --> 00:37:32,820 3075 the frames per second that your life is 3076 3077 770 3078 00:37:29,460 --> 00:37:35,220 3079 able to process and of course compared 3080 3081 771 3082 00:37:32,820 --> 00:37:38,640 3083 to mask our CNN we can see that your 3084 3085 772 3086 00:37:35,219 --> 00:37:40,889 3087 lock is actually much faster it's a 3088 3089 773 3090 00:37:38,639 --> 00:37:43,109 3091 little bit less accurate but of course 3092 3093 774 3094 00:37:40,889 --> 00:37:46,469 3095 it depends for what kind of application 3096 3097 775 3098 00:37:43,110 --> 00:37:47,910 3099 you want to use it if you want to use if 3100 3101 776 3102 00:37:46,469 --> 00:37:50,489 3103 you want to have some segmentation 3104 3105 777 3106 00:37:47,909 --> 00:37:53,009 3107 output in real time then you have to 3108 3109 778 3110 00:37:50,489 --> 00:37:56,159 3111 Yoshio like this is the only method up 3112 3113 779 3114 00:37:53,010 --> 00:37:58,710 3115 to this point up to 2019 3116 3117 780 3118 00:37:56,159 --> 00:38:01,469 3119 that was actually reaching these these 3120 3121 781 3122 00:37:58,710 --> 00:38:04,500 3123 real-time capabilities so of course if 3124 3125 782 3126 00:38:01,469 --> 00:38:06,599 3127 you need the real time then you might be 3128 3129 783 3130 00:38:04,500 --> 00:38:12,239 3131 able to sustain this little bit of a 3132 3133 784 3134 00:38:06,599 --> 00:38:14,039 3135 drop in precision so then of course 3136 3137 785 3138 00:38:12,239 --> 00:38:16,229 3139 there are improvements on top of your 3140 3141 786 3142 00:38:14,039 --> 00:38:18,269 3143 leg right so the the authors have been 3144 3145 787 3146 00:38:16,230 --> 00:38:20,789 3147 actually active in developing your lat 3148 3149 788 3150 00:38:18,269 --> 00:38:24,300 3151 plus plus and you can read the paper 3152 3153 789 3154 00:38:20,789 --> 00:38:26,099 3155 there there is actually a specially 3156 3157 790 3158 00:38:24,300 --> 00:38:27,510 3159 designed version of non maximum 3160 3161 791 3162 00:38:26,099 --> 00:38:30,230 3163 suppression in order to make the 3164 3165 792 3166 00:38:27,510 --> 00:38:32,790 3167 procedure even faster and an auxiliary 3168 3169 793 3170 00:38:30,230 --> 00:38:35,400 3171 segmentation semantic segmentation loss 3172 3173 794 3174 00:38:32,789 --> 00:38:38,250 3175 function which is performed on the final 3176 3177 795 3178 00:38:35,400 --> 00:38:43,380 3179 features of the SPN so trying to bring a 3180 3181 796 3182 00:38:38,250 --> 00:38:45,989 3183 little bit this idea of the FCN concept 3184 3185 797 3186 00:38:43,380 --> 00:38:48,450 3187 that actually looked at all the image in 3188 3189 798 3190 00:38:45,989 --> 00:38:55,169 3191 order to predict a semantic segmentation 3192 3193 799 3194 00:38:48,449 --> 00:38:57,480 3195 results okay so we have seen semantic 3196 3197 800 3198 00:38:55,170 --> 00:38:59,608 3199 segmentation in the previous lecture we 3200 3201 801 3202 00:38:57,480 --> 00:39:03,630 3203 have seen instant segmentation in this 3204 3205 802 3206 00:38:59,608 --> 00:39:07,469 3207 lecture and now the idea is why not 3208 3209 803 3210 00:39:03,630 --> 00:39:11,190 3211 combining both concepts so remember that 3212 3213 804 3214 00:39:07,469 --> 00:39:13,799 3215 semantic segmentation labels each of the 3216 3217 805 3218 00:39:11,190 --> 00:39:16,800 3219 pixels in the image with a unique 3220 3221 806 3222 00:39:13,800 --> 00:39:18,810 3223 semantic label and this means that all 3224 3225 807 3226 00:39:16,800 --> 00:39:21,450 3227 cars are going to have the same label 3228 3229 808 3230 00:39:18,809 --> 00:39:23,969 3231 car and it's only through instant 3232 3233 809 3234 00:39:21,449 --> 00:39:26,879 3235 segmentation that we can separate the 3236 3237 810 3238 00:39:23,969 --> 00:39:28,559 3239 different cars within this label but the 3240 3241 811 3242 00:39:26,880 --> 00:39:31,920 3243 problem with intense segmentation is 3244 3245 812 3246 00:39:28,559 --> 00:39:33,750 3247 that it does not give us any label for 3248 3249 813 3250 00:39:31,920 --> 00:39:37,500 3251 the things that are not countable for 3252 3253 814 3254 00:39:33,750 --> 00:39:40,199 3255 example the grass the sky or the road so 3256 3257 815 3258 00:39:37,500 --> 00:39:43,889 3259 the task that Panoptix segmentation is 3260 3261 816 3262 00:39:40,199 --> 00:39:46,589 3263 trying to resolve is actually the task 3264 3265 817 3266 00:39:43,889 --> 00:39:49,139 3267 of combining semantic segmentation and 3268 3269 818 3270 00:39:46,590 --> 00:39:51,960 3271 instant segmentation therefore each 3272 3273 819 3274 00:39:49,139 --> 00:39:54,960 3275 pixel needs to receive a label a 3276 3277 820 3278 00:39:51,960 --> 00:39:57,690 3279 semantic level and at the same time if 3280 3281 821 3282 00:39:54,960 --> 00:40:00,449 3283 possible if the object is countable you 3284 3285 822 3286 00:39:57,690 --> 00:40:04,200 3287 need to receive essentially an instance 3288 3289 823 3290 00:40:00,449 --> 00:40:08,069 3291 label so for semantics 3292 3293 824 3294 00:40:04,199 --> 00:40:10,980 3295 imitation we have the FCN like methods 3296 3297 825 3298 00:40:08,070 --> 00:40:14,220 3299 for instant segmentation we have masks 3300 3301 826 3302 00:40:10,980 --> 00:40:17,070 3303 our CNN and your act and other devised 3304 3305 827 3306 00:40:14,219 --> 00:40:19,679 3307 methods and for Panoptix segmentation we 3308 3309 828 3310 00:40:17,070 --> 00:40:21,660 3311 have what is called the UPS net of 3312 3313 829 3314 00:40:19,679 --> 00:40:24,598 3315 course now we have many more methods 3316 3317 830 3318 00:40:21,659 --> 00:40:27,210 3319 there are that are trying to solve the 3320 3321 831 3322 00:40:24,599 --> 00:40:29,130 3323 Panoptix segmentation task but this is 3324 3325 832 3326 00:40:27,210 --> 00:40:31,940 3327 one of the first methods that actually 3328 3329 833 3330 00:40:29,130 --> 00:40:35,460 3331 tackle the task of fanatic segmentation 3332 3333 834 3334 00:40:31,940 --> 00:40:38,338 3335 so in fanatic segmentation we have to 3336 3337 835 3338 00:40:35,460 --> 00:40:41,190 3339 predict both the labels for uncountable 3340 3341 836 3342 00:40:38,338 --> 00:40:45,088 3343 objects which we call stuff in computer 3344 3345 837 3346 00:40:41,190 --> 00:40:46,349 3347 vision so things like sky road grass etc 3348 3349 838 3350 00:40:45,088 --> 00:40:48,059 3351 for which you cannot really 3352 3353 839 3354 00:40:46,349 --> 00:40:51,088 3355 differentiate between different 3356 3357 840 3358 00:40:48,059 --> 00:40:54,630 3359 instances and the labels are usually 3360 3361 841 3362 00:40:51,088 --> 00:40:56,519 3363 obtained with networks similar to the 3364 3365 842 3366 00:40:54,630 --> 00:40:58,619 3367 fully convolutional networks that we 3368 3369 843 3370 00:40:56,519 --> 00:41:01,829 3371 have seen for the semantic segmentation 3372 3373 844 3374 00:40:58,619 --> 00:41:05,880 3375 tasks and on the other hand one also has 3376 3377 845 3378 00:41:01,829 --> 00:41:08,130 3379 to label all the objects which belong to 3380 3381 846 3382 00:41:05,880 --> 00:41:09,990 3383 the countable classes 3384 3385 847 3386 00:41:08,130 --> 00:41:12,680 3387 so these countable objects which are 3388 3389 848 3390 00:41:09,989 --> 00:41:15,629 3391 called things in computer vision cars 3392 3393 849 3394 00:41:12,679 --> 00:41:17,279 3395 pedestrian from for which we actually 3396 3397 850 3398 00:41:15,630 --> 00:41:20,400 3399 also have to differentiate between 3400 3401 851 3402 00:41:17,280 --> 00:41:22,440 3403 pixels coming from different instances 3404 3405 852 3406 00:41:20,400 --> 00:41:25,730 3407 of the same class so differentiate 3408 3409 853 3410 00:41:22,440 --> 00:41:29,010 3411 between car one car two and car three 3412 3413 854 3414 00:41:25,730 --> 00:41:32,280 3415 now of course if we just tackle the task 3416 3417 855 3418 00:41:29,010 --> 00:41:35,250 3419 with an FC anonymous car CNN separately 3420 3421 856 3422 00:41:32,280 --> 00:41:38,010 3423 some pixels might get classified as 3424 3425 857 3426 00:41:35,250 --> 00:41:39,780 3427 stuff from the FCN Network and at the 3428 3429 858 3430 00:41:38,010 --> 00:41:42,390 3431 same time they might be classified as 3432 3433 859 3434 00:41:39,780 --> 00:41:45,810 3435 instances of some class from mask our 3436 3437 860 3438 00:41:42,389 --> 00:41:48,719 3439 CNN so if we just kind of put together a 3440 3441 861 3442 00:41:45,809 --> 00:41:51,809 3443 CNN mask our CNN for both tasks we might 3444 3445 862 3446 00:41:48,719 --> 00:41:53,879 3447 have conflicting results so the solution 3448 3449 863 3450 00:41:51,809 --> 00:41:56,519 3451 that they proposed in this EPR 19 paper 3452 3453 864 3454 00:41:53,880 --> 00:41:58,769 3455 is very simple a parametric free 3456 3457 865 3458 00:41:56,519 --> 00:42:00,960 3459 panoptic head which actually combines 3460 3461 866 3462 00:41:58,769 --> 00:42:03,630 3463 the information from the Sen and the 3464 3465 867 3466 00:42:00,960 --> 00:42:05,699 3467 mask our CNN so essentially what we want 3468 3469 868 3470 00:42:03,630 --> 00:42:08,690 3471 to do is we want to create this network 3472 3473 869 3474 00:42:05,699 --> 00:42:10,679 3475 that combines both the stuff 3476 3477 870 3478 00:42:08,690 --> 00:42:13,858 3479 classification as well as the things 3480 3481 871 3482 00:42:10,679 --> 00:42:15,598 3483 classification and the network 3484 3485 872 3486 00:42:13,858 --> 00:42:17,969 3487 architecture looks like this so we have 3488 3489 873 3490 00:42:15,599 --> 00:42:19,950 3491 a set of shared features right with all 3492 3493 874 3494 00:42:17,969 --> 00:42:22,319 3495 have two completely separate networks 3496 3497 875 3498 00:42:19,949 --> 00:42:26,279 3499 but we actually have this set of shared 3500 3501 876 3502 00:42:22,320 --> 00:42:28,590 3503 features with also feature pyramids then 3504 3505 877 3506 00:42:26,280 --> 00:42:30,599 3507 we have the semantic head on top which 3508 3509 878 3510 00:42:28,590 --> 00:42:32,940 3511 is the one that gives us the semantic 3512 3513 879 3514 00:42:30,599 --> 00:42:35,070 3515 segmentation and the instance head at 3516 3517 880 3518 00:42:32,940 --> 00:42:37,800 3519 the bottom which is essentially Mars 3520 3521 881 3522 00:42:35,070 --> 00:42:39,300 3523 Carson and inspired and in the end we're 3524 3525 882 3526 00:42:37,800 --> 00:42:41,039 3527 going to have this pond optic which 3528 3529 883 3530 00:42:39,300 --> 00:42:43,800 3531 actually puts the information together 3532 3533 884 3534 00:42:41,039 --> 00:42:46,108 3535 and puts the semantic logics and the 3536 3537 885 3538 00:42:43,800 --> 00:42:49,530 3539 instance logics together to create what 3540 3541 886 3542 00:42:46,108 --> 00:42:51,719 3543 they call the panopticon logics so let's 3544 3545 887 3546 00:42:49,530 --> 00:42:54,000 3547 take another look at this semantic head 3548 3549 888 3550 00:42:51,719 --> 00:42:57,329 3551 because it has some interesting design 3552 3553 889 3554 00:42:54,000 --> 00:42:59,400 3555 choices so in particular the semantic 3556 3557 890 3558 00:42:57,329 --> 00:43:02,789 3559 head the fully convolutional network 3560 3561 891 3562 00:42:59,400 --> 00:43:05,700 3563 that outputs the semantic logics or the 3564 3565 892 3566 00:43:02,789 --> 00:43:07,230 3567 semantic segmentation map and the new 3568 3569 893 3570 00:43:05,699 --> 00:43:11,069 3571 thing that they use in this architecture 3572 3573 894 3574 00:43:07,230 --> 00:43:12,719 3575 are deformable convolutions so before 3576 3577 895 3578 00:43:11,070 --> 00:43:15,390 3579 introducing the concept we're gonna 3580 3581 896 3582 00:43:12,719 --> 00:43:17,189 3583 recall the dilated convolutions because 3584 3585 897 3586 00:43:15,389 --> 00:43:20,309 3587 these two types of convolutions are 3588 3589 898 3590 00:43:17,190 --> 00:43:22,889 3591 actually very similar so if you remember 3592 3593 899 3594 00:43:20,309 --> 00:43:25,789 3595 a normal convolution which would be a 3596 3597 900 3598 00:43:22,889 --> 00:43:30,088 3599 convolution with dilation parameter one 3600 3601 901 3602 00:43:25,789 --> 00:43:35,460 3603 would be as the operation depicted here 3604 3605 902 3606 00:43:30,088 --> 00:43:37,170 3607 in in this image a so in this case we 3608 3609 903 3610 00:43:35,460 --> 00:43:39,809 3611 have a three by three convolutional 3612 3613 904 3614 00:43:37,170 --> 00:43:42,630 3615 filter dilation one which means a normal 3616 3617 905 3618 00:43:39,809 --> 00:43:45,539 3619 convolution hence the receptive field is 3620 3621 906 3622 00:43:42,630 --> 00:43:47,730 3623 three by three with the same number of 3624 3625 907 3626 00:43:45,539 --> 00:43:50,400 3627 parameters but now with a dilation of 3628 3629 908 3630 00:43:47,730 --> 00:43:52,500 3631 two what we can get is essentially a 3632 3633 909 3634 00:43:50,400 --> 00:43:55,410 3635 receptive field of seven by seven and 3636 3637 910 3638 00:43:52,500 --> 00:43:57,869 3639 help dilated convolution actually chip 3640 3641 911 3642 00:43:55,409 --> 00:44:00,088 3643 this is by essentially spreading these 3644 3645 912 3646 00:43:57,869 --> 00:44:02,338 3647 weights and not applying them to 3648 3649 913 3650 00:44:00,088 --> 00:44:06,960 3651 neighboring pixels but applying them to 3652 3653 914 3654 00:44:02,338 --> 00:44:09,480 3655 pixel separated by to say music the 3656 3657 915 3658 00:44:06,960 --> 00:44:11,639 3659 dilation parameter is four then each 3660 3661 916 3662 00:44:09,480 --> 00:44:13,949 3663 element produced by it has a receptive 3664 3665 917 3666 00:44:11,639 --> 00:44:17,279 3667 field of fifteen by fifteen but we still 3668 3669 918 3670 00:44:13,949 --> 00:44:18,779 3671 have these nine parameters to learn so 3672 3673 919 3674 00:44:17,280 --> 00:44:22,589 3675 increase incentive field without 3676 3677 920 3678 00:44:18,780 --> 00:44:24,690 3679 increasing the number of parameters now 3680 3681 921 3682 00:44:22,588 --> 00:44:27,599 3683 the formal convolution is a very similar 3684 3685 922 3686 00:44:24,690 --> 00:44:31,349 3687 concept but now we're not going to have 3688 3689 923 3690 00:44:27,599 --> 00:44:31,740 3691 this kind of stacked dilation right in 3692 3693 924 3694 00:44:31,349 --> 00:44:35,430 3695 which 3696 3697 925 3698 00:44:31,739 --> 00:44:38,009 3699 in which you just spread the weights but 3700 3701 926 3702 00:44:35,429 --> 00:44:40,559 3703 always in the same way so what the 3704 3705 927 3706 00:44:38,010 --> 00:44:43,200 3707 deformable convolution actually proposes 3708 3709 928 3710 00:44:40,559 --> 00:44:45,960 3711 to do is kind of a generalization of 3712 3713 929 3714 00:44:43,199 --> 00:44:47,609 3715 dilated convolutions and in this case 3716 3717 930 3718 00:44:45,960 --> 00:44:51,000 3719 what you want to do is you want to also 3720 3721 931 3722 00:44:47,610 --> 00:44:53,910 3723 learn the offset so essentially I have 3724 3725 932 3726 00:44:51,000 --> 00:44:56,099 3727 here my weights of my 3x3 kernel 3728 3729 933 3730 00:44:53,909 --> 00:44:58,199 3731 depicted in green and now I'm going to 3732 3733 934 3734 00:44:56,099 --> 00:45:01,170 3735 learn essentially where to send them 3736 3737 935 3738 00:44:58,199 --> 00:45:04,469 3739 where to get that information to be 3740 3741 936 3742 00:45:01,170 --> 00:45:07,230 3743 multiplied by the weight so of course 3744 3745 937 3746 00:45:04,469 --> 00:45:12,809 3747 dilated convolutions are a special case 3748 3749 938 3750 00:45:07,230 --> 00:45:15,389 3751 of deformable convolutions so in this 3752 3753 939 3754 00:45:12,809 --> 00:45:18,570 3755 case what we need to do is to get our 3756 3757 940 3758 00:45:15,389 --> 00:45:21,420 3759 input feature map with a cone flayer 3760 3761 941 3762 00:45:18,570 --> 00:45:24,269 3763 we actually learn what is called the 3764 3765 942 3766 00:45:21,420 --> 00:45:27,389 3767 offset field which actually tells us 3768 3769 943 3770 00:45:24,269 --> 00:45:29,909 3771 where to send these weights where to 3772 3773 944 3774 00:45:27,389 --> 00:45:32,368 3775 multiply this weight or by which pixel 3776 3777 945 3778 00:45:29,909 --> 00:45:34,679 3779 to multiply these weights in order to 3780 3781 946 3782 00:45:32,369 --> 00:45:37,680 3783 create one pixel of the output feature 3784 3785 947 3786 00:45:34,679 --> 00:45:39,750 3787 map and you have here the formulation 3788 3789 948 3790 00:45:37,679 --> 00:45:41,669 3791 for regular convolution and deformable 3792 3793 949 3794 00:45:39,750 --> 00:45:46,739 3795 convolution in case you want to take a 3796 3797 950 3798 00:45:41,670 --> 00:45:48,980 3799 look so of course with respect to 3800 3801 951 3802 00:45:46,739 --> 00:45:51,329 3803 standard standard convolutions 3804 3805 952 3806 00:45:48,980 --> 00:45:54,539 3807 deformable convolutions are much more 3808 3809 953 3810 00:45:51,329 --> 00:45:56,549 3811 flexible so you can see here a couple of 3812 3813 954 3814 00:45:54,539 --> 00:45:59,880 3815 convolutional layers that are acting on 3816 3817 955 3818 00:45:56,550 --> 00:46:02,580 3819 this on this image and how they create 3820 3821 956 3822 00:45:59,880 --> 00:46:05,490 3823 one pixel in the output space at the top 3824 3825 957 3826 00:46:02,579 --> 00:46:06,920 3827 and of course the deformable convolution 3828 3829 958 3830 00:46:05,489 --> 00:46:10,079 3831 will pick the values at different 3832 3833 959 3834 00:46:06,920 --> 00:46:13,200 3835 location for for in order to compute the 3836 3837 960 3838 00:46:10,079 --> 00:46:15,659 3839 actual convolution operation and these 3840 3841 961 3842 00:46:13,199 --> 00:46:18,480 3843 locations will be conditioned on the 3844 3845 962 3846 00:46:15,659 --> 00:46:20,339 3847 input image therefore you can imagine 3848 3849 963 3850 00:46:18,480 --> 00:46:22,740 3851 that if you actually have an object with 3852 3853 964 3854 00:46:20,340 --> 00:46:24,960 3855 a lot of trained structures you're going 3856 3857 965 3858 00:46:22,739 --> 00:46:27,569 3859 to really place your weights in precise 3860 3861 966 3862 00:46:24,960 --> 00:46:30,869 3863 location in order to get the most useful 3864 3865 967 3866 00:46:27,570 --> 00:46:33,720 3867 information and not just spread the 3868 3869 968 3870 00:46:30,869 --> 00:46:36,050 3871 values like in normal convolutions so 3872 3873 969 3874 00:46:33,719 --> 00:46:39,750 3875 this is actually a very interesting 3876 3877 970 3878 00:46:36,050 --> 00:46:41,039 3879 operation for segmentation outputs for 3880 3881 971 3882 00:46:39,750 --> 00:46:45,050 3883 when you actually need to have these 3884 3885 972 3886 00:46:41,039 --> 00:46:45,050 3887 pixel wise accurate outputs 3888 3889 973 3890 00:46:45,119 --> 00:46:51,240 3891 so let's go now into the panopticon so 3892 3893 974 3894 00:46:47,639 --> 00:46:53,670 3895 what is this Panoptix together the mask 3896 3897 975 3898 00:46:51,239 --> 00:46:57,358 3899 output and the semantic information 3900 3901 976 3902 00:46:53,670 --> 00:47:00,329 3903 output so we have at the top the mask 3904 3905 977 3906 00:46:57,358 --> 00:47:02,670 3907 what they call logits from the instance 3908 3909 978 3910 00:47:00,329 --> 00:47:05,490 3911 head right this comes from a masker 3912 3913 979 3914 00:47:02,670 --> 00:47:08,490 3915 cement type of head that tells you where 3916 3917 980 3918 00:47:05,489 --> 00:47:11,518 3919 these instances are located and what is 3920 3921 981 3922 00:47:08,489 --> 00:47:13,919 3923 the extent of them the object logics 3924 3925 982 3926 00:47:11,518 --> 00:47:16,229 3927 come from the semantic head and tell you 3928 3929 983 3930 00:47:13,920 --> 00:47:18,809 3931 what is the probability that this pixel 3932 3933 984 3934 00:47:16,230 --> 00:47:21,809 3935 belongs to the class car the class 3936 3937 985 3938 00:47:18,809 --> 00:47:23,970 3939 person etc etc and then you had the 3940 3941 986 3942 00:47:21,809 --> 00:47:27,420 3943 stuff logics which is exactly the same 3944 3945 987 3946 00:47:23,969 --> 00:47:30,929 3947 but for the classes which are not count 3948 3949 988 3950 00:47:27,420 --> 00:47:32,940 3951 alike sky for example or world now for 3952 3953 989 3954 00:47:30,929 --> 00:47:34,710 3955 the stuff trusses one needs to do 3956 3957 990 3958 00:47:32,940 --> 00:47:36,798 3959 nothing right there are no instances 3960 3961 991 3962 00:47:34,710 --> 00:47:39,659 3963 there so one needs to do really nothing 3964 3965 992 3966 00:47:36,798 --> 00:47:43,230 3967 so the stuff logics can be evaluated 3968 3969 993 3970 00:47:39,659 --> 00:47:45,598 3971 directly but the objects actually need 3972 3973 994 3974 00:47:43,230 --> 00:47:47,818 3975 to be masked by their instance right 3976 3977 995 3978 00:47:45,599 --> 00:47:50,778 3979 look at this example where we have this 3980 3981 996 3982 00:47:47,818 --> 00:47:55,920 3983 instance of I think this is a car and 3984 3985 997 3986 00:47:50,778 --> 00:47:58,679 3987 actually all of like the extent of the 3988 3989 998 3990 00:47:55,920 --> 00:48:01,739 3991 image depicts more things than just this 3992 3993 999 3994 00:47:58,679 --> 00:48:05,578 3995 car but in the end since we're going to 3996 3997 1000 3998 00:48:01,739 --> 00:48:08,880 3999 have this mask logic this instance logic 4000 4001 1001 4002 00:48:05,579 --> 00:48:12,509 4003 that actually is going to evaluate only 4004 4005 1002 4006 00:48:08,880 --> 00:48:14,579 4007 this part of the semantic head right the 4008 4009 1003 4010 00:48:12,509 --> 00:48:18,028 4011 rest doesn't really apply because we 4012 4013 1004 4014 00:48:14,579 --> 00:48:20,489 4015 know that the object is bounded by this 4016 4017 1005 4018 00:48:18,028 --> 00:48:23,400 4019 bounding box therefore we are only 4020 4021 1006 4022 00:48:20,489 --> 00:48:28,619 4023 interested in having a good prediction 4024 4025 1007 4026 00:48:23,400 --> 00:48:31,950 4027 inside this instance box that is coming 4028 4029 1008 4030 00:48:28,619 --> 00:48:33,660 4031 from the instance head therefore what 4032 4033 1009 4034 00:48:31,949 --> 00:48:35,278 4035 we're going to do is we're going to 4036 4037 1010 4038 00:48:33,659 --> 00:48:38,818 4039 perform this operation where we 4040 4041 1011 4042 00:48:35,278 --> 00:48:41,099 4043 essentially cut what we're interested in 4044 4045 1012 4046 00:48:38,818 --> 00:48:43,079 4047 what is actually inside the box 4048 4049 1013 4050 00:48:41,099 --> 00:48:46,769 4051 that were interested in we cut the 4052 4053 1014 4054 00:48:43,079 --> 00:48:49,769 4055 semantic map with the instance map and 4056 4057 1015 4058 00:48:46,768 --> 00:48:52,469 4059 then this is what we're going to use for 4060 4061 1016 4062 00:48:49,768 --> 00:48:55,318 4063 our instance logics which are depicted 4064 4065 1017 4066 00:48:52,469 --> 00:48:58,618 4067 here in green and the rest of the image 4068 4069 1018 4070 00:48:55,318 --> 00:49:02,579 4071 the rest that doesn't fit into an 4072 4073 1019 4074 00:48:58,619 --> 00:49:04,680 4075 instance is going to end up in this last 4076 4077 1020 4078 00:49:02,579 --> 00:49:09,539 4079 channel here which is actually the 4080 4081 1021 4082 00:49:04,679 --> 00:49:12,358 4083 channel unknown and so we're going to 4084 4085 1022 4086 00:49:09,539 --> 00:49:16,349 4087 perform a softmax over the pond optics 4088 4089 1023 4090 00:49:12,358 --> 00:49:19,469 4091 and the key here is that if the maximum 4092 4093 1024 4094 00:49:16,349 --> 00:49:22,048 4095 value falls into the first stuffed 4096 4097 1025 4098 00:49:19,469 --> 00:49:26,730 4099 channels then it belongs to one of the 4100 4101 1026 4102 00:49:22,048 --> 00:49:29,278 4103 stuff classes right otherwise the index 4104 4105 1027 4106 00:49:26,730 --> 00:49:32,460 4107 of the maximum value tells us the 4108 4109 1028 4110 00:49:29,278 --> 00:49:34,559 4111 instance ID that the pixel belongs to 4112 4113 1029 4114 00:49:32,460 --> 00:49:36,568 4115 and like this there are no conflicts 4116 4117 1030 4118 00:49:34,559 --> 00:49:39,569 4119 right because we take the maximum over 4120 4121 1031 4122 00:49:36,568 --> 00:49:42,058 4123 all the channels over the instance or 4124 4125 1032 4126 00:49:39,568 --> 00:49:44,460 4127 the unknown and over the stuff so we 4128 4129 1033 4130 00:49:42,059 --> 00:49:47,789 4131 have to make a decision between these 4132 4133 1034 4134 00:49:44,460 --> 00:49:51,539 4135 three and the last thing that we need to 4136 4137 1035 4138 00:49:47,789 --> 00:49:54,509 4139 know is actually how to use this unknown 4140 4141 1036 4142 00:49:51,539 --> 00:49:57,749 4143 trust this last class which doesn't 4144 4145 1037 4146 00:49:54,509 --> 00:49:59,309 4147 belong to instances or to stuff and this 4148 4149 1038 4150 00:49:57,748 --> 00:50:01,318 4151 is something that I would actually 4152 4153 1039 4154 00:49:59,309 --> 00:50:03,839 4155 recommend you to read how it exactly 4156 4157 1040 4158 00:50:01,318 --> 00:50:07,230 4159 works and what are the details on how to 4160 4161 1041 4162 00:50:03,838 --> 00:50:11,969 4163 use this unknown to us in the CPR 2019 4164 4165 1042 4166 00:50:07,230 --> 00:50:14,960 4167 paper excellent so we can now move to 4168 4169 1043 4170 00:50:11,969 --> 00:50:17,219 4171 the metrics right we know how to measure 4172 4173 1044 4174 00:50:14,960 --> 00:50:19,710 4175 semantic segmentation we know how to 4176 4177 1045 4178 00:50:17,219 --> 00:50:22,998 4179 measure object detection but how do we 4180 4181 1046 4182 00:50:19,710 --> 00:50:26,759 4183 measure Panoptix segmentation quality 4184 4185 1047 4186 00:50:22,998 --> 00:50:30,238 4187 now the panopticon t measure contains 4188 4189 1048 4190 00:50:26,759 --> 00:50:32,789 4191 two parts the first part the first term 4192 4193 1049 4194 00:50:30,239 --> 00:50:35,548 4195 measures the segmentation quality 4196 4197 1050 4198 00:50:32,789 --> 00:50:38,460 4199 meaning how close the predicted segments 4200 4201 1051 4202 00:50:35,548 --> 00:50:40,170 4203 are to the ground truth segments and you 4204 4205 1052 4206 00:50:38,460 --> 00:50:43,679 4207 can see that this is measured with the 4208 4209 1053 4210 00:50:40,170 --> 00:50:47,130 4211 iou the intersection over union of the 4212 4213 1054 4214 00:50:43,679 --> 00:50:50,118 4215 two true positive masks so essentially 4216 4217 1055 4218 00:50:47,130 --> 00:50:53,190 4219 not taking into account bad predictions 4220 4221 1056 4222 00:50:50,119 --> 00:50:55,650 4223 so this is just one part of the penalty 4224 4225 1057 4226 00:50:53,190 --> 00:50:58,349 4227 quality measure the part of the 4228 4229 1058 4230 00:50:55,650 --> 00:51:01,108 4231 segmentation quality and now we go 4232 4233 1059 4234 00:50:58,349 --> 00:51:04,289 4235 towards measuring the actual recognition 4236 4237 1060 4238 00:51:01,108 --> 00:51:06,719 4239 quality so the same as for detection we 4240 4241 1061 4242 00:51:04,289 --> 00:51:09,390 4243 actually want to know if we are missing 4244 4245 1062 4246 00:51:06,719 --> 00:51:12,538 4247 any instances which would lead to false 4248 4249 1063 4250 00:51:09,389 --> 00:51:14,759 4251 negatives or we are predicting more in 4252 4253 1064 4254 00:51:12,539 --> 00:51:17,699 4255 instances which would lead to false 4256 4257 1065 4258 00:51:14,759 --> 00:51:20,159 4259 positives so we have in this case an 4260 4261 1066 4262 00:51:17,699 --> 00:51:23,130 4263 example of this ground truth where we 4264 4265 1067 4266 00:51:20,159 --> 00:51:26,909 4267 have three persons three instances we 4268 4269 1068 4270 00:51:23,130 --> 00:51:28,798 4271 have a dog and in our prediction the sky 4272 4273 1069 4274 00:51:26,909 --> 00:51:31,379 4275 and the grass are predicted more or less 4276 4277 1070 4278 00:51:28,798 --> 00:51:34,228 4279 correctly but there is one person 4280 4281 1071 4282 00:51:31,380 --> 00:51:37,709 4283 missing and the dog is actually miss 4284 4285 1072 4286 00:51:34,228 --> 00:51:39,629 4287 classified as a person so in this case 4288 4289 1073 4290 00:51:37,708 --> 00:51:41,219 4291 we would have several true positives 4292 4293 1074 4294 00:51:39,630 --> 00:51:43,979 4295 right well it has this light brown 4296 4297 1075 4298 00:51:41,219 --> 00:51:46,318 4299 person matched we'd have the orange 4300 4301 1076 4302 00:51:43,978 --> 00:51:48,899 4303 person matched but this dark brown 4304 4305 1077 4306 00:51:46,318 --> 00:51:50,938 4307 person is actually not found in your 4308 4309 1078 4310 00:51:48,900 --> 00:51:54,059 4311 prediction therefore it is a false 4312 4313 1079 4314 00:51:50,938 --> 00:51:57,298 4315 negative and at the same time we have a 4316 4317 1080 4318 00:51:54,059 --> 00:51:59,819 4319 false positive from this person which is 4320 4321 1081 4322 00:51:57,298 --> 00:52:03,239 4323 actually in the wrong class so the 4324 4325 1082 4326 00:51:59,818 --> 00:52:06,028 4327 prediction is the wrong class so this 4328 4329 1083 4330 00:52:03,239 --> 00:52:08,818 4331 all this much false positives true 4332 4333 1084 4334 00:52:06,028 --> 00:52:11,338 4335 positive computation needs to be done in 4336 4337 1085 4338 00:52:08,818 --> 00:52:12,838 4339 a similar way as we did for detection so 4340 4341 1086 4342 00:52:11,338 --> 00:52:15,298 4343 essentially there needs to be a way to 4344 4345 1087 4346 00:52:12,838 --> 00:52:17,489 4347 match ground truth and predictions and 4348 4349 1088 4350 00:52:15,298 --> 00:52:20,518 4351 in this case we actually have to do 4352 4353 1089 4354 00:52:17,489 --> 00:52:22,739 4355 segment matching so the segment is 4356 4357 1090 4358 00:52:20,518 --> 00:52:25,858 4359 actually matched if the intersection of 4360 4361 1091 4362 00:52:22,739 --> 00:52:28,048 4363 reunion is about 0.5 and a suite 4364 4365 1092 4366 00:52:25,858 --> 00:52:32,938 4367 detections no pixel can actually belong 4368 4369 1093 4370 00:52:28,048 --> 00:52:36,929 4371 to two predicted segments so in this 4372 4373 1094 4374 00:52:32,938 --> 00:52:39,958 4375 case where we have one cat but we 4376 4377 1095 4378 00:52:36,929 --> 00:52:42,298 4379 actually have two predictions we compute 4380 4381 1096 4382 00:52:39,958 --> 00:52:44,308 4383 the intersection over union of the two 4384 4385 1097 4386 00:52:42,298 --> 00:52:46,018 4387 predictions and we find that the blue 4388 4389 1098 4390 00:52:44,309 --> 00:52:49,048 4391 prediction has intersection over you 4392 4393 1099 4394 00:52:46,018 --> 00:52:51,988 4395 know 0.6 therefore is considered as true 4396 4397 1100 4398 00:52:49,048 --> 00:52:53,849 4399 positive but the top part of the cat is 4400 4401 1101 4402 00:52:51,989 --> 00:52:55,918 4403 considered as a false positive pink 4404 4405 1102 4406 00:52:53,849 --> 00:53:00,778 4407 because the intersection of reunion is 4408 4409 1103 4410 00:52:55,918 --> 00:53:05,449 4411 0.4 and of course no two segments can 4412 4413 1104 4414 00:53:00,778 --> 00:53:05,449 4415 actually belong to the same ground truth 4416 4417 1105 4418 00:53:06,650 --> 00:53:11,818 4419 now qualitatively results for Panoptix 4420 4421 1106 4422 00:53:09,688 --> 00:53:15,719 4423 segmentation actually look pretty good 4424 4425 1107 4426 00:53:11,818 --> 00:53:19,619 4427 so it's quite nice what computer vision 4428 4429 1108 4430 00:53:15,719 --> 00:53:23,039 4431 can do nowadays and we can pretty stuff 4432 4433 1109 4434 00:53:19,619 --> 00:53:25,910 4435 classes without the instance IDs as well 4436 4437 1110 4438 00:53:23,039 --> 00:53:28,190 4439 as things class 4440 4441 1111 4442 00:53:25,909 --> 00:53:31,399 4443 with the correct semantic level and also 4444 4445 1112 4446 00:53:28,190 --> 00:53:33,409 4447 by separating the different instances so 4448 4449 1113 4450 00:53:31,400 --> 00:53:35,358 4451 for example here we have the separation 4452 4453 1114 4454 00:53:33,409 --> 00:53:43,368 4455 of the different glasses in two 4456 4457 1115 4458 00:53:35,358 --> 00:53:46,728 4459 different instances okay so finally I 4460 4461 1116 4462 00:53:43,369 --> 00:53:48,949 4463 will present a third way of doing 4464 4465 1117 4466 00:53:46,728 --> 00:53:51,739 4467 instant segmentation and this is a way 4468 4469 1118 4470 00:53:48,949 --> 00:53:55,489 4471 that I particularly like and it is 4472 4473 1119 4474 00:53:51,739 --> 00:54:00,440 4475 inspired to what we used to do a little 4476 4477 1120 4478 00:53:55,489 --> 00:54:04,009 4479 bit before CNN's came in so all the our 4480 4481 1121 4482 00:54:00,440 --> 00:54:08,568 4483 CNN methods all the methods in the arson 4484 4485 1122 4486 00:54:04,009 --> 00:54:12,108 4487 and family or even deformable part 4488 4489 1123 4490 00:54:08,568 --> 00:54:15,170 4491 models use a sliding window approach for 4492 4493 1124 4494 00:54:12,108 --> 00:54:17,509 4495 detection so essentially the basic ideas 4496 4497 1125 4498 00:54:15,170 --> 00:54:21,979 4499 we have seen is to tensely enumerate box 4500 4501 1126 4502 00:54:17,509 --> 00:54:23,900 4503 proposals and then classify them so this 4504 4505 1127 4506 00:54:21,978 --> 00:54:25,998 4507 is a successful paradigm we have seen 4508 4509 1128 4510 00:54:23,900 --> 00:54:28,579 4511 it's well engineered it achieves soda 4512 4513 1129 4514 00:54:25,998 --> 00:54:31,598 4515 results and most of the state-of-the-art 4516 4517 1130 4518 00:54:28,579 --> 00:54:36,380 4519 methods are still based on this paradigm 4520 4521 1131 4522 00:54:31,599 --> 00:54:39,469 4523 nonetheless before DPN before our CNN we 4524 4525 1132 4526 00:54:36,380 --> 00:54:41,599 4527 used to do detection as voting or let's 4528 4529 1133 4530 00:54:39,469 --> 00:54:44,028 4531 say one of the paradigms that existed 4532 4533 1134 4534 00:54:41,599 --> 00:54:47,959 4535 for detection was the one for polling 4536 4537 1135 4538 00:54:44,028 --> 00:54:51,018 4539 and this was way before we had actual 4540 4541 1136 4542 00:54:47,958 --> 00:54:54,440 4543 convolutional neural networks so what do 4544 4545 1137 4546 00:54:51,018 --> 00:54:57,649 4547 I mean by how voting so in this case we 4548 4549 1138 4550 00:54:54,440 --> 00:55:00,170 4551 can see the very simple example in which 4552 4553 1139 4554 00:54:57,650 --> 00:55:03,380 4555 we want to detect analytical shapes for 4556 4557 1140 4558 00:55:00,170 --> 00:55:06,469 4559 example lines speaks in a dual 4560 4561 1141 4562 00:55:03,380 --> 00:55:08,329 4563 parametric space so essentially what we 4564 4565 1142 4566 00:55:06,469 --> 00:55:11,630 4567 would have is each pixel would cast 4568 4569 1143 4570 00:55:08,329 --> 00:55:13,489 4571 about in these two all space and then we 4572 4573 1144 4574 00:55:11,630 --> 00:55:16,009 4575 will detect the peaks in the dual space 4576 4577 1145 4578 00:55:13,489 --> 00:55:18,940 4579 and kind of back project them to the 4580 4581 1146 4582 00:55:16,009 --> 00:55:22,219 4583 image phase to detect for example align 4584 4585 1147 4586 00:55:18,940 --> 00:55:25,369 4587 so let me put it into into a visual 4588 4589 1148 4590 00:55:22,219 --> 00:55:28,940 4591 example so we want to do line detection 4592 4593 1149 4594 00:55:25,369 --> 00:55:31,219 4595 and all we have in order to detect a 4596 4597 1150 4598 00:55:28,940 --> 00:55:33,650 4599 line are different points that are 4600 4601 1151 4602 00:55:31,219 --> 00:55:35,809 4603 placed on top of a line so essentially 4604 4605 1152 4606 00:55:33,650 --> 00:55:38,660 4607 would want to do is to fit a line 4608 4609 1153 4610 00:55:35,809 --> 00:55:39,350 4611 through these points now what we can do 4612 4613 1154 4614 00:55:38,659 --> 00:55:41,809 4615 is there 4616 4617 1155 4618 00:55:39,349 --> 00:55:46,190 4619 each point in the image space for 4620 4621 1156 4622 00:55:41,809 --> 00:55:50,259 4623 example in this case x0 y0 actually cast 4624 4625 1157 4626 00:55:46,190 --> 00:55:53,929 4627 a vote into the half parameter space and 4628 4629 1158 4630 00:55:50,260 --> 00:55:57,560 4631 this vote actually takes the form of a 4632 4633 1159 4634 00:55:53,929 --> 00:56:00,859 4635 line that crosses that point so it is 4636 4637 1160 4638 00:55:57,559 --> 00:56:06,079 4639 important to note that this line which 4640 4641 1161 4642 00:56:00,860 --> 00:56:09,289 4643 is parametrized by M and B is always 4644 4645 1162 4646 00:56:06,079 --> 00:56:11,179 4647 going to cross the point x0 y0 so 4648 4649 1163 4650 00:56:09,289 --> 00:56:13,989 4651 essentially this represents all the 4652 4653 1164 4654 00:56:11,179 --> 00:56:18,579 4655 lines they're going to cross that point 4656 4657 1165 4658 00:56:13,989 --> 00:56:21,229 4659 now if you take another point x1 y1 4660 4661 1166 4662 00:56:18,579 --> 00:56:23,299 4663 we're going to cast another boat we're 4664 4665 1167 4666 00:56:21,230 --> 00:56:26,329 4667 going to cast another line in the 1/2 4668 4669 1168 4670 00:56:23,300 --> 00:56:29,030 4671 parameter space and so for all the 4672 4673 1169 4674 00:56:26,329 --> 00:56:31,610 4675 points of this line that we're trying to 4676 4677 1170 4678 00:56:29,030 --> 00:56:33,350 4679 find out we're going to cast votes in 4680 4681 1171 4682 00:56:31,610 --> 00:56:35,300 4683 the parameter space we're going to cast 4684 4685 1172 4686 00:56:33,349 --> 00:56:38,210 4687 all of these lines in this parameter 4688 4689 1173 4690 00:56:35,300 --> 00:56:39,980 4691 space and then what we're going to do is 4692 4693 1174 4694 00:56:38,210 --> 00:56:42,380 4695 we're going to go to the parameter space 4696 4697 1175 4698 00:56:39,980 --> 00:56:45,349 4699 to the half power meter space and we're 4700 4701 1176 4702 00:56:42,380 --> 00:56:47,840 4703 going to read out the Maxima from this 4704 4705 1177 4706 00:56:45,349 --> 00:56:50,420 4707 parameter space and in this case the 4708 4709 1178 4710 00:56:47,840 --> 00:56:52,460 4711 Maxima is going to be this point here 4712 4713 1179 4714 00:56:50,420 --> 00:56:55,010 4715 where all the lines are going to cross 4716 4717 1180 4718 00:56:52,460 --> 00:56:59,780 4719 and this point here where all the lines 4720 4721 1181 4722 00:56:55,010 --> 00:57:03,170 4723 cross is a point represented by a value 4724 4725 1182 4726 00:56:59,780 --> 00:57:06,260 4727 of M and a value of P that represent the 4728 4729 1183 4730 00:57:03,170 --> 00:57:12,680 4731 line that actually best fits all of 4732 4733 1184 4734 00:57:06,260 --> 00:57:14,720 4735 these points that have casted votes so 4736 4737 1185 4738 00:57:12,679 --> 00:57:17,299 4739 let's see how can we actually use this 4740 4741 1186 4742 00:57:14,719 --> 00:57:19,969 4743 to actually perform object detection or 4744 4745 1187 4746 00:57:17,300 --> 00:57:23,630 4747 only line detection but object detection 4748 4749 1188 4750 00:57:19,969 --> 00:57:26,329 4751 in the form of voting so the idea is 4752 4753 1189 4754 00:57:23,630 --> 00:57:29,150 4755 that objects are going to be detected as 4756 4757 1190 4758 00:57:26,329 --> 00:57:32,989 4759 consistent configuration of observed 4760 4761 1191 4762 00:57:29,150 --> 00:57:37,010 4763 parts so in this case we have a car and 4764 4765 1192 4766 00:57:32,989 --> 00:57:39,709 4767 we know that a car has two wheels and we 4768 4769 1193 4770 00:57:37,010 --> 00:57:41,990 4771 know that these two wheels are always 4772 4773 1194 4774 00:57:39,710 --> 00:57:44,570 4775 roughly in the same position with 4776 4777 1195 4778 00:57:41,989 --> 00:57:48,319 4779 respect for example to the center of the 4780 4781 1196 4782 00:57:44,570 --> 00:57:51,590 4783 car so the rough idea would be to use 4784 4785 1197 4786 00:57:48,320 --> 00:57:53,019 4787 the wheel patch and whenever this wheel 4788 4789 1198 4790 00:57:51,590 --> 00:57:56,318 4791 patch is detected 4792 4793 1199 4794 00:57:53,018 --> 00:57:59,648 4795 I'm going to cast a vote for the center 4796 4797 1200 4798 00:57:56,318 --> 00:58:02,108 4799 of the car and I know that the center of 4800 4801 1201 4802 00:57:59,648 --> 00:58:03,909 4803 the car is always going to have the same 4804 4805 1202 4806 00:58:02,108 --> 00:58:06,909 4807 relationship with respect to the wheel 4808 4809 1203 4810 00:58:03,909 --> 00:58:09,999 4811 more or less therefore I'm going to cast 4812 4813 1204 4814 00:58:06,909 --> 00:58:11,829 4815 votes for from the wheels of the car the 4816 4817 1205 4818 00:58:09,998 --> 00:58:12,879 4819 window of the car at the back of the car 4820 4821 1206 4822 00:58:11,829 --> 00:58:15,130 4823 at the front of the car 4824 4825 1207 4826 00:58:12,880 --> 00:58:17,739 4827 they're all gonna cast votes to the 4828 4829 1208 4830 00:58:15,130 --> 00:58:20,108 4831 center of the car and by detecting these 4832 4833 1209 4834 00:58:17,739 --> 00:58:24,969 4835 Peaks I'm going to be able to know if 4836 4837 1210 4838 00:58:20,108 --> 00:58:26,708 4839 there was indeed a car there or not so 4840 4841 1211 4842 00:58:24,969 --> 00:58:28,809 4843 let's look at this in more detail how 4844 4845 1212 4846 00:58:26,708 --> 00:58:31,958 4847 can we actually train a method to 4848 4849 1213 4850 00:58:28,809 --> 00:58:35,140 4851 perform this kind of voting so I'm going 4852 4853 1214 4854 00:58:31,958 --> 00:58:38,048 4855 to do is I'm going to first extract some 4856 4857 1215 4858 00:58:35,139 --> 00:58:41,588 4859 features from the image so again this is 4860 4861 1216 4862 00:58:38,048 --> 00:58:45,998 4863 before CNN so we used to use in that 4864 4865 1217 4866 00:58:41,588 --> 00:58:48,998 4867 period was point detection methods so 4868 4869 1218 4870 00:58:45,998 --> 00:58:52,268 4871 interest key point detection methods for 4872 4873 1219 4874 00:58:48,998 --> 00:58:54,548 4875 example sift or serve and this basically 4876 4877 1220 4878 00:58:52,268 --> 00:58:57,968 4879 extracted interesting points from the 4880 4881 1221 4882 00:58:54,548 --> 00:59:00,759 4883 image salient points from the image now 4884 4885 1222 4886 00:58:57,969 --> 00:59:04,208 4887 from this point what we did is we placed 4888 4889 1223 4890 00:59:00,759 --> 00:59:07,539 4891 a patch centered around this interest 4892 4893 1224 4894 00:59:04,208 --> 00:59:10,298 4895 point and this patch had to cast a vote 4896 4897 1225 4898 00:59:07,539 --> 00:59:12,880 4899 for the center of the object so of 4900 4901 1226 4902 00:59:10,298 --> 00:59:14,798 4903 course we took the interest points that 4904 4905 1227 4906 00:59:12,880 --> 00:59:16,479 4907 were on top of the object and these were 4908 4909 1228 4910 00:59:14,798 --> 00:59:19,809 4911 casting a vote for the center of the 4912 4913 1229 4914 00:59:16,478 --> 00:59:21,638 4915 object which we had as ground truth so 4916 4917 1230 4918 00:59:19,809 --> 00:59:24,429 4919 this was our training procedure right 4920 4921 1231 4922 00:59:21,639 --> 00:59:28,689 4923 each patch had had to learn how to vote 4924 4925 1232 4926 00:59:24,429 --> 00:59:31,479 4927 for a center point and a test timer with 4928 4929 1233 4930 00:59:28,688 --> 00:59:34,618 4931 those would get the original image cast 4932 4933 1234 4934 00:59:31,478 --> 00:59:39,428 4935 these complete these interest points 4936 4937 1235 4938 00:59:34,619 --> 00:59:42,608 4939 find the similarity between the interest 4940 4941 1236 4942 00:59:39,429 --> 00:59:44,648 4943 point and the code boot entries so 4944 4945 1237 4946 00:59:42,608 --> 00:59:48,608 4947 essentially find the most similar 4948 4949 1238 4950 00:59:44,648 --> 00:59:52,118 4951 patches from our training set and then 4952 4953 1239 4954 00:59:48,608 --> 00:59:55,259 4955 use the votes from the training set in 4956 4957 1240 4958 00:59:52,119 --> 00:59:57,969 4959 order to vote for the center of the car 4960 4961 1241 4962 00:59:55,259 --> 01:00:00,219 4963 now this is an interesting example for 4964 4965 1242 4966 00:59:57,969 --> 01:00:02,289 4967 the car here right because the front 4968 4969 1243 4970 01:00:00,219 --> 01:00:05,420 4971 wheel and the back wheel have very very 4972 4973 1244 4974 01:00:02,289 --> 01:00:07,760 4975 similar appearance so it is very likely 4976 4977 1245 4978 01:00:05,420 --> 01:00:10,608 4979 that they're both going to vote or 4980 4981 1246 4982 01:00:07,760 --> 01:00:13,400 4983 they're going to vote with the same 4984 4985 1247 4986 01:00:10,608 --> 01:00:16,670 4987 likelihood for the center of the car but 4988 4989 1248 4990 01:00:13,400 --> 01:00:19,099 4991 also for let's say the symmetrical 4992 4993 1249 4994 01:00:16,670 --> 01:00:21,519 4995 position so in this case this patch 4996 4997 1250 4998 01:00:19,099 --> 01:00:24,048 4999 would vote for the center but also for 5000 5001 1251 5002 01:00:21,519 --> 01:00:26,358 5003 this position here which would be the 5004 5005 1252 5006 01:00:24,048 --> 01:00:28,789 5007 center if this was the back wheel and 5008 5009 1253 5010 01:00:26,358 --> 01:00:30,619 5011 not the front wheel and the same happens 5012 5013 1254 5014 01:00:28,789 --> 01:00:32,900 5015 for the back wheel with roads for the 5016 5017 1255 5018 01:00:30,619 --> 01:00:33,818 5019 real center but also for this point here 5020 5021 1256 5022 01:00:32,900 --> 01:00:38,298 5023 in the back 5024 5025 1257 5026 01:00:33,818 --> 01:00:41,358 5027 now the key idea here is that we're 5028 5029 1258 5030 01:00:38,298 --> 01:00:43,219 5031 going to cast a lot of votes right so 5032 5033 1259 5034 01:00:41,358 --> 01:00:44,960 5035 the winders are going to vote the front 5036 5037 1260 5038 01:00:43,219 --> 01:00:47,239 5039 of the car is going to vote the door is 5040 5041 1261 5042 01:00:44,960 --> 01:00:49,369 5043 going to vote and you're gonna have 5044 5045 1262 5046 01:00:47,239 --> 01:00:52,519 5047 votes all over the place but there is 5048 5049 1263 5050 01:00:49,369 --> 01:00:56,809 5051 going to be a concentration of votes in 5052 5053 1264 5054 01:00:52,519 --> 01:00:59,269 5055 the center of the actual object so in 5056 5057 1265 5058 01:00:56,809 --> 01:01:01,670 5059 this half parametric space we can find 5060 5061 1266 5062 01:00:59,269 --> 01:01:04,940 5063 this peak of votes right where the votes 5064 5065 1267 5066 01:01:01,670 --> 01:01:07,818 5067 really concentrate now once we have done 5068 5069 1268 5070 01:01:04,940 --> 01:01:10,880 5071 this then we can do segmentation like 5072 5073 1269 5074 01:01:07,818 --> 01:01:14,179 5075 this because we can go back to the image 5076 5077 1270 5078 01:01:10,880 --> 01:01:17,240 5079 space and we can say let's look at all 5080 5081 1271 5082 01:01:14,179 --> 01:01:21,548 5083 the patches that voted for this position 5084 5085 1272 5086 01:01:17,239 --> 01:01:25,548 5087 here now we gather all of these patches 5088 5089 1273 5090 01:01:21,548 --> 01:01:28,068 5091 we perform some further processing and 5092 5093 1274 5094 01:01:25,548 --> 01:01:35,358 5095 now we have a rough segmentation of the 5096 5097 1275 5098 01:01:28,068 --> 01:01:35,960 5099 object so this was the method back in 5100 5101 1276 5102 01:01:35,358 --> 01:01:39,429 5103 the past 5104 5105 1277 5106 01:01:35,960 --> 01:01:44,059 5107 no CNN's here so now we come back to 5108 5109 1278 5110 01:01:39,429 --> 01:01:47,659 5111 2020 and we're going to present a paper 5112 5113 1279 5114 01:01:44,059 --> 01:01:51,200 5115 called pixel consensus voted for 5116 5117 1280 5118 01:01:47,659 --> 01:01:54,588 5119 Panoptix segmentation which was 5120 5121 1281 5122 01:01:51,199 --> 01:01:58,399 5123 published actually at cpr 2020 so really 5124 5125 1282 5126 01:01:54,588 --> 01:02:02,328 5127 recent research happening that merges 5128 5129 1283 5130 01:01:58,400 --> 01:02:04,670 5131 this concept of voting with the modern 5132 5133 1284 5134 01:02:02,329 --> 01:02:08,660 5135 CNN and the power of CNN feature 5136 5137 1285 5138 01:02:04,670 --> 01:02:11,869 5139 extraction so the overview of this 5140 5141 1286 5142 01:02:08,659 --> 01:02:15,649 5143 method is we're going to use of course 5144 5145 1287 5146 01:02:11,869 --> 01:02:18,900 5147 our FPN or backbone to extract features 5148 5149 1288 5150 01:02:15,650 --> 01:02:21,568 5151 we are going to have a semantics that 5152 5153 1289 5154 01:02:18,900 --> 01:02:24,838 5155 mutation branch at the top so the same 5156 5157 1290 5158 01:02:21,568 --> 01:02:27,630 5159 that we have been seeing so far and the 5160 5161 1291 5162 01:02:24,838 --> 01:02:30,778 5163 interesting part is this part here so 5164 5165 1292 5166 01:02:27,630 --> 01:02:34,140 5167 it's this second branch and it's the 5168 5169 1293 5170 01:02:30,778 --> 01:02:36,809 5171 instance voting branch and it actually 5172 5173 1294 5174 01:02:34,139 --> 01:02:39,750 5175 predicts for every pixel whether the 5176 5177 1295 5178 01:02:36,809 --> 01:02:42,839 5179 pixel is part of an instance mask and if 5180 5181 1296 5182 01:02:39,750 --> 01:02:46,528 5183 so the relative location of the instance 5184 5185 1297 5186 01:02:42,838 --> 01:02:51,298 5187 mesentery so the same idea as the paper 5188 5189 1298 5190 01:02:46,528 --> 01:02:53,909 5191 that actually used saved to to extract 5192 5193 1299 5194 01:02:51,298 --> 01:02:55,679 5195 meaningful patches of course this is a 5196 5197 1300 5198 01:02:53,909 --> 01:02:58,170 5199 much more powerful representation 5200 5201 1301 5202 01:02:55,679 --> 01:03:01,828 5203 because the authors what they proposed 5204 5205 1302 5206 01:02:58,170 --> 01:03:05,010 5207 to do was to put this instant voting 5208 5209 1303 5210 01:03:01,829 --> 01:03:07,200 5211 branch or or to code it into operations 5212 5213 1304 5214 01:03:05,010 --> 01:03:09,750 5215 that you can fully back propagate 5216 5217 1305 5218 01:03:07,199 --> 01:03:14,278 5219 through and therefore you can train this 5220 5221 1306 5222 01:03:09,750 --> 01:03:17,309 5223 whole thing and to end so in a nutshell 5224 5225 1307 5226 01:03:14,278 --> 01:03:19,710 5227 how does this method work well first of 5228 5229 1308 5230 01:03:17,309 --> 01:03:22,048 5231 all we need to make a decision for each 5232 5233 1309 5234 01:03:19,710 --> 01:03:24,449 5235 pixel and when we're going to do is 5236 5237 1310 5238 01:03:22,048 --> 01:03:26,519 5239 we're going to discretize the regions 5240 5241 1311 5242 01:03:24,449 --> 01:03:28,169 5243 around each pixel right the pixel has a 5244 5245 1312 5246 01:03:26,519 --> 01:03:30,119 5247 neighborhood we're interested in looking 5248 5249 1313 5250 01:03:28,170 --> 01:03:33,358 5251 at the neighborhood in order to make a 5252 5253 1314 5254 01:03:30,119 --> 01:03:36,750 5255 decision about the centroid right so 5256 5257 1315 5258 01:03:33,358 --> 01:03:38,940 5259 each pixel has to vote for the centroid 5260 5261 1316 5262 01:03:36,750 --> 01:03:40,949 5263 of the object that it belongs to so it 5264 5265 1317 5266 01:03:38,940 --> 01:03:45,210 5267 needs to have an idea of what is going 5268 5269 1318 5270 01:03:40,949 --> 01:03:48,179 5271 on around now every pixel is going to 5272 5273 1319 5274 01:03:45,210 --> 01:03:51,179 5275 vote for it its centroid if it belongs 5276 5277 1320 5278 01:03:48,179 --> 01:03:55,409 5279 to the category stuff so essentially no 5280 5281 1321 5282 01:03:51,179 --> 01:03:57,868 5283 instance in in road or sky or grass then 5284 5285 1322 5286 01:03:55,409 --> 01:04:01,558 5287 you're going to vote essentially for an 5288 5289 1323 5290 01:03:57,869 --> 01:04:04,500 5291 extra class which is no centroid but the 5292 5293 1324 5294 01:04:01,559 --> 01:04:06,269 5295 main idea is that every pixel is going 5296 5297 1325 5298 01:04:04,500 --> 01:04:09,088 5299 to vote for a century 5300 5301 1326 5302 01:04:06,269 --> 01:04:11,940 5303 if the centroid is located in this area 5304 5305 1327 5306 01:04:09,088 --> 01:04:14,400 5307 here if the center is not located in 5308 5309 1328 5310 01:04:11,940 --> 01:04:19,200 5311 this area here this is ignored for 5312 5313 1329 5314 01:04:14,400 --> 01:04:21,240 5315 training now in a third step we're going 5316 5317 1330 5318 01:04:19,199 --> 01:04:24,419 5319 to have this vote aggression right same 5320 5321 1331 5322 01:04:21,239 --> 01:04:30,929 5323 as we saw in the half space this vote 5324 5325 1332 5326 01:04:24,420 --> 01:04:32,380 5327 aggregation of an each pixel and this is 5328 5329 1333 5330 01:04:30,929 --> 01:04:35,009 5331 basically casted 5332 5333 1334 5334 01:04:32,380 --> 01:04:38,079 5335 into these accumulator space right and 5336 5337 1335 5338 01:04:35,009 --> 01:04:42,670 5339 disgusting it's very nicely formulated 5340 5341 1336 5342 01:04:38,079 --> 01:04:44,710 5343 as a dilated transpose convolution on a 5344 5345 1337 5346 01:04:42,670 --> 01:04:46,990 5347 fourth step we're going to detect these 5348 5349 1338 5350 01:04:44,710 --> 01:04:48,730 5351 objects as these Peaks in this case we 5352 5353 1339 5354 01:04:46,989 --> 01:04:51,578 5355 have these three objects these three 5356 5357 1340 5358 01:04:48,730 --> 01:04:54,730 5359 peaks and finally we're going to do 5360 5361 1341 5362 01:04:51,579 --> 01:04:56,829 5363 again a back projection of the peaks so 5364 5365 1342 5366 01:04:54,730 --> 01:04:59,199 5367 same as we presented for the method 5368 5369 1343 5370 01:04:56,829 --> 01:05:01,599 5371 before you look at who voted for that 5372 5373 1344 5374 01:04:59,199 --> 01:05:03,909 5375 Center you go back to the image space 5376 5377 1345 5378 01:05:01,599 --> 01:05:05,588 5379 and you can obtain the masks which are 5380 5381 1346 5382 01:05:03,909 --> 01:05:08,558 5383 all the pixels that voted for that 5384 5385 1347 5386 01:05:05,588 --> 01:05:10,659 5387 Center and the category information the 5388 5389 1348 5390 01:05:08,559 --> 01:05:15,160 5391 semantic information is provided by the 5392 5393 1349 5394 01:05:10,659 --> 01:05:17,199 5395 parallel semantic segmentation hat okay 5396 5397 1350 5398 01:05:15,159 --> 01:05:18,969 5399 so now the interesting thing is how to 5400 5401 1351 5402 01:05:17,199 --> 01:05:21,759 5403 implement this into a neural network 5404 5405 1352 5406 01:05:18,969 --> 01:05:24,548 5407 right so what the authors proposed to do 5408 5409 1353 5410 01:05:21,759 --> 01:05:27,818 5411 is to have what they call a voting 5412 5413 1354 5414 01:05:24,548 --> 01:05:30,670 5415 lookup table so first of all we need to 5416 5417 1355 5418 01:05:27,818 --> 01:05:32,980 5419 discretize the region around the pixel 5420 5421 1356 5422 01:05:30,670 --> 01:05:35,889 5423 right I am a pixel I need to cast a vote 5424 5425 1357 5426 01:05:32,980 --> 01:05:38,079 5427 for my centroid and I need to know where 5428 5429 1358 5430 01:05:35,889 --> 01:05:40,268 5431 to cast this vote so the first thing 5432 5433 1359 5434 01:05:38,079 --> 01:05:43,720 5435 that I'm going to do is I'm going to 5436 5437 1360 5438 01:05:40,268 --> 01:05:46,000 5439 place this voting filter center around 5440 5441 1361 5442 01:05:43,719 --> 01:05:48,939 5443 the pixel that has to cast a ball and 5444 5445 1362 5446 01:05:46,000 --> 01:05:52,900 5447 what this sorting filter does is it 5448 5449 1363 5450 01:05:48,940 --> 01:05:55,809 5451 converts this M by n cells right is this 5452 5453 1364 5454 01:05:52,900 --> 01:05:59,139 5455 square Center on this pixel into 17 5456 5457 1365 5458 01:05:55,809 --> 01:06:03,519 5459 indices so essentially I can cast a vote 5460 5461 1366 5462 01:05:59,139 --> 01:06:06,400 5463 for 17 positions note that there's of 5464 5465 1367 5466 01:06:03,518 --> 01:06:09,368 5467 course much more resolution closer to 5468 5469 1368 5470 01:06:06,400 --> 01:06:11,650 5471 the to the pixel and much less 5472 5473 1369 5474 01:06:09,369 --> 01:06:15,759 5475 resolution as we go further further away 5476 5477 1370 5478 01:06:11,650 --> 01:06:18,940 5479 from the object now in this case if I am 5480 5481 1371 5482 01:06:15,759 --> 01:06:21,880 5483 this instance mask and I'm this pixel in 5484 5485 1372 5486 01:06:18,940 --> 01:06:25,869 5487 the instance mask my Center is the red 5488 5489 1373 5490 01:06:21,880 --> 01:06:28,778 5491 square so I basically need to cast the 5492 5493 1374 5494 01:06:25,869 --> 01:06:33,190 5495 vote for the center which is going to be 5496 5497 1375 5498 01:06:28,778 --> 01:06:35,739 5499 on position 16 so me blue pixel I'm 5500 5501 1376 5502 01:06:33,190 --> 01:06:39,548 5503 going to cast the vote which is actually 5504 5505 1377 5506 01:06:35,739 --> 01:06:41,949 5507 the value 16 and thanks to the voting 5508 5509 1378 5510 01:06:39,548 --> 01:06:45,509 5511 filter I know exactly what this value 5512 5513 1379 5514 01:06:41,949 --> 01:06:45,509 5515 means in the image space 5516 5517 1380 5518 01:06:46,250 --> 01:06:51,230 5519 now an inference the instance voting 5520 5521 1381 5522 01:06:48,769 --> 01:06:54,500 5523 branch actually provides a tensor and 5524 5525 1382 5526 01:06:51,230 --> 01:06:58,010 5527 this tensor has size H by W so image 5528 5529 1383 5530 01:06:54,500 --> 01:07:01,179 5531 size and the number of channels is k 5532 5533 1384 5534 01:06:58,010 --> 01:07:06,080 5535 plus 1 so they essentially K positions 5536 5537 1385 5538 01:07:01,179 --> 01:07:10,159 5539 remember we had 17 indices here so 17 5540 5541 1386 5542 01:07:06,079 --> 01:07:12,858 5543 positions that I can vote for plus 1 5544 5545 1387 5546 01:07:10,159 --> 01:07:16,309 5547 which is basically for the class for all 5548 5549 1388 5550 01:07:12,858 --> 01:07:20,179 5551 the classes that are not countable so 5552 5553 1389 5554 01:07:16,309 --> 01:07:23,090 5555 the sky the grass or the road and now 5556 5557 1390 5558 01:07:20,179 --> 01:07:25,940 5559 the the ideas that I want to accumulate 5560 5561 1391 5562 01:07:23,090 --> 01:07:30,260 5563 the votes in my accumulator space and 5564 5565 1392 5566 01:07:25,940 --> 01:07:32,780 5567 how do I actually want to do that well 5568 5569 1393 5570 01:07:30,260 --> 01:07:35,750 5571 remember again our example at the blue 5572 5573 1394 5574 01:07:32,780 --> 01:07:38,300 5575 picture we get a vote for index 16 with 5576 5577 1395 5578 01:07:35,750 --> 01:07:41,119 5579 very high probability right probability 5580 5581 1396 5582 01:07:38,300 --> 01:07:44,150 5583 is 0.9 and this comes from a soft max 5584 5585 1397 5586 01:07:41,119 --> 01:07:47,329 5587 output with these 17 classifications 5588 5589 1398 5590 01:07:44,150 --> 01:07:49,820 5591 plus 1 17 class's sorry plus 1 where 5592 5593 1399 5594 01:07:47,329 --> 01:07:53,420 5595 each class is one of these positions in 5596 5597 1400 5598 01:07:49,820 --> 01:07:55,369 5599 the voting filter now what I need to do 5600 5601 1401 5602 01:07:53,420 --> 01:07:59,869 5603 is I basically need to transfer these 5604 5605 1402 5606 01:07:55,369 --> 01:08:01,789 5607 0.9 value to the cell number 16 so I'm 5608 5609 1403 5610 01:07:59,869 --> 01:08:04,309 5611 going to do this with a dilated 5612 5613 1404 5614 01:08:01,789 --> 01:08:07,880 5615 transpose convolution right I'm going to 5616 5617 1405 5618 01:08:04,309 --> 01:08:09,619 5619 place the value in there and then the 5620 5621 1406 5622 01:08:07,880 --> 01:08:11,690 5623 other person I need to do in this 5624 5625 1407 5626 01:08:09,619 --> 01:08:15,320 5627 particular case is to evenly distribute 5628 5629 1408 5630 01:08:11,690 --> 01:08:18,980 5631 this value among the pixels and for this 5632 5633 1409 5634 01:08:15,320 --> 01:08:22,969 5635 I'm going to do average poly so both of 5636 5637 1410 5638 01:08:18,979 --> 01:08:25,968 5639 these operations are very familiar for 5640 5641 1411 5642 01:08:22,969 --> 01:08:27,920 5643 for deep learning people for the 5644 5645 1412 5646 01:08:25,969 --> 01:08:30,289 5647 convolutional neural networks with no 5648 5649 1413 5650 01:08:27,920 --> 01:08:32,569 5651 transpose convolution in this case it's 5652 5653 1414 5654 01:08:30,289 --> 01:08:34,250 5655 a fixed dilated convolution and we know 5656 5657 1415 5658 01:08:32,569 --> 01:08:36,589 5659 pulling in this case it's average 5660 5661 1416 5662 01:08:34,250 --> 01:08:39,439 5663 pooling so we can actually map all of 5664 5665 1417 5666 01:08:36,588 --> 01:08:43,359 5667 these voting operations into essentially 5668 5669 1418 5670 01:08:39,439 --> 01:08:46,159 5671 convolutional neural network operations 5672 5673 1419 5674 01:08:43,359 --> 01:08:48,500 5675 now these transpose convolutions they 5676 5677 1420 5678 01:08:46,159 --> 01:08:51,229 5679 need to take this this single value in 5680 5681 1421 5682 01:08:48,500 --> 01:08:53,509 5683 the input and by multiplying it with a 5684 5685 1422 5686 01:08:51,229 --> 01:08:56,389 5687 kernel they will distribute this value 5688 5689 1423 5690 01:08:53,509 --> 01:08:58,670 5691 in the output map now this curl is 5692 5693 1424 5694 01:08:56,390 --> 01:08:59,850 5695 actually going to define the amount of 5696 5697 1425 5698 01:08:58,670 --> 01:09:02,880 5699 the input value 5700 5701 1426 5702 01:08:59,850 --> 01:09:05,160 5703 that is being distributed and of course 5704 5705 1427 5706 01:09:02,880 --> 01:09:07,199 5707 when we talk about transpose convolution 5708 5709 1428 5710 01:09:05,159 --> 01:09:10,889 5711 in general we talk about learn transpose 5712 5713 1429 5714 01:09:07,199 --> 01:09:14,010 5715 convolution however for this particular 5716 5717 1430 5718 01:09:10,890 --> 01:09:16,109 5719 purpose of vote aggregation we actually 5720 5721 1431 5722 01:09:14,010 --> 01:09:19,380 5723 fix the kernel parameters and these 5724 5725 1432 5726 01:09:16,109 --> 01:09:22,950 5727 kernel parameters are this one Hodge 5728 5729 1433 5730 01:09:19,380 --> 01:09:25,230 5731 encoding across each channel that marks 5732 5733 1434 5734 01:09:22,949 --> 01:09:29,039 5735 the target location so of course we know 5736 5737 1435 5738 01:09:25,229 --> 01:09:31,829 5739 exactly where to place this vote so this 5740 5741 1436 5742 01:09:29,039 --> 01:09:33,630 5743 is going to be a fixed operation we're 5744 5745 1437 5746 01:09:31,829 --> 01:09:37,859 5747 not going to have learn about parameters 5748 5749 1438 5750 01:09:33,630 --> 01:09:40,380 5751 in there now this is a more detailed 5752 5753 1439 5754 01:09:37,859 --> 01:09:42,750 5755 example on the voting on the 5756 5757 1440 5758 01:09:40,380 --> 01:09:47,130 5759 implementation and you're welcome to 5760 5761 1441 5762 01:09:42,750 --> 01:09:50,039 5763 take a look at home now for object 5764 5765 1442 5766 01:09:47,130 --> 01:09:53,340 5767 detection what happens then this is an 5768 5769 1443 5770 01:09:50,039 --> 01:09:56,489 5771 example where we have this several 5772 5773 1444 5774 01:09:53,340 --> 01:09:59,369 5775 objects motorbikes person we cast these 5776 5777 1445 5778 01:09:56,489 --> 01:10:01,969 5779 votes and now we detect these peaks in 5780 5781 1446 5782 01:09:59,369 --> 01:10:04,970 5783 the heat map and this Peaks essentially 5784 5785 1447 5786 01:10:01,970 --> 01:10:08,010 5787 determined the consensus between 5788 5789 1448 5790 01:10:04,970 --> 01:10:10,170 5791 different pixels in the image in the 5792 5793 1449 5794 01:10:08,010 --> 01:10:13,110 5795 instance that all have voted for the 5796 5797 1450 5798 01:10:10,170 --> 01:10:14,789 5799 same Center so by simply threshold in 5800 5801 1451 5802 01:10:13,109 --> 01:10:17,789 5803 and doing some connected component 5804 5805 1452 5806 01:10:14,789 --> 01:10:22,170 5807 analysis we can detect the center for 5808 5809 1453 5810 01:10:17,789 --> 01:10:24,269 5811 all of these objects now we need to do 5812 5813 1454 5814 01:10:22,170 --> 01:10:27,750 5815 the back projection now we need to 5816 5817 1455 5818 01:10:24,270 --> 01:10:29,760 5819 localize the mask of these objects so 5820 5821 1456 5822 01:10:27,750 --> 01:10:33,770 5823 for every peak we need to determine 5824 5825 1457 5826 01:10:29,760 --> 01:10:37,710 5827 which pixels voted for this Center and 5828 5829 1458 5830 01:10:33,770 --> 01:10:41,780 5831 therefore favored this region to be the 5832 5833 1459 5834 01:10:37,710 --> 01:10:44,609 5835 center above all other possibilities and 5836 5837 1460 5838 01:10:41,779 --> 01:10:47,039 5839 you see actually that by doing this back 5840 5841 1461 5842 01:10:44,609 --> 01:10:49,829 5843 projection we get fairly good results 5844 5845 1462 5846 01:10:47,039 --> 01:10:52,680 5847 right I mean look at this instance here 5848 5849 1463 5850 01:10:49,829 --> 01:10:55,409 5851 by just looking at the pixels about it 5852 5853 1464 5854 01:10:52,680 --> 01:11:00,030 5855 for the center we already have quite a 5856 5857 1465 5858 01:10:55,409 --> 01:11:02,639 5859 good segmentation of this person so in 5860 5861 1466 5862 01:11:00,029 --> 01:11:04,649 5863 order to determine which P which pixel 5864 5865 1467 5866 01:11:02,640 --> 01:11:07,770 5867 could have voted for a specific object 5868 5869 1468 5870 01:11:04,649 --> 01:11:09,809 5871 Center the authors proposed to use what 5872 5873 1469 5874 01:11:07,770 --> 01:11:12,750 5875 they call a query filter which is 5876 5877 1470 5878 01:11:09,810 --> 01:11:13,409 5879 essentially a spatial inversion of the 5880 5881 1471 5882 01:11:12,750 --> 01:11:17,219 5883 voting 5884 5885 1472 5886 01:11:13,408 --> 01:11:20,460 5887 so see how is Horizonte and and 5888 5889 1473 5890 01:11:17,219 --> 01:11:24,329 5891 vertically flipped so my question is if 5892 5893 1474 5894 01:11:20,460 --> 01:11:27,389 5895 when I did the boating I voted for pixel 5896 5897 1475 5898 01:11:24,329 --> 01:11:30,539 5899 eight position eight to be my center 5900 5901 1476 5902 01:11:27,389 --> 01:11:33,029 5903 now during back propagation I look at 5904 5905 1477 5906 01:11:30,539 --> 01:11:36,149 5907 this pixel and I say well the 5908 5909 1478 5910 01:11:33,029 --> 01:11:38,849 5911 bottom-left pixel right we're here where 5912 5913 1479 5914 01:11:36,149 --> 01:11:42,388 5915 the area's should had actually voted for 5916 5917 1480 5918 01:11:38,850 --> 01:11:44,880 5919 eight if I'm the instant center right 5920 5921 1481 5922 01:11:42,389 --> 01:11:48,000 5923 this is exactly the opposite operation 5924 5925 1482 5926 01:11:44,880 --> 01:11:50,880 5927 and this essentially by applying this 5928 5929 1483 5930 01:11:48,000 --> 01:11:52,948 5931 this query filter is how you can do this 5932 5933 1484 5934 01:11:50,880 --> 01:11:55,739 5935 this vote aggregation this back 5936 5937 1485 5938 01:11:52,948 --> 01:11:58,908 5939 propagation where you actually find out 5940 5941 1486 5942 01:11:55,738 --> 01:12:02,638 5943 which pixels voted for you as a center 5944 5945 1487 5946 01:11:58,908 --> 01:12:05,908 5947 and the qualitative results of this 5948 5949 1488 5950 01:12:02,639 --> 01:12:08,460 5951 voting scheme are also really good so 5952 5953 1489 5954 01:12:05,908 --> 01:12:10,888 5955 you can see here for example this crazy 5956 5957 1490 5958 01:12:08,460 --> 01:12:13,560 5959 image with all the teddy bears that are 5960 5961 1491 5962 01:12:10,889 --> 01:12:16,199 5963 correctly classified and also the the 5964 5965 1492 5966 01:12:13,560 --> 01:12:20,670 5967 instances of ten are actually really 5968 5969 1493 5970 01:12:16,198 --> 01:12:23,369 5971 really impressive so you might actually 5972 5973 1494 5974 01:12:20,670 --> 01:12:25,050 5975 wonder why do we need to go through all 5976 5977 1495 5978 01:12:23,369 --> 01:12:28,198 5979 the trouble of doing semantic 5980 5981 1496 5982 01:12:25,050 --> 01:12:31,520 5983 segmentation instant segmentation and in 5984 5985 1497 5986 01:12:28,198 --> 01:12:35,789 5987 the end Panoptix segmentation right so 5988 5989 1498 5990 01:12:31,520 --> 01:12:38,370 5991 the idea is that we want to use a camera 5992 5993 1499 5994 01:12:35,789 --> 01:12:41,279 5995 in computer vision to understand the 5996 5997 1500 5998 01:12:38,369 --> 01:12:43,559 5999 scene around us and the ultimate scene 6000 6001 1501 6002 01:12:41,279 --> 01:12:46,019 6003 interpretation is to know exactly what 6004 6005 1502 6006 01:12:43,560 --> 01:12:48,389 6007 every pixel represents so we want to 6008 6009 1503 6010 01:12:46,020 --> 01:12:51,409 6011 find individual objects we want to find 6012 6013 1504 6014 01:12:48,389 --> 01:12:53,969 6015 surfaces and for example for robots 6016 6017 1505 6018 01:12:51,408 --> 01:12:57,029 6019 surfaces are really important right we 6020 6021 1506 6022 01:12:53,969 --> 01:12:59,489 6023 need to actually allow robots to 6024 6025 1507 6026 01:12:57,029 --> 01:13:01,590 6027 understand water drivable surfaces for 6028 6029 1508 6030 01:12:59,488 --> 01:13:04,319 6031 example road and water non drivable 6032 6033 1509 6034 01:13:01,590 --> 01:13:06,869 6035 surfaces we need to allow the robot to 6036 6037 1510 6038 01:13:04,319 --> 01:13:10,250 6039 understand the type of objects the type 6040 6041 1511 6042 01:13:06,869 --> 01:13:13,079 6043 of obstacles that it can find and also 6044 6045 1512 6046 01:13:10,250 --> 01:13:15,359 6047 finally and this is not done through a 6048 6049 1513 6050 01:13:13,079 --> 01:13:17,488 6051 Panoptix eggman tation but more through 6052 6053 1514 6054 01:13:15,359 --> 01:13:20,009 6055 tracking and through trajectory 6056 6057 1515 6058 01:13:17,488 --> 01:13:24,178 6059 prediction as we will see we also need 6060 6061 1516 6062 01:13:20,010 --> 01:13:26,969 6063 to allow robots to understand or to 6064 6065 1517 6066 01:13:24,179 --> 01:13:29,069 6067 predict the intent of the agency 6068 6069 1518 6070 01:13:26,969 --> 01:13:32,038 6071 the vicinity for example whether person 6072 6073 1519 6074 01:13:29,069 --> 01:13:33,118 6075 is going to cross the path of the of the 6076 6077 1520 6078 01:13:32,038 --> 01:13:36,118 6079 robot or not 6080 6081 1521 6082 01:13:33,118 --> 01:13:38,670 6083 so understanding the scene around us 6084 6085 1522 6086 01:13:36,118 --> 01:13:42,448 6087 through Panoptix segmentation is one of 6088 6089 1523 6090 01:13:38,670 --> 01:13:48,269 6091 the pillars actually of mobile robot 6092 6093 1524 6094 01:13:42,448 --> 01:13:50,578 6095 vision thank you very much for following 6096 6097 1525 6098 01:13:48,269 --> 01:13:54,380 6099 this lecture on instant segmentation 6100 6101 1526 6102 01:13:50,578 --> 01:13:54,380 6103 stay tuned for the next lecture 6104 6105