Professional Documents
Culture Documents
Hallide
Hallide
Hallide
1Halide – An
2The Halide
3
Halide for HVX
Introduction Programming
Language
2
Halide
A new DSL for image processing and computational photography
3
Halide
A new DSL for image processing and computational photography.
• Halide programs / pipelines consist of two major components
• Algorithm
• Schedule
• Algorithm specifies what will be computed at a pixel
• Schedules specifies how the computation will be organized
4
Halide
Typical Optimization Techniques for Image Processing
Blue boxes
represent memory
read and yellow
represent memory
written
8
Halide
Scheduling for performance (Halide vs C++ comparison)
// horizontal blur – Algorithm. for (y = min_row; y < max_row; ++y) {
f(x, y) = (input(x-1, y) + input(x, y) + for (x = min_col, x <= max_col; ++x) {
input(x+1, y))/3; f(x, y) =
(input(x-1,y) + input(x, y) + input(x+1, y)) / 3;
}
}
Tiling
f.tile(x, y, xi, yi, 128, 4); for (y = min_row; y < max_row/4; ++y) {
for (x = min_col, x <= max_col/128; ++x) {
for(yi = 0; yi < 4; ++yi) {
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + yi;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
}
}
9
Halide
Scheduling for performance (Halide vs C++ comparison)
Tiling & unrolling
f.tile(x, y, xi, yi, 128, 4) for(y = min_row; y < max_row/4; ++y) {
.unroll(yi); for (x = min_col, x <= max_col/128; ++x) {
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + 0;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
for (xi = 0; xi < 128; ++xi) {
• Benefits of unrolling x_ = x*128 + xi;
y_ = y*4 + 1;
• Reduces branching overhead f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
• Increases Instruction Level Parallelism for (xi = 0; xi < 128; ++xi) {
(ILP) which is especially advantageous x_ = x*128 + xi;
y_ = y*4 + 2;
on a processor like Hexagon which is a f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
VLIW processor (think packets!) input(x_+1, y_)) / 3;
}
• Exposes dependences across loop for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
iterations such as read-after-write y_ = y*4 + 3;
(RAW) dependences across two f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
iterations of a loop }
10
}
Halide
Scheduling for performance (Halide vs C++ comparison)
Tiling, unrolling and vectorization
f.tile(x, y, xi, yi, 128, 4) for(y = min_row; y < max_row/4; ++y) {
.unroll(yi) for (x = min_col, x <= max_col/128; ++x) {
for (xi = 0; xi < 128; ++xi) {
.vectorize(xi); x_ = x*128 + xi;
y_ = y*4 + 0;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + 1;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
Not possible without
input(x_+1, y_)) / 3;
an auto-vectorizer
}
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + 2;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + 3;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
11
}
Halide
Scheduling for performance (Halide vs C++ comparison)
Tiling, unrolling and vectorization
f.tile(x, y, xi, yi, 128, 4) for(y = min_row; y < max_row/4; ++y) {
.unroll(yi) for (x = min_col, x <= max_col/128; ++x) {
for (xi = 0; xi < 128; ++xi) {
.vectorizer(xi); x_ = x*128 + xi;
y_ = y*4 + 0;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
y_ = y*4 + 1;
Halide provides many more scheduling f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
Not possible without an auto-vectorizer
input(x_+1, y_)) / 3;
directives for greater control over the }
organization of computation. The Halide for (xi = 0; xi < 128; ++xi) {
x_ = x*128 + xi;
user guide and the tutorials hosted at y_ = y*4 + 2;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
halide-lang.org are the best input(x_+1, y_)) / 3;
}
resources for learning more about these for (xi = 0; xi < 128; ++xi) {
scheduling directives x_ = x*128 + xi;
y_ = y*4 + 3;
f(x_, y_) = (input(x_-1,y_) + input(x_, y_) +
input(x_+1, y_)) / 3;
}
12
}
Halide
Summary
13
Scheduling a Stage (1)
Reordering Dimensions (slide 1)
• Change loop orders
1 2 3
• Order is defined from innermost to
outermost loop 4 5 6
7 8 9
• Usage:
.reorder(y, x)
output.reorder(dimensions from
innermost -> outermost)
1 4 7
2 5 8
3 6 9
14
Scheduling a Stage (2)
Reordering Dimensions (slide 2)
reorder(y, x)
16
Scheduling a Stage (7)
Tiling (slide 2)
1 2
Iterates over the tiles
xo
xi 3 4
1 2 5 6
3 4 7 8
yi 9 10 13 14
yo 11 12 15 16
1 2
.tile(x, y, xi, yi, 2, 2) 3 4 Iterates inside the tile
1 2
3 4
5 6
7 8
output.parallel(y, work_per_thread)
1 2 3 4
Parallel 1 2 3 4
threads 1 2 3 4
1 2 3 4
20
Interleaving Stages (1)
Producer Consumer Relationship
• Consider the Halide pipeline: (skipping obvious details)
21
Interleaving Stages (2)
Compute Inline
• Calculate the producer as and when required by the consumer
• Directly replace the producer expression with it’s value
• By default all stages are inlined
22
Interleaving Stages (3)
producer_buffer is a
Compute Root Halide created
intermediate buffer
• Calculate entire producer before starting
consumer
• Usage:
producer.compute_root()
Good for simple stages with less data reuse Good for complex producer function or if
multiple consumers consume the values
24
Interleaving Stages (5)
.compute_at (slide 1)
• Place the computation of producer inside specified loop level of consumer
• Usage: producer.compute_at(consumer, y)
25
Interleaving Stages (6)
.compute_at (slide 2)
• producer.compute_at(consumer, y);
• Locality: compute_root < compute_at(consumer, y) < inline
• Redundant Computations: compute_root < compute_at(consumer, y) < inline
Bad Average Good
dSum031 = Q6_Ww_vrmpyacc_WwWubRbI(dSum031, d
109
dline200,__m5m4m3, 0);
dSum131 = Q6_Ww_vrmpyacc_WwWubRbI(dSum131,
43
deep knowledge of HVX architecture and
68
69
70
sline300 = *iptr3++;
sOut01p = Q6_V_vzero();
line100,__m5m4m3, 1);
95
dline200,__m5m4m3, 1);
110
111
microarchitecture
44 HVX_Vector sline000,sline004,sline064;
96 dSum120 = Q6_Ww_vrmpyacc_WwWubRbI(dSum120,
71 sOut11p = Q6_V_vzero();
dSum020 = Q6_Ww_vrmpyacc_WwWubRbI(dSum020, d dline300,__m8m7m6, 0);
45 HVX_Vector sline100,sline104,sline164; 72
line200,__m8m7m6, 0); 112
73 for ( i=width; i>0; i-=VLEN )
46 HVX_Vector sline200,sline204,sline264; 97 dSum131 = Q6_Ww_vrmpyacc_WwWubRbI(dSum131,
74 { 30
dSum031 = Q6_Ww_vrmpyacc_WwWubRbI(dSum031, d dline300,__m8m7m6, 1);
75 sline064 = *iptr0++; line200,__m8m7m6, 1);
47 HVX_Vector sline300,sline304,sline364; 113
Halide on Hexagon with HVX
3x3 convolution: Halide vs HVX Intrinsics
Halide HVX Intrinsics
0.109 cycles/pixel
0.114 cycles/pixel
Algorithm
Hand-optimized,
complex loop nest
Schedule
Write algorithm once, test and freeze algorithm after tests pass.
Then, alter schedules to explore optimizations to improve performance.
Never change algorithm again!
32
Advantages of using Halide vs Intrinsics or Assembly
Development time and cost
Significantly more time needed to author assembly/intrinsics compared to a Halide program
Expertise
Assembly/intrinsics require nuanced understanding of processor architecture and micro-architecture
Only scales to a few programmers
Halide allows a significantly larger set of developers to use HVX
Halide allows programmers to focus on the algorithm and not on processor details
Scalability
Assembly and intrinsics programming does not scale to very large code bases
Automatic optimization
Halide allows optimizations across kernels for the entire image pipeline
Not possible to optimize across libraries of assembly/intrinsics
Halide allows programmers to easily tune program for performance
Portability
With Halide, no change in code required for new versions of Hexagon
Compiler will automatically take advantage of new hardware features and instruction set
33
Halide HVX v60 Performance on Hexagon Simulator
Halide Performance vs. Hand-optimized Intrinsic Benchmarks
Higher is Better, 100% equals intrinsic performance
1.4 Testing on Simulator with Perfect Cache Behaviour
125.94%
1.2
109.06%
104.51%
100.40% 101.12%
1 95.90% 95.16% 95.88%
86.71%
0.8
0.6
0.4
0.2
34
Thank you
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Nothing in these materials is an offer to sell any of the components or devices referenced herein.
©2017 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names
may be trademarks or registered trademarks of their respective owners.
References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business,
QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated,
operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all
of its product and services businesses, including its semiconductor business, QCT.