Professional Documents
Culture Documents
Power8 TM
Power8 TM
via
Rollback-Only Transactions and Suspend/Resume
Alexander
Shady Issa Pascal Felber Paolo Romano
Matveev
1
Extending Hardware Transactional Memory Capacity
via
Rollback-Only Transactions and Suspend/Resume
POWER8-TM
Alexander
Shady Issa Pascal Felber Paolo Romano
Matveev
1
Transactional Memory
• alternative paradigm for parallel programming
• easy to use
2
Hardware Transactional
Memory
• Intel and IBM processors
• best effort
3
Capacity Limitations
6 90
Throughput (10 Tx/s)
ROT capacity
5
HTM-SGL 80 ROT conflicts
Lock aborts
4 60 HTM capacity
3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size
4
Capacity Limitations
capacity
aborts
6 90
Throughput (10 Tx/s)
ROT capacity
5
HTM-SGL 80 ROT conflicts
Lock aborts
4 60 HTM capacity
3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size
4
Capacity Limitations
capacity
aborts
6 90
Throughput (10 Tx/s)
ROT capacity
5
HTM-SGL 80 ROT conflicts
Lock aborts
4 60 HTM capacity
3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size activation of
the fallback
path
4
POWER8-TM
• hardware/software co-design
• suspend/resume
• ROTs
5
Rollback-only Transaction
• lightweight transaction type
• not serialisable
6
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
End ROT
read X
inconsistent returns 1
value
7
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
WAR Begin ROT
read X
returns 0 X=1
End ROT
read X
inconsistent returns 1
value
7
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
End ROT
read X
8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
read X
returns 0 End ROT
consistent new value can
only appear now
8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
R AW
read X
returns 0 End ROT
consistent new value can
only appear now
8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
X=1
End ROT
read X
wait for concurrent
ROTs
non-transactionally
9
ROTs
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
WAR Begin ROT
read X
X=1
WAR read Y
Y=1
10
ROTs
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
X=0 X=1 X=1
Y=1 Y=0
read Y
Y=1
10
Touch-to-Validate
11
T2V
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y End ROT
End ROT
12
T2V
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
12
T2V
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT
12
T2V
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT
12
T2V
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT
12
T2V
13
TMCAM
1:____________
2:____________ Begin HTM
3:____________
4:____________ read A
5:____________
6:____________ read B
7:____________
8:____________ read C
9:____________
10:____________ read D
write E
End HTM
64:___________
TMCAM
14
TMCAM
&A
1:____________
&B
2:____________ Begin HTM
&C
3:____________
&D
4:____________ read A
&E
5:____________
6:____________ read B
7:____________
8:____________ read C
9:____________
10:____________ read D
write E
End HTM
64:___________
TMCAM
14
Read-set Tracking
1:___________________________________
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A
5:___________________________________
6:___________________________________ read B
7:___________________________________
8:___________________________________ read C
9:___________________________________
10:__________________________________ read D
write E
End ROT
64:__________________________________
15
Read-set Tracking
1:___________________________________
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A store &A
5:___________________________________
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________
15
Read-set Tracking
&A &B &C&D
1:___________________________________
&E
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A store &A
5:___________________________________
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________
15
Read-set Tracking
8 bytes
128bytes
15
Read-set Tracking
8 bytes
128bytes
15
HTM
16
HTM + ROT
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1 X=1
End HTM
End ROT
17
HTM + ROT
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1 X=1
End HTM
End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1
End HTM
End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1
End HTM
read Y
End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0
Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0
using
S/R consistent
value
read Y
End HTM T2V returns 0
End ROT
HTM is protected
by H/W
17
Uninstrumented Read-only
• no bounds on Tx size
18
POWER8-TM
w/o
Transaction
read-only instrumentation
Tx
update
Tx
HTM ROT GL
19
POWER8-TM
overkill
small
w/o
Transaction
read-only instrumentation
Tx
update
Tx
overkill
HTM ROT GL
19
POWER8-TM
large
w/o
Transaction
read-only instrumentation
Tx
update
Tx
useless
HTM ROT GL
19
Self-tuning
• lightweight, online reinforcement learning
20
Evaluation: Vacation
8 HyNoRec 100
Throughput (105 tx/s)
7 HTM-SGL
P8TMUCB 80
Commits (%)
6 P8TM URO
5 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: Vacation
8 >3x HyNoRec 100
Throughput (105 tx/s)
7 HTM-SGL
P8TMUCB 80
Commits (%)
6 P8TM URO
5 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: Vacation
committing
8 >3x HyNoRec 100 in h/w
Throughput (105 tx/s)
7 HTM-SGL
P8TMUCB 80
Commits (%)
6 P8TM URO
5 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: SSCA2
16 100
Throughput (106 tx/s)
14
80
Commits (%)
12 URO
10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM
6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
22
Evaluation: SSCA2
16 small Txs 100
Throughput (106 tx/s)
14
80
Commits (%)
12 URO
10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM
6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
22
Evaluation: SSCA2
UCB disables
16 small Txs 100 ROTs
Throughput (106 tx/s)
14
80
Commits (%)
12 URO
10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM
6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
22
Conclusion
• POWER8-TM was able to exploit ROTs and
suspend/resume to expand the capacity limitations
HTM-SGL
HTM non-tx Lock aborts
1.66 2 100
1.4 80 Abort rate (%)
Abort rate (%)
80 5 SE++SE 1.8 TE
6014 60 100
HTM non-tx Lock aborts ROT capacity
1.4 60
7 HyNoRec
80
Commits (%)
HTM-SGL
6 P8TMUCB URO
5 HERWL
P8TM 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: Vacation
8
>3x 100
NoRec
Throughput (105 tx/s)
7 HyNoRec
80
Commits (%)
HTM-SGL
6 P8TMUCB URO
5 HERWL
P8TM 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: Vacation
committing
8
>3x 100 in h/w
NoRec
Throughput (105 tx/s)
7 HyNoRec
80
Commits (%)
HTM-SGL
6 P8TMUCB URO
5 HERWL
P8TM 60
GL/STM
ROT
4 HTM
3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: SSCA2
160 100
Throughput (106 tx/s)
140
80
Commits (%)
120 URO
100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM
60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
26
Evaluation: SSCA2
160 small Txs 100
Throughput (106 tx/s)
140
80
Commits (%)
120 URO
100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM
60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
26
Evaluation: SSCA2
UCB disables
160 small Txs 100 ROTs
Throughput (106 tx/s)
140
80
Commits (%)
120 URO
100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM
60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
26