Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Extending Hardware Transactional Memory Capacity

via
Rollback-Only Transactions and Suspend/Resume

Alexander
Shady Issa Pascal Felber Paolo Romano
Matveev

1
Extending Hardware Transactional Memory Capacity
via
Rollback-Only Transactions and Suspend/Resume

POWER8-TM
Alexander
Shady Issa Pascal Felber Paolo Romano
Matveev

1
Transactional Memory
• alternative paradigm for parallel programming

• easy to use

• potential of fine-grained locking performance


withdraw(account, value){
__transaction{
if account.balance > value:
account.balance -= value;
return account.balance;
else
return -1;
}
}
Transactional memory
implementation

2
Hardware Transactional
Memory
• Intel and IBM processors

• implemented in the cache coherence protocol

• cache line granularity

• best effort

• S/W fallback is needed

3
Capacity Limitations

6 90
Throughput (10 Tx/s)

ROT capacity
5
HTM-SGL 80 ROT conflicts

Abort rate (%)


70
6

Lock aborts
4 60 HTM capacity

3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size

4
Capacity Limitations
capacity
aborts

6 90
Throughput (10 Tx/s)

ROT capacity
5
HTM-SGL 80 ROT conflicts

Abort rate (%)


70
6

Lock aborts
4 60 HTM capacity

3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size

4
Capacity Limitations
capacity
aborts

6 90
Throughput (10 Tx/s)

ROT capacity
5
HTM-SGL 80 ROT conflicts

Abort rate (%)


70
6

Lock aborts
4 60 HTM capacity

3 50 HTM non-tx
40 HTM tx
2 30
1 20
10
0 0
Transaction size activation of
the fallback
path

4
POWER8-TM
• hardware/software co-design

• utilises specific features available in POWER8:

• suspend/resume

• ROTs

• to support execution of larger transactions

5
Rollback-only Transaction
• lightweight transaction type

• updates are applied atomically

• does not track the reads

• theoretically infinite read-set

• not serialisable

6
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
End ROT
read X
inconsistent returns 1
value

7
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
WAR Begin ROT
read X
returns 0 X=1
End ROT
read X
inconsistent returns 1
value

7
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1

End ROT
read X

8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1

read X
returns 0 End ROT
consistent new value can
only appear now

8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
returns 0 X=1
R AW

read X
returns 0 End ROT
consistent new value can
only appear now

8
ROTs
X=0 X=0
Thread 1 Thread 2
Begin ROT
Begin ROT
read X
X=1
End ROT
read X
wait for concurrent
ROTs
non-transactionally

9
ROTs
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
WAR Begin ROT
read X
X=1

WAR read Y
Y=1

End ROT End ROT

10
ROTs
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
X=0 X=1 X=1
Y=1 Y=0
read Y
Y=1

End ROT End ROT

10
Touch-to-Validate

• core algorithm of P8TM

• to make concurrent execution of ROTs safe and


serialisable

• basic intuition: convert WAR to RAW

11
T2V
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y End ROT
End ROT

12
T2V
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y

End ROT End ROT

12
T2V
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT

12
T2V
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT

12
T2V
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin ROT
Begin ROT
read X
write X
read Y
write Y
re-read X re-read Y
End ROT End ROT

12
T2V

• needs to track only the addresses

• this must be done in software

• how can software outperform hardware?

13
TMCAM
1:____________
2:____________ Begin HTM
3:____________
4:____________ read A
5:____________
6:____________ read B
7:____________
8:____________ read C
9:____________
10:____________ read D
write E
End HTM
64:___________
TMCAM
14
TMCAM
&A
1:____________
&B
2:____________ Begin HTM
&C
3:____________
&D
4:____________ read A
&E
5:____________
6:____________ read B
7:____________
8:____________ read C
9:____________
10:____________ read D
write E
End HTM
64:___________
TMCAM
14
Read-set Tracking
1:___________________________________
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A
5:___________________________________
6:___________________________________ read B
7:___________________________________
8:___________________________________ read C
9:___________________________________
10:__________________________________ read D
write E
End ROT
64:__________________________________

15
Read-set Tracking
1:___________________________________
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A store &A
5:___________________________________
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________

15
Read-set Tracking
&A &B &C&D
1:___________________________________
&E
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A store &A
5:___________________________________
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________

15
Read-set Tracking
8 bytes

&A &B &C&D


1:___________________________________
&E
2:___________________________________ Begin ROT
3:___________________________________
4:___________________________________ read A store &A
5:___________________________________
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________

128bytes
15
Read-set Tracking
8 bytes

&A &B &C&D


1:___________________________________
&E
2:___________________________________ Begin ROT
3:___________________________________
up to 16x
4:___________________________________
5:___________________________________
read A store &A
larger read-set
6:___________________________________ read B store &B
7:___________________________________
8:___________________________________ read C store &C
9:___________________________________
10:__________________________________ read D store &D
write E
End ROT
64:__________________________________

128bytes
15
HTM

• transactions may fit in HTM

• we need to avoid extra overheads of using ROTs

• try first in HTM, if it overflows, fallback to ROT

• how can HTMs and ROTs run concurrently?

16
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1 X=1

End HTM

End ROT

17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1 X=1

End HTM

End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
Y=1
End HTM

End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1
End HTM
read Y

End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0

End HTM inconsistent


value
read Y
returns 1
End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0

End HTM inconsistent


value
read Y
T2V returns 1
End ROT
HTM is protected
by H/W
17
HTM + ROT
X=0 X=0
Y=0 Y=0

Thread 1 Thread 2
Begin HTM
Begin ROT
read X
read Y
Y=1 returns 0
using
S/R consistent
value
read Y
End HTM T2V returns 0
End ROT
HTM is protected
by H/W
17
Uninstrumented Read-only

• read only transactions without any instrumentation

• outside the context of HTM or ROT

• no bounds on Tx size

• HTMs and ROTs must wait for UROs

18
POWER8-TM

w/o
Transaction
read-only instrumentation
Tx
update
Tx

HTM ROT GL
19
POWER8-TM
overkill
small
w/o
Transaction
read-only instrumentation
Tx
update
Tx

overkill

HTM ROT GL
19
POWER8-TM
large
w/o
Transaction
read-only instrumentation
Tx
update
Tx

useless

HTM ROT GL
19
Self-tuning
• lightweight, online reinforcement learning

• determine execution path:

• HTM —> GL : small Txs

• ROT —> GL : large Txs

• HTM —> ROT —> GL : mixed workload

20
Evaluation: Vacation
8 HyNoRec 100
Throughput (105 tx/s)

7 HTM-SGL
P8TMUCB 80

Commits (%)
6 P8TM URO

5 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: Vacation
8 >3x HyNoRec 100
Throughput (105 tx/s)

7 HTM-SGL
P8TMUCB 80

Commits (%)
6 P8TM URO

5 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: Vacation
committing
8 >3x HyNoRec 100 in h/w
Throughput (105 tx/s)

7 HTM-SGL
P8TMUCB 80

Commits (%)
6 P8TM URO

5 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
21
Evaluation: SSCA2
16 100
Throughput (106 tx/s)

14
80

Commits (%)
12 URO

10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM

6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

22
Evaluation: SSCA2
16 small Txs 100
Throughput (106 tx/s)

14
80

Commits (%)
12 URO

10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM

6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

22
Evaluation: SSCA2
UCB disables
16 small Txs 100 ROTs
Throughput (106 tx/s)

14
80

Commits (%)
12 URO

10 HyNoRec 60
GL/STM
ROT
8 HTM-SGL HTM

6 P8TMUCB 40
P8TM
4 20
2
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

22
Conclusion
• POWER8-TM was able to exploit ROTs and
suspend/resume to expand the capacity limitations

• TMCAM aware read-set tracking was necessary

• Self-tuning was effective in adapting to different


workloads

• POWER8-TM promotes the importance of such


features that can be used in innovative techniques
to mitigate hardware limitations
23
Results: read-set tracking
contention Low contention
2 HTM tx HTM capacity ROT
HTM tx HTM non-tx
HTM capacity ROT aborts
Lock conflicts ROT
1.8Almost
100
TE
100
HTM non-tx
no contention
Lock aborts
Low contentionHTMROT
tx capacity HTM capacity
Speedup w.r.t. HTM-SGL

HTM-SGL
HTM non-tx Lock aborts
1.66 2 100
1.4 80 Abort rate (%)
Abort rate (%)

80 5 SE++SE 1.8 TE

Abort rate (%)


HTM-SGL
80
1.2 1.6 HTM tx HTM capacity ROT conflicts

6014 60 100
HTM non-tx Lock aborts ROT capacity

1.4 60

Abort rate (%)


0.83 80
40 1.2
40
0.6 1 6040
2
2
0.4 1
3 20 2 20 3 0.8 4020
10 10
1 10 10 0.6 20
Bucket length
0 0 0.4 0
0 1TE 2SE 3 TE 1 0
SE TE 2 SE++
TE SE 3
SE
SE++ TETESE++ SESE SE++
TE S
10 10 10SE++ 10 TE 10 SE 10SE++
Bucket length (20,50,100,266,800,1333,2666)
Bucket length
Bucket (20,50,100,266,800,1333,26
Bucket
length
Bucket length (20,50,100,266,800,1333,2666) length (20,50,100,266,80
24
Evaluation: Vacation
8 NoRec 100
Throughput (105 tx/s)

7 HyNoRec
80

Commits (%)
HTM-SGL
6 P8TMUCB URO

5 HERWL
P8TM 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: Vacation
8
>3x 100
NoRec
Throughput (105 tx/s)

7 HyNoRec
80

Commits (%)
HTM-SGL
6 P8TMUCB URO

5 HERWL
P8TM 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: Vacation
committing
8
>3x 100 in h/w
NoRec
Throughput (105 tx/s)

7 HyNoRec
80

Commits (%)
HTM-SGL
6 P8TMUCB URO

5 HERWL
P8TM 60
GL/STM
ROT
4 HTM

3 40
2 20
1
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads
10 physical cores
25
Evaluation: SSCA2
160 100
Throughput (106 tx/s)

140
80

Commits (%)
120 URO

100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM

60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

26
Evaluation: SSCA2
160 small Txs 100
Throughput (106 tx/s)

140
80

Commits (%)
120 URO

100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM

60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

26
Evaluation: SSCA2
UCB disables
160 small Txs 100 ROTs
Throughput (106 tx/s)

140
80

Commits (%)
120 URO

100 NoRec 60
GL/STM
ROT
80 HyNoRec HTM

60 HTM-SGL
P8TMUCB
40
40 HERWL
20
P8TM
20
0 0 P8TM HTM-SGL
2 4 8 16 32 64 P8TMUCB HyNoRec
Number of threads

26

You might also like