Thread-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
Outline
 Multi-processors
 Shared-memory architectures
 Memory synchronization
2
Increasing Importance of
Multiprocessing
 Inability to exploit more ILP
 Power & silicon costs grow faster than performance
 Growing interest in
 High-end servers as cloud computing & software as a
service
 Data-intensive applications
 Lower interest in increasing desktop performance
 Better understanding of how to use multiprocessors
effectively
 Replication of a core is relatively easy than a
completely new design
3
Vectors, MMX, GPUs vs.
Multiprocessors
 Vectors, MMX, GPUs
 SIMD
 Multiprocessors
 MIMD
4
Multiprocessor Architecture
 MIMD multiprocessor with n processors
 Need n threads or processors to keep it fully utilized
 Thread-Level parallelism
 Uses MIMD model
 Have multiple program counters
 Targeted for tightly-coupled shared-memory
multiprocessors
 Communication among threads through shared
memory
5
Definition – Grain Size
 Amount of computation assigned to each thread
 Threads can be used for data-level parallelism,
but overheads may outweigh benefits
6
Symmetric Multiprocessors (SMP)
 Small no of cores
 Share single memory
with uniform memory
latency
 A.k.a. Uniform Memory
Access (UMA)
7
Distributed Shared Memory (DSM)
 Memory distributed among processors
 Processors connected via direct (switched) & non-direct
(multi-hop) interconnection networks
 A.k.a. Non-Uniform Memory Access/latency (NUMA)
8
Cache Coherence
9
Core A Core B
Cache Cache
RAM
Cache Coherence (Cont.)
 Processors may see different values through
their caches
 Caching shared data introduces new problems
 In a coherent memory system read of a data
item returns the most recently written value
 2 aspects
 Coherence
 Consistency
10
1. Coherence
 Defines behavior of reads & writes to the same
memory location
 What value can be returned by a read
 All reads by any processor must return most recently
written value
 Writes to same location by any 2 processors are seen
in same order by all processors
 Serialized writes to same location
11
2. Consistency
 Defines behavior of reads & writes with respect
to access to other memory locations
 When a written value will be returned by a read
 If a processor writes location x followed by
location y, any processor that sees new value of
y must also see new value of x
 Serialized writes different locations
12
Enforcing Coherence
 Coherent caches provide
 Migration – movement of data
 Replication – multiple copies of data
 Cache coherence protocols
 Snooping
 Each core tracks sharing status of each of its blocks
 Distributed
 Directory based
 Sharing status of each block kept in a single location
 Centralized
13
Snooping Coherence Protocol
 Write invalidate protocol
 On write, invalidate all other cached copies
 Use bus itself to serialize
 Write can’t complete until bus access is obtained
 Concurrent writes?
 One who obtains bus wins
 Most common implementation
14
Snooping Coherence Protocol (Cont.)
 Write update protocol
 A.k.a. Write broadcast protocol
 On write, update all copies
 Need more bandwidth
15
Snooping Coherence Protocol –
Implementation Techniques
 Locating an item when a read miss occurs
 Write-through cache
 Recent copy in memory
 Write-back cache
 Every processor snoops every address placed on shared bus
 If a processor finds it has a dirty block, updated block is sent to
requesting processor
 Cache lines marked as shared or
exclusive/modified
 Only writes to shared lines need an invalidate
broadcast
 After this, line is marked as exclusive
16
Snoopy Coherence Protocol – State
Transition Example
17
Snoopy Coherence Protocol – State
Transition Example
18
Snoopy Coherence Protocol – Issues
 Operations are not atomic
 e.g., detect miss, acquire bus, receive a response
 Applies to both reads & writes
 Creates possibility of deadlock & races
 Actual Snoopy Coherence Protocols are more
complicated
19
Coherence Protocols – Extensions
 Shared memory bus &
snooping bandwidth is
bottleneck for scaling
symmetric
multiprocessors
 Duplicating tags
 Place directory in
outermost cache
 Use crossbars or point-to-
point networks with
banked memory
20
Coherence Protocols – Example
 AMD Opteron
 Memory directly connected
to each multicore chip in
NUMA-like organization
 Implement coherence
protocol using point-to-
point links
 Use explicit
acknowledgements to
order operations
21
Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
Cache Coherence – Performance
 Coherence influences cache miss rate
 Coherence misses
 True sharing misses
 Write to shared block (transmission of invalidation)
 Read an invalidated block
 False sharing misses
 Read an unmodified word in an invalidated block
22
Performance Study – Commercial
Workload
23
Directory-Based Cache Coherence
 Sharing status of each physical memory block
kept in a single location
 Approaches
 Central directory for memory or common cache
 For Symmetric Multiprocessors (SMPs)
 Distributed directory
 For Distributed Shared Memory (DSM) systems
 Overcomes single point of contention in SMPs
24
Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir-
cache-m.pl?cd-t.html,cd-f.html,menu1.html
Directory Protocols
 For each block, maintain state
 Shared
 1 or more nodes have the block cached, value in memory is
up-to-date
 Set of node IDs
 Uncached
 Modified
 Exactly 1 node has a copy of the cache block, value in
memory is out-of-date
 Owner node ID
 Directory maintains block states & sends
invalidation messages
25
Directory Protocols (Cont.)
26
Directory Protocols (Cont.)
 For uncached block
 Read miss
 Requesting node is sent requested data & is made the only
sharing node, block is now shared
 Write miss
 Requesting node is sent requested data & becomes the sharing
node, block is now exclusive (modified)
 For shared block
 Read miss
 Requesting node is sent requested data from memory, node is
added to sharing set
 Write miss
 Requesting node is sent value, all nodes in sharing set are sent
invalidate messages, sharing set only contains requesting
node, block is now exclusive 27
Directory Protocols (Cont.)
 For exclusive block
 Read miss
 Owner is sent a data fetch message, block becomes shared,
owner sends data to directory, data written back to memory,
sharers set contains old owner & requestor
 Data write back
 Block becomes uncached, sharer set is empty
 Write miss
 Message is sent to old owner to invalidate & send value to
directory, requestor becomes new owner, block remains
exclusive
28
Directory Protocols (Cont.)
29
Synchronization Operations
 Basic building blocks
 Atomic exchange
 Swaps register with memory location
 Test-and-set
 Sets under condition
 Fetch-and-increment
 Reads original value from memory & increments it in memory
 Requires memory read & write in uninterruptable
instruction
 load linked/store conditional
 If contents of memory location specified by load linked are
changed before store conditional to same address, store
conditional fails 30
Implementing Locks Using Coherence
 Spin lock
 If no coherence
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1) ;atomic exchange
BNEZ R2,lockit ;already locked?
 If coherence
lockit: LD R2,0(R1) ;load of lock
BNEZ R2,lockit ;not available-spin
DADDUI R2,R0,#1 ;load locked value
EXCH R2,0(R1); swap
BNEZ R2,lockit; branch if lock wasn’t 0
31
Implementing Locks – Advantages
 Reduces memory traffic
 During each iteration, current lock value can be read from cache
 Locality of accessing lock
32
33
Summary
 Multi-processors
 Create new problems, e.g., cache coherency
 Aspects of cache coherence
 Coherence
 Consistency
 Shared-memory architectures
 Snooping
 Directory based
 Memory synchronization
34

Introduction to Thread Level Parallelism

  • 1.
    Thread-Level Parallelism CS4342 AdvancedComputer Architecture Dilum Bandara [email protected] Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2.
    Outline  Multi-processors  Shared-memoryarchitectures  Memory synchronization 2
  • 3.
    Increasing Importance of Multiprocessing Inability to exploit more ILP  Power & silicon costs grow faster than performance  Growing interest in  High-end servers as cloud computing & software as a service  Data-intensive applications  Lower interest in increasing desktop performance  Better understanding of how to use multiprocessors effectively  Replication of a core is relatively easy than a completely new design 3
  • 4.
    Vectors, MMX, GPUsvs. Multiprocessors  Vectors, MMX, GPUs  SIMD  Multiprocessors  MIMD 4
  • 5.
    Multiprocessor Architecture  MIMDmultiprocessor with n processors  Need n threads or processors to keep it fully utilized  Thread-Level parallelism  Uses MIMD model  Have multiple program counters  Targeted for tightly-coupled shared-memory multiprocessors  Communication among threads through shared memory 5
  • 6.
    Definition – GrainSize  Amount of computation assigned to each thread  Threads can be used for data-level parallelism, but overheads may outweigh benefits 6
  • 7.
    Symmetric Multiprocessors (SMP) Small no of cores  Share single memory with uniform memory latency  A.k.a. Uniform Memory Access (UMA) 7
  • 8.
    Distributed Shared Memory(DSM)  Memory distributed among processors  Processors connected via direct (switched) & non-direct (multi-hop) interconnection networks  A.k.a. Non-Uniform Memory Access/latency (NUMA) 8
  • 9.
    Cache Coherence 9 Core ACore B Cache Cache RAM
  • 10.
    Cache Coherence (Cont.) Processors may see different values through their caches  Caching shared data introduces new problems  In a coherent memory system read of a data item returns the most recently written value  2 aspects  Coherence  Consistency 10
  • 11.
    1. Coherence  Definesbehavior of reads & writes to the same memory location  What value can be returned by a read  All reads by any processor must return most recently written value  Writes to same location by any 2 processors are seen in same order by all processors  Serialized writes to same location 11
  • 12.
    2. Consistency  Definesbehavior of reads & writes with respect to access to other memory locations  When a written value will be returned by a read  If a processor writes location x followed by location y, any processor that sees new value of y must also see new value of x  Serialized writes different locations 12
  • 13.
    Enforcing Coherence  Coherentcaches provide  Migration – movement of data  Replication – multiple copies of data  Cache coherence protocols  Snooping  Each core tracks sharing status of each of its blocks  Distributed  Directory based  Sharing status of each block kept in a single location  Centralized 13
  • 14.
    Snooping Coherence Protocol Write invalidate protocol  On write, invalidate all other cached copies  Use bus itself to serialize  Write can’t complete until bus access is obtained  Concurrent writes?  One who obtains bus wins  Most common implementation 14
  • 15.
    Snooping Coherence Protocol(Cont.)  Write update protocol  A.k.a. Write broadcast protocol  On write, update all copies  Need more bandwidth 15
  • 16.
    Snooping Coherence Protocol– Implementation Techniques  Locating an item when a read miss occurs  Write-through cache  Recent copy in memory  Write-back cache  Every processor snoops every address placed on shared bus  If a processor finds it has a dirty block, updated block is sent to requesting processor  Cache lines marked as shared or exclusive/modified  Only writes to shared lines need an invalidate broadcast  After this, line is marked as exclusive 16
  • 17.
    Snoopy Coherence Protocol– State Transition Example 17
  • 18.
    Snoopy Coherence Protocol– State Transition Example 18
  • 19.
    Snoopy Coherence Protocol– Issues  Operations are not atomic  e.g., detect miss, acquire bus, receive a response  Applies to both reads & writes  Creates possibility of deadlock & races  Actual Snoopy Coherence Protocols are more complicated 19
  • 20.
    Coherence Protocols –Extensions  Shared memory bus & snooping bandwidth is bottleneck for scaling symmetric multiprocessors  Duplicating tags  Place directory in outermost cache  Use crossbars or point-to- point networks with banked memory 20
  • 21.
    Coherence Protocols –Example  AMD Opteron  Memory directly connected to each multicore chip in NUMA-like organization  Implement coherence protocol using point-to- point links  Use explicit acknowledgements to order operations 21 Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
  • 22.
    Cache Coherence –Performance  Coherence influences cache miss rate  Coherence misses  True sharing misses  Write to shared block (transmission of invalidation)  Read an invalidated block  False sharing misses  Read an unmodified word in an invalidated block 22
  • 23.
    Performance Study –Commercial Workload 23
  • 24.
    Directory-Based Cache Coherence Sharing status of each physical memory block kept in a single location  Approaches  Central directory for memory or common cache  For Symmetric Multiprocessors (SMPs)  Distributed directory  For Distributed Shared Memory (DSM) systems  Overcomes single point of contention in SMPs 24 Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir- cache-m.pl?cd-t.html,cd-f.html,menu1.html
  • 25.
    Directory Protocols  Foreach block, maintain state  Shared  1 or more nodes have the block cached, value in memory is up-to-date  Set of node IDs  Uncached  Modified  Exactly 1 node has a copy of the cache block, value in memory is out-of-date  Owner node ID  Directory maintains block states & sends invalidation messages 25
  • 26.
  • 27.
    Directory Protocols (Cont.) For uncached block  Read miss  Requesting node is sent requested data & is made the only sharing node, block is now shared  Write miss  Requesting node is sent requested data & becomes the sharing node, block is now exclusive (modified)  For shared block  Read miss  Requesting node is sent requested data from memory, node is added to sharing set  Write miss  Requesting node is sent value, all nodes in sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive 27
  • 28.
    Directory Protocols (Cont.) For exclusive block  Read miss  Owner is sent a data fetch message, block becomes shared, owner sends data to directory, data written back to memory, sharers set contains old owner & requestor  Data write back  Block becomes uncached, sharer set is empty  Write miss  Message is sent to old owner to invalidate & send value to directory, requestor becomes new owner, block remains exclusive 28
  • 29.
  • 30.
    Synchronization Operations  Basicbuilding blocks  Atomic exchange  Swaps register with memory location  Test-and-set  Sets under condition  Fetch-and-increment  Reads original value from memory & increments it in memory  Requires memory read & write in uninterruptable instruction  load linked/store conditional  If contents of memory location specified by load linked are changed before store conditional to same address, store conditional fails 30
  • 31.
    Implementing Locks UsingCoherence  Spin lock  If no coherence DADDUI R2,R0,#1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked?  If coherence lockit: LD R2,0(R1) ;load of lock BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1); swap BNEZ R2,lockit; branch if lock wasn’t 0 31
  • 32.
    Implementing Locks –Advantages  Reduces memory traffic  During each iteration, current lock value can be read from cache  Locality of accessing lock 32
  • 33.
  • 34.
    Summary  Multi-processors  Createnew problems, e.g., cache coherency  Aspects of cache coherence  Coherence  Consistency  Shared-memory architectures  Snooping  Directory based  Memory synchronization 34

Editor's Notes

  • #32 If no coherence – lock in memory, keep reading it If coherence – lock in cache, keep reading from cache (local copy), until it get changes. If we do EXCH on 1st line it require getting exclusive access to cache.