world leader in high performance signal processing
Trace: » smp-like

Blackfin SMP "Like"

 http://www.isc.tamu.edu/~Elewing/linux/

The SMP support in certain Blackfin processors is describe as “SMP Like” rather than just “SMP” due to the lack of hardware cache coherency. A true SMP system would have support for cache coherency in hardware.

On all “SMP Like” setups, cache coherency is maintained via software mechanisms. The result has a few significant implications:

  • caches must be in write through mode (no write back)
  • overhead when forcing coherency (via entire cache invalidates)
  • all threads of a process are restricted to the same core

On systems where the L1 SRAM cannot be accessed directly from another core (such as the BF561), dedicated L1 SRAM cannot be used in the kernel. Care must be taken if using L1 SRAM from userspace (making sure applications have their affinity set to a specific core).

How to Enable SMP on BF561

It's easy to enable SMP on BF561. First, you'd go into kernel configuration,

Linux Kernel Configuration
  Blackfin Processor Options  --->
    CPU (BF561)
    [*] Symmetric multi-processing support

After you selected SMP, “Blackfin Kernel Optimizations” will disappear, because under SMP conditions, the usage of L1 memory is limited.

Then SMP kernel is enabled.

Ensure SMP Kernel is Working

There are several places indicate that SMP kernel is working.

  • Kernel booting information
root:/> dmesg | grep SMP
Linux version 2.6.28-rc2-ADI-2009R1-pre (ymm@gyang) (gcc version 4.1.2 (ADI svn)) #752 SMP Mon Dec 22 13:38:05 CST 2008
SMP: Total of 2 processors activated (1179.64 BogoMIPS).
  • Cpu information
root:/> cat /proc/cpuinfo
processor       : 0
vendor_id       : Analog Devices
cpu family      : 0x27bb
model name      : ADSP-BF561 600(MHz CCLK) 100(MHz SCLK) (mpu off)
stepping        : 3
cpu MHz         : 600.000/100.000
bogomips        : 1179.64
Calibration     : 589824000 loops
cache size      : 16 KB(L1 icache) 32 KB(L1 dcache-wt) 0 KB(L2 cache)
dbank-A/B       : cache/cache
icache setup    : 4 Sub-banks/4 Ways, 32 Lines/Way
dcache setup    : 2 Super-banks/4 Sub-banks/2 Ways, 64 Lines/Way
SMP Dcache Flushes      : 31241

processor       : 1
vendor_id       : Analog Devices
cpu family      : 0x27bb
model name      : ADSP-BF561 600(MHz CCLK) 100(MHz SCLK) (mpu off)
stepping        : 3
cpu MHz         : 600.000/100.000
bogomips        : 1179.64
Calibration     : 589824000 loops
cache size      : 16 KB(L1 icache) 32 KB(L1 dcache-wt) 0 KB(L2 cache)
dbank-A/B       : cache/cache
icache setup    : 4 Sub-banks/4 Ways, 32 Lines/Way
dcache setup    : 2 Super-banks/4 Sub-banks/2 Ways, 64 Lines/Way
SMP Dcache Flushes      : 28537

L2 SRAM         : 128KB
board name      : ADI BF561-EZKIT
board memory    : 65536 kB (0x00000000 -> 0x04000000)
kernel memory   : 57336 kB (0x00001000 -> 0x037ff000)
  • Interrupt information
root:/> cat /proc/interrupts
 35:          0          0      INTN  BFIN_UART_RX
 36:        835        145      INTN  BFIN_UART_TX
 42:      14163      14086      INTN  Blackfin Timer Tick
 69:        198        178      INTN  SMP interrupt
 82:          1          0      GPIO  eth0
Err:          0

The second and third columns are the interrupt times on CoreA and CoreB.

BF561 Architecture

Each core has its own dedicated L1 Data/Instruction SRAM and cache which cannot be accessed by another core: bf561_diagram_smp.jpg

While the BF561 has one atomic instruction (TESTSET), it has significant restrictions. Basically, it can only be used on L2 regions of memory. So all inter-core locks are stored in L2 memory.

Here is a comparison between the BF561 and a typical X86:

BF561 X86
Cache Coherency N/A Cache coherency protocols
Atomic instruction TESTSET Lock# signal/Lock Prefix
Local interrupt controller CEC LAPIC
System interrupt controller SIC(SICA,SICB) IOAPIC
Local timer Core timer LAPIC timer
Peripheral timer gptimer HPET/8254PIT
Inter-Processor interrupt SICB LAPIC

Software Cache Coherency

Cache policy

  • Main memory - Write Through
  • Shared on-chip L2 SRAM - Cannot be cacheable

The overhead of implementing write back caches is so significant that it is unusable. So only write through cache mode is supported.

Global Core Lock: protect atomic data

  • A special spin lock that lives in shared on-chip L2 SRAM
  • Operate functions: _get_core_lock/_put_core_lock
  • Parameter: address of atomic data

file: arch/blackfin/mach-bf561/atomic.S

scm failed with exit code 1:
file does not exist in git

file: arch/blackfin/mach-bf561/atomic.S

scm failed with exit code 1:
file does not exist in git

All spin lock, atomic, and memory barrier operations need to obtain this Core Lock first so as to protect atomic data.

Spin locks

We will cover the spin lock and unlock operations as all other spin lock operations can be trivially extrapolated from here.

A call to spin_lock() does:

  1. Get Core Lock
  2. Get spin lock
  3. Check the Core mask
    1. if this spin lock was held by another Core
      1. cache is probably out of sync, so invalidate all current Core caches
    2. if spin lock was not held by another Core, do nothing
  4. Put Core Lock

A call to spin_unlock() does:

  1. Get Core Lock
  2. Put spin lock
  3. Set Core Mask into spin lock
  4. Put Core Lock

Atomic Operations

All atomic operations do:

  1. obtain Core Lock
  2. perform operation
  3. release Core Lock

Memory barrier

We defined a global variable, barrier_mask, located in L2 SRAM, to denote whether the barrier operations have crossed Cores. Under such conditions, we will invalidate entire data cache in smp_rmb() and smp_mb().

  • The write barrier - smp_wmb(), set the other Core Mask to barrier_mask.
  • The read barrier - smp_rmb(), check whether we have current Core mask in the barrier_mask: if yes, we will invalidate entire data cache.
  • The write/read barrier - smp_mb(), first check the mask, and then mark the mask.
  • The depends barrier - smp_read_barrier_depends(), is the same as smp_rmb().

Other coherency points

  • cpu_relax() will call smp_mb()

SMP Performance

Whetstone test platform,

  • Hardware - BF561-Ezkit, Core clock 600MHz, System clock 100MHz, Silicon Rev. 0.3
  • Software - Kernel 2.6.26.5-ADI-2009R1-pre with SMP patched
Command Line UP SMP
whetstone 15s 15s
whetstone & whetstone 30s & 30s 15~18s & 18~24s
  • Code Efficiency (WORST): (15+15)/(18+24)=0.714
  • Code Efficiency (BEST): (15+15)/(15+18)=0.909

Application Suggestions

  • SMP is good choice for multi-task applications, for example VoIP or Video encoders.
  • The system call sched_setaffinity/sched_getaffinity is used to set the Core affinity.
  • There is a busybox utility called taskset, that can set affinity before executing another program.

POSIX IPC Support

SMP like kernel support POSIX IPC semaphore and message. Share memory between two processes running simultaneously on different cores have cache coherency issues. If you intend to use POSIX shm, you can,

  • Bind the processes that share the memory on the same core.
  • Protect the share memory operations using semaphore/message.

If you don't need POSIX standard, you can,

  • Statically allocate L2 memory in your application as the share memory.