world leader in high performance signal processing
Trace: » icc

Inter Core Communication Introduction

Parallel Processing refers to the concept of speeding-up the execution of a program by dividing the program into multiple fragments that can execute simultaneously, each on its own processor. A program being executed across n processors might execute n times faster than it would using a single processor.

Traditionally, multiple processors were provided within a specially designed “parallel computer”; along these lines, Linux now supports SMP systems in which multiple processors share a single memory and bus interface within a single computer. It is also possible for a group of computers (for example, a group of PCs each running Linux) to be interconnected by a network to form a parallel-processing cluster. The third alternative for parallel computing using Linux is to use the multimedia instruction extensions (i.e., MMX) to operate in parallel on vectors of integer data. Finally, it is also possible to use a Linux system as a “host” for a specialized attached parallel processing compute engine. All these approaches are discussed in detail in the Parallel-Processing How To, and the 4th, (a specialized attached parallel processing compute engine) will be described here.

This document describes some design/architecture ideas on how to make Core B easier to use within a Linux framework. This initial implementation can be found in blackfin kernel and uClinux-dist source repositories under folder named icc. If you would like to provide feedback, please do on the fourms.

The classical trade-off between system performance and ease of programming is one of the primary differentiators between general purpose operating system (GPOS) and real-time operating systems (RTOS).

GPOSes tend to provide a higher degree of resource abstraction. This improves application portability, ease of development and increases system robustness through software modularity and isolation of resources. This makes a GPOS ideal for addressing general purpose system components such as networking, user interface and display management.

However, this abstraction sacrifices the fine-grained control of system resources required to meet the performance goals of computationally intensive algorithms such as signal processing code. For this level of control, developers typically turn to a real-time operating system (RTOS), or program directly on bare metal.

Use Cases

There are various use cases for wanting to be able to load bare metal applications or appications under real time OS into Core B, and use it like a hardware accelerator.

Compute Accelerator

There are various things that can be done to accelerate some task which normally runs under the Linux kernel.

Video Accelerator

Running an optimized H.264 or MPEG or WMV video codec on Core B, with mplayer running on Core A.

  • mplayer runs in Linux on Core A
  • the Linux kernel manages 100% of the peripherals (including LCD)
  • H.264 decoder on core B
  • The two pass the raw bitstream, and decoded video through the CPU→DSP framework.

The decoder does nothing except for decoding video stream into some frame buffers. Mplayer open a h.264 bit stream from either a file on disk or a connection over network.

Video

  • mplayer runs in Linux on Core A
  • the Linux kernel manages some of the peripherals (not including LCD)
  • H.264 decoder on core B
  • The mplayer passes the raw bitstream to the decoder, and decoded video is passed directly to the PPI from Core B.
  • The DSP code should negotiate a proper DMA/IRQ/GPIO/DRAM resource allocation with Linux kernel through the blackfin DSP framework.

Crypto Accelerator

Crypto_API_(Linux) offers hardware acceleration support.

Real Time Task

There are times where the hard real time performance offered by the Linux kernel or by ADEOS are not enough for the application. In those select times, you can still use a thin RTOS (VDK, uCos, etc.) on CORE B, and Linux on Core A.

Shared Memory based Inter-core Communication Protocol

This section is intended to define the communication well enough that different implementations can successfully communicate.

Shared Memory

There are a fixed number of shared variables with sizes and addresses known to each processor.

There is a shared variable for the basic message queue. Protocols that use this queue may require additional shared variables or may require individual processors to have a pool of shareable memory from which buffers can be allocated.

If processors have different word sizes and address maps then addresses of shared buffers and the size of addressable units could differ, and the protocol would need to define a common address representation and addressable unit. We also define types that have at least 16 bits and 32 bits with size larger than or equal to the smallest addressable unit.

We will use the data types:

typedef 'some unsigned integer' sm_unit_t;   // defined in specifics
typedef 'some unsigned integer' sm_uint16_t; // defined in specifics
typedef 'some unsigned integer' sm_uint32_t; // defined in specifics
typedef 'some integral type' sm_address_t;   // defined in specifics

Specifics for Blackfin

Both cores on BF561 and BF60x have the same address space and are byte addressed.

typedef uint8_t sm_unit_t;
typedef uint16_t sm_uint16_t;
typedef uint32_t sm_uint32_t;
typedef void *sm_address_t;

Cache policy

One of the assumptions of the MCAPI/ICC protocol is that the payload buffer received on one core is located in the memory region managed(owned) by the other cores.

Core 0 should set up write through CPLB entries for the memory region managed by core 1. So, The invalidate instruction on core 0 doesn’t flush dummy data in cache back to the MCAPI payload buffer sent by core 1 or drop unrelated data in the same cache line near the MCAPI payload boundary. What CPLB entries (WT/WB) are set up for the same memory region on core 1 doesn’t matter, because core 1 should flush the MCAPI payload buffer before sending.

For example:

BF609 mem addr Owner Core0 cache Core1 cache
0~0x3FFFFF Core0 WB WT
0x400000~0x800000 Core1 WT WB

Atomic access

The specific part defines a type which may be read and written atomically. Two operations are defined on the type: Read and Write, and it may hold values of sm_uint16_t.

// defined at specifics
typedef 'some type' sm_atomic_t;
sm_uint16_t sm_read_atomic(volatile sm_atomic_t *);
void sm_write_atomic(volatile sm_atomic_t *, sm_uint16_t);

Atomic means that if one core writes a variable and another core reads it, the value read is either the value before the write or the value after it and not some third value because the write had only half completed when the read occurred.

Atomic operations on a single core must also be ordered with respect to each other so the following logic holds.

// initial values
sm_atomic_t a = 0, b = 0;
on processor 0:
  sm_write_atomic(&a, 1);
  while (sm_read_atomic(&b) == 0)
    ;
on processor 1:
  sm_write_atomic(&b, 1);
  x = sm_read_atomic(&a);
  assert(x == 1);
  // because read(b) must follow write(a) on processor 0

Specifics for Blackfin

On BF561 and BF60x both cores use the same bus to L2 and the EBUI. L2 memory is 64-bits wide and memory attached to the EBUI is at least 16-bits wide. So an uncached 16-bit write to L2 or L3 is atomic.

typedef uint16_t sm_atomic_t;
inline void sm_atomic_write(volatile sm_atomic_t *a, sm_uint16_t v) {
  *a = v;
}
inline sm_uint16_t sm_atomic_read(volatile sm_atomic_t *a) {
  return *a;
}

Interrupts

Each processor must be able to raise interrupts on the other processor. We use one interrupt on each core which indicates some action is required. The interrupt handler works out what the action is from the channel state. So all modifications to shared data should be visible to both processors by the time the interrupt handler is entered.

The mechanism for initialising interrupt handlers and clearing the interrupt source is necessarily processor and environment specific. The initialisation sequence described below requires the interrupt to be initially masked, which is usually the case.

Specifics for Blakcfin

CPU master core ICC interrupt slave core ICC interrupt
BF561 core supplemental interrupt 0 core supplemental interrupt 0
BF60x SEC soft interrupt 0 SEC soft interrupt 1

The protocol does not define the core interrupt vectors used to handle this interrupts or whether they are shared with other interrupt sources, as this is a decision local to the environment running on the core.

Modifications to shared data are made visible to the other core before raising the interrupt by

  1. ensuring any cached writes are flushed from cached
  2. executing an SSYNC instruction to flush the write buffer.

The interrupted core is responsible for ensuring the initial reads of shared data are not from cache.

Message passing

The protocol is for two way communication between two processors. If there are more processors in the system then the protocol could be used for separate two way channels between each pair of processors.

There are four message queues. Two in each direction, one for high priority messages and the other for standard priority.

Message queues are circular buffers containing SM_MSGQ_LEN fixed size messages.

  typedef struct {
    sm_atomic_t sent;
    sm_atomic_t received;
    sm_msg_t buf[SM_MSGQ_LEN];
  } sm_msgq_t;

The size and content of sm_msg_t is defined in Part 2 below.

SM_MSGQ_LEN is a constant. For efficiency it should be a power of two.

The message queue uses a lockless protocol. The sender always writes a message at sent % SM_MSGQ_LEN and then increments sent, and the receiver always reads from received % SM_MSGQ_LEN and then increments received. The number of messages in the queue is (sm_uint16_t)(sent - received). The counters are unsigned so, due to the wonders of modulo arithmetic, this is true even if received > sent because sent has wrapped round.

Before sending a message the sender checks that there is space available in the buffer. If space is available the sender writes the message to buf[sent % SM_MSGQ_LEN], increments sent, then raises the 'Action Required' interrupt on the other processor. If no space is available in the buffer the calling process must block.

The handler for the 'Action Required' interrupt causes a receiver for both high and standard priority queues to run. Whether the receivers execute within the handler or are just scheduled to run once it returns is environment dependent.

The receiver checks the number of messages in the buffer. If there are any it reads the message at buf[received % SM_MSGQ_LEN] and tries to deliver it. If successful it increments received and raises the 'Action Required' interrupt on the sending core.

The interrupt handler also checks whether space has come available in the queues there are processes blocked on.

The definition of a process, the mechanism for blocking a process, and the method of dealing with the race condition between the sender blocking and the receiver raising the interrupt is processor and environment specific and outside the scope of the protocol. For example in a bare metal environment there is only one “process” and it can block by spinning on a variable that is set by the interrupt handler whereas other environments would use operating system primitives.

A message channel between a pair of processors is composed of 4 message queues.

  typedef struct {
    volatile sm_msgq_t msgq[2][2];
  } sm_channel_t;

The processor of large cpu_id is on the contrary. It receives on msgq[priority][1] and sends on mesgq[priority][0]. The message queue id can be identified according to current cpu, destination cpu and cpu which sends the inter processor interrupt.

  if (cur_cpuid == ipi_src_cupid || cur_cpuid == des_cpuid)
      BUG();

  recv_msgq_id = cur_cpuid < ipi_src_cpuid ? 0 : 1;
  send_msgq_id = cur_cpuid < des_cpuid ? 1 : 0;

Each processor receives high priority on the queue msgq[0][recv_msgq_id], and the standard priority messages on msgq[1][recv_msgq_id]. So do the queue to send message.

If there are N processors in the architecture, there should be a channel array of (N - 1) * N / 2 channels. The channel arrays exist at a known location of the shared memory. The channel id can be identified according to current cpu and remote cpu.

Message Channel ID Table

Processor ID0123
0NA012
1NANA34
2NANANA5
3NANANANA
  #define CPU_NUM 4
  sm_channel_t channels[(CPU_NUM - 1) * CPU_NUM / 2];
  int8 channel_table[CPU_NUM][CPU_NUM] = {
   {-1, 0, 1, 2}, 
   {-1,-1, 3, 4},
   {-1,-1,-1, 5},
   {-1,-1,-1,-1},
  };
  
  channel_id = channel_table[cur_cpuid, remote_cpuid];
  if (channel_id < 0 || channel_id >= CPU_NUM)
      BUG();
  channel = channels[channel_id];

Each message queue is statically initialised with received and sent containing the value 0. When a processor starts running the 'Action Required' interrupt is masked before attempting to sending the first message a handler is installed and the interrupt is unmasked. A message queue can be written to before the receiver has initialized its interrupts. If it fills up the 'Action Required' signal is raised but not serviced until the receiver unmasks the interrupt.

Specifics for Blackfin

A single block of four message queues is held in the shared variable at a known address msgq. Its start address should be at a fixed position known to code running on all processors.

  #define MSGQ_START_ADDR	0xFEB00000	// in BF561 and BF60x L2 SRAM

  typedef struct {
    volatile sm_msgq_t msgq[2][2];
  } sm_channel_t;

  static dsp_channel_t *sm_ch = (sm_channel_t *)MSGQ_START_ADDR;
  • Core A receives messages on sm_ch→msgq[priority][0]
  • Core B receives messages on sm_ch→msgq[priority][1]

Message Format

  typedef sm_unint16_t sm_endpoint_t;

  typedef struct {
    sm_endpoint_t dst_ep, src_ep;
    sm_uint32_t type;
    sm_uint32_t length;
    sm_address_t payload;
  } sm_msg_t;

The fields dst_ep and src_ep denote endpoints. The meaning of an endpoint is application dependent. The receiver should inspect the dst_ep field to decide how to process the message. The src_ep field indicates the sender which may be meaningful to the receiving endpoint.

type is an an unsigned 32 bit integer value that indicates a message type defined in one of the higher level protocols and is mainly interpreted by the endpoint.

The top eight bits of the value indicates the protocol and the low 24 bits the subtype.

  // compose type enumeration value from protocol & subtype
  #define SM_MSG_TYPE(protocol, subtype) (((protocol)<<24)|(subtype))

  // extract subtype from type enumeration value
  #define SM_MSG_SUBTYPE(type) ((type)&0xffffff)

  // extract protocol from type enumeration value
  #define SM_MSG_PROTOCOL(type) (((type)>>24)&0xff)

An endpoint may recognise more than one protocol. The receiver must know the protocols recognised by each endppoint. When dst_ep has the value 0xffff the message is broadcast to every endpoint which recognises the protocol encoded in the type field.

The meaning of length and payload is dependent on the value of type and interpreted by the endpoint.

Specifics for BF561

On BF561 message reads and writes to a queue in L2 is more efficient if 'sm_msg_t' is aligned on a 64 bit boundary. An aligned access should take 14 rather than 21 cycles.

The type declaration for the VisualDSP compiler should use pragma align:

typedef struct {
#pragma align 8
  ...
} sm_msg_t;

All protocol Types

All protocol types

enum {
        SP_GENERAL = 0,
        SP_CORE_CONTROL,
        SP_TASK_MANAGER,
        SP_RES_MANAGER,
        SP_PACKET,
        SP_SESSION_PACKET,
        SP_SCALAR,
        SP_SESSION_SCALAR,
        SP_MAX,
};

Standard Message Types

All protocols should recognise the standard message types.

A couple of common error conditions are covered by standard messages.

These are sent with the same priority as the message to which they responding.

SM_BAD_ENDPOINT

All endpoints should recognise the message:

SM_BAD_ENDPOINT = SM_MSG_TYPE(0, 0)

This may be sent in response to a message sent by this endpoint to indicate the dst_ep field was invalid.

The SM_BAD_ENDPOINT message has its src_ep field set to the invalid endpoint id, its length to 0, and its payload to the type value of the original message.

The SM_BAD_ENDPOINT message may not be sent in all environments. If endpoints can be created dynamically it may be more appropriate to queue the message until the endpoint is created.

SM_BAD_MSG

All endpoints should recognise and be able to send the message:

SM_BAD_MSG = SM_MSG_TYPE(0, 1)

When an endpoint receives a message with a type field it does not expect it should return an SM_BAD_MSG message with the payload set to the type value it did not recognise and length field set to 0.

The message queue layer should also return SM_BAD_MSG if a message with an invalid protocol value is sent to an endpoint. Either 0 or the endpoint's known protocol is valid.

SM_QUERY_MSG

All endpoints should recognise and be able to send the message:

SM_QUERY_MSG = SM_MSG_TYPE(0, 2)

SM_QUERY_MSG and SM_QUERY_ACK_MSG messages are used for query remote endpoint status. Query message should set dsp_ep field and type field.

When an endpoint receives a SM_QUERY_MSG, it should return a SM_QUERY_ACK_MSG. The message queue layer should return SM_QUERY_NOEP_MSG if the endpoint hasn't been created.

SM_QUERY_ACK_MSG

All endpoints should recognise and be able to send the message:

SM_QUERY_ACK_MSG = SM_MSG_TYPE(0, 3)

SM_QUERY_ACK_MSG message should set src_ep field and type field.

SM_QUERY_NOEP_MSG

All endpoints should recognise and be able to send the message:

SM_QUERY_NOEP_MSG = SM_MSG_TYPE(0, 4)

SM_NOTIFY_EP_CREATE_MSG

All endpoints should recognise and be able to send the message:

SM_NOTIFY_EP_CREATE_MSG = SM_MSG_TYPE(0, 5)

If a new endpoint has been created, it should send a SM_NOTIFY_EP_CREATE_MSG notify to remote message queue layer. SM_NOTIFY_EP_CREATE_MSG should set src_ep field.

Communication Protocols

Communication protocols defined in DSP bridge framework are as following.

Protocol type value Protocol Name
SP_CORE_CONTROL 1 Core Control Protocol
SP_TASK_MANAGER 2 Task Manager Protocol
SP_RES_MANAGER 3 Resource Manager Protocol
SP_PACKET 4 Connectionless Packet Transfer Protocol
SP_SESSION_PACKET 5 Connection based Packet Transfer Protocol
SP_SCALAR 6 Connectionless Scalar Transfer Protocol
SP_SESSION_SCALAR 7 Connection based Scalar Transfer Protocol

Core Control Protocol

The core control protocol is a simple set of messages for controlling a slave core.

message value sent by meaning
SM_CORE_START SM_MSG_TYPE(SP_CORE_CONTROL, 0) Master Change slave state from stopped to started
SM_CORE_STARTED SM_MSG_TYPE(SP_CORE_CONTROL, 1) Slave in response to SM_CORE_START once started
SM_CORE_STOP SM_MSG_TYPE(SP_CORE_CONTROL, 2) Master Change slave state from started to stopped
SM_CORE_STOPPED SM_MSG_TYPE(SP_CORE_CONTROL, 3) Slave in response to SM_CORE_STOPPED once stopped
SM_CORE_RESET SM_MSG_TYPE(SP_CORE_CONTROL, 4) Master Put slave in stopped state if not already stopped and reset state including PC.
SM_CORE_RESETED SM_MSG_TYPE(SP_CORE_CONTROL, 5) Slave in response to SM_CORE_STOPPED once stopped

All messages are sent with high priority.

Task Manage Protocol

The task manage protocol is a simple set of messages to run and kill a task on the slave cores.

message value sent by meaning
SM_TASK_RUN SM_MSG_TYPE(SP_TASK_MANAGER, 0) Master ask slave core to execute a task with function addresses and parameters of init and exit. addresses and parameters are stored in payload buffer allocated by master.
SM_TASK_RUNNING SM_MSG_TYPE(SP_TASK_MANAGER, 1) Slave in response to SM_TASK_RUN. task id or 0 is stored in payload. master can free payload buffer after received this response.
SM_TASK_KILL SM_MSG_TYPE(SP_TASK_MANAGER, 2) Master ask slave core to stop running a task of give id in payload.
SM_TASK_KILLED SM_MSG_TYPE(SP_TASK_MANAGER, 3) Slave in response to SM_TASK_KILL once return to idle. task id or 0 is stored in payload.

All messages are sent with high priority.

Resource Manager Protocol

How the application use this resource manager protocol depends on how the precedent shared resource partition is defined for all cores. Precedent shared resource partition may be more suitable to systems that don't need dynamic resource allocation and free. Different implementations can make their own decision.

message value sent by meaning
SM_RES_MGR_REQUEST SM_MSG_TYPE(SP_RES_MANAGER, 0) slave request shared resources
SM_RES_MGR_REQUEST_OK SM_MSG_TYPE(SP_RES_MANAGER, 1) master request succeeds for all resources in the slave's request list
SM_RES_MGR_REQUEST_FAIL SM_MSG_TYPE(SP_RES_MANAGER, 2) master request fails for at least one resource in he slave's request list
SM_RES_MGR_FREE SM_MSG_TYPE(SP_RES_MANAGER, 3) slave free reserved resources
SM_RES_MGR_FREE_DONE SM_MSG_TYPE(SP_RES_MANAGER, 4) master free done
SM_RES_MGR_EXPIRE SM_MSG_TYPE(SP_RES_MANAGER, 5) master ask slave to stop using the resources
SM_RES_MGR_EXPIRE_DONE SM_MSG_TYPE(SP_RES_MANAGER, 6) slave
SM_RES_MGR_LIST SM_MSG_TYPE(SP_RES_MANAGER, 7) slave request a list of all shared resources of a type, no payload
SM_RES_MGR_LIST_OK SM_MSG_TYPE(SP_RES_MANAGER, 8) master reply a list of all available shared resources of a resource type in payload buffer
SM_RES_MGR_LIST_DONE SM_MSG_TYPE(SP_RES_MANAGER, 9) slave finish access this list buffer

The same payload address should be returned in all reply messages, while list message has no payload. All messages are with normal priority.

enum {
  SM_RES_MGR_REQUEST =  SM_MSG_TYPE(SP_RES_MANAGER, 0),
  SM_RES_MGR_REQUEST_OK,
  SM_RES_MGR_REQUEST_FAIL,
  SM_RES_MGR_FREE,
  SM_RES_MGR_FREE_DONE,
  SM_RES_MGR_EXPIRE,
  SM_RES_MGR_EXPIRE_DONE,
  SM_RES_MGR_LIST,
  SM_RES_MGR_LIST_OK,
  SM_RES_MGR_LIST_DONE,
  SM_RES_MGR_MAX,
};

The resource manager service should bind to endpoint 0 on each processor. Slave applications and OS should always request all types of shared resources from this endpoint in master OS.

// resource manager service endpoint
#define EP_RESMGR_SERVICE	0

The ID of a shared resource is unique among all kinds of resources. The supper 4 bits indicate the type of the shared resource, while the rest 12 bits is the index in the given type group. There are at most 16 (2^4) types and only 5 is defined yet. For each type, there could be at most 4096 (2^12) individual resources. The SM_RES_MGR message use payload to pass resouce ID, and use length to point to a 32-bit resouce description data address if resouce type is RESMGR_TYPE_PERIPHERAL.

// resource types
enum {
  RESMGR_TYPE_PERIPHERAL = 0,
  RESMGR_TYPE_GPIO,
  RESMGR_TYPE_SYS_IRQ,
  RESMGR_TYPE_DMA,
  RESMGR_TYPE_MAX,
};

#define RES_TYPE_OFFSET  12
#define RES_TYPE_MASK    0xF
#define RES_SUBID_MASK   0xFFF

// compose resource id from resource type & sub id 
#define RESMGR_ID(type, subid) ((type << RES_TYPE_OFFSET ) | (subid & RES_SUBID_MASK))

// extract resource subid from resource id 
#define RESMGR_SUBID(id)       (id & RES_SUBID_MASK)

// extract resource type from resource id 
#define RESMGR_TYPE(id)        ((id >> RES_TYPE_OFFSET) & RES_TYPE_MASK)

Resource description data address should be put in the length of the message in following format.

typedef struct {
  uint8_t label[32];			        // resource device owner name
  uint16_t count;				// resource number in next array
  uint32_t resources_array;			// address of the resource ID array
} resources_t;

Resource manager APIs declaration:

int sm_request_resource(uint32_t dst_cpu, uint32_t resource_id, resources_t *data)
int sm_free_resource(uint32_t dst_cpu, uint32_t resource_id, resources_t *data)

peripherals type

For peripherals type, the peripheral name and list is passed by resouce description data.

example to request/free peripherals type

unsigned short bfin_peripheral_list[] = {P_SPI1_SCK, P_SPI1_MISO, P_SPI1_MOSI, 0};
resources_t bfin_peri_res = {
        .label = "bfin-spi1",
};

       bfin_peri_res.count = 3;
        bfin_peri_res.resources_array = (uint32_t)bfin_peripheral_list;

        COREB_DEBUG(1, "request resource id %s\n", bfin_peri_res.label);
        ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_PERIPHERAL, 0), &bfin_peri_res);
        if (ret) {
                COREB_DEBUG(1, "request peri resource failed\n");
        }

        ret = sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_PERIPHERAL, 0), &bfin_peri_res);
        if (ret) {
                COREB_DEBUG(1, "free peri resource failed\n");
        }

GPIO, IRQ and DMA type

The generic map of the GPIOs, system IRQs and DMA channels to their ID should be defined for each arch. The resource sequence in the HRM can be one reference for the generic map.

Specifics for BF561

GPIO ID GPIO in bf561 HRM
0 PF0
1 PF1
47 PF47
System IRQ ID System IRQ in bf561 HRM
0 PLL_WAKEUP
1 DMA1_ERROR
2 DMA2_ERROR
3 IMDMA_ERROR
4 PPI0_ERROR
5 PPI1_ERROR
6 SPORT0_ERROR
7 SPORT1_ERROR
8 SPI0_ERROR
9 UART0_ERROR
10 RESERVED
11 DMA1_CH0
12 DMA1_CH1
13 DMA1_CH2
14 DMA1_CH3
15 DMA1_CH4
16 DMA1_CH5
17 DMA1_CH6
18 DMA1_CH7
19 DMA1_CH8
20 DMA1_CH9
21 DMA1_CH10
22 DMA1_CH11
23 DMA2_CH0
24 DMA2_CH1
25 DMA2_CH2
26 DMA2_CH3
27 DMA2_CH4
28 DMA2_CH5
29 DMA2_CH6
30 DMA2_CH7
31 DMA2_CH8
32 DMA2_CH9
33 DMA2_CH10
34 DMA2_CH11
35 TIMER0
36 TIMER1
37 TIMER2
38 TIMER3
39 TIMER4
40 TIMER5
41 TIMER6
42 TIMER7
43 TIMER8
44 TIMER9
45 TIMER10
46 TIMER11
47 PF0_PF15_A
48 PF0_PF15_B
49 PF16_PF31_A
50 PF16_PF31_B
51 PF32_PF47_A
52 PF32_PF47_B
53 DMA1_MDMA_STREAM0
54 DMA1_MDMA_STREAM1
55 DMA2_MDMA_STREAM0
56 DMA2_MDMA_STREAM1
57 IMDMA_STREAM0
58 IMDMA_STREAM0
59 WATCHDOG
60 RESERVED
61 RESERVED
62 RESERVED (SUPPLE_0 is reserved by DSP bridge framework
63 SUPPLE_1
DMA ID DMA in bf561 HRM
0 DMA1_PPI0
1 DMA1_PPI1
2 RESERVED
3 RESERVED
4 RESERVED
5 RESERVED
6 RESERVED
7 RESERVED
8 RESERVED
9 RESERVED
10 RESERVED
11 RESERVED
12 DMA1_MEM_STREAM0_DES
13 DMA1_MEM_STREAM0_SRC
14 DMA1_MEM_STREAM1_DES
15 DMA1_MEM_STREAM1_SRC
16 DMA2_SPORT0_RX
17 DMA2_SPORT0_TX
18 DMA2_SPORT1_RX
19 DMA2_SPORT1_TX
20 DMA2_SPI0
21 DMA2_UART0_RX
22 DMA2_UART0_TX
23 RESERVED
24 RESERVED
25 RESERVED
26 RESERVED
27 RESERVED
28 DMA2_MEM_STREAM0_DES
39 DMA2_MEM_STREAM0_SRC
30 DMA2_MEM_STREAM1_DES
31 DMA2_MEM_STREAM1_SRC
32 IMDMA_MEM_STREAM0_DES
33 IMDMA_MEM_STREAM0_SRC
34 IMDMA_MEM_STREAM1_DES
35 IMDMA_MEM_STREAM1_SRC

Specifics for BF609

GPIO ID GPIO in bf609 HRM
0 GPIO0
1 GPIO1
112 GPIO112
System IRQ ID System IRQ in b609 HRM
0 IRQ_SEC_ERR
1 IRQ_CGU_EVT
2 IRQ_WATCH0
3 IRQ_WATCH1
4 IRQ_L2CTL0_ECC_ERR
5 IRQ_L2CTL0_ECC_WARN
6 IRQ_C0_DBL_FAULT
7 IRQ_C1_DBL_FAULT
8 IRQ_C0_HW_ERR
9 IRQ_C1_HW_ERR
10 IRQ_C0_NMI_L1_PARITY_ERR
11 IRQ_C1_NMI_L1_PARITY_ERR
12 IRQ_TIMER0
13 IRQ_TIMER1
14 IRQ_TIMER2
15 IRQ_TIMER3
16 IRQ_TIMER4
17 IRQ_TIMER5
18 IRQ_TIMER6
19 IRQ_TIMER7
20 IRQ_TIMER_STAT
21 IRQ_PINT0
22 IRQ_PINT1
23 IRQ_PINT2
24 IRQ_PINT3
25 IRQ_PINT4
26 IRQ_PINT5
27 IRQ_CNT
28 IRQ_PWM0_TRIP
29 IRQ_PWM0_SYNC
30 IRQ_PWM1_TRIP
31 IRQ_PWM1_SYNC
32 IRQ_TWI0
33 IRQ_TWI1
34 IRQ_SOFT0
35 IRQ_SOFT1
36 IRQ_SOFT2
37 IRQ_SOFT3
38 IRQ_ACM_EVT_MISS
39 IRQ_ACM_EVT_COMPLETE
40 IRQ_CAN0_RX
41 IRQ_CAN0_TX
42 IRQ_CAN0_STAT
43 IRQ_SPORT0_TX
44 IRQ_SPORT0_TX_STAT
45 IRQ_SPORT0_RX
46 IRQ_SPORT0_RX_STAT
47 IRQ_SPORT1_TX
48 IRQ_SPORT1_TX_STAT
49 IRQ_SPORT1_RX
50 IRQ_SPORT1_RX_STAT
51 IRQ_SPORT2_TX
52 IRQ_SPORT2_TX_STAT
53 IRQ_SPORT2_RX
54 IRQ_SPORT2_RX_STAT
55 IRQ_SPI0_TX
56 IRQ_SPI0_RX
57 IRQ_SPI0_STAT
58 IRQ_SPI1_TX
59 IRQ_SPI1_RX
60 IRQ_SPI1_STAT
61 IRQ_RSI
62 IRQ_RSI_INT0
63 IRQ_RSI_INT1
64 IRQ_SDU
65 DMA12 Data Reserved
66 Reserved
67 Reserved
68 IRQ_EMAC0_STAT
69 EMAC0 Power Reserved
70 IRQ_EMAC1_STAT
71 EMAC1 Power Reserved
72 IRQ_LP0
73 IRQ_LP0_STAT
74 IRQ_LP1
75 IRQ_LP1_STAT
76 IRQ_LP2
77 IRQ_LP2_STAT
78 IRQ_LP3
79 IRQ_LP3_STAT
80 IRQ_UART0_TX
81 IRQ_UART0_RX
82 IRQ_UART0_STAT
83 IRQ_UART1_TX
84 IRQ_UART1_RX
85 IRQ_UART1_STAT
86 IRQ_MDMA0_SRC_CRC0
87 IRQ_MDMA0_DEST_CRC0/ IRQ_MDMAS0
88 IRQ_CRC0_DCNTEXP
89 IRQ_CRC0_ERR
90 IRQ_MDMA1_SRC_CRC1
91 IRQ_MDMA1_DEST_CRC1/IRQ_MDMAS1
92 IRQ_CRC1_DCNTEXP
93 IRQ_CRC1_ERR
94 IRQ_MDMA2_SRC
95 IRQ_MDMA2_DEST/IRQ_MDMAS2
96 IRQ_MDMA3_SRC
97 IRQ_MDMA3_DEST/IRQ_MDMAS3
98 IRQ_EPPI0_CH0
99 IRQ_EPPI0_CH1
100 IRQ_EPPI0_STAT
101 IRQ_EPPI2_CH0
102 IRQ_EPPI2_CH1
103 IRQ_EPPI2_STAT
104 IRQ_EPPI1_CH0
105 IRQ_EPPI1_CH1
106 IRQ_EPPI1_STAT
107 IRQ_PIXC_CH0
108 IRQ_PIXC_CH1
109 IRQ_PIXC_CH2
110 IRQ_PIXC_STAT
111 IRQ_PVP_CPDOB
112 IRQ_PVP_CPDOC
113 IRQ_PVP_CPSTAT
114 IRQ_PVP_CPCI
115 IRQ_PVP_STAT0
116 IRQ_PVP_MPDO
117 IRQ_PVP_MPDI
118 IRQ_PVP_MPSTAT
119 IRQ_PVP_MPCI
120 IRQ_PVP_CPDOA
121 IRQ_PVP_STAT1
122 IRQ_USB_STAT
123 IRQ_USB_DMA
124 IRQ_TRU_INT0
125 IRQ_TRU_INT1
126 IRQ_TRU_INT2
127 IRQ_TRU_INT3
128 IRQ_DMAC0_ERROR
129 IRQ_CGU0_ERROR
130 Reserved
131 IRQ_DPM
132 Reserved
133 IRQ_SWU0
134 IRQ_SWU1
135 IRQ_SWU2
136 IRQ_SWU3
137 IRQ_SWU4
138 IRQ_SWU5
139 IRQ_SWU6
DMA ID DMA in bf609 HRM
0 CH_SPORT0_TX
1 CH_SPORT0_RX
2 CH_SPORT1_TX
3 CH_SPORT1_RX
4 CH_SPORT2_TX
5 CH_SPORT2_RX
6 CH_SPI0_TX
7 CH_SPI0_RX
8 CH_SPI1_TX
9 CH_SPI1_RX
10 CH_RSI
11 CH_SDU
13 CH_LP0
14 CH_LP1
15 CH_LP2
16 CH_LP3
17 CH_UART0_TX
18 CH_UART0_RX
19 CH_UART1_TX
20 CH_UART1_RX
21 CH_MEM_STREAM0_SRC_CRC0/CH_MEM_STREAM0_SRC
22 CH_MEM_STREAM0_DEST_CRC0/CH_MEM_STREAM0_DEST
23 CH_MEM_STREAM1_SRC_CRC1/CH_MEM_STREAM1_SRC
24 CH_MEM_STREAM1_DEST_CRC1/CH_MEM_STREAM1_DEST
25 CH_MEM_STREAM2_SRC
26 CH_MEM_STREAM2_DEST
27 CH_MEM_STREAM3_SRC
28 CH_MEM_STREAM3_DEST
29 CH_EPPI0_CH0
30 CH_EPPI0_CH1
31 CH_EPPI2_CH0
32 CH_EPPI2_CH1
33 CH_EPPI1_CH0
34 CH_EPPI1_CH1
35 CH_PIXC_CH0
36 CH_PIXC_CH1
37 CH_PIXC_CH2
38 CH_PVP_CPDOB
39 CH_PVP_CPDOC
40 CH_PVP_CPSTAT
41 CH_PVP_CPCI
42 CH_PVP_MPDO
43 CH_PVP_MPDI
44 CH_PVP_MPSTAT
45 CH_PVP_MPCI
46 CH_PVP_CPDOA

An example request/free other resource type

ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_GPIO, 40), 0);
if (ret) 
    COREB_DEBUG(1, "request resource failed\n");
    
ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_SYS_IRQ, 52), 0);
if (ret) 
    COREB_DEBUG(1, "request resource failed\n");

ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_DMA, 20), 0);
if (ret) 
    COREB_DEBUG(1, "request resource failed\n");

sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_GPIO, 40), 0);
sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_SYS_IRQ, 52), 0);
sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_DMA, 20), 0);

Packet Transfer Protocol

The packet transfer protocol is to transfer data via local allocated buffer between processors. It is based on top of the former message protocol. Each processor should be able to access other processor's local memory pool via proper CPLB configuration.

This protocol is connectionless. One endpoint registered on one processor may receive packets sent from any src_enp on the other processors.

To send a packet, the packet protocol:

  1. Allocate a buffer of the packet size in local memory management system.
  2. Prepare packet data into this buffer and flush its data cache.
  3. Send SM_PACKET_READY message with buffer address and length to the given endpoint on the other processor.
  4. Queue this packet buffer into a sent packet list.
  5. After receiving SM_PACKET_CONSUMED message, find the buffer in the sent packet list according to the received address and free to local memory management system.

To received a packet in icc for bare metal application:

  1. Receive a message of type SM_PACKET_READY with packet address and length in message interrupt handler.
  2. Notify the message loop to dispatch it to the application, which binds to the same endpoint as des_enp in the message.
  3. Application can process the sender's buffer directly or allocate a local buffer from local memory pool and do copy for future use.
  4. After return from application's dispatch callback, invalidate data cache of the sender's buffer and send SM_PACKET_CONSUMED with the same payload address back to sender.

To received a packet in icc for OS

  1. Receive a message of type SM_PACKET_READY with packet address and length in message interrupt handler.
  2. Allocate local buffer from OS.
  3. Copy the packet data from sender's buffer to local buffer.
  4. Append the local buffer to a received packet list indexed by the des_ep in the message.
  5. Invalidate the data cache of the sender's buffer and send SM_PACKET_CONSUMED with the same payload address back to sender.
  6. Dispatch the message to application, who binds to the same endpoint as des_ep in the message.

Message to deliver packet is with normal priority.

message type value meaning
SM_PACKET_READY SM_MSG_TYPE(SP_PACKET, 0) The sender allocates memory for the packet. len = packet length; payload = packet address of buffer allocated by sender
SM_PACKET_CONSUMED SM_MSG_TYPE(SP_PACKET, 1) The receiver finishes processing the arriving packet and the sender can free its memory. len = packet length; payload = packet address of buffer allocated by sender
SM_PACKET_ERROR SM_MSG_TYPE(SP_PACKET, 2) Signal an error, payload field is an error code, len=0. Both sides free local buffers in the received packet list.
SM_PACKET_ERROR_ACK SM_MSG_TYPE(SP_PACKET, 3) In response to ERROR received.

Endpoint reserved for broadcast packet.

/*
 * Protocol layer should dispatch packet of des_ep 0xFFFF to all receivers.
 * Receivers should not bind to this endpoint.
 */
#define EP_PACKET_BORADCAST    0xFFFF

Endpoint reserved for debug information service.

/*
 * Debug information service should bind to 0 end point on each processor.
 * Senders should not bind to this endpoint.
 */
#define EP_PACKET_DEBUG_INFO   0

Session Packet Transfer Protocol

The session packet transfer protocol establishes a connection between 2 endpoints on different processors to transfer data via local allocated buffers. It is based on top of the message protocol. Each processor should be able to access other processor's local memory pool via proper CPLB configuration.

In this protcol, connection should be established before packet can be delivered. The server should bind to a listening endpoint in advance. After receive a connection request message, the server creates a session with an endpoint pair of the src_enp in connection request and a new free local endpoint. Then, application can deliver packets over this session, while the server backs to monitor the listening endpoint. This session is closed only after connection close request and ACK are received by any party.

Broadcast data is not supported in this protocol.

Message for session packet protocol is with normal priority.

message type value meaning
SM_SESSION_PACKET_CONNECT SM_MSG_TYPE(SP_SESSION_PACKET, 0) After allocate a new session and bind to a local endpoint, the client sends connection request to the server.
SM_SESSION_PACKET_CONNECT_ACK SM_MSG_TYPE(SP_SESSION_PACKET, 1) The server allocates a new session and responses to the connection request. After client receives ACK, it thinks the connection is established and start to transfer data over this session. No payload.
SM_SESSION_PACKET_CONNECT_DONE SM_MSG_TYPE(SP_SESSION_PACKET, 2) The client sends connection established status back to server after receive ACK and before real data transfer. No payload. After server receives DONE, server thinks the connection is established and wakes up application or thread to do data transfer on the new session.
SM_SESSION_PACKET_ACTIVE SM_MSG_TYPE(SP_SESSION_PACKET, 3) The client sends this message at a minute-level interval and wait for the ACK to keep the connection active after the connection succeeds. No payload.
SM_SESSION_PACKET_ACTIVE_ACK SM_MSG_TYPE(SP_SESSION_PACKET, 4) The server should answer the active tick message to keep the connection active. No payload.
SM_SESSION_PACKET_CLOSE SM_MSG_TYPE(SP_SESSION_PACKET, 5) Any party in the session can send connection close request to the other. No payload. After receiving CLOSE, free the session.
SM_SESSION_PACKET_CLOSE_ACK SM_MSG_TYPE(SP_SESSION_PACKET, 6) Response to the connection close request. No payload. After receiving ACK, free the session.
SM_SESSION_PACKET_READY SM_MSG_TYPE(SP_SESSION_PACKET, 7) The sender allocates memory for the packet. len = packet length; payload = packet address of buffer allocated by sender
SM_SESSION_PACKET_COMSUMED SM_MSG_TYPE(SP_SESSION_PACKET, 8) The receiver finishes processing the arriving packet and the sender can free its memory. len = packet length; payload = packet address of buffer allocated by sender
SM_SESSION_PACKET_ERROR SM_MSG_TYPE(SP_SESSION_PACKET, 9) Signal an error, payload field is an error code, len=0. Both sides free local buffers in the connection received data list.
SM_SESSION_PACKET_ERROR_ACK SM_MSG_TYPE(SP_SESSION_PACKET, 10) In response to ERROR received.

To enable the session packet protocol without a standard socket stack, you have to have at least a simple stack library(API) to:

  1. create and free a session which binds to a local endpoint/cpuid pair
  2. listen on a server session
  3. init a connection request to a remote endpoint/cpuid pair and bind the session to this pair.
  4. accept a connection and allocated a new session which binds to the service endpoint and the remote endpoint/cpuid pair in the request.
  5. read and write data via this session.

This library may differ on cores with different DSP bridge implementation.

Scalar Transfer Protocol

Scalar transfer provide a efficient method to transmit scalars (8-bit, 16-bit, 32-bit and 64-bit variant) between endpoints. It is based on top of the former message protocol. Packet protocol tranfer pass a reference to local allocated buffers through ICC msg(payload, length). To transmit scalars efficiently payload and length of ICC sm_msg is used for passing 2 32-bits scalar data directly.

message type value meaning
SM_SCALAR_READY_8 SM_MSG_TYPE(SP_SCALAR, 0)
SM_SCALAR_READY_16 SM_MSG_TYPE(SP_SCALAR, 1)
SM_SCALAR_READY_32 SM_MSG_TYPE(SP_SCALAR, 2)
SM_SCALAR_READY_64 SM_MSG_TYPE(SP_SCALAR, 3)
SM_SCALAR_CONSUMED SM_MSG_TYPE(SP_SCALAR, 4)
SM_SCALAR_ERROR SM_MSG_TYPE(SP_SCALAR, 5)
SM_SCALAR_ERROR_ACK SM_MSG_TYPE(SP_SCALAR, 6)

Session Scalar Transfer Protocol

Like scalar transfer, session scalar transfer alse transmit scalars (8-bit, 16-bit, 32-bit and 64-bit variant) between endpoints. It is based on top of the former message protocol. In this protcol, connection should be established before scalar data can be delivered.

message type value meaning
SM_SESSION_SCALAR_READY_8 SM_MSG_TYPE(SP_SESSION_SCALAR, 0)
SM_SESSION_SCALAR_READY_16 SM_MSG_TYPE(SP_SESSION_SCALAR, 1)
SM_SESSION_SCALAR_READY_32 SM_MSG_TYPE(SP_SESSION_SCALAR, 2)
SM_SESSION_SCALAR_READY_64 SM_MSG_TYPE(SP_SESSION_SCALAR, 3)
SM_SESSION_SCALAR_COMSUMED SM_MSG_TYPE(SP_SESSION_SCALAR, 4)
SM_SESSION_SCALAR_ERROR SM_MSG_TYPE(SP_SESSION_SCALAR, 5)
SM_SESSION_SCALAR_ERROR_ACK SM_MSG_TYPE(SP_SESSION_SCALAR, 6)
SM_SESSION_SCALAR_CONNECT SM_MSG_TYPE(SP_SESSION_SCALAR, 7)
SM_SESSION_SCALAR_CONNECT_ACK SM_MSG_TYPE(SP_SESSION_SCALAR, 8)
SM_SESSION_SCALAR_CONNECT_DONE SM_MSG_TYPE(SP_SESSION_SCALAR, 9)
SM_SESSION_SCALAR_ACTIVE SM_MSG_TYPE(SP_SESSION_SCALAR, 10)
SM_SESSION_SCALAR_ACTIVE_ACK SM_MSG_TYPE(SP_SESSION_SCALAR, 11)
SM_SESSION_SCALAR_CLOSE SM_MSG_TYPE(SP_SESSION_SCALAR, 12)
SM_SESSION_SCALAR_CLOSE_ACK SM_MSG_TYPE(SP_SESSION_SCALAR, 13)

Inter-core communication Framework design for Linux and bare metal application

This section describes a framework to be implemented on Linux that will use the above communication protocols.

Design Goal

The design goal is to be able to control Core B in a generic way as possible from (userspace and kernel) to load/start/stop/reload any potential acceleration or RTOS task that a user may want to do.

To accomplish this, we lean on the OSI network model, which we review here, to provide a little context.

The OSI model was developed by the International Organization for Standardization (ISO) as a guideline for developing standards to enable the interconnection of dissimilar computing devices. It is important to understand that the OSI model is not itself a communication standard. In other words, it is not an agreed-on method that governs how data is sent and received; it is only a guideline for developing such standards.

It would be difficult to overstate the importance of the OSI model. Virtually all vendors and users of products which must communicate over the network understand how important it is that their products adhere to and fully support the networking standards this model has generated.

When a vendor's products adhere to the standards the OSI model has generated, connecting those products to other vendors' products is relatively simple. Conversely, the further a vendor departs from those standards, the more difficult it becomes to connect that vendor's products to those of other vendors.

In addition, if a vendor were to depart from the communication standards the model has engendered, software development efforts would be very difficult because the vendor would have to build every part of all necessary software, rather than being able to build on the existing work of other vendors.

In the “Core B” scenario, the implications are the same. By providing standard communications methods, and allowing people to build on these standard methods, it will make interoperability higher, at the same time as lowering development costs.

Layer 7:Application Layer

  • Defines interface to user processes for communication and data transfer in network
  • Provides standardized services such as virtual terminal, file and job transfer and operations
  • In the Core B model - this is an Linux Application on Core A talking to a RTOS application on Core B via Linux standard methods

Layer 6:Presentation Layer

  • Masks the differences of data formats between dissimilar systems
  • Specifies architecture-independent data transfer format
  • Encodes and decodes data; Encrypts and decrypts data; Compresses and decompresses data
  • In the Core B model - this is responsible for

Layer 5:Session Layer

  • Manages user sessions and dialogues
  • Controls establishment and termination of logic links between users
  • Reports upper layer errors
  • In the Core B model - this is responsible for

Layer 4:Transport Layer

  • Manages end-to-end message delivery in network
  • Provides reliable and sequential packet delivery through error recovery and flow control mechanisms
  • Provides connectionless oriented packet delivery
  • In the Core B model - this is responsible for

Layer 3:Network Layer

  • Determines how data are transferred between network devices
  • Routes packets according to unique network device addresses
  • Provides flow and congestion control to prevent network resource depletion
  • In the Core B model - this is responsible for

Layer 2:Data Link Layer

  • Defines procedures for operating the communication links
  • Frames packets
  • Detects and corrects packets transmit errors
  • In the Core B model - this is the mechanics of using the layer 1 in a manner that both Cores know how to pass data back and forth in a manner which data will not be lost.

Layer 1:Physical Layer

  • Defines physical means of sending data over network devices
  • Interfaces between network medium and devices
  • Defines optical, electrical and mechanical characteristics
  • In the Core B model - this is the physical addresses of common memory (and cache flushing if necessary), mailboxes, interrupts, etc and other things necessary to pass data to/from Core A and Core B.

Summary

The basic communication framework is intended to allow both message and stream based communication in both synchronous or asynchronous way. A simple API is defined and libraries are provided for both:

  • Linux userspace applications
  • Linux kernel modules
  • and Core B code.

At this time, only the layer 1 to layer 3 protocols are defined -- anything higher than just passing raw data back and forth are implementation and user application dependent.

  • < Layer 5 ~ 7 > user data, user defined command, statics counters
  • < Layer 3 ~ 4 > Packet, connection packet, data stream, core control, resource manager in local runtime allocated memory
  • < Layer 2 > Message queue at memory of a fixed known address
  • < Layer 1 > Inter-processor interrupt and share memory

User Interface

There are two kind of interface available for both the Linux application and bare metal application.

Linux User Interface

From the view of a linux user, the icc is a device driver that control the DSP devices, and bridges the the program runing on DSPs and linux user applications. The program running on DSP, is an ELF non-relocatable binary. It can be loaded by the icc driver per the request of the Linux user application.

Kernel icc driver will build a packet list for each registered end point. The packets from the current dsp side will be copyed and added to this list, waiting user application to fetch.

If the DSP device is opened in non-block mode. Poll by select system call or register signal SIG_DSP_PACKET_ARRIVE and do real message receiving operation in application.

Control of DSP device

DSP bridge ioctl commands are executed under the combination efforts of main CPU and DSP device.

  • CMD_DSP_LOAD - Load ELF non-relocatable binary to the reserved memory of a specified DSP device.
  • CMD_DSP_START - Wake up DSP device and make it execute the user binary from start address.
  • CMD_DSP_STOP - Stop DSP device to execute the user binary and make it sleep in idle loop.
  • CMD_DSP_RESET - Reinitialize the DSP device resources and make it sleep in idle loop.
char *pathname[];

ioctl(fd, CMD_DSP_LOAD, pathname);
ioctl(fd, CMD_DSP_START, NULL);
ioctl(fd, CMD_DSP_STOP, NULL);
ioctl(fd, CMD_DSP_RESET, NULL);

Connectionless Packet Communication

Network layer interface is to transfer buffers among linux user application, kernel driver, and program running on DSP core.

/*
 * remote_ep - destination end point in sending operation, local endpoint which receiver binds to
 * local_ep - sender's endpoint in sending operation, should be 0 in receiving operation
 * buf_len is used to indicate the actual data size to send or have been received.
 * type    -  packet protocol type, connectionless or connection packet, SP_PACKET or SP_SESSION_PACKET
 */
struct sm_packet {
        sm_uint32_t session_idx;
        sm_uint32_t local_ep;
        sm_uint32_t remote_ep;
        sm_uint32_t type;
        sm_uint32_t dst_cpu;
        sm_uint32_t src_cpu;
        sm_uint32_t buf_len;
        void *buf;
};

ioctl commands:

  • CMD_SM_CREATE -create local endpoint by packet define.
  • CMD_SM_SEND -send packet to dst cpu.
  • CMD_SM_RECV -receive packet from local endpoint.
  • CMD_SM_CONNECT -connect a remote endpoint, paired with local endpoint.
  • CMD_SM_SHUTDOWN -shutdown a local session, disconnect and free all the resource.
        struct sm_packet pkt;
        char buf[64] = "1234567890abcdef";
        memset(&pkt, 0, sizeof(struct sm_packet));

        pkt.local_ep = 9;
        pkt.remote_ep = 5;
        pkt.type = SP_PACKET;
        pkt.dst_cpu = 1;
        pkt.buf_len = 16;
        pkt.buf = buf;
        ioctl(fd, CMD_SM_CREATE, &pkt);
        ioctl(fd, CMD_SM_SEND, &pkt);
        ioctl(fd, CMD_SM_SHUTDOWN, &pkt);

Connection based Packet Communication

        struct sm_packet pkt;
        char buf[64] = "1234567890abcdef";
        memset(&pkt, 0, sizeof(struct sm_packet));
        pkt.local_ep = 9;
        pkt.remote_ep = 6;

        pkt.type = SP_SESSION_PACKET;
        pkt.dst_cpu = 1;
        pkt.buf_len = 16;
        pkt.buf = payload;

        printf("sp packet %d\n", pkt.type);

        printf("begin create ep\n");
        ioctl(fd, CMD_SM_CREATE, &pkt);
        printf("finish create ep session index = %d\n", pkt.session_idx);

        ioctl(fd, CMD_SM_CONNECT, &pkt);

        ioctl(fd, CMD_SM_SEND, &pkt);
        ioctl(fd, CMD_SM_RECV, &pkt);
        /* get buffer from pkt.buf */ 
        ioctl(fd, CMD_SM_SHUTDOWN, &pkt);

DSP User Interface

DSP initialization

icc device nodes are /dev/icc. When coreb dsp binary is loaded by icc driver, it starts each dsp to initialize dsp's cplb and event contoller properly. IPI interrupt is configured especially for dsp bridge message and control notification. After initialization is done, DSP devices sleep in idle loop in IRQ level 15. These DSP initialization and idle loop code and data are in shared memory for all DSPs and main CPU.

Application initialization

Each DSP application should implement two enrances(icc_task_init, icc_task_exit). icc_task_init is for DSP application to register its end point and protocol based packet dispatch functions. DSP runs this entrance in EVT7 mode when it is asked to start by a task run message. The DSP applications should call icc_wait() to wait for any incoming messages or register session handler callbacks via registration API- sm_registe_session_handler(). After task_init it exit to EVT15 and wait for new message to handle. icc_task_exit is for DSP application end running and exit with cleanup.

sample1
sm_uint32_t __icc_task_data session_index;
void  icc_task_init(int argc, char *argv[])
{
        struct sm_session *session;
        void *buf;
        int len;
        int ret;
        int src_ep, src_cpu;
        session_index = sm_create_session(LOCAL_SESSION, SP_PACKET);
        coreb_msg("%s() %s %s index %d\n", __func__, argv[0], argv[1], session_index);
        if (session_index >= 32)
                coreb_msg("create session failed\n");

        while (1) {
                coreb_msg("task loop\n");
                if (icc_wait()) {
                        ret = sm_recv_packet(session_index, &src_ep, &src_cpu, &buf, len);
                        if (ret <= 0) {
                                coreb_msg("recv packet failed\n");
                        }
                        /* handle payload */
                        coreb_msg("processing msg %s\n", buf);
                        if (*(char *)buf  == '1') {
                                int len = 64;
                                int dst_ep = src_ep;
                                int dst_cpu = src_cpu;
                                void *send_buf = sm_send_request(len, session_index);
                                coreb_msg("coreb send buf %x\n", send_buf);
                                if (!send_buf)
                                        coreb_msg("NO MEM\n");
                                memset(send_buf, 0, len);
                                strcpy(send_buf, "finish");
                                sm_send_packet(session_index, dst_ep, dst_cpu, send_buf, len);
                        } else {
                                coreb_msg("msg payload %s \n", buf);
                        }

                        sm_recv_release(buf, len, session_index);
                }

        }

        coreb_msg("%s() end\n", __func__);
}

void  icc_task_exit(void)
{
        sm_destroy_session(session_index);
}

sample2
void icc_task_init(int argc, char *argv[])
{
        struct sm_session *session;
        index = sm_create_session(LOCAL_SESSION, SP_PACKET);
        coreb_msg("%s() %s %s index %d\n", __func__, argv[0], argv[1], index);
        if (index >= 32)
                coreb_msg("create session failed\n");

        session = &coreb_info.icc_info.sessions_table[index];
        sm_registe_session_handler(index, default_session_handle);
        coreb_msg("%s() end\n", __func__);
}

void icc_task_exit(void)
{
        sm_destroy_session(index);
}
int default_session_handle(struct sm_message *msg, struct sm_session *session)
{
        void *buf;
        sm_uint32_t len;
        int ret;
        coreb_msg(" %s session %d msg %s \n",__func__, session->local_ep, msg->payload);
        coreb_msg("dst %d dstep %d, src %d, srcep %d\n", msg->dst, msg->dst_ep, msg->src, msg->src_ep);

        ret = sm_recv_packet(index, &buf, len);
        if (ret <= 0) {
                coreb_msg("recv packet failed\n");
                return ret;
        }
        /* handle payload */
        coreb_msg("processing msg %s\n", buf);
        if (*(char *)buf == '1') {
                int len = 64;
                int dst_ep = msg->src_ep;
                int dst_cpu = msg->src;
                void *send_buf = sm_send_request(len, session);
                coreb_msg("coreb send buf %x\n", send_buf);
                if (!send_buf)
                        coreb_msg("NO MEM\n");
                memset(send_buf, 0, len);
                *(char *)send_buf = 'f';
                sm_send_packet(index, dst_ep, dst_cpu, send_buf, len);
        } else {
                coreb_msg("msg payload %s \n", buf);
        }

        sm_recv_release(buf, len, session);

packet transfer

  • int sm_send_packet(sm_uint32_t session_idx, sm_uint32_t dst_ep, sm_uint32_t dst_cpu, void *buf, sm_uint32_t len) - send packet from dsp application to linux side
  • int sm_recv_packet(sm_uint32_t session_idx, void **buf, sm_uint32_t len) - receive packet from icc message queue to DSP application

manage message buffer

  • void *sm_send_request(sm_uint32_t size, struct sm_session *session) - prepare message buffer before send packet, the message buffer will be auto freed after the message is handled
  • void sm_recv_release(void *addr, sm_uint32_t size, struct sm_session *session) - after DSP application finish handling the packet, call sm_recv_release to free message buffer

Connectionless Packet Communication

DSP application call register_packet_dispatch_callback to register its packet dispatch function and sender's clean up function in main entrance. The registered packet receive callback functions are invoked in EVT15 mode(IPEND = 0x8000) as well.

/*
 * endpoint - bind to a local endpoint to receive incoming packet.
 * src_cpuid - processor who sends the incoming packet.
 * src_enp - source endpoint of the incoming packet.
 * len - the length of the buffer.
 * packet - the buffer pointer.
 */
int sm_register_session_handler(sm_uint32_t session_idx,
                        void (*handle)(struct sm_message *message, struct sm_session *session))

Connection based Packet Communication

  • sm_connect_session(session_idx, remote_ep, dst_cpu); - connect the local endpoint to a remote endpoint on another processor
  • sm_close_session(session_idx, remote_ep, dst_cpu); - close a connection between two endpoints

After session is connected, send and receive data is same as packet transfer by sm_send_packet() and sm_recv_packet().

A Simple Example

This example is based on network layer communication APIs defined for both Linux application and the bare metal DSP application.

Linux APP

Core 1 Bare Metal Application

Link the DSP application

The bare metal application should be linked with the dsp bridge library in order to interact with linux application properly. The offset address of entry main() can be discovered by dsp bridge kernel moduel when loading.

The compile command, when compiling on linux host, seems like,

$bfin-elf-gcc -T coreb.lds -mcpu=bf561 -D__DSP__ coreb.c dsp_bridge.a -o coreb.bin

The linker scripts coreb.lds, seems like,

MEMORY
{
  MEM_L1_CODE :       ORIGIN = 0xFF600000, LENGTH = 0x4000
  MEM_L1_CODE_CACHE : ORIGIN = 0xFF610000, LENGTH = 0x4000
  MEM_L1_SCRATCH :    ORIGIN = 0xFF700000, LENGTH = 0x1000
  MEM_L1_DATA_B :     ORIGIN = 0xFF500000, LENGTH = 0x8000
  MEM_L1_DATA_A :     ORIGIN = 0xFF400000, LENGTH = 0x8000
  MEM_L2 :            ORIGIN = 0xFEB00000, LENGTH = 0x20000
}

OUTPUT_FORMAT("elf32-bfin", "elf32-bfin",
	      "elf32-bfin")
OUTPUT_ARCH(bfin)
ENTRY(_main)

SECTIONS
{
  .text_l1        :
  {
   /*
    * Here is the reserved jump instruction to jump to the Linux
    * dsp device driver core B init code.
    */
    . = MEM_L1_CODE + 0x10;

    *(.l1.text)
  } >MEM_L1_CODE =0
  .text           :
  {
   /*
    * Here is the static shared message queues between core A and B.
    */
    . = MEM_L2 + 0x40;
    *(.text.*)
  } >MEM_L2 =0
  .l2             :
  {
    *(.l2 .l2.*)
  } >MEM_L2 =0

  .data_l1        :
  {
    *(.l1.data)
  } >MEM_L1_DATA_A =0
  .data           :
  {
    *(.data .data.*)
  } >MEM_L2
  .bss            :
  {
    __bss_start = .;
    *(.bss .bss.*)
    __bss_end = .;
  } >MEM_L2

  __stack_end = ORIGIN(MEM_L1_SCRATCH) + LENGTH(MEM_L1_SCRATCH);
}

Implementation approach

Following aspects are described:

  1. Initialize DSP CPLB, Event controller.
  2. Loading and controlling the DSP bare metal application via core control protocol.
  3. Dispatch packet via packet transfer protocol.
  4. ICC library for bare metal application.
  5. ICC Linux driver for Linux application.

Load and control bare metal application

The DSP bridge relies on endpoint 0 to control DSP application status via core control protocol. Message dispatch loop on DSP core can react to the core control commands.

enum {
  EP_CORE_CONTROL = 0;
};

Bare metal application is loaded by user space loader into core B memory space. The main and disatch entries in application and its dsp_bridge library are figured out by the loader. The loader informs the dsp_driver of these entry address.

Dispatch packet

icc session layer manage the user space packet send/recv session

the sm_session data structure:

struct sm_session {
        struct list_head rx_messages; /*rx queue sm message*/
        struct list_head tx_messages;
        uint32_t        n_avail;
        uint32_t        n_uncompleted;
        uint32_t        local_ep;
        uint32_t        remote_ep; /*remote ep*/
        uint32_t        type;
        pid_t           pid;
        uint32_t        flags;
        int (*handle)(struct sm_message *msg, struct sm_session *session);
        struct sm_proto *proto_ops;
        uint32_t        queue_priority;
        wait_queue_head_t rx_wait;
} __attribute__((__aligned__(4)));

if the icc queue is full, packet send will be blocked on icc queue tx_wait wait queue until the tx queue is not full.

packet receive will blocked on session's rx_wait queue if there's no available message, until the ipi wait up the icc queue thread to hanle incoming message, receive the message to packet and wakeup the packet recv process sleeping on rx_wait queue.

message_queue_thread kernel thread to handle incoming msg, the remote ipi will wakeup this thread.

On core running Linux

  • To send a packet in application , the packet buffer should be allocated in user space first. Its pointer then is passed to the kernel system call. Kernel code also allocates a buffer in kernel space and copies the user data in. After that, the dsp bridge driver appends a packet ready message with packet address and length to the shared message queue in L2 memory and link the packet buffer to the sent list.
  • When a packet ready message arrives, the DSP bridge driver allocates a kernel buffer and copies the packet in. It links this buffer to the receiving list of the registered end point and send a packet consumed message with the received packet address back.
  • When a packet consumed message arrives, the DSP bridge driver frees the packet of the received address from the send list.

On core running bare metal application

  • To send a packet in bare metal application is similar to under Linux except there is no packet copying between the application and the dsp bridge library, which is linked into the bare metal application. The library appends a packet ready message with the packet address to the message queue.
  • When message IPI is triggered, the DSP core wakes up from the message dispatch loop in DSP-bridge driver. If the message is not of core control protocol, it jumps to the entry of the upper layer dispatch loop in DSP bridge library. This dispatch loop handles all message protocols except for the core control protocol. Its symbol name is known to the DSP bridge driver and its address in bare metal application is figured out in loading stage.
  • When a packet ready message arrives, the dispatch loop in DSP bridge library looks up the registered callback of the same destination end point in message. Then, it invoked the callback with source end point, packet length and packet address. After application finish processing the packet, it send a packet consumed message with the received packet address back. The application should not access this packet after it exit the callback.
  • When a packet consumed message arrives, the dispatch loop in DSP bridge library call application callback to free the packet of the received address.

ICC bare metal library

Local system interrupt Supp0 and core interrupt EVT15 are always reserved for DSP bridge library.

Following APIs are to be implemented in DSP bridge library:

  1. memory allocation against local L1, L2, DRAM
  2. register message callback handlers for packet transfer and resource manager protocols.
  3. socket alike interface(Optional)
  4. register local exception, hardware error and core timer callback handlers (Optional)
  5. map system interrupt to a given local core interrupt(Optional)

Run and test ICC

ICC framework is now enabled for both BF561 and BF609.

Steps to run and test the initial ICC implementation for Linux and bare metal can be found at test_icc.

Load ICC applications to slave core

  • Load bare metal apps to core B

You should following the example under folder icc_utils/. The ICC stub(main event loop) for core B should be loaded by icc_utility from Linux filesystem before loading any further core B ICC applications. This can be done in /etc/rc or any time later. You can also build the ICC stub and applications for core B into one elf binary and load at once after kernel bootup.

  • Load RTOS to core B

You can either boot the RTOS in the same way as ICC stub or boot it from proper address in NOR flash directly with the help of u-boot/kernel.

Debug ICC applications

GDB and gdbserver over ethernet/UART is the only way to debug Linux application on core A. For RTOS on core B, that depends on the application debuging tool available in that RTOS. For bare metal application on core B, only JTAG tool is applicable, such as GDB and gdbproxy over JTAG. To debug 2 ICC applications on 2 cores, you have to run 2 debugging instances concurrently and stepping each application individually.

Process to debug ICC applications:

  1. Build Linux distribution with ICC driver and utility.
  2. Attach GDB and gdbproxy to core B
  3. Boot Linux on core A.
  4. Build ICC stub/apps and copy to Linux filesystem via Ethernet.
  5. Load ICC stub and application to core B in Linux.
  6. Set break point on core B and run.
  7. Load Linux application under gdbserver and GDB.
  8. Set break point in Linux app and run.
  9. Debug.
  10. Revise ICC app source code and go to step 3.

Wrap ICC framework by MCAPI

To better address the issue of proprietary Inter-Processor Communication (IPC), the Multicore Association (MCA) created an API-based standard called the Multicore Communication API (MCAPI). MCAPI is used in AMP configurations that require communication and synchronization between multiple operating system instances. MCAPI defines three fundamental communication types. These are: 1. messages - connection-less datagrams 2. packet channels - connection-oriented, uni-directional, FIFO packet streams 3. scalar channels - connection-oriented single word uni-directional, FIFO packet streams

MCAPI overview

  • MCAPI Domains

An MCAPI domain is comprised of one or more MCAPI nodes in a multicore topology, and it is used for routing purposes. Potential uses for domains: separation between different transports

  • MCAPI Nodes

An MCAPI node is a logical abstraction that can be mapped to many entities, including but not limited to: a process, a thread, a instance of an operating system, a hardware accelerator, or a proccessor core.

  • MCAPI Endpoints

MCAPI endpoints are socket-like communication termination points.

  • MCAPI Channels

Channels provide point-to-point FIFO connections between a pair of endpoints. MCAPI channels are unidirectional.

MCAPI implementation concerns Link management

MCAPI implementation on top of ICC

The MCAPI specification is both an API and communications sematic specification. It does not define which link management, device model or wired protocol underneath it.

In our use case, we will implement MCAPI2.0 APIs which sit on top of ICC protocol. The domain field in our impelmentation will be used to separate different transport type(0 for ICC protocol in our case). In long term there will be other transport types other than ICC according to the new multi-core architecture. The node ID will be used to identify processor cores (e.g. 0 for coreA 1 for coreB on BF561). The port id will be used to map to the ICC session to communicate with another end on another core.

adi-mcapi-framework.jpg

  • MCAPI and transport layer

mcapi application interfaces, initialize and finalize mcapi, create endpoints, manage mcapi data communication between two endpoints. We can implement the mcapi on top of icc by modifying the transport layer. CoreA node and CoreB node will be statically created on coreA and coreB, each is a logical abstraction instance of a core node(or OS node). Then MCAPI ports can be implemented on top if ICC sessions, each endpoint maps a ICC session. MCAPI endpoints identified by a <domain_id, node_id, port_id> tuple will map to <node, session> in ICC layer, then data delivery between a pair of MCAPI endpoints can be implemented on top of ICC.

  • resource management

OS specific resource management layer, manage share memory, semaphore synchromization,

  • ICC dev interface

device node implementation on top of ICC

  • ICC

intercore communication layer base on share memory and intercore interrupt

  • phisical layer

physical data delivery layer, can be L2 shared memory, link port, etc.

Run and test MCAPI 2.0

Steps to run and test the MCAPI 2.0 implementation for Linux and bare metal on BF561 or BF609 can be found at test_mcapi.

Other implementations