world leader in high performance signal processing
Trace: » application_binary_interface

Application Binary Interface

The Application Binary Interface (ABI) describes the low-level interface between an application program and the operating system, between an application and its libraries, or between component parts of the application. Normally ABIs cover details such as the calling convention (which controls how functions' arguments are passed and return values retrieved), the system call numbers and how an application should make system calls to the operating system, and in the case of a complete operating system ABI, the binary format of object files, program libraries and so on.

In this case, the calling conventions or the run-time model will be described in detail. This applies to compiler-generated code, includes descriptions of layout of the stack, data access, and call/entry sequence. The C/C++ run-time environment includes the conventions that C/C++ routines must follow to run on Blackfin processors with the GNU GCC compiler. Assembly routines linked to C/C++ routines must follow these conventions.

The GNU GCC compiler is noted specifically - while care is taken to spend best efforts not to be different from other compilers and run time models, sometimes this is not possible or practical, and this specification should be thought of as something that is specific to the GNU toolchain

There are actually three separate run time models that are acceptable on the Blackfin under the uClinux distribution, from increasing complexity:

  • Bare Metal ELF - For Applications which do not require the Linux operating system, like U-Boot
  • Linux ABI - For the Linux Operating System independent of executable format
  • FLAT under Linux - For Applications which run under the Linux Operating System
  • FDPIC/ELF under Linux - For Applications which have Position Independent Code and Dynamic Shared Objects, under the Linux Operating System

Data Storage Formats

Endianness

Both internal and external memory are accessed in little endian byte order. The following shows a data word stored in register R0 or in memory at addr. B0 refers to the least significant byte of the 32-bit word.

Data in Register
B3 B2 B1 B0
Data in Memory
B3 B2 B1 B0
addr+3 addr+2 addr+1 addr

The following shows 16- and 32-bit instructions stored in memory. The diagram shows 16-bit instructions stored in memory with the most significant byte of the instruction stored in the high address (byte B1 in addr+1) and the least significant byte in the low address (byte B0 in addr). The diagram also shows 32-bit instructions stored in memory. Note the most significant 16-bit half word of the instruction (bytes B3 and B2) is stored in the low addresses (addr+1 and addr), and the least significant half word (bytes B1 and B0) is stored in the high addresses (addr+3 and addr+2).

16-bit instruction
Instruct0
B1 B0
addr+1 addr
16-bit instructions
Instruct1 Instruct0
B1 B0 B1 B0
addr+3 addr+2 addr+1 addr
32-bit instructions
Instruct0
B1 B0 B3 B2
addr+3 addr+2 addr+1 addr

Data Sizes

The sizes of intrinsic C/C++ data types are selected by so that most C/C++ programs execute with hardware-native data types, and, therefore, at high speed. All C/C++ run-time environments uses the intrinsic C/C++ data types and data formats below:

Type Bit Size Number Representation sizeof() returns Representation in Memory Representation in 32/64-bit Register
char 8-bit signed two's complement 1 sddd dddd ssss ssss ssss ssss
ssss ssss sddd dddd
unsigned char 8-bit unsigned unsigned magnitude 1 dddd dddd 0000 0000 0000 0000
0000 0000 dddd dddd
short 16-bit signed two's complement 2 sddd dddd dddd dddd ssss ssss ssss ssss
sddd dddd dddd dddd
unsigned short 16-bit unsigned unsigned magnitude 2 dddd dddd dddd dddd 0000 0000 0000 0000
dddd dddd dddd dddd
int 32-bit signed two’s complement 4 sddd dddd dddd dddd
dddd dddd dddd dddd
sddd dddd dddd dddd
dddd dddd dddd dddd
unsigned int 32-bit unsigned unsigned magnitude 4 dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
long 32-bit signed two's complement 4 sddd dddd dddd dddd
dddd dddd dddd dddd
sddd dddd dddd dddd
dddd dddd dddd dddd
unsigned long 32-bit unsigned unsigned magnitude 4 dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
long long 64-bit signed two's complement 8 sddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
sddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
unsigned long long 64-bit unsigned unsigned magnitude 8 dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
pointer 32-bit address 4 dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
function pointer 32-bit address 4 dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
dddd dddd dddd dddd
float 32-bit IEEE single-precision 4 seee eeee emmm mmmm
mmmm mmmm mmmm mmmm
seee eeee emmm mmmm
mmmm mmmm mmmm mmmm
double 64-bit1) IEEE double-precision 8 seee eeee eeee mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
seee eeee eeee mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
long double 64-bit2) IEEE double-precision 8 seee eeee eeee mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
seee eeee eeee mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
mmmm mmmm mmmm mmmm
fract16 16-bit signed 1.15 fract 2 s.ddd dddd dddd dddd
s.ddd dddd dddd dddd
s.ddd dddd dddd dddd
s.ddd dddd dddd dddd
fract32 32-bit signed 1.31 fract 4 s.ddd dddd dddd dddd
dddd dddd dddd dddd
s.ddd dddd dddd dddd
dddd dddd dddd dddd
  • s = sign bit(s)
  • d = data bit(s)
  • e = exponent bit(s)
  • m = mantissa bit(s)
  • ”.” = decimal point by convention; however, a decimal point does not literally appear in the number.
  • Italics denotes data from a source other than adjacent bits.

The floating-point and 64-bit data types are implemented using software emulation, so must be expected to run more slowly than hard-supported native data types. The emulated data types are float, double, long double, long long and unsigned long long.

The fract16 and fract32 are not actually intrinsic data types, they are typedefs to short and long, respectively. In C, you need to use built-in functions to do basic arithmetic. (See compiler's built-in_functions). You cannot do fract16*fract16 and get the right result. This is being worked on - check out mainline for details. In C++, for fract data, the classes fract and shortfract define the basic arithmetic operators.

Floating-Point Data Size

On Blackfin processors, the float data type is 32 bits, and the double data type default size is 32 bits. This size is chosen because it is the most efficient. The 64-bit long double data type is available if more precision is needed, although this is more costly because the type exceeds the data sizes supported natively by hardware.

In the C language, floating-point literal constants default to double data type. When operations involve both float and double, the float operands are promoted to double and the operation is done at double size (64-bit). GNU GCC compiles with the C language standard.

VDSP++ defaults to a double to a 32-bit data type, and usually avoids additional expense during these promotions. This does not however fully conform to the C and C++ standards which require that the double type supports at least 10 digits of precision. This is an optional feature that can be controlled with a compiler flag.

Floating-Point Binary Formats

The Blackfin compiler supports IEEE and non-IEEE floating-point formats.

IEEE Floating-Point Format By default, the Blackfin compiler provides floating point emulation using IEEE single- and double-precision formats. Single-precision IEEE format provides a 32-bit value, with 23 bits for mantissa, 8 bits for exponent, and 1 bit for sign. This format is used for the float data type.

Double-precision IEEE format provides a 64-bit value, with 52 bits for mantissa, 11 bits for exponent, and 1 bit for sign. This format is used for the long double data type, and for the double data type.

Variants of IEEE Floating-Point Support

The Blackfin compiler supports two variants of IEEE floating-point support. These variants are implemented in terms of two alternative emulation libraries, selected at compile/link time. The two alternative emulation libraries are:

  • The default IEEE floating-point library. It is a strictly-conforming variant, which offers less performance, but includes all the input-checking that has been relaxed in the alternative library.
  • An alternative IEEE floating-point library. It is a high-performance variant, which relaxes some of the IEEE rules in the interests of performance. This library assumes that its inputs will be value numbers, rather than Not-a-number values. This library can also explicitly be selecting via the -mfast-fp switch.

Symbol Prefixes

A user should be able to use C/C++ symbols (function or variable names) in assembly routines and use assembly symbols in C/C++ code. This section describes how to name and use C/C++ and assembly symbols.

The Blackfin C ABI stipulates that all symbols have an underscore prefix (_). To name an assembly symbol that corresponds to a C symbol, add an underscore prefix to the C symbol name when declaring the symbol in assembly. For example, the C symbol foo becomes the assembly symbol _foo. C++ global symbols are usually “mangled” to encode the additional type information. Declare C++ global symbols using extern “C” to disable the mangling.

To use a C/C++ function or variable in an assembly routine, declare it as global in the C program. Then in the assembly routine, you just use the symbol. The GNU assembler treats all undefined symbols as extern. The common .extern directive is not needed and is simply ignored.

To use an assembly function or variable in your C/C++ program, declare the symbol with the .global assembler directive in the assembly file and import the symbol by using extern in the C/C++ code.

ELF Run Time

C/C++ Run-Time Header and Startup Code

The C/C++ run-time (CRT) header is code that is executed after the processor jumps to the start address on reset. The CRT header sets the machine into a known state and calls _main.

Default CRT objects are provided for all platforms in the run-time libraries, and are linked against for all C/C++ projects.

The CRT ensures that when execution enters _main, the processor’s state obeys the C Application Binary Interface (ABI), and that global data declared by the application have been initialized as required by the C/C++ standards. It arranges things so that _main appears to be “just another function” invoked by the normal function invocation procedure. Not all applications require the same configuration. For example, C++ constructors are invoked only for applications that contain C++ code. The list of optional configuration items is long enough that determining whether to invoke each one in turn at run-time would be overly costly.

The CRT header is used for projects that use C, C++ and Assembly.

The list of operations performed by the CRT (startup code) can include (not necessarily in the following order):

  • Setting registers to known/required values
  • Disabling hardware loops
  • Disabling circular buffers
  • Setting up default event handlers and enabling interrupts
  • Initializing the Stack and Frame Pointers
  • Enabling the cycle counter
  • Configuring the memory ports used by the two DAGs
  • Setting the processor clock speed
  • Copying data from the flash memory to RAM
  • Initializing device drivers
  • Setting up memory protection and caches
  • Changing processor interrupt priority
  • Initializing profiling support
  • Invoking C++ constructors
  • Invoking _main, with supplied parameters
  • Invoking _exit on termination

What the CRT Does Not Do

The CRT does not initialize actual memory hardware. However, some properties of external SDRAM may be configured if the CRT contains code to customize clock and power settings. The initialization of the external SDRAM is left to the boot loader because it is possible (and even likely) that the CRT itself will need to be moved into external memory before being executed.

Dedicated Registers

The C/C++ run-time environment specifies a set of registers whose contents should not be changed except in specific defined circumstances. If these registers are changed, their values must be saved and restored. The dedicated register values must always be valid for every function call (especially for library calls) and for any possible interrupt.

Dedicated registers are:

  • SP (P6) - FP (P7) The SP (P6) and FP (P7) are the Stack Pointer and the Frame Pointer registers, respectively. The compiler requires that both are 4-byte aligned registers pointing to valid areas within the stack section.
  • L0 — L3 These registers define the lengths of the DAG’s circular buffers. The compiler makes use of the DAG registers, both in linear mode and in circular buffering mode. The compiler assumes that the Length registers are zero, both on entry to functions and on return from functions, and ensures this is the case when it generates calls or returns. Your application may modify the Length registers and make use of circular buffers, but you must ensure that the Length registers are appropriately reset when calling compiled functions, or returning to compiled functions. Interrupt handlers must store and restore the Length registers, if making use of DAG registers.

Call Preserved Registers

The C/C++ run-time environment specifies a set of registers whose contents must be saved and restored. Your assembly function must save these registers during the function’s prologue and restore the registers as part of the function’s epilogue. The call preserved registers must be saved and restored if they are modified within the assembly function; if a function does not change a particular register, it does not need to save and restore the register. The registers are:

  • P3 — P5
  • R4 — R7

Scratch Registers

The C/C++ run-time environment specifies a set of registers whose contents do not need to be saved and restored. Note that the contents of these registers are not preserved across function calls.

  • P0 Used as the Aggregate Return Pointer
  • P1 — P2
  • R0 — R3 The first three words of the argument list are always passed in R0, R1 and R2 if present (R3 is not used for parameters).
  • LB0 — LB1
  • LC0 — LC1
  • LT0 — LT1
  • ASTAT (Including CC)
  • A0 — A1
  • I0 — I3
  • B0 — B3
  • M0 — M3
  • RETS

Loop Counters, Overlays and DMA’d Code

The compiler does not ensure that the loop counter registers LC0 and LC1 are zero on entry or exit from a function. This does not normally cause a problem because the exit point of a hardware loop is unique within the program, and the compiler ensures that the only path to the exit is through the corresponding loop setup instruction. If overlays are being used, or if code is being DMA’d into faster memory for execution, this may no longer be the case. It is possible for an overlay or a DMA’d function to set up a loop that terminates at address A, and then for a different overlay or DMA’d function to have different code occupying address A at a later point in time. If a hardware loop is still active — LC0 or LC1 is non-zero—at the point when the instruction at address A is reached, then undefined behavior results as the hardware loop “jumps” back to the start of the loop. Therefore, in such cases, it is necessary for the overlay manager or the DMA manager to reset loop counters to ensure no hardware loops remain active that might relate to the address range covered by the variant code.

Stack Registers

The C/C++ run-time environment reserves a set of registers for controlling the run-time stack. These registers may be modified for stack management, but must be saved and restored. Stack registers are:

  • SP (P6) - Stack pointer
  • FP (P7) - Frame pointer

Managing the Stack

The C/C++ run-time environment uses the run-time stack to store automatic variables and return addresses. The stack is managed by a Frame Pointer (FP) and a Stack Pointer (SP) and grows downward in memory, moving from higher to lower addresses. A stack frame is a section of the stack used to hold information about the current context of the C/C++ program. Information in the frame includes local variables, compiler temporaries, and parameters for the next function.

The Frame Pointer serves as a base for accessing memory in the stack frame. Routines refer to locals, temporaries, and parameters by their offset from the Frame Pointer.

Note that when calling other functions, make sure that at least 12 bytes exist on the stack. The target function may assume that these top twelve bytes exist for its own usage without explicitly allocating them itself. Typically this space is used to save the argument registers (R0, R1, R2), but it may be used as scratch space as well. It's a similar concept to the Red Zone defined in the ABI of other architectures. This can easily be combined into any existing LINK instruction by adding 12 to whatever your function actually needs.

Example

To enter and perform a function, follow this sequence of steps:

  1. Linking Stack Frames - The return address and the caller’s FP are saved on the stack, and FP set pointing to the beginning of the new (callee) stack frame. SP is decremented to allocate space for local variables and compiler temporaries.
  2. Register Saving - Any registers that the function needs to preserve are saved on the stack frame, and SP is set pointing to the top of the stack frame.

At the end of the function, these steps must be performed:

  1. Restore Registers - Any registers that had been preserved are restored from the stack frame, and SP is set pointing to the top of the stack frame.
  2. Unlinking Stack Frame - The frame pointer is restored from the stack frame to the caller’s FP, RETS is restored from the stack frame to the return address, and SP is set pointing to the top of the caller’s stack frame.

A typical function prologue would be:

	LINK 16;
	[--SP] = (R7:4);
	SP += -16;
	[FP+8] = R0;
	[FP+12] = R1;
	[FP+16] = R2;

where:

  • LINK 16; is a special linkage instruction that saves the return address and the frame pointer, and updates the Stack Pointer to allocate 16 bytes for local variables.
  • [--SP] = (R7:4); allocates space on the stack and saves the registers in the save area.
  • SP += -16; allocates space on the stack for outgoing arguments. Always allocate at least 12 bytes on the stack.
  • [FP+8]=R0; [FP+12]=R1; [FP+16]=R2; saves the argument registers in the argument area.

A matching function epilogue would be:

	SP += 16;
	P0 = [FP+4];
	(R7:4) = [SP++];
	UNLINK;
	JUMP (P0);

where

  • SP += 16; reclaims the space on the stack that was used for outgoing arguments.
  • P0 = [FP+4] loads the return address into register P0.
  • (R7:4) = [SP++]; restores the registers from the save area and reclaims the area.
  • UNLINK; is a special instruction that restores the frame pointer and stack pointer.
  • JUMP (P0); returns to the caller.

Function Arguments and Return Values

The C/C++ run-time environment uses a set of registers and the run-time stack to transfer function parameters to assembly routines. Your assembly language functions must follow these conventions when they call (or when called by) C/C++ functions.

Passing Arguments

The details of argument passing are most easily understood in terms of a conceptual argument list. This is a list of words on the stack. Double arguments are placed starting on the next available word in the list, as are structures. Each argument appears in the argument list exactly as it would in storage, and each separate argument begins on a word boundary.

The actual argument list is like the conceptual argument list except that the contents of the first three words are placed in registers R0, R1 and R2. Normally this means that the first three arguments (if they are integers or pointers) are passed in registers R0 to R2 with any additional arguments being passed on the stack.

If any argument is greater than one word, it occupies multiple registers. The caller is responsible for extending any char or short arguments to 32-bit values.

When calling a C function, at least twelve bytes of stack space must be allocated.

The details of argument passing do not change for variable argument lists. For example, if a function is declared as:

int varying(char *fmt, ...) { /* ... */ }

it may receive one or more arguments. As with other functions, the first argument, fmt, is passed in R0, and other arguments are passed in R1, and then R2, and then on the stack, as required.

Variable argument lists are processed using the macros defined in the stdarg.h header file. The va_start() function obtains a pointer to the listof arguments which may be passed to other functions, or which may be walked by the va_arg() macro. To support this, the compiler begins variable argument functions by flushing R0, R1 and R2 to their reserved spaces on the stack:

_varying:
	[SP+0] = R0;
	[SP+4] = R1;
	[SP+8] = R2;

The va_start() function can then take the address of the last non-varying argument (fmt, in the example above, at [SP+0]), and va_arg() can walk through the complete argument list on the stack.

Return Values

If a function returns a short or a char, the callee is responsible for sign- or zero-extending the return value into a 32-bit register. So, for example, a function that returns a signed short must sign-extend that short into R0. Similarly a function that returns an unsigned char must zero-extend that unsigned char into R0.

  • For functions returning aggregate values occupying less than or equal to 32 bits, the result is returned in R0.
  • For aggregate values occupying greater than 32 bits, and less than or equal to 64 bits, the result is returned in register pair R0, R1.
  • For functions returning aggregate values occupying more than 64 bits, the caller allocates the return value object on the stack and the address of this object is passed to the callee as a hidden argument in register P0.

The callee must copy the return value into the object at the address in P0.

Examples of Parameter Passing

Function Prototype Parameters Passed as Return Location
int test(int a, int b,int c) a in R0, b in R1, c in R2 in R0
char test(int a, char b, char c) a in R0, b in R1, c in R2 in R0
int test(int a) a in R0 in R0
int test(char a, char b, char c, char d, char e) a in R0, b in R1, c in R2, d in [FP+20], e in [FP+24] in R0
int test(struct *a, int b, int c) a (addr) in R0, b in R1, c in R2 in R0
struct s2a { char ta; char ub; int vc;}
int test(struct s2a x, int b, int c)
x.ta and x.ub in R0, x.vc in R1, b in R2, c in [FP+20] in R0
struct foo *test(int a, int b, int c) a in R0, b in R1, c in R2 (address) in R0
void qsort(void *base, int nel, int width, int (*compare)(const void *, const void *)) base(addr) in R0, nel in R1, width in R2, compare(addr) in [FP+20]
struct s2 { char t; char u; int v; }
struct s2 test(int a, int b, int c)
a in R0, b in R1, c in R2 in R0 (s.t and s.u), R1 (s.v)
struct s3 { char t; char u; int v; int w; }
struct s3 test(int a, int b, int c)
a in R0, b in R1, c in R2 in *P0 (based on value of P0 at the call, not necessarily at the return)

C/C++ and Assembly Interface

This section describes how to call assembly language subroutines from within C/C++ programs, and how to call C/C++ functions from within assembly language programs. Before attempting to perform either of these operations, familiarize yourself with the information about the C/C++ run-time model (including details about the stack, data types, and how arguments are handled)

Calling Assembly Subroutines From C/C++ Programs

Before calling an assembly language subroutine from a C/C++ program, create a prototype to define the arguments for the assembly language subroutine and the interface from the C/C++ program to the assembly language subroutine. Even though it is legal to use a function without a prototype in C/C++, prototypes are a strongly-recommended practice for good software engineering. When the prototype is omitted, the compiler cannot perform argument-type checking and assumes that the return value is of type integer and uses K&R promotion rules instead of ANSI promotion rules.

The compiler prefaces the name of any external entry point with an underscore. Therefore, declare your assembly language subroutine’s name with a leading underscore.

The run-time model defines some registers as scratch registers and others as preserved or dedicated registers. Scratch registers can be used within the assembly language program without worrying about their previous contents. If more room is needed (or an existing code is used) and you wish to use the preserved registers, you must save their contents and then restore those contents before returning.

Use the dedicated or stack registers for their intended purpose only; the compiler, libraries, debugger, and interrupt routines depend on having a stack available as defined by those registers.

The compiler also assumes the machine state does not change during execution of the assembly language subroutine.

Do not change any machine modes (for example, certain registers may be used to indicate circular buffering when those register values are nonzero).

If arguments are on the stack, they are addressed via an offset from the stack pointer or frame pointer. A good way to explore how arguments are passed between a C/C++ program and an assembly language subroutine is to write a dummy function in C/C++ and compile it using the save temporary files option

Example

The following example includes the global volatile variable assignments to indicate where the arguments can be found upon entry to asmfunc.

//  Sample file for exploring compiler interface …
//  global variables … assign arguments there just so
//  we can track which registers were used
//  (type of each variable corresponds to one of arguments):
int global_a;
float global_b;
int *global_p;
 
// the function itself:
int asmfunc(int a, float b, int * p)
{
	// do some assignments so assembly file will show where args are:
	global_a = a;
	global_b = b;
	global_p = p;
 
	// value gets loaded into the return register:
	return 12345;
}

When compiled with the -save-temps switch being set, the following code is produced.

$ bfin-elf-gcc -O2 foo.c -o foo.o --save-temps

.file "foo.c";
.text;
        .align 4
.global _asmfunc;
.type _asmfunc, STT_FUNC;
_asmfunc:
        P2.H = _global_a;
        P2.L = _global_a;
        LINK 0;
        [P2] = R0;
        P2.H = _global_b;
        P2.L = _global_b;
        [P2] = R1;
        P2.H = _global_p;
        P2.L = _global_p;
        [P2] = R2;
        R0 = 12345 (X);
        UNLINK;
        rts;
        .size   _asmfunc, .-_asmfunc
        .comm   _global_a,4,4
        .comm   _global_b,4,4
        .comm   _global_p,4,4
        .ident  "GCC: (GNU) 4.1.2 (ADI cvs)"

Calling C/C++ Functions From Assembly Programs

You may want to call a C/C++ callable library and other functions from within an assembly language program. As discussed in Calling Assembly Subroutines From C/C++ Programs above, you may want to create a test function to do this in C/C++, and then use the code generated by the compiler as a reference when creating your assembly language program and the argument setup. Using volatile global variables may help clarify the essential code in your test function.

The run-time model defines some registers as scratch registers and others as preserved or dedicated. The contents of the scratch registers may be changed without warning by the called C/C++ function. If the assembly language program needs the contents of any of those registers, you must save their contents before the call to the C/C++ function and then restore those contents after returning from the call.

Use the dedicated registers for their intended purpose only; the compiler, libraries, debugger, and interrupt routines all depend on having a stack available as defined by those registers.

Preserved registers can be used; their contents are not changed by calling a C/C++ function. The function always saves and restores the contents of preserved registers if they are going to change.

If arguments are on the stack, they are addressed via an offset from the stack pointer or frame pointer. Explore how arguments are passed between an assembly language program and a function by writing a dummy function in C/C++ and compiling it with the save temporary files option (see the -save-temps option). By examining the contents of volatile global variables in a *.s file, you can determine how the C/C++ function passes arguments, and then duplicate that argument setup process in the assembly language program.

The stack must be set up correctly before calling a C/C++ callable function. If you call other functions, maintaining the basic stack model also facilitates the use of the debugger. The easiest way to do this is to define a C/C++ main program to initialize the run-time system; maintain the stack until it is needed by the C/C++ function being called from the assembly language program; and then continue to maintain that stack until it is needed to call back into C/C++. However, make sure the dedicated registers are correct. You do not need to set the FP prior to the call; the caller’s FP is never used by the recipient.

C++ Template Support

The compiler provides template support C++ templates as defined in the ISO/IEC 14882:1998 C++ standard.

Linux Run Time

Here are the aspects of the Linux ABI that are not specific to any executable format.

Kernel Startup

While the ELF ABI for the Linux kernel is very much like that of other ELFs (it has the same symbol prefix, calling convention, etc…), the initial runtime environment is unique. Thus, any bootloader which wishes to boot the Linux kernel on the Blackfin processor must adhere to the following specification. Any register not listed here may have any value as the kernel will simply ignore it.

RegisterValue
R0Address of command line
RETXValue of RETX at power on

The command line is a normal C string stored anywhere in addressable space. Typically the L1 scratch pad is used for this purpose as it is the only L1 address that exists on all Blackfin parts (0xFFB00000). A value of 0 will be interpreted as NULL (meaning no string was specified).

The RETX value is not required, but may be useful in the case of a double exception or similar crash. The kernel can detect and utilize that value to print out helpful information.

System Calls

The following argument passing conventions are used for system calls (syscalls):

REG ENTRY EXIT
P0 syscall no. preserved
R0 arg 1 return value / error
R1 arg 2 preserved
R2 arg 3 preserved
R3 arg 4 preserved
R4 arg 5 preserved
R5 arg 6 preserved

Note that, with the exception of R0, the kernel preserves the values of each of these registers as well as all other registers.

The syscall is made via:

EXCPT 0;

The returned value has duel meanings. For negative values greater than -4096, it is a negated errno value. All other values depend on the system call for proper interpretation. Some example C code for handling this:

if (ret > (unsigned long)-4096) {
    /* it is an errno value */
    errno = -ret;
    return -1;
} else
    return ret;

Reserved Memory

The Linux kernel is fully relocatable (start and end addresses can be anywhere), with the exception of aspects which interface to other tools (bootloaders or the toolchain). There are some things which have a binary ABI between these tools, which effect the kernel layout:

Suspend To RAM

If you wish to support resuming from suspend to RAM, then the ABI is as follows:

Address Meaning
0x00000000+4 Hibernate magic (needs to be 0xDEADBEEF)
0x00000004+4 Resume address
0x00000008+4 Saved stack pointer

Just before Linux finishes suspending completely, it will write out these fields for the bootloader to use when resuming.

The magic string is to make sure we do only resume systems that have actually been suspended. Normally we check the SCKELOW bit in the VR_CTL, but this may have problems on some parts due to hardware anomaly 307.

The resume address is where the boot loader should jump to after checking the magic string.

The stack pointer is where the stack should be restored before resuming.

We need to store these values in external memory as all other locations are not maintained across reboot of the part (core registers, core MMRs, system MMRs, on-chip memory, etc…). Only the external memory is maintained.

Also make sure that before you jump to the resume address, you have lowered yourself to IVG15. The Blackfin does not come out of reset at IVG15, but rather a much higher level.

To prevent incorrectly resuming multiple times, you should also clear the magic string before jumping to the resume address.

For examples of how to do all of this, please consult the Blackfin port of Das U-Boot.

Null Pointer Section

The lowest 1kB of memory is configured so that NULL pointers are correctly caught rather than being silently ignored. This is done with a dedicated locked CPLB for this range in addition to the normal CPLB mapping set up by the kernel. Since this low 1kB is thus covered by two CPLBs, any attempt to use this memory will result in a multiple CPLB hit exception and the kernel will safely kill the application. Both an ICPLB and a DCPLB are configured in this way so instruction and data violations are caught.

Since this isn't just address 0 (the NULL address), we can actually catch offsets of NULL pointers as well (up to 1024 offset of course). So code that does not directly dereference a pointer (e.g. *foo) will also be caught most of the time (e.g. foo[10]).

Address Meaning
0x00000000+0x400 Doubly mapped memory

Note that this does not conflict with the suspend-to-ram ABI above as that memory is used at very low level times where no CPLBs are enabled.

Fixed Code Section

The fixed code section includes fast implementations of atomic operations, which are required to be at these exact locations. If they are not, any application built with the Blackfin/Linux Toolchain will not work. The area between 0x400 and 0x490 is currently reserved, but room is left to grow.

See the Linux fixed-code page for more details.

Address Meaning
0x00000400 FIXED_CODE_START
0x00000490 FIXED_CODE_END

Shadow Console (Early Printk Log Buffer)

The shadow console is a 2.75k buffer in memory, starting at (typically) 0x500, ending at 0xFFF, which will contain the first few boot messages of the linux kernel. It is always terminated with double nulls (0x00 0x00), so you can use the U-Boot strings command on it.

If the kernel believes it has crashed, it will write a value of 0xDEADBEEF into location 0x4F0, and the location of the start of the shadow console buffer into 0x4F4. When the boot loader boots, if it sees the magic value at 0x4F0, it should clear the magic value and then dump the contents of the buffer pointed to by 0x4F4.

Address Meaning
0x000004F0+4 crash magic (needs to be 0xDEADBEEF)
0x000004F4+4 pointer to log buffer

L1 Stack Checking

If you compile code with -mstack-check-l1, then gcc will instrument every function to compare its stack pointer with the value in the kernel. This is done before the function epilogue, so the current frame is not missed. If the stack did overflow, the application will manually call exception 3 which the kernel uses to kill the application with a stack overflow message.

Address Meaning
0xffb00000+4 Address of the stack bottom (initialized by kernel)
0xffb00004+4 Lowest stack address seen (updated by userspace)

These two locations are always reserved, even if you don't use stack checking in your application.

The typical generated code looks something like:

_func:
	/* function prologue */
	/* function body */
	P2.H = 0xfb00;
	P2.L = 0x0000;
	P2 = [P2];
	P2 += 8;
	cc = SP < P2;
	if !cc jump 4 (bp); excpt 3;
	/* function epilogue */

Page size

The page size is fixed at 4 kilobytes. The mmap2 system call will take offsets right-shifted by 12 bits, like other ports, but it will reject offsets that do not represent multiples of the page size. Programs must not, however, assume the result of mmap to be aligned to 4-kilobyte boundaries, nor that the amount of space obtained from mmap is rounded up to a multiple of the page size, since uClinux does not offer such guarantees.

FLAT Run Time

The Linux Flat ABI is very close to the Standalone ELF run time, with the main differences being the application initialization.

FIXME! - what else is different, and how is the crt0 different?

FDPIC Run Time

Although the ELF and FLAT run time ABIs are very close (if not exactly the same) as other Blackfin compilers ABIs (including ADI's VDSP++), the Shared Library (FDPIC) ABI add additional requirements that make it impossible to use code written for any of the other ABIs without many changes. At this time, only the GNU Toolchain supports the Shared Library (FDPIC) ABI.

The Blackfin Shared Library (FDPIC) ABI is heavily borrowed from the FR-V FDPIC ABI, and portions of this specification are borrowed directly from the The FR-V FDPIC ABI version 1, written by Kevin Buettner, Alexandre Oliva and Richard Henderson; copyright Red Hat, Inc, (2004) and used with permission.

FIXME! - Alot of the current text still explains FR-V, since I didn't know how we do it … - the code examples are also still FR-V, :(

Introduction

This describes extensions (and changes) to the existing Blackfin ABI required to support the implementation of shared libaries on a system whose OS allows that processes share a common address space. This document will also attempt to explore the motivations behind and the implications of these extensions.

One of the primary goals in using shared libraries is to reduce the memory requirements of the overall system. Thus, if two processes use the same library, the hope is that at least some of the memory pages will be shared between the two processes resulting in an overall savings. To realize these savings, tools used to build a program and library must identify which sections may be shared and which must not be shared. The shared sections, when grouped together, are commonly referred to as the “text segment” whereas the non-shared (grouped) sections are commonly referred to as the “data segment”. The text segment is read-only and is usually comprised of executable code and read-only data. The data segment must be writable and it is this fact which makes it non-sharable.

Systems which utilize disjoint address spaces for its processes are free to group the text and data segments in such a way that they may always be loaded with fixed relative positions of the text and data segments. I.e, for a given load object, the offset from the start of the text segment to the start of the data segment is constant. This property greatly simplifies the design of the shared library machinery.

The design of the shared library mechanism described in this document does not (and cannot) have this property. Due to the fact that all processes share a common address space, the text and data segments will be placed at arbitrary locations relative to each other and will therefore need a mechanism whereby executable code will always be able to find its corresponding data. One of the CPU's registers is typically dedicated to hold the base address of the data segment. This register will be called the “FDPIC register” in this document. Such a register is sometimes used in systems with disjoint address spaces too, but this is for efficiency rather than necessity.

The fact that the locations of the text and data segments are at non-constant offsets with respect to each other also complicates function pointer representation. As noted above, executable code must be able to find its corresponding data segment. When making an indirect function call, it is therefore important that both the address of the function and the base address of the data segment are available. This means that a function pointer needs to represented as the address of a “function descriptor” which contains the address of the actual code to execute as well as the corresponding data (FDPIC register) address.

FDPIC Register

The FDPIC register is used as a base register for accessing the global offset table (GOT) and function descriptors. Since both code and data are relocatable, executable code may not contain any instruction sequences which directly encode a pointer's value. Instead, pointers to global data are indirectly referenced via the global offset table. At load time, pointers contained in the global offset table are relocated by the dynamic linker to point at the correct locations.

The Blackfin ELF or FLAT ABI do not specify a PIC, or FDPIC register. The Blackfin's P3 register was used for this purpose.

Upon entry to a function, the caller saved register P3 is the FDPIC register. As described above, it contains the GOT address for that function. P3 obtains its value in one of three ways:

  1. By being inherited from the calling function in the case of a direct call to a function within the same load module.
  2. By being set either in a PLT entry or in inlined PLT code.
  3. By being set from a function descriptor as part of an indirect call.

The specifics associated with each of these cases are covered in greater detail in “Procedure Linkage Table (PLT)” and “Function Calls”, below.

The prologue code of a non-leaf function should save P3 either on the stack or in one of the callee-saved registers. After each function call, P3 must be restored if it is needed later on in the function. Direct calls to functions in the same load module and direct calls which are routed through a PLT entry require that P3 be restored. Calls which use inlined PLT code and indirect calls may be able to avoid using P3; such calls will need to use some other register in which the GOT address has been saved, however. A leaf function makes no calls and need not save P3.

Note that once a function has moved P3 to one if its callee saved registers, the function is then free to use that register as the FDPIC register for accessing data. This is why the sections describing relocations are careful to specify FDPIC-relative references instead of P3-relative references.

The location of the data segment must be chosen in such a way so that the GOT address (i.e, FDPIC register value) has double word (64-bit) alignment. Note: This makes it possible to load the resolver's descriptor stored in the dynamic linker reserve area (see below) with a single double word load instruction. People familiar with the Blackfin architecture might realize that there is no atomic 64-bit load instruction, and indeed they would be correct. See the Lazy Procedure Linkage section for details on when this matters.

Also, it's envisioned (though not mandated) that the GOT entries are located at positive FDPIC-based offsets and that function descriptors are found at negative offsets to FDPIC.

P3 Considerations

P3, a caller saved register, plays a role in effecting transfer of control for some function calls. A PLT entry (or inlined PLT code) loads a function descriptor into P1 and P3. After that, P1 will contain the code address to which control should be transferred. (P3 will contain the GOT address.) The address loaded into P1 will either be the entry point of the function itself or the address of the lazy PLT fragment corresponding to the function to call. See “Lazy Procedure Linkage” below. In either case, the PLT entry (or inlined PLT code) will branch to the address contained in P1.

Note: Upon entry to a function, P1 should not be relied upon to contain the entry point address of the function. It is possible that the function was called directly, i.e, via a call instruction. Also, after (lazy) resolution, there's no requirement for the resolver to set P1 in this manner.

Function Descriptors

A number of programs assume that pointers to functions are as wide as pointers to data, even though programming languages don't require this. However, two words are needed to represent a function pointer meaningfully: not only is the function's entry point required, but also some context information that enables the function to find the corresponding data segment in the current process. Such context information is given in the form of a pointer to the GOT in FDPIC (which is P3 upon entry to a function).

In order to keep pointers to functions as 32-bit values, while adding context information to them, we introduce function descriptors, such that, when the address of a function is taken, the address of its descriptor is obtained. As shown below, the descriptor contains pointers to both the function's entry point and its GOT. A load module will also likely contain a number of private function descriptors which are used in conjunction with a corresponding PLT entry (or inlined PLT code) for calling a function.

A function descriptor consists of two 4-byte words:

  1. The “entry point” at offset 0 contains the text address of the function. This is the address at which to start executing the function.
  2. The “GOT address” at offset 4 contains the value to which the FDPIC register must be set when executing the function.

Each direct function call requiring a PLT entry (or which uses inlined PLT code) requires a function descriptor stored in the data segment.

Each private function descriptor needs to be initialized using a 64-bit relocation which fills in both the function entry point and GOT address. The R_BFIN_FUNCDESC_VALUE relocation is used for this purpose.

Function Addresses

When a function address is required, the address of an “official” (or canonical) function descriptor is used. Descriptors corresponding to static, non-overridable functions are allocated by the link editor and are initialized at load time via the R_BFIN_FUNCDESC_VALUE relocation. The dynamic linker is responsible for allocating and initializing all other “official” function descriptors.

As described above, a function's address is actually the address of a function descriptor, not that of the function's entry point. As is the case with other kinds of pointers, executable code obtains the values of pointer constants via the global offset table. The R_BFIN_FUNCDESC relocation (see below) is used in global offset table entries and initialized data to obtain the addresses of function descriptors used for representing function addresses.

This document borrows many of the concepts and terminology related to function addresses and their descriptors from the IA-64 System V ABI 3)4).

Procedure Linkage Table (PLT)

In order to make direct calls to a function external to a given load module, the CALL instruction's target is a PLT entry. (Calls to internal, but overridable functions also need PLT entries.) The PLT entry contains instructions for fetching the function's start address and global pointer value from a function descriptor associated with the function in question. The function descriptor will be located at a fixed offset from the address specified by the FDPIC register. The instructions in a PLT entry look like this:

foo@plt:    P1 = [P3 + foo@FUNCDESC_GOT17M4];
            P3 = [P3 + foo@FUNCDESC_GOT17M4 + 4];
            JUMP (P1);

The PLT entries load the address of the function's entry point into P1 and the new GOT address into P3.

In order to accomplish “lazy dynamic linking” (see below), P1 must be set to the entry point address found in the function descriptor.

Since PLT entries are so short, the compiler may choose to inline them directly into the call site. The resultant code should be speedier, due to the fact that branch instruction is eliminated, and due to the fact that it may be possible to move the loads earlier in the instruction stream. However, calling functions within the same translation unit may often be done with a single call instruction, so it's not always advantageous to do the inlining.

Dynamic Linker Reserve Area

The linker reserves three 32-bit words starting at the location pointed to by the FDPIC register for use by the dynamic linker. The first two words comprise a function descriptor for invoking the resolver used in lazy dynamic linking. The third (at P3 + 8) is used by the dynamic linker and the debugger to obtain access to information regarding the loaded module and the amount that each segment has been relocated by.

Lazy Procedure Linkage

Lazy procedure linkage requires an additional PLT fragment for each dynamic function that requires a local descriptor in the module. These entries are not large, but their aggregate will increase the size of the text segment. For this reason, the use of lazy dynamic linking is optional. (Implementation of lazy dynamic linking in the dynamic linker is mandatory, however.)

A lazy PLT fragment looks like this:

                        .word   funcdesc_value_reloc_offset(foo)
        foo@lazy_plt:   JUMP.S     resolverStub

The code for resolverStub looks like this:

        resolverStub:   P2 = [P3];
                        R3 = [P3 + 4];
                        JUMP (P2);

The link editor adds as many resolverStub fragments as necessary to ensure that the branch in each lazy PLT fragment is within range.

It is also possible to inline the resolverStub instructions as follows:

                        .word   funcdesc_value_reloc_offset(foo)
        foo@lazy_plt:   P2 = [P3];
                        R3 = [P3 + 4];
                        JUMP (P2);

Lazy PLT fragments have 16-bit alignment for space reasons.

Function descriptors residing in the GOT segment are initialized so that the entry point is that of the corresponding lazy PLT entry address. The function descriptor's GOT address is initialized to the GOT address for the load module itself. These initializations occur as the result of the dynamic linker performing R_BFIN_FUNCDESC_VALUE relocations (located in the .rel.plt section) at load time.

Thus a function call to an unresolved function will go through the lazy PLT fragment for that function as a result of picking up the lazy PLT entry point from the function descriptor. The lazy PLT fragment immediately branches to resolverStub, a special PLT entry which uses the dynamic linker reserve area (see above) to cause execution to be transferred to the actual resolver without disturbing either P1 or P3.

Upon entry to the actual (lazy) resolver, the following register values are important:

  • P2 -- the address of the resolver itself
  • R3 -- the GOT address (FDPIC value) for the resolver's GOT
  • P1 -- the address of the lazy PLT entry being resolved
  • P3 -- the GOT address for the caller's GOT

The resolver must take care not to modify the argument registers or the callee-saved registers, or if it does, to restore them to their original state when it's done.

The resolver uses the word at P1 - 4 (that is [P1 + -4], but note that it is only 16-bit aligned) which is an offset to a R_BFIN_FUNCDESC_VALUE relocation. This offset is relative to the value (address) associated with the DT_JMPREL tag in the dynamic section. (Tags related to DT_JMPREL are DT_PLTRELSZ and DT_PLTREL. The value associated with DT_PLTRELSZ provides the size of this section. The value associated with DT_PLTREL must be set to DT_REL indicating that Elf32_Rel structs are used to hold the relocation information.) The R_BFIN_FUNCDESC_VALUE relocation provides the offset to the function descriptor to update and the symbol table index of the function to resolve.

Assuming the resolver completes successfully, it will perform the following actions prior to transferring control to the entry point of the resolved function:

  1. Fill in the function descriptor in the caller's GOT so that the entry point and GOT address are correct for the next call of the resolved function.
  2. Set P3 to the GOT address of the resolvee's GOT.

Currently, there is a race condition between both words getting written and some other thread attempting to read them. The Blackfin does not have an atomic 64 bit load/store instruction that could be used to prevent it; it is recommended that threaded FD-PIC applications run with the LD_BIND_NOW environment variable set.

Example

Here is an example analysis of a small PLT (5 entries with lazy support):

00000588 <.plt>:
Lazy PLT
 588:   10 00           RTS;     These 4 bytes are disassembled as random garbage
 58a:   00 00           NOP;     because they're inlined data -- not instructions
 58c:   0c 20           JUMP.S 0x5a4 <resolver>;
Lazy PLT
 58e:   20 00           IDLE;    inlined relocation offset
 590:   00 00           NOP;
 592:   09 20           JUMP.S 0x5a4 <resolver>;
Lazy PLT
 594:   18 00           ILLEGAL  inlined relocation offset
 596:   00 00           NOP;
 598:   06 20           JUMP.S 0x5a4 <resolver>;
Lazy PLT
 59a:   00 00           NOP;     inlined relocation offset
 59c:   00 00           NOP;
 59e:   03 20           JUMP.S 0x5a4 <resolver>;
Lazy PLT -- no explicit jump needed here as we fall through to the resolver
 5a0:   08 00           ILLEGAL  inlined relocation offset
 5a2:   00 00           NOP;
jump to ldso resolver
 5a4:   5a 91           P2 = [P3];
 5a6:   5b a0           R3 = [P3 + 0x4];
 5a8:   52 00           JUMP (P2);
 5aa:   00 00           NOP;
 5ac:   00 00           NOP;
 5ae:   00 00           NOP;
Symbol PLT
 5b0:   19 e5 fe ff     P1 = [P3 + -0x8];  Load real function address
 5b4:   1b e5 ff ff     P3 = [P3 + -0x4];  Load GOT address
 5b8:   51 00           JUMP (P1);
Symbol PLT
 5ba:   19 e5 fc ff     P1 = [P3 + -0x10];
 5be:   1b e5 fd ff     P3 = [P3 + -0xc];
 5c2:   51 00           JUMP (P1);
Symbol PLT
 5c4:   19 e5 fa ff     P1 = [P3 + -0x18];
 5c8:   1b e5 fb ff     P3 = [P3 + -0x14];
 5cc:   51 00           JUMP (P1);
Symbol PLT
 5ce:   19 e5 f8 ff     P1 = [P3 + -0x20];
 5d2:   1b e5 f9 ff     P3 = [P3 + -0x1c];
 5d6:   51 00           JUMP (P1);
Symbol PLT
 5d8:   19 e5 f6 ff     P1 = [P3 + -0x28];
 5dc:   1b e5 f7 ff     P3 = [P3 + -0x24];
 5e0:   51 00           JUMP (P1);

Function Calls

Direct function calls are performed as follows:

	/* set up arguments as mandated by Blackfin Elf ABI */
	call foo;
	/* restore any needed ''caller saves'' registers */

The call foo pseudo-instruction will either transfer control directly to foo's entry point or will transfer control to foo's PLT entry if one is needed.

Since PLT entries reference P3, a function must ensure that P3 is set correctly prior to making a function call.

Inlined PLT code may be able to make use of the FDPIC value stored in another register - thus avoiding the need for setting P3. A direct call with an inlined PLT entry looks like this:

	/* set up arguments as mandated by Blackfin Elf ABI */
	P1 = [fdpic + foo@FUNCDESC_GOT17M4];
	CALL (P1);
	/* restore any needed call saved registers */

In the sequence above, fdpic refers to either P3 or some other register containg the GOT address for the current load module.

Indirect calls are performed by loading the entry point and GOT address from the function descriptor into P1 and P3, respectively. Control is transferred via a CALLL instruction to the function's entry point. The call site for an indirect function call might look like this:

	/* set up arguments as mandated by Blackfin Elf ABI */
	P1 = [fdpic];
	P3 = [fdpic + 4];
	CALL (P1);
	/* restore any needed `caller saves' registers */

Global Data and the Global Offset Table (GOT)

As noted earlier, position independent code must not contain any instruction sequences which directly encode a reference to global data. If they did so, load time relocations would be necessary to adjust these addresses. Also, any reference to a address in a non-shared segment would force the executable segment in question to be non-sharable.

The global offset table (GOT) contains words which hold the addresses of global data. In order to access these global data, position independent code must first use an FDPIC-relative load instruction to fetch the data address from the GOT. The data structure is then accessed as necessary using the address obtained from the GOT. It is envisioned that the various GOT related structures might look something like this:

                +-----------------------+ <--------------------\
                |          .            |                      |
                           .                                   |
                |          .            |                      |
                +-----------------------+                      |
                |                       |                      |
                +-    Func Descr #2    -+                      |
                |                       |                      |
                +-----------------------+                      |
                |                       |                      |
                +-    Func Descr #1    -+                      |
                |                       |                      |
                +-----------------------+ <---\                |
   FDPIC -----> |                       |     |                |
                +- Resolver Descriptor -+   Dynamic Linker     |
                |                       |   Reserve Area       |
                +-----------------------+     |                |
                |   link_map pointer    |     |                |
                +-----------------------+ <---/             Global
                | Global Data Addr #1   |                   Offset
                +-----------------------+                   Table
                | Global Data Addr #2   |                   (GOT)
                +-----------------------+                      |
                | Global Data Addr #3   |                      |
                +-----------------------+                      |
                |          .            |                      |
                           .                                   |
                |          .            |                      |
                +-----------------------+ <--------------------/

The link-editor is responsible for determining the precise layout of the GOT. The only hard requirements are the following:

  1. FDPIC must point at the first word of the dynamic linker reserve area.
  2. The dynamic linker reserve area needs to start on a 32-bit aligned word.
  3. Each function descriptor must be 32-bit aligned.
  4. The global offset table must reside in a non-shared segment.

In the picture above, function descriptors are placed at negative offsets relative to P3 and the GOT data address entries are placed at positive offsets relative to P3. The link editor is free to place either the function descriptors at postitive offsets (subject to alignment constraints) or the data address entries at negative offsets. It may wish to do so in order to maximize the number of instructions which access the GOT via 16-bit offsets, or via 32-bit offsets once the 16-bit offset slots are used up. Also, note that there is no requirement that the function descriptors or data address entries have any particular grouping.

GOT initialization is performed at load time by the dynamic linker. In order to accomplish these initializations, the dynamic linker uses R_FRV_32 relocations that have been placed in the object file by the link editor. R_FRV_32 relocations may cause addresses of other global data in other load modules to be resolved or the relocation may refer to data within the same load module. See the description of R_FRV_32 in “New Relocations” below. (For function descriptors, the R_BFIN_FUNCDESC_VALUE relocation is used. This relocation is described in greater detail below.)

Each load module has a symbol _GLOBAL_OFFSET_TABLE_ which resolves to the GOT address for that load module. The DT_PLTGOT dynamic section entry in each load module contains the GOT address also.

Computing the address of a data object can be done in several different ways. The simplest one is:

        sethi   #gothi(bar), gr#
        setlo   #gotlo(bar), gr#
        ld      @(gr15, gr#), gr#

or, for -fpic:

        ldi     @(gr15, #got12(bar)), gr#

If data symbol bar is known to be local to the translation unit, or to have internal, hidden or protected (but not global) visibility, different sequences can be used that assume the symbol to be located at a fixed offset within the text or data segments. If the symbol is known to be in the .data section, the following sequence computes the address of bar:

        sethi   #gotoffhi(bar), gr#
        setlo   #gotofflo(bar), gr#
        add     gr15, gr#, gr#

If the symbol is known to be in the .rodata section (that is mapped to the text segment), the following sequence has to be used instead:

        sethi   #gprelhi(bar), gr#
        setlo   #gprello(bar), gr#
        add     gr16?, gr#, gr#

gr16 (or any other register) must have been previously initialized with the gprel base address, as described in the GR16/GR17 Usage section.

The possibility of using gotoff12 or gprel12 is not affected by -fpic, since -fpic causes the GOT section to be assumed small, but not offsets from the GOT to other arbitrary sections. If bar is known to be mapped to a small data section, however, narrower offsets using gotoff12 or gprel12 relocations, can be used.

However, since there are no guarantees about _GLOBAL_OFFSET_TABLE_ or _gp being close enough to small data sections, a reasonable approach in some cases is to initialize a base register with the address of some local variable, then use this base register plus the offset between the base variable and other local variables defined in the same translation unit to reference other such variables throughout the function. For example, if gr18 is initialized in the beginning of a function or before a loop with the address of such a base variable, one can then use an instruction such as:

	ldi	@(gr18, other_var - base_var), gr#

to access other_var. This only works for symbols that are both defined in the same section in the same translation unit, and known to non-overridable.

Taking the address of a function can be accomplished with the following sequences:

        sethi   #gotfuncdeschi(foo), gr#
        setlo   #gotfuncdesclo(foo), gr#
        ld      @(gr15, gr#), gr#

or, in case it can be assumed that the GOT is smaller:

        ldi     @(gr15, #gotfuncdesc12(foo)), gr#

If the function is local to a translation unit, or is known to have internal or hidden (but not protected or global) visibility, the canonical function descriptor of the function will be in the module, so it is possible to avoid the need for a GOT entry containing the address of the function descriptor, by using code sequences like:

        sethi   #gotofffuncdeschi(foo), gr#
        sethi   #gotofffuncdesclo(foo), gr#
        add     gr15, gr#, gr#

or, for -fpic:

        addi    gr15, #gotofffuncdesc12(foo), gr#

Global-scope variable initialized with a pointer to a function causes code like this to be generated:

bar:    .picptr #funcdesc(foo)

Variables initialized with pointers (to data or code) must not be assigned to read-only segments.

New Relocations

The following are new relocation types for supporting position independent code.

Name Value Meaning
R_BFIN_GOT17M4 0x14 Used with immediate instructions for FDPIC-relative references to GOT entries
R_BFIN_GOTHI 0x15 Used with sethi for FDPIC-relative references to GOT entries
R_BFIN_GOTLO 0x16 Used with setlo for FDPIC-relative references to GOT entries
R_BFIN_FUNCDESC 0x17 Used to obtain the address of an “official” function descriptor
R_BFIN_FUNCDESC_GOT17M4 0x18 Used with immediate instructions for FDPIC-relative references to GOT entries containing the address of an “official” function descriptor
R_BFIN_FUNCDESC_GOTHI 0x19 Used with sethi for FDPIC-relative references to GOT entries containing the address of an “official” function descriptor
R_BFIN_FUNCDESC_GOTLO 0x1a Used with setlo for FDPIC-relative references to GOT entries containing the address of an “official” function descriptor
R_BFIN_FUNCDESC_VALUE 0x1b Used to fill in function entry point and GOT address in private function descriptors
R_BFIN_FUNCDESC_GOTOFF17M4 0x1c Used with immediate instructions for FDPIC-relative references to private function descriptors, i.e, those used by inlined PLT code
R_BFIN_FUNCDESC_GOTOFFHI 0x1d Used with sethi for FDPIC-relative references to private function descriptors
R_BFIN_FUNCDESC_GOTOFFLO 0x1e Used with setlo for FDPIC-relative references to private function descriptors
R_BFIN_GOTOFF17M4 0x1f Used with immediate instructions for FDPIC-relative references to small data
R_BFIN_GOTOFFHI 0x20 Used with sethi for FDPIC-relative references to small data
R_BFIN_GOTOFFLO 0x21 Used with setlo for FDPIC-relative references to small data

The dynamic loader needs to ajust or “fix up” portions of the data segment due to it being dynamically located. The various dynamic relocation entries tell the dynamic loader how to do this. The text segment is dynamically located too, but it is read-only and must not have any relocation entries associated with it.

Dynamic relocations have the following types: R_FRV_32, R_BFIN_FUNCDESC, and R_BFIN_FUNCDESC_VALUE. The precise interpretration given to these relocation types by the dynamic linker is described in the following paragraphs.

R_FRV_32

The R_FRV_32 relocation is used to initialize pointer values in the global offset table and in initialized data. The _offset field in the Elf32_Rel relocation struct contains the location to which the relocation should be applied. The r_info field encodes a symbol table index (as well as the R_FRV_32 relocation type).

When the symbol table index refers to a section (in which case the symbol type is STT_SECTION), the relocation value is computed by adding the base address of that section to the offset stored in the relocation location.

Otherwise, the symbol table index refers to a symbol which is defined in some other load module. The symbol's address is determined and is added to the addend at the location given by r_offset.

R_BFIN_FUNCDESC

The R_BFIN_FUNCDESC relocation is used to obtain the address of an “official” function descriptor from the dynamic linker. The r_offset field contains the location (offset) of the word which must receive this address. The r_info field contains an encoding of the symbol table index corresponding to the function to resolve. The dynamic linker resolves the function and determines the address of the corresponding official descriptor, allocating and initializing it as necessary. (It is the dynamic linker's responsibility to allocate and initialize all official descriptors.) The address of the official descriptor is written to the location specified by r_offset.

This relocation is always expected to reference symbols for which the dynamic linker is expected to create an “official descriptor”. References to descriptors which are allocated and initialized by the link editor are handled via the R_FRV_32 relocation.

R_BFIN_FUNCDESC_VALUE

The R_BFIN_FUNCDESC_VALUE relocation is used to initialize both words of a function descriptor. The r_offset member (in an Elf32_Rel struct) specifies the location of the descriptor to initialize. The r_info member encodes both the number associated with the R_BFIN_FUNCDESC_VALUE type and a symbol table index.

Support for lazy binding is accomplished by R_BFIN_FUNCDESC_VALUE relocations residing in the .rel.plt section. The symbol index encoded in r_info corresponds to the symbol to resolve. In the descriptor itself, the link editor sets the low word to the address of the lazy PLT entry which, when executed, will ultimately resolve the symbol. The high word is set to the index of the segment containing the lazy PLT code. Relocations in .rel.plt are potentially processed twice, once at load time to fix up the offset so that the function descriptor really points at the lazy PLT entry, and possibly later on, as a result of the code in the lazy PLT entry being run, forcing actual binding to be done.

The environment variable LD_BIND_NOW may be set to a non-null value to force binding to occur at load time. When LD_BIND_NOW is used for this purpose, the descriptor's contents are ignored, and the relocations are only processed once.

R_BFIN_FUNCDESC_VALUE relocations found outside of .rel.plt are used either for non-lazy binding support (forced at compile/link time) or for static function descriptor initializations. These cases will be considered separately.

Relocations used for resolving external functions (in a non-lazy manner) have the symbol index encoded in r_info set to correspond to symbol to resolve. The descriptor contents are irrelevant and are ignored. The function corresponding to the symbol index is resolved and the entry point and GOT address for that function are written to the descriptor.

The R_BFIN_FUNCDESC_VALUE relocation is also used to initialize function descriptors used as addresses for static, non-overridable functions. When used for this purpose, the r_info member encodes the symbol table index for the section in which the function is found. The low word of the descriptor contains the offset to the function and the high word contains the segment index.

The segment index can be used to speed up the computation of the address of the symbol, if the dynamic linker maintains internally an array that maps a segment number to the offset by which it was relocated. Such a map is not required, though, and the dynamic linker is free to ignore segment index information.

Assembler pseudo-functions

Below is a list of additional pseudo-functions for writing assembly code:

Name Corresponding relocation
got17m4 R_BFIN_GOT17M4
gotlo R_BFIN_GOTLO
gothi R_BFIN_GOTHI
gotfuncdesc17m4 R_BFIN_FUNCDESC_GOT17M4
gotfuncdeschi R_BFIN_FUNCDESC_GOTHI
gotfuncdesclo R_BFIN_FUNCDESC_GOTLO
funcdesc R_BFIN_FUNCDESC
gotofffuncdesc17m4 R_BFIN_FUNCDESC_GOTOFF17M4
gotofffuncdeschi R_BFIN_FUNCDESC_GOTOFFHI
gotofffuncdesclo R_BFIN_FUNCDESC_GOTOFFLO
gotoff17m4 R_BFIN_GOTOFF17M4
gotoffhi R_BFIN_GOTOFFHI
gotofflo R_BFIN_GOTOFFLO

ELF Header

The Blackfin processor specific flag for the e_flags field in the ELF header which indicates the use of the Blackfin shared library ABI is EF_BFIN_FDPIC.

When both EF_BFIN_FDPIC and EF_BFIN_PIC are set, it means each segment of the binary can be loaded at an arbitrary address, which means sharing of text segments is possible. If EF_BFIN_FDPIC is set but EF_BFIN_PIC is clear, all segments must be relocated by the same amount. The linker should warn and clear EF_BFIN_PIC when linking FDPIC binaries if it finds any inter-segment relocation, and set it otherwise. Examples of inter-segment relocations are a GPREL relocation referencing a symbol that is not in the text segment, or a GOTOFF relocation referencing a symbol that is not in the data segment.

file: uClibc/include/elf.h

scm failed with exit code 1:
file does not exist in git

Start up

At the program's entry point, the stack pointer must be set to an address close to the end of the stack segment. The size of the stack segment is specified by the PT_GNU_STACK program header, and is derived from the value of the symbol __stacksize, that can be defined to an absolute value when linking a program. The default stack size is 128Kb. Starting at the address pointed to by sp, the program should be able to find its arguments, environment variables, and auxiliary vector table and load maps. Here's what the stack looks like:

  sp:		argc
  sp+4:		argv[0]
  ...
  sp+4*argc:	argv[argc-1]
  sp+4+4*argc:	NULL
  sp+8+4*argc:	envp[0]
  ...
  		NULL

  sp+8+4*(argc+envc):    AuxVT[0].type
  sp+8+4*(argc+envc)+4:  AuxVT[0].value
  ...
                         AT_NULL
                         0

The NULL terminator of envp is immediately followed by the Auxiliary Vector Table. Each entry is a pair of words, the first being an entry type, the second being either an integer value or a pointer. An entry type of value zero (AT_NULL) marks the end of the auxiliary vector.

Load maps will often, but not necessarily, follow the auxiliary vector. They use the following data structure:

file: uClibc/libc/sysdeps/linux/bfin/bits/elf-fdpic.h

scm failed with exit code 1:
file does not exist in git

file: uClibc/libc/sysdeps/linux/bfin/bits/elf-fdpic.h

scm failed with exit code 1:
file does not exist in git

At program start-up, register P0 should hold a pointer to a struct elf32_fdpic_loadmap that describes where the kernel mapped each of the PT_LOAD segments of the executable. At start-up of an interpreter for another program (e.g., ld.so), P1 will be set to the load map of the interpreter, and P2 will be set to a pointer to the PT_DYNAMIC section of the intepreter, if it was mapped as part of any loadable segment, or 0 otherwise. In the absence of an interpreter, P1 will be 0, and P2 will be the main program's PT_DYNAMIC address. All other callee-saved registers are supposed to be initialized to 0 by the kernel before it transfers control to userland, but applications shouldn't rely on this (except for R7, see below) since future extensions of the ABI may assign other meanings to these registers. Caller-saved registers have indeterminate value.

Reg Value
P0 Pointer to executable's elf32_fdpic_loadmap
P1 Pointer to interpreter's elf32_fdpic_loadmap
P2 Pointer to PT_DYNAMIC address (interpreter if dynamic or executable if static)
R7 Kernel sets to 0, interpreter sets to fini function

Both static and dynamic executables are responsible for self-relocating and initializing the PIC register. Self-relocation is accomplished by adjusting, according to the link map stored in P0, every pointer in the range [__ROFIXUP_LIST__,__ROFIXUP_END__-4). The addresses of __ROFIXUP_LIST__ and __ROFIXUP_END__ can be computed by means of GP/PC-relative addressing, since they are known to be in the text segment, as in the code below:

file: uClibc/libc/sysdeps/linux/bfin/crt1.S

scm failed with exit code 1:
file does not exist in git

Note that, the pointers in the .rofixup section are created by the linker; FDPIC object files should not contain .rofixup sections. The linker emits rofixup entries in static or dynamic executables that are not linked with -pie wherever it would emit a dynamic relocation in PIEs or dynamic libraries.

The linker also emits, as the last entry of the .rofixup section, the value of the _GLOBAL_OFFSET_TABLE_ symbol. The code that performs self-relocation should not dereference this last entry to relocate its contents; instead, it should simply compute the relocated value of the entry itself, thus obtaining the PIC register value without using any non-PIC or inter-segment relocation, that would force the executable to relocate as a unit.

In case a dynamic loader is used, it may set R7 to the address of a function descriptor that represents a function to be called at program termination time. The dynamic loader, however, must not depend on this function being called for proper termination.

The dynamic loader may change the stack pointer such that it is not aligned to a double-word boundary, but rather to a single-word boundary. It is recommended that every program's start up code adjusts the stack pointer after obtaining the program arguments from the top of the stack.

Chunks of code inserted in .init and .fini sections (_init and _fini functions, respectively) must not assume P3 to hold the value of the PIC register. _init and _fini prologues are expected to save the initial gr15 at @(fp,4), and the initial lr at @(fp,8).

Debugger Support - Overview

Debugger support is substantially different from what is normally done on a Linux system with a MMU for the following reasons:

  1. The usual method for finding the dynamic linker data structures won't work since the text and data area for the main program itself are dynamically located. Normally, the debugger is able to find the address of the executable's sections by looking in the executable itself. This, in turn allows the debugger to find the dynamic section in which it looks for the value of the DT_DEBUG tag. The DT_DEBUG value provides the debugger with the address of the r_debug struct which, in turn, provides access to the necessary relocation information for shared objects. But, since none of this will work, an alternate method must be found for locating the dynamic linker data structures.
  2. The debugger must relocate different sections by different amounts due to the fact that the text and data areas (and perhaps other sections too) are relocated independently. The dynamic linker's debug interface must allow the debugger to find out how much each section has been relocated by.
  3. It must be possible for the debugger to attach to a process at an arbitrary point of its execution.
  4. Text areas are truly shared among processes which means there must be some sort of kernel level support for breakpoints.

Debugger Support - Locating the Dynamic Linker's Data Structures

In a given process, for all possible values of FDPIC (which is in P3 at function entry time), the word at FDPIC+8 - which is in the dynamic linker reserve area - contains a pointer to the dynamic linker's data structures. This means that each data area for a shared library or the main executable in a given process contains a pointer to dynamic linker data structures describing the various load objects and their relocations.

Unfortunately, P3 may not keep its value throughout the execution of a function. It may be overwritten and used for any other computation. If it's needed again, it can be copied to another register or to a stack slot. It might be possible for the debugger to locate the PIC value at such alternate locations by using call-frame debug information, but to do so, it would need the PC value as in the executable, not the relocated PC value in the memory location the kernel chose to map the text segment of the executable, or of any of the shared libraries it may have been linked with.

To enable a debugger to find where an executable is located in memory, the initial load maps that the kernel passes to the program in P0 and P1 are made available with ptrace calls, as described below:

file: arch/blackfin/include/asm/ptrace.h

scm failed with exit code 1:
file does not exist in git
struct elf32_fdpic_loadmap *x;
ptrace(PTRACE_GETFDPIC, pid, PTRACE_GETFDPIC_EXEC /* or _INTERP */, &x, NULL);

With these maps plus the executable (and/or interpreter) symbol table, the debugger can locate the program's GOT in memory, and thus obtain the link_map doubly-linked list (see below), from which it can obtain the loadmaps of all loaded modules.

Obtaining r_debug requires the dynamic loader's link map and symbol tables only, to locate the _dl_debug_addr symbol defined in the dynamic loader. If there is no dynamic loader, or if it hasn't got to the point at which it sets up the main program's GOT reserve area, r_debug won't be available.

Debugger Support - Data structures

The word at FDPIC+8 (which is typically P3+8) is a pointer to a struct of the following form:

file: uClibc/include/link.h

scm failed with exit code 1:
file does not exist in git

Where l_addr's type definition is:

file: uClibc/libc/sysdeps/linux/bfin/bits/elf-fdpic.h

scm failed with exit code 1:
file does not exist in git

(struct elf32_fdpic_loadaddr is the type of field dlpi_addr in struct dl_phdr_info as well)

_dl_debug_addr (a global symbol defined in the dynamic loader) is a pointer to the following type:

file: uClibc/include/link.h

scm failed with exit code 1:
file does not exist in git

The version number for this protocol will be 1.

Debugger Support - Finding GOT Addresses

The field got_value in the link_map struct provides the debugger with the GOT address for all functions in the load module described by that link_map entry.

Debugger Support - Finding "Official" Function Descriptor Addresses

We might want to add some means for the debugger to obtain a function descriptor for a function at a certain address, like _dl_funcdesc_for(void *entry_point, void *got_value), that is defined in the dynamic loader but is static.

However, since the debugger has to make do without it for static executables, it can probably make do without it for dynamic executables as well. For global functions, it could look for dynamic R_BFIN_FUNCDESC relocations pointing to the function's symbol when it needs the same pointer that the application would use. For local functions, R_BFIN_FUNCDESC_VALUEs within the GOT of the module that defines the function would do. If it can't find a function descriptor, it has to allocate memory and initialize it with a descriptor.

There is a risk that a dlopen()ed module may trigger the creation of a canonical function descriptor for a function that previously didn't need one, in which case the debugger will have created a different function descriptor for the function and they won't compare equal. This is the only case in which _dl_funcdesc_for would come in handy. But is any of this worth all the complexity and duplication of functionality?

Debugger Support - Breakpoint Considerations

Debugger applications implement software breakpoints by causing a trap instruction to be written at the address at which a breakpoint is desired. (The debugger will first fetch the contents of the location under consideration so that it may be restored when the breakpoint is removed.)

In order to implement software breakpoints, the text sections for the process being debugged must reside in writable memory. It is okay for the text section of non-debugged processes to reside in read-only memory, but some provision must be made to run a process being debugged in read/write memory. Furthermore, this determination must be made at the time the process is started. (Trying to migrate a running process from read-only to read/write memory would involve attempting to fix text section pointers on the stack, which is an impossible task without type information about each stack slot.)

The solution we suggest the kernel to implement on non-MMU systems is the following: when a process that is being ptrace()d runs exec()s, the kernel will not share the text segment of the newly-exec()ed program, nor those of an interpreter it might require. Also, the mmap() system call will not share text segments used by libraries of such a process, which it would normally do in response to the presence of MAP_EXECUTABLE and MAP_DENYWRITE in the flags passed to mmap().

This arrangement will not make processes that the debugger attaches to after they are mapped in look like they have independent sets of breakpoints; they may just crash instead of they reach a breakpoint instruction set with ptrace for another process. Enabling independent breakpoints in this case would require the kernel to monitor breakpoint installation with POKETEXT and arrange for such changes to code sections to only be visible while the affected process is running. This was regarded to be a sufficiently uncommon case that we have decided to not penalize every context switch with the additional verifications that would have been needed to implement this solution.

It remains as an optional feature of the kernel, but it is no longer mandated by the ABI.

Blackfin Elf ABI vs. Blackfin Shared Library ABI Differences

The Blackfin shared library ABI uses the same parameter passing conventions established by the Blackfin Elf ABI, but it is a different ABI due to the following differences:

  • The representation of function pointers is different. In the Blackfin Elf ABI, a function pointer is merely the address of the function in question. In the Blackfin shared library ABI, a function pointer is the address of a descriptor containing the function's entry point and GOT address.
  • The Blackfin Elf ABI assumes that any text and data segment load time relocations will cause both segments to be relocated by the same amount. The Blackfin shared library ABI assumes that these segments will be relocated by different amounts.
  • Calling conventions are different (even though parameter passing conventions are the same). The Blackfin shared library ABI requires that P3 be set to the GOT address upon function entry. The Blackfin Elf ABI has no such requirement.
  • The mechanisms used for accessing global data are different (and incompatible) between the Blackfin Elf ABI and the Blackfin shared library ABI.
  • The numbers associated with some of the relocation types differs between the ABIs.

Differences with other compilers

The GCC compiler is not run or link time compatible with any other compiler for the Blackfin processor. This includes, but is not limited to, VDSP++.

Although the ABI for Blackfin (ELF and FLAT) is common across most compilers, there are subtle differences.

Just one example (there are others, some known and some not):
In the C and C++ language standards, floating-point literal constants default to double data type. When operations involve both float and double, the float operands are promoted to double and the operation is done at double size. GCC follows the standard while VDSP does not. By having double default to a 32-bit data type, the VDSP++ compiler can avoid additional expense during these promotions. This does not conform to the C and C++ standards which require that the double type supports at least 10 digits of precision. The optional switch in VDSP++ -double-size-64 sets the size of the double type to 64 bits if the additional precision, or full standard conformance, is required.

The ABI for Blackfin (FDPIC) has large differences, and can not be thought of as equivalent with any other ABI.

1) some Blackfin compilers use 32-bit doubles
3) “IA-64 Software Conventions and Runtime Architecture Guide”, Intel, 2000, pp. 8-1 thru 8-4.
4) “Unix System V Application Binary Interface” (for IA-64), Intel, 2000, pp. 5-4 thru 5-9.