The binutils portion of the GNU toolchain includes a simulation environment simply referred to as the Simulator. Often times it might also be referred to as the GDB simulator due to the nice integration it has with GDB.
Different architectures have very different feature sets, so here we'll focus on the Blackfin feature set. Currently, it supports:
The performance of the simulator depends on what type of Blackfin instructions you are running, and how fast your host processor is. For example, on a 1.83 GHz Intel processor (unfortunately, since the simulator is single threaded, it can not take advantage of multi-core hosts).
$ bfin-elf-run -v --env user ./user/mp3play/mp3play.gdb -w out.raw in.mp3 run ./user/mp3play/mp3play.gdb bfin-sim: unimplemented syscall 97 in.mp3: MPEG1-III (121238 ms) bfin-sim: unimplemented syscall 78 bfin-sim: unimplemented syscall 78 Simulator Execution Speed Total instructions: 2,583,456,559 Total execution time : 929.75 seconds Simulator speed: 2,778,657 insns/second
We can see on this host, and this application - about 3MIPs.
The simulator has different environment modes. This is controlled by the --environment sim option. You should also pick a CPU model via the --model sim option.
The simulator may be used directly via the run program, or via the GDB sim target.
Options to the simulator must come before the program. Any options after the program are interpreted as arguments to the program itself. Running something like the following will execute the Blackfin program ls with the --help option. It will not pass the --help option to the bfin-elf-run program itself!
$ bfin-elf-run ./ls --help
This example will pass --help to the bfin-elf-run program and not the Blackfin ls program.
$ bfin-elf-run --help ./ls
Only basic Linux userspace support exists in the simulator. Enough syscalls are supported to do basic I/O (stdin/out) and file I/O, but don't expect too much. For FLAT binaries, you must run the statically linked FLAT .gdb ELF file (rather than the raw FLAT file itself). Static FDPIC ELF binaries work fine, but dynamic FDPIC ELF binaries require you to pass the --sysroot option (set it to the bfin-linux-uclibc/runtime/ subdir in the toolchain).
For example, the flat application mp3play:
$ bfin-elf-run --env user ./user/mp3play/mp3play.gdb -w output.raw input.mp3 run ./user/mp3play/mp3play.gdb bfin-sim: unimplemented syscall 97 /nfs_export/mp3/fpm-calm-river.mp3: MPEG1-III (121238 ms) bfin-sim: unimplemented syscall 78 bfin-sim: unimplemented syscall 78 $The files
output.raw and input.mp3 will be in the current host directory.
To run a simple Blackfin ELF in Virtual mode:
$ bfin-elf-run ./some-blackfin-elf
To run a Blackfin ELF of an operating system (like U-Boot or Linux) in Operating mode:
$ bfin-elf-run --env operating --model bf537 ./u-boot
U-Boot 2009.11.1-00099-g1c038a0-dirty (ADI-2010R1-pre) (Apr 02 2010 - 15:05:34)
CPU: ADSP bf537-0.2 (Detected Rev: 0.0) (bypass boot)
Board: ADI BF537 stamp board
Support: http://blackfin.uclinux.org/
Clock: VCO: 0.250 MHz, Core: 0.250 MHz, System: 0.050 MHz
RAM: 64 MB
Using default environment
In: serial
Out: serial
Err: serial
KGDB: [on serial] ready
Hit any key to stop autoboot: 0
## Error: "ramboot" not defined
bfin> bdinfo
bdinfo
U-Boot = U-Boot 2009.11.1-00099-g1c038a0-dirty (ADI-2010R1-pre) (Apr 02 2010 - 15:05:34)
CPU = bf537-0.2
Board = bf537-stamp
VCO = 0.250 MHz
CCLK = 0.250 MHz
SCLK = 0.050 MHz
boot_params = 0x00000000
memstart = 0x00000000
memsize = 0x04000000
flashstart = 0x00000000
flashsize = 0x00000000
flashoffset = 0x00000000
ethaddr = (not set)
ip_addr = 03f1ffa8
baudrate = 57600 bps
bfin>
The sim target with GDB is like any other remote GDB target. You give GDB an ELF to debug, connect to the “remote” simulator, load up the ELF, and then execute things.
The sim target takes the same options as the run binary. The format is target sim [options].
To load a simple Blackfin ELF in Virtual mode in GDB and run it:
$ bfin-elf-gdb ./lsetup.s.x (gdb) target sim Connected to the simulator. (gdb) load Loading section .text, size 0x17c lma 0x0 Loading section .data, size 0x22c lma 0x117c Start address 0xa8 Transfer rate: 7488 bits in <1 sec. (gdb) break *_start Breakpoint 1 at 0xa8: file lsetup.s, line 8. (gdb) disassemble _start Dump of assembler code for function _start: 0x000000a8 <_start+0>: R0 = 0x123 (X); 0x000000ac <_start+4>: P0 = R0; 0x000000ae <_start+6>: LSETUP(0x0xb2 <_start+10>, 0x0xb2 <_start+10>) LC0 = P0; 0x000000b2 <_start+10>: R0 += -0x1; 0x000000b4 <_start+12>: R1 = 0x0 (X); 0x000000b6 <_start+14>: CC = R1 == R0; 0x000000b8 <_start+16>: IF CC JUMP 0x0xbe <_start+22>; 0x000000ba <_start+18>: CALL 0x0x52 <_fail>; 0x000000be <_start+22>: P0 = 0xa (X); (gdb) run Starting program: svn/toolchain/trunk/binutils-2.17/sim/testsuite/sim/bfin/lsetup.s.x Breakpoint 1, _start () at lsetup.s:8 8 R0 = 0x123; Current language: auto; currently asm (gdb) stepi _start () at lsetup.s:9 9 P0 = R0; (gdb) stepi _start () at lsetup.s:10 10 LSETUP (.L1, .L1) LC0 = P0; (gdb) regs R0: 00000123 291 P0: 00000123 RETS: 00000000 LC0: 00000000 0 R1: 00000000 0 P1: 00000000 RETI: 00000000 LT0: 00000000 R2: 00000000 0 P2: 00000000 RETX: 00000000 LB0: 00000000 R3: 00000000 0 P3: 00000000 RETE: 00000000 LC1: 00000000 0 R4: 00000000 0 P4: 00000000 RETN: 00000000 LT1: 00000000 R5: 00000000 0 P5: 00000000 ASTAT: 00000000 LB1: 00000000 R6: 00000000 0 SP: 01000000 CC: 00000000 R7: 00000000 0 USP: 00000000 CYC1: 00000000 SEQSTAT: 00000001 PC: 000000ae FP: 00000000 CYC2: 00000000 SYSCFG: 00000030 (gdb) stepi _start () at lsetup.s:12 12 R0 += -1; (gdb) regs R0: 00000123 291 P0: 00000123 RETS: 00000000 LC0: 00000123 291 R1: 00000000 0 P1: 00000000 RETI: 00000000 LT0: 000000b2 R2: 00000000 0 P2: 00000000 RETX: 00000000 LB0: 000000b2 R3: 00000000 0 P3: 00000000 RETE: 00000000 LC1: 00000000 0 R4: 00000000 0 P4: 00000000 RETN: 00000000 LT1: 00000000 R5: 00000000 0 P5: 00000000 ASTAT: 00000000 LB1: 00000000 R6: 00000000 0 SP: 01000000 CC: 00000000 R7: 00000000 0 USP: 00000000 CYC1: 00000000 SEQSTAT: 00000001 PC: 000000b2 FP: 00000000 CYC2: 00000000 SYSCFG: 00000030 (gdb) continue Continuing. pass Program exited normally.
While the majority of the core should be handled, the different Blackfin peripherals may vary greatly in terms of functional completeness. While IRQ routing should be fully working from peripherals through the DMAC, to the SIC, and up to the CEC, not all peripherals simulate interrupt generation yet.
| Peripheral | % Complete | Notes |
|---|---|---|
| acm | Nonexistent | |
| atapi | =10 | Reads/writes to MMR range not checked |
| can | Nonexistent | |
| cec | =========100 | Simulation isn't exactly complete, but expected EVT0 behavior is unknown (since it needs an ICE hooked up) |
| ctimer | =========100 | Hardware & HRM don't seem to match; everything the hardware does should work |
| dma | =========90 | Striding isn't supported |
| dmac | =======70 | Only primary peripheral works when multiple are muxed to one channel |
| ebiu_amc | =======70 | Basic bank control works, and can be mapped to flash (CFI) or raw memory; extended flash MMRs are stubs |
| ebiu_ddrc | ==20 | Stubs only; status bits clear/stay cleared; external memory fixed at start |
| ebiu_sdc | ==20 | Stubs only; status bits clear/stay cleared; external memory fixed at start |
| emac | =====50 | Simple sending/receiving works; MDIO works |
| eppi | =====50 | Output only; no error detection; data visualized via SDL |
| gpio/pmux | =10 | Reads/writes to MMR range not checked |
| gptimer | ==20 | Stubs only; status bits clear/stay cleared; unified control MMRs not checked |
| kpad | Nonexistent | |
| lockbox | Nonexistent | |
| mmu | =========90 | ITEST not implemented; no cache simulation; everything else works |
| musb | =10 | Reads/writes to MMR range not checked |
| mxvr | Nonexistent | |
| nfc | ==20 | Stubs only; status bits clear/stay cleared |
| otp | =======70 | Reads/writes work; no CRC handling; no Lock support; no interrupts |
| pixc | Nonexistent | |
| pll | ========80 | MMRs all function as expected, but no plans to bother simulating SCLK/CCLK domains |
| ppi | =====50 | Output only; no error detection; data visualized via SDL |
| pwm | Nonexistent | |
| rsi/sdh | =10 | Reads/writes to MMR range not checked |
| rtc | ===30 | Reading time returns host system time; otherwise only basic MMR checks |
| sic | =========90 | Wakeup sources not supported, but not sure if we should bother; everything else works |
| spi | ==20 | Stubs only; status bits clear/stay cleared |
| sport | =10 | Reads/writes to MMR range not checked |
| twi | ==20 | Stubs only; status bits clear/stay cleared |
| trace | =========100 | Hardware behavior should be matched; trace buffer extends even further than hardware |
| uart | ========80 | Sending/receiving works (DMA+PIO); exotic things (IRDA/CTS/RTS) not simulated |
| uart2 | ========80 | Sending/receiving works (DMA+PIO); exotic things (IRDA/CTS/RTS) not simulated |
| cnt | Nonexistent | |
| wdog | ==20 | Stubs only; status bits clear/stay cleared |
The Blackfin on-chip MAC driver can be simulated with the help of the TUN/TAP driver under an operating environment simulation. On your host Linux system, you can set up the host side of the network by doing:
$ sudo tunctl -u ${USER} -t tap-gdb
$ sudo ifconfig tap-gdb 10.1.1.1 up
This should provide the tap-gdb ethernet device.
$ sudo ifconfig tap-gdb
tap-gdb Link encap:Ethernet HWaddr f2:f4:c6:fb:0b:4c
inet addr:10.1.1.1 Bcast:10.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::f0f4:c6ff:fefb:b4c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:28 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Now the host side has an IP of 10.1.1.1. Have the simulated code bind a similar appropriate IP and you can talk to each other. For example, this uses the bfin MAC driver in U-Boot to talk to the host Ethernet device.
$ bin/bfin-elf-run --env operating --model bf537 ./u-boot
U-Boot 2010.06-svn2356 (ADI-2010R1-pre) (Jul 16 2010 - 14:31:48)
CPU: ADSP bf537-0.2 (Detected Rev: 0.3) (bypass boot)
Board: ADI BF537 stamp board
Support: http://blackfin.uclinux.org/
Clock: VCO: 500 MHz, Core: 500 MHz, System: 125 MHz
RAM: 64 MiB
Flash: ## Unknown FLASH on Bank 1 - Size = 0x00000000 = 0 MB
0 Bytes
In: serial
Out: serial
Err: serial
KGDB: [on serial] ready
Warning: Generating 'random' MAC address
Net: bfin_mac
Hit any key to stop autoboot: 5
bfin> set ipaddr 10.1.1.2
bfin> set serverip 10.1.1.1
bfin> ping $(serverip)
Using bfin_mac device
host 10.1.1.1 is alive
The simulator supports a variety of tracing methods. Not all are implemented, so don't be surprised if you use one and see no output related to it.
$ bfin-elf-run --trace-insn --trace-branch --trace-core --trace-linenum ./simple0.s.x reg: wrote SP = 0x4000000 reg: wrote SYSCFG = 0x30 reg: wrote PC = 0xa8 core: 0x0000a8 #7 __start -IBUS FETCH 2 bytes @ 0x000000a8: 0x6028 insn: 0x0000a8 #7 __start -R0 = 0x5 (X); reg: 0x0000a8 #7 __start -wrote R0 = 0x5 reg: 0x0000a8 #7 __start -wrote PC = 0xaa core: 0x0000aa #8 __start -IBUS FETCH 2 bytes @ 0x000000aa: 0x67f8 insn: 0x0000aa #8 __start -R0 += -0x1; reg: 0x0000aa #8 __start -wrote ASTAT[an] = 0 reg: 0x0000aa #8 __start -wrote ASTAT[v] = 0 reg: 0x0000aa #8 __start -wrote ASTAT[az] = 0 reg: 0x0000aa #8 __start -wrote ASTAT[ac0] = 1 reg: 0x0000aa #8 __start -wrote R0 = 0x4 reg: 0x0000aa #8 __start -wrote PC = 0xac core: 0x0000ac #9 __start -IBUS FETCH 2 bytes @ 0x000000ac: 0xf000 core: 0x0000ac #9 __start -IBUS FETCH 2 bytes @ 0x000000ae: 0x0004 insn: 0x0000ac #9 __start -DBGA (R0.L, 0x4); reg: 0x0000ac #9 __start -wrote PC = 0xb0 core: 0x0000b0 #10 __start -IBUS FETCH 2 bytes @ 0x000000b0: 0xe3ff core: 0x0000b0 #10 __start -IBUS FETCH 2 bytes @ 0x000000b2: 0xffb8 insn: 0x0000b0 #10 __start -CALL 0xffffff70; branch: 0x0000b0 #10 __start -CALL to 0x20 reg: 0x0000b0 #10 __start -wrote RETS = 0xb4 reg: 0x0000b0 #10 __start -wrote PC = 0x20 core: 0x000020 #4 __pass -IBUS FETCH 2 bytes @ 0x00000020: 0xe148 core: 0x000020 #4 __pass -IBUS FETCH 2 bytes @ 0x00000022: 0x0000 insn: 0x000020 #4 __pass -P0.H = 0; reg: 0x000020 #4 __pass -wrote P0 = 0 reg: 0x000020 #4 __pass -wrote PC = 0x24 ...
various profiling options can be given
$ bfin-elf-run --profile-pc -v --env user ./user/mp3play/mp3play.gdb -w output.raw input.mp3 run ./user/mp3play/mp3play.gdb bfin-sim: unimplemented syscall 97 input.mp3: MPEG1-III (121238 ms) bfin-sim: unimplemented syscall 78 bfin-sim: unimplemented syscall 78 Summary profiling results: Program Counter Statistics: Total samples: 10,052,360 Granularity: 8,192 bytes per bucket Size: 16 buckets Frequency: 257 cycles per sample Range: 0x0 0x20000 0x00000000: 6,334 0.1: 0x00002000: 1,682,325 16.7: *************** 0x00004000: 901,640 9.0: ******** 0x00006000: 4,437,908 44.1: **************************************** 0x00008000: 2,999,642 29.8: *************************** 0x0000a000: 24,440 0.2: 0x0000c000: 63 0.0: 0x0000e000: 8 0.0: Simulator Execution Speed Total instructions: 2,583,456,559 Total execution time : 932.65 seconds Simulator speed: 2,770,017 insns/secondA histogram of the program counter is provided.
The amount of PC samples can be changes with the -profile-pc-frequency (the number of cycles to go by before the PC is sampled), and --profile-pc-granularity affects the size of the bins in the histgram. Setting either of this too high will negatively effect the overall speed of the simulator. For example:
$ bfin-elf-run --profile-pc --profile-pc-granularity 1024 --profile-pc-frequency 1 -v --env user ./user/mp3play/mp3play.gdb -w output.raw in.mp3 run ./user/mp3play/mp3play.gdb bfin-sim: unimplemented syscall 97 in.mp3: MPEG1-III (121238 ms) bfin-sim: unimplemented syscall 78 bfin-sim: unimplemented syscall 78 Summary profiling results: Program Counter Statistics: Total samples: 2,583,456,559 Granularity: 1,024 bytes per bucket Size: 70 buckets Frequency: 1 cycles per sample Range: 0x0 0x11800 0x00000000: 12,957 0.0: 0x00000400: 232,225 0.0: 0x00000800: 99 0.0: 0x00000c00: 157 0.0: 0x00001000: 92,934 0.0: 0x00001400: 372,446 0.0: 0x00001800: 455,003 0.0: 0x00001c00: 454,504 0.0: 0x00002000: 2,397,115 0.1: 0x00002400: 145,451,550 5.6: ************************ 0x00002800: 153,068,129 5.9: ************************* 0x00002c00: 129,948 0.0: 0x00003000: 56,338,159 2.2: ********* 0x00003800: 70,734,046 2.7: *********** 0x00003c00: 4,273,825 0.2: 0x00004000: 4,433,501 0.2: 0x00004400: 8,708,839 0.3: * 0x00004c00: 3,693,004 0.1: 0x00005000: 2,358,656 0.1: 0x00005400: 207,597,745 8.0: ********************************** 0x00005800: 3,080,881 0.1: 0x00005c00: 1,812,873 0.1: 0x00006000: 119,947,811 4.6: ******************** 0x00006400: 172,422,432 6.7: **************************** 0x00006800: 136,668,168 5.3: ********************** 0x00006c00: 1,670,760 0.1: 0x00007000: 4,740,794 0.2: 0x00007400: 236,056,788 9.1: *************************************** 0x00007800: 239,330,640 9.3: **************************************** 0x00007c00: 229,662,084 8.9: ************************************** 0x00008000: 158,777,161 6.1: ************************** 0x00008400: 19,785,449 0.8: *** 0x00008800: 22,187,520 0.9: *** 0x00008c00: 157,194,704 6.1: ************************** 0x00009000: 161,369,152 6.2: ************************** 0x00009400: 148,607,816 5.8: ************************ 0x00009800: 103,023,715 4.0: ***************** 0x0000a000: 60 0.0: 0x0000a400: 67,023 0.0: 0x0000a800: 66 0.0: 0x0000ac00: 218 0.0: 0x0000b000: 848 0.0: 0x0000b400: 3,665 0.0: 0x0000b800: 152 0.0: 0x0000bc00: 6,222,694 0.2: * 0x0000c000: 334 0.0: 0x0000c800: 3,215 0.0: 0x0000cc00: 11,637 0.0: 0x0000d000: 290 0.0: 0x0000d400: 89 0.0: 0x0000d800: 778 0.0: 0x0000dc00: 710 0.0: 0x0000e000: 204 0.0: 0x0000e400: 119 0.0: 0x0000e800: 767 0.0: 0x0000f000: 38 0.0: 0x0000f400: 43 0.0: 0x00011400: 19 0.0: Simulator Execution Speed Total instructions: 2,583,456,559 Total execution time : 1458.86 seconds Simulator speed: 1,770,873 insns/second
The overhead of sampling the PC on ever cycle, slows the simulator down to 1.7MIPS. However, we can find the hot spots of the code much easier, and use tools like bfin-elf-nm or bfin-elf-objdump to determine the functions in those address ranges which should be looked at in terms of performance optimizations.