Measure stuff in embedded platforms
The reader will build the following competences in this tutorial:
To assess Flash and static RAM usage at compile time, by analysing memory map files.
To measure Heap and Stack usage at runtime using the technique of memory painting.
To measure execution time using internal and external tools (hardware timers and logic analysers).
To measure energy consumption using osciloscopes and power profilers.
To correlate execution time and energy consumption data in a granular and step-by-step way, allowing the identification of performance bottlenecks.
To measure message sizes of IoT protocols deployed in embedded systems, using techniques such as logging and packet sniffing.
Overview of the setup and workflow:
Measuring Memory Usage
Flash and RAM (Compile Time)
analyse compiled binary with GNU size and objdump
inspect memory map files
To measure memory at compile time, one can analyse the resulting binary and associated metadata, such as memory map files. Analysis at compile time allows assessing full Flash usage and partial RAM usage. It can be performed with more or less granularity, depending on the tool used. Available tools include GNU objdump
, size
, and nm
; the memory map file generated by the linker can also be used for manual inspection or automated parsing.
For global granularity, use the GNU size
command on your binary. The text and data sections represent memory used for code and initialized variables, respectively, and they will use space in the Flash. So in the example below, Flash usage is 15840 + 56 = 15896 bytes. The bss section (which for historical reasons stands for Block Started by Symbol) stores uninitialized variables, and therefore it does not occupy space in the Flash. Since the variables in bss and data will need to be manipulate during runtime, these occupy space in RAM, thus in this example the static RAM usage amounts to 1032 + 56 = 1088 bytes.
$ size target/thumbv7em-none-eabihf/debug/lakers-no_std
text data bss dec hex filename
15840 56 1032 16928 4220 target/thumbv7em-none-eabihf/debug/lakers-no_std
We can see a slightly more detailed table with the objdump -h <binary file>
command, which will print all the section headers and their attributes. For example, the text section contains code, while sections .vector_table, .rodata, and .data contain data.
$ objdump -h target/thumbv7em-none-eabihf/release/lakers-no_std
Idx Name Size VMA LMA File off Algn
0 .vector_table 00000400 00000000 00000000 00010000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 0000ceec 00000400 00000400 00010400 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .rodata 000038d0 0000d2f0 0000d2f0 0001d2f0 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .data 00000038 20000000 00010bc0 00030000 2**2
CONTENTS, ALLOC, LOAD, DATA
(...)
Sometimes, we want to measure only the sizes for certain parts of our code. For example, in lakers
, we normally want to measure how much memory is needed by the library itself, but want to discard things like the cryptographic backend, since it changes across platforms.
One way to do that is by analysing the memory map file generated by the linker (one might need to enable it by passing a flag such as -Clink-args=-Map=/tmp/lakers.map
to the linker). Different linkers generate slightly different memory map files, but all map files will show exactly what symbols are placed in each section of the memory, as well as their address and size. In the example below, the Flash section begins at address 0x400 (__stext
). The first symbol is a reset handler introduced by the cortex-m-rt crate, and the next symbol is the prepare_message_1
function, which starts at address 0x459 and uses 0x13c bytes. Using a script to parse this file and selecting only the target libraries or functions will give a very granular insight into Flash usage by the program. Similarly, sections such as .data, .bss, and .rodata can be analysed to obtain static RAM usage.
$ cat /tmp/lakers_no-std.map | grep " .text" -A 8
400 400 15a4c 4 .text
400 400 0 1 __stext = .
400 400 58 4 /home/gfedrech/Developer/inria/dev/lakers-FORK/target/thumbv7em-none-eabihf/debug/deps/libcortex_m_rt-ab9dabb33bc95171.rlib(cortex_m_rt-ab9dabb33bc95171.cortex_m_rt.bd536e3d6951dd08-cgu.0.rcgu.o):(.Reset)
400 400 0 1 $t.1
401 401 3e 1 Reset
440 440 0 1 $d.12
458 458 13c 2 /home/gfedrech/Developer/inria/dev/lakers-FORK/target/thumbv7em-none-eabihf/debug/deps/lakers_no_std-3f752946f41f98ae.08piolvegulzp6fpuoijkvkau.rcgu.o:(.text._ZN6lakers28EdhocInitiator$LT$Crypto$GT$17prepare_message_117h45bd752ef830d0b1E)
458 458 0 1 $t.0
459 459 13c 1 lakers::EdhocInitiator$LT$Crypto$GT$::prepare_message_1::h45bd752ef830d0b1
Stack and heap (RAM at Runtime)
Stack and heap: memory painting, probe-rs
Measuring RAM at runtime in embedded systems can be challenging due to the lack of an operating system that keeps track of memory usage. A way of circunventing this consists in employing the technique of “memory painting”. It consists in filling the RAM with a known pattern (e.g. 0DEAD_BEEF) before the program executes, then let it run, and finally count how many bytes are still intact.
To fill the memory, we can use a simple loop that writes the pattern to the memory. We need, however, to find what should be the start and stop addresses. There are a few ways to do that:
Look at the target datasheet. For example, in the nRF52840, the RAM goes from 0x2000000 to 0x4000000.
Look at or configure the GNU linker script. For example, our application sets
RAM : ORIGIN = 0x20000000, LENGTH = 64K
, meaning that the RAM begins at 0x2000000 (as per the datasheet) and has a total size of 64 kiB.Look at the generated memory map file, and find where are the symbols
_stack_start
and__sheap
. Since the stack grows from top to bottom (e.g. from 0x2000000 to 0x2000000 + 64 kiB), and the heap grows from the bottom,_stack_start - __sheap
is the total size of our allocatable RAM. Note that this discards sections such as .bss, .data, .uninit, which we do not want to paint!
We now know that we want to paint the memory from __sheap
up to _stack_start
.
Before continuing, remember that we want to write in the RAM before our code starts executing, otherwise we risk overwriting the stack that is already in use. One way of doing that is writing the loop in assembly (using only registers), and another way is doing it in the reset handler or in some pre-initialisation code in your platform. The cortex-m-rt
crate provides a pre_init
hook that runs before main, which is ideal to put the stack painting code. In the code below, we first obtain the address where the heap starts using the symbol defined by the linker. Next, since our code is already executing, we do not want to overwrite already allocated stack memory, so we get the current value of the stack pointer (offset by a constant, since we are using the stack while painting it). Finally, we run the loop that writes the pattern to the memory.
Next, we flash and run the program, and after it finishes, we can use a debugger to inspect the memory and learn how much of our pattern was erased. We can use the command probe-rs read <WIDTH> <ADDRESS> <WORDS>
and parse it’s output. For example, when reading the first 2 words after __sheap
we can see our pattern.
Assuming our RAM size is set to 4096 = 0x1000 bytes (as per the memory.x file), we can compute the runtime memory usage as follows:
Measuring Execution Time
timers, gpio's connected to logic analyzers
There are two main approaches to measuring execution time. In one of them, let’s call it internal, we rely on some timer already available in our board, and use it to keep track of certain interesting events that we want to measure. While this approach is simple, it has the downsides of consuming some energy and an additional peripheral, as well as some extra added code (e.g. a RTC driver). The second approach is the external, where an instrument such as a logical analyser is connected to one or more GPIOs in the board. Then, we want to log an event, we toggle those GPIOs on or off, and have them logged over time by the external instrument software.
Measuring Energy Consumption
multimeters, oscilloscopes, power profilers
To measure energy consumption we connect the device to an external instrument, such as a multimeter, an oscilloscope or a power profiler, that can measure current draw.
Multimeters can be used to read the current consumption at a given point in time, but do not allow automating measurements or taking them over time.
TODO: how to set up the hardware part of measuring with an osciloscope?
A more sophisticated and accurate way of measuring energy consumption consists in using a power profiler, such as the nRF Power Profiler Kit or the Otii Arc Pro. For this tutorial we selected the latter, since it is designed to be board-agnostic and also allow interacting with GPIO pins.
Correlating Execution Time and Energy Consumption
Syncing execution time and energy consumption: power profiler (otii arc) with gpio
Extra: using Nordic’s Power Profiler Kit 2
There’s a Python API, see: GitHub - IRNAS/ppk2-api-python: Power Profiling Kit 2 unofficial python api.
Install via
pip install git+https://github.com/IRNAS/ppk2-api-python.git
Advanced Measurements for Energy and Time
Step-by-step Granular Time and Energy Consumption
Obtaining step-by-step time and energy consumption: merging results from power profiler and logic analyser
PPI (Programmable Peripheral Interconnect) / EXTI (External Interrupt) Mapping
Often it is necessary to link an input GPIO directly to an output GPIO at the hardware level, bypassing software delays. This type of operation is available on certain microcontrollers and systems to reduce latency. In some scenarios peripheral devices are attached to a MCU and we want to measure the energy consumption during the peripheral interaction. Instead of raising a GPIO in SW that is connected to the measurement kit (e.g. PPK), one can use the PPI/ EXTI functionality of the MCU.
Examples:
A button press connected to an input GPIO immediately toggles an LED on an output GPIO.
A sensor’s input signal on GPIO triggers another peripheral or an output pin change via an interrupt.
HW-Platforms:
Nordic Semiconductor (nRF Series) – PPI (Programmable Peripheral Interconnect): PPI allows one hardware event (e.g., a GPIO input change) to directly trigger another (e.g., GPIO output) without CPU intervention.
STM32 Microcontrollers – EXTI (External Interrupt) with Direct GPIO Mapping: GPIOs can be mapped to trigger interrupts or interact with timers, enabling fast response without CPU control.
Texas Instruments (TI) – PRU (Programmable Real-Time Unit): On TI microcontrollers, real-time units handle direct GPIO linking and control, ensuring minimal latency between input and output operations.
Measuring Message Sizes
logging, hardware packet sniffing, packet analysis tools (wireshark)
Logging: can be performed on the mote or on the gateway, easier on the latter.
Packet sniffing: there are two main situations.
If the constrained devices talks to a computer or gateway, just run Wireshark on the computer.
If two devices talk between each other, you need a third device that understand the protocol to sniff the conversation. Some IoT platforms offer facilities to save the conversation as a .cap file, which can be later analyzed on Wireshark.
drafts
nm (actually not recommended)
One way to do that is by using the nm
(name list) utility. By default it returns all the symbol names and respective Flash addresses. To also get the symbol size, we set the -S
flag. In the example below, we filter out the crypto backend and get the the symbol belonging to the lakers library (EdhocBuffer (...) Default
), which occupies 0x22 bytes.
Sum all relevant symbols with awk
and we get the Flash usage for our application.