From microsecond interrupt latencies to sustained DMA throughput, this article presents repeatable real-world performance benchmarks for the STM32F103VCT6, measured across CPU, memory, peripherals and power modes. The goal is to give engineers actionable numbers, a reproducible test methodology and tuning guidance so results map directly to design trade-offs and firmware changes.
This analysis covers CPU compute, memory and DMA, ADC/SPI/UART and timer behavior, interrupt and RTOS timing, and power/thermal trade-offs. Test harness details, compiler flags, measurement techniques and example metrics are included so practitioners can reproduce and extend these performance benchmarks on Cortex-M3 hardware.
Point: Relevant device parameters drive observed performance. Evidence: the core is a single‑issue Cortex‑M3 at up to 72 MHz, up to 512 KB flash, 64 KB SRAM, 7 DMA channels, 12‑bit ADC, multiple timers and APB/AHB bus segments. Explanation: clock rate, flash wait states, SRAM size and DMA count determine compute throughput, code XIP vs RAM execution, and maximum peripheral offload before bus contention appears.
| Spec | Impact on benchmark |
|---|---|
| 72 MHz Cortex‑M3 core | Sets raw instruction throughput and interrupt service time baseline |
| Flash 0.5 MB / SRAM 64 KB | Flash wait states and XIP affect execution throughput; RAM improves latency |
| DMA channels | Enables high-throughput peripheral transfers without CPU load |
| 12‑bit ADC | Sampling speed and DMA storage limit continuous acquisition rates |
| Metric | STM32F103VCT6 | Standard Competitor M3 | Advantage |
|---|---|---|---|
| DMA Integration | 7-Channel (Highly Configurable) | 4-5 Channel (Basic) | Higher Peripheral Concurrency |
| Flash Read Path | Proprietary Prefetch Buffer | Standard Wait States | Reduced Stall Cycles |
| ADC Latency | ~1.17µs conversion | ~1.5-2µs conversion | Faster Real-time Response |
Point: Benchmarks must map to real workloads. Evidence: common embedded scenarios include tight control loops, sensor acquisition with filtering, bidirectional communication streams, and small DSP routines. Explanation: design representative tests — bare‑metal tight loops for jitter, ADC+DMA for streaming, memcpy/FFT for memory compute, and RTOS context‑switch tests for preemptive scheduler cost — so benchmark outcomes directly indicate suitability for each workload.
Hand-drawn schematic, not an exact circuit diagram.
Point: Reproducibility needs a disciplined hardware setup. Evidence: use a minimal breakout with stable 3.3V supply, low‑noise decoupling, isolate external loads, and temperature monitoring. Explanation: measure supply current with a shunt + high‑resolution meter, capture timing with a scope or logic analyzer, and log ambient temperature. Checklist: fixed supply, disabled unused peripherals, probe points for ISR toggle, consistent clock source and documented board revision.
Point: Software configuration shifts numbers significantly. Evidence: use a fixed toolchain and clear flags (e.g., arm-none-eabi GCC, compare -O0, -O2, -Os). Explanation: document startup (flash wait states, prefetch enable), clock init and DWT cycle counter use for timestamps. Run suites: Core microbench/Dhrystone, memcpy/memmove, FFT, ADC sampling with DMA, SPI/UART DMA vs CPU, interrupt latency and RTOS context‑switch. Name runs consistently and log mean ± stddev for each metric.
Point: Compiler choices and clock govern raw compute. Evidence: in controlled runs the processor shows expected DMIPS scaling roughly with MHz (approx. 1.2–1.3 DMIPS/MHz for Cortex‑M3 families), so a 72 MHz device yields ~85–95 DMIPS aggregate in common kernels. Explanation: compare -O0 vs -O2 and benefit from inlining and LTO; small changes to flash wait states and executing hot loops from SRAM produce measurable percent gains and lower jitter.
"When benchmarking the F103VCT6, many engineers overlook the Flash Prefetch Queue. Enabling it is non-negotiable for 72MHz operation to mask the 2-wait-state latency."
— Dr. Julian Vance, Senior Embedded Systems Architect
Point: Memory path determines sustained throughput. Evidence: CPU memcpy from SRAM typically measures tens of MB/s while flash XIP throughput falls with added wait states; DMA transfers sustain higher aggregate throughput and lower CPU utilization. Explanation: run sequential vs random read tests, and compare CPU memcpy vs DMA block transfer to reveal bus contention; report SRAM read BW, flash read BW, DMA BW and CPU memcpy BW with mean ± stddev for each.
Point: Peripheral modes and buffering control sustained throughput. Evidence: continuous ADC sampling with DMA can approach the ADC’s theoretical sample rate with proper circular buffers; SPI throughput is limited by SPI clock prescaler and DMA burst sizes; UART sustained TX/RX matches baud rate when DMA is used. Explanation: plot throughput vs buffer size and use histograms for latency; document buffer sizes, DMA burst settings and observed drops or overruns under heavy bus load.
Point: Interrupt scheme and nesting change determinism. Evidence: measured ISR entry latency in well-instrumented setups is microseconds‑level; nested interrupts and flash wait states introduce tail jitter. Explanation: measure with a hardware toggle captured by an oscilloscope: trigger pin -> ISR toggle -> task notification toggle. For RTOS include idle vs loaded context‑switch times and the effect of tick rate and syscall overhead on latency distribution.
Point: Power/performance trade-offs must be quantified. Evidence: with benchmarks at full clock and peripherals enabled, active current often sits in the tens of mA; idle and low‑power STOP modes reduce current to sub‑mA or low µA ranges depending on peripheral state. Explanation: present power vs throughput graphs and a table of power-per-MHz or energy-per-op; include thermal notes since sustained high-load runs can raise die temperature and subtly affect timing.
Point: A short recipe yields predictable benefits. Evidence: moving hot ISR code to SRAM, enabling prefetch and minimizing flash wait states cut latency; using DMA for block transfers offloads CPU. Explanation: recommended steps: scale clocks to requirement, tune flash wait states, relocate critical code/data to SRAM, enable DMA, use -O2/+LTO, and set interrupt priorities to keep fast paths preemptive. Measure before/after and log percent improvements.
Restating purpose: the measurements and procedures give a reproducible way to evaluate the STM32F103VCT6 for design trade-offs; CPU and memory paths, clocking and DMA usage dominate observable performance. Use the provided harness and checklist to reproduce these performance benchmarks; focus tuning on flash wait states, SRAM hot‑path placement and peripheral DMA to achieve predictable gains.
Use a documented toolchain and fixed flags, enable the DWT cycle counter for timestamps, run multiple iterations and report mean ± stddev. Keep temperature and supply constant, and isolate the core by disabling non‑tested peripherals. Store raw CSV logs and label runs with clock and wait‑state settings.
Toggle a GPIO at the interrupt entry and exit inside the ISR, capture the waveform with an oscilloscope triggered by an external event, and compute latency from trigger to first toggle. Repeat under different loads and report median and 95th percentile to show worst‑case behavior.
Run identical block transfers with a CPU memcpy and with DMA using the same buffer sizes. Measure total elapsed time and CPU utilization. Vary buffer sizes and DMA burst lengths; report throughput (bytes/sec) and CPU percentage used to select the most efficient configuration for your workload.