Arm Cortex M4

The Cortex-M7 inside the Daisy Seed features a six-stage, dual-issue superscalar pipeline. In plain terms, while previous Cortex-M processors execute one instruction per clock cycle, the M7 can execute two—provided those instructions do not depend on each other and use different execution units (like pairing a math calculation with a memory load).

If the CPU encounters a dependency (e.g., Instruction B needs the result of Instruction A), the pipeline stalls, and you lose that superscalar advantage.

To squeeze the absolute maximum performance out of the M7 for demanding audio algorithms, you have to write code that keeps both pipelines fed. Here are the core strategies to optimize your DSP code:

1. Break Data Dependencies

The compiler is smart, but it cannot reorder instructions if one relies on the output of another immediately.

Consider a simple audio calculation like a multiply-accumulate for an FIR filter:

\[ \begin{aligned} y[n] &= y[n] + c_0 \cdot x[n] \\ y[n] &= y[n] + c_1 \cdot x[n-1] \end{aligned} \]

In the naive approach above, the second line cannot execute until the first line finishes updating \(y[n]\).

The Fix: Use multiple accumulators.

\[ \begin{aligned} acc_0 &= c_0 \cdot x[n] \\ acc_1 &= c_1 \cdot x[n-1] \\ y[n] &= acc_0 + acc_1 \end{aligned} \]

By splitting the calculation into acc_0 and acc_1, the Cortex-M7 can issue both multiplications simultaneously in a single clock cycle. The compiler will automatically map these to registers, drastically speeding up the filter.

2. Unroll Your Audio Loops

Looping over audio buffers (like processing a block of 48 samples) introduces branch overhead. Every time the loop restarts, the CPU has to evaluate the condition and jump back, which can disrupt the pipeline.

By unrolling the loop—processing 2, 4, or 8 samples per iteration—you give the compiler a larger block of sequential instructions. This makes it much easier for the compiler to find independent instructions to pair up for dual-issue execution.

// Instead of processing 1 sample per iteration:
for (int i = 0; i < size; i += 4) {
    // Process 4 samples sequentially
    out[i] = process(in[i]);
    out[i+1] = process(in[i+1]);
    out[i+2] = process(in[i+2]);
    out[i+3] = process(in[i+3]);
}

3. Leverage Tightly Coupled Memory (TCM)

A blazing-fast superscalar CPU is useless if it spends most of its time waiting for data to arrive from slow external memory. The Daisy Seed has external SDRAM, which is great for massive delay buffers, but it is too slow for real-time DSP execution without caching.

The M7 includes ITCM (Instruction Tightly Coupled Memory) and DTCM (Data Tightly Coupled Memory). These are small blocks of ultra-fast SRAM connected directly to the CPU core, bypassing standard system buses for zero-wait-state access.

Move Critical Code to ITCM: You can instruct the Daisy compiler to place your most aggressive DSP functions directly into this memory. In the Daisy ecosystem, you can use a macro like DSY_ITCM_SECTION before your audio callback function.
Keep Working Buffers in DTCM: Ensure your immediate block of audio samples is processed in internal memory before being sent out or stored in the slower SDRAM.

4. Use CMSIS-DSP (The Easy Way)

If you want to leverage the superscalar architecture, SIMD (Single Instruction, Multiple Data) instructions, and hardware FPU without manually unrolling loops and managing accumulators, use the Arm CMSIS-DSP library.

It is natively supported in the Daisy toolchain and contains highly optimized, pre-compiled functions for biquad filters, FFTs, and matrix math. The engineers at Arm have already hand-tuned these functions in assembly to perfectly utilize the M7's dual-issue pipeline.

Would you like to look at a concrete example of how to flag a specific function to run from the ITCM on the Daisy Seed, or dive into implementing a specific CMSIS-DSP filter for your pedal?