Abstract: This document descibes the process used to develop and optimize nested loops for the Texas Instruments (TIa) TMS320C6x digital signal processor (DSP). The performance of loops can greatly affect the performance of entire applications. Many loops are nested loops with both an inner and outer loop. To optimize nested loops it is necessary to consider both the inner loop and the outer loop performance, especially when the inner loop count is small for execution of each outer loop. Design Problem In many typical DSP applications, loops comprise a majority of the number of cycles, or MIPS. Because of this, performance of loops can greatly affect the performance of the entire application. Many of these loops are nested loops with both an inner and outer loop. Some common examples are FIR and IIR filters, FFT, and DCT. To optimize these nested loops, it is necessary to consider not only the inner loop performance but also the outer loop performance, especially when the inner loop count is small for execution of each outer loop. One technique used to optimize loops on the highly parallel C6x VelociTI architecture is software pipelining. This involves initiating new iterations of the loop before previous iterations have completed to obtain high throughput. This implies there are some cycles (loop prolog) to begin executing, or pipe up, of each inner loop and some more cycles to pipe down the loop (loop epilog). These cycles will be incurred for each outer loop execution so they can affect performance, especially when the inner loop count is small. The more deeply pipelined the DSP is, the more cycles will be required for the prolog and epilog. Figure 1 shows a simple dot product example, (with non-C6x-like single cycle loads and multiplies), where inner loop setup is 2 cycles, the prolog is 2 cycles, the epilog is 2 cycles, and the time to execute outer loop instructions is 2 cycles. At the end of cycle 9 there is a branch back to the beginning of the loop setup (Br 1). Thus, 8 cycles will be incurred each time this inner loop is executed in an outer loop. As we move to deeper and deeper pipelines in DSPs for higher clock speeds, the number of cycles of overhead will increase. The higher the number of cycles for setup, prolog, epilog, and outer loop instructions, and the lower the inner loop count, the more overall nested loop performance is reduced. Application Report SPRA519 Nested Loop Optimization on the TMS320C6x 2 Figure 1. Nested Loop w/ Software Pipelined Inner Loop
Publication Year: 1999
Publication Date: 1999-01-01
Language: en
Type: article
Access and Citation
Cited By Count: 9
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot