Ref: z263146m
Around 25 years ago, when microprocessors were in their infancy, most designers had one overriding aim: to minimise the memory requirements. Memory was expensive and largely external, so that a small reduction in code size could often eliminate an external memory device and lead to a proportionately much greater reduction in component cost.
As high-level languages such as C replaced assembly-level coding, compiler developers shared the same goal and one of the most important metric of any new compiler was the size of the resulting code. As a result, the environment in which most software engineers learned their skills was one that favoured code compactness above all else. For example, if the program sometimes needed to calculate a sum of squares and some others needed to calculate a sum of cubes, the standard approach was to write a general procedure that calculated the sum of nth powers, where n was supplied as a parameter. By consolidating as much common code as possible into a parameterised procedure, the amount of memory required for code storage was minimised.
However, just as there were excellent reasons why this programming approach evolved, there are also sound reasons why it does not always deliver optimum results today.
One of the factors that have changed is that processor-based systems no longer necessarily consist of a standalone microprocessor plus external memory plus external peripherals. Today, they are likely to be all incorporated in an embedded system which allows considerably more flexibility in terms of amount of embedded code, data and program memory required
Designing for low power
Another of the key factors is the growing importance of low power consumption. Today, we are on the threshold of a new paradigm where mobile devices will become an increasing part of everyday life, and here minimising power consumption is often far more important than minimising code size. There are many approaches to reducing power consumption, perhaps the most obvious being a reduction in clock frequency, but to minimise power consumption without significant performance penalties, more sophisticated techniques are required.
ST is working with many prestigious universities around the world to develop tools and methodologies that allow software to be optimised for energy reduction. For example, one recent working paper by a team from ST, Stanford University (USA) and the University of Bologna (Italy) described the team's work in developing a tool that reduces the computational effort of programs by specialising it for highly expected situations.
Take, for example, the procedure sum(n,k) mentioned above that calculates the sum of the nth powers of the first k integers. Now suppose that, in practice, the value of n is 1 in 90% of the procedure calls. In such cases, n is called a constant-like argument (CLA) because its value is often - though not always - constant. If we write a simpler procedure, eg, sum1(k), to handle this special case and make sum(n,k) call sum1(k) whenever n = 1, the program will execute faster and consume less power. Of course, the code size will be slightly increased because we have added a specialised procedure to handle the case of n = 1 but this is often a very small price to pay for the lower power consumption and greater performance that results from calling a procedure that only uses one loop rather than two in 90% of cases.
Automatic code transformation
In real applications, the problem is far more complex. What ST and its research partners are developing is a tool that will take the input source code (written in C), find the procedure calls which are frequently executed with the same parameter values (known as 'highly expected situations' or 'common cases') and then generate additional specialised versions of the procedure code to handle the common cases with less computational effort. As research shows that reducing the computational effort usually reduces both power consumption and execution time, a tool that could automatically transform source code in this way would bring tremendous benefits for the customer.
In practice, there are three major problems to be addressed.
* The first is that a procedure may have several possible common cases and it may not be clear which common case is the most effective candidate.
* The second problem is to determine, once a common case has been selected, how best to optimise.
* Finally, after each procedure call has been specialised with the best combination of loop unrolling, it is necessary to analyse the global effect.
This must be done not only in terms of the resulting code size but also because changes in the calling sequences may introduce cache conflicts that were not present in the original code.
Figure 1 shows the key steps in the source code transformation flow. The first three steps do not depend on the target architecture, while the two final steps use instruction level simulation to consider the underlying hardware architecture.
* Step 1: The first step is to collect information for the three search problems and estimate the computational efforts involved in the procedure calls. Two types of profiling are performed: execution frequency profiling is used to estimate the total computational effort associated with each procedure, while value profiling identifies CLAs and their common values by observing the parameter value changes of procedure calls.
* Step 2: Armed with this information, the next step is to calculate the normalised computational effort for every detected common case. A user-defined threshold allows trivial common cases to be pruned from the search. In the next stage, all the remaining common cases are specialised, with the result that the original source code is transformed in such a way that all procedures for which there are effective common cases now include a conditional statement that checks for the occurrence of a common value and executes the appropriate specialised version if it is found.
* Step 3: Finally, the global interaction of the specialised calls is examined to determine which ones can be included in the final code.
Today results are very promising. For example, using the ST210 VLIW as the target architecture and a variety of DSP programs based on a set of industrial benchmarks for multimedia applications (eg, G721 encoding, FFT, FIR and edge detection and convolution of images) the average improvements in energy consumption and execution speed were both around 40%. And this for an average increase in code size of just 5%. In some cases, even more spectacular improvements were observed: in the FFT program, for example, over 80% improvement in both energy consumption and execution speed was achieved with a 14% increase in code size!
As the world becomes more and more mobile and connected, designing for low power is becoming increasingly critical. The beauty of this approach is that it allows one to optimise the trade-offs between price, performance and power consumption that can make all the difference to the customer's winning edge.
Avnet Kopp, 011 809 6100, [email protected]
© Technews Publishing (Pty) Ltd | All Rights Reserved