Introduction
The exponential growth of the wireless communications industry has created a multitude of new products with advanced features that allow users to stay in touch with every aspect of their lives wherever they may be. These new products are quite diverse, require more system performance with no exceptions to power conservation and have short product life cycles. Features such as video-teleconferencing, global positioning and internet access requires these systems to be flexible and capable of understanding a variety of digital wireless standards currently defined by the USA, Europe, Asia-Pacific and Japan.
For example, there is a growing need for cellular baseband transceivers that accommodate GSM as well as CDMA standards at a low cost. In order to accomplish this, a micro-architecture that couples easily to DSPs, ASICs, standard peripherals and memory devices is needed. This micro-architecture must be programmable in C or C++, supported by the most popular real-time operating systems, and have a high instance of code re-usability for rapid prototype development with a rich development tool set.
The focus of this paper is to discuss the low power features of the M·CORE architecture and describe a dual processor solution for a TDMA baseband transceiver which is currently in production. The key features of the 1.8 volt DSP56652 cellular baseband processor, currently designed into the iDENâ i1000TM phone, will be discussed highlighting the integration of smart peripherals to reduce overall power consumption.
Low Power Architecture
Motorola’s M·CORE architecture is designed specifically for sophisticated, yet low power, applications. It’s a fully static CMOS core that packs about 80,000 transistors in a 2.2-mm2 square of silicon in a 0.36 micron process. The architecture implements logic within portions of the core execution and control blocks to minimize power and reduce EMI. In addition to providing mechanisms to power down the processor and system logic, there is focus on minimizing dynamic power consumption when the system is active.
The M·CORE architecture utilizes a streamlined execution engine that provides many of the same performance enhancements as mainstream RISC architectures. It is implemented with a fixed 16- bit instruction length and 32-bit internal data path which meets the computational precision requirements of newer advanced products with the cost and power advantages previously available only with 16-bit architectures. Thus, increased code density accomplishes the goal of minimizing the overhead of memory system energy consumption.
(Para ver la totalidad de las gráficas de este documento, es necesario utilizar la versión de descarga.)
A close examination of the M·CORE micro- RISC architecture, as illustrated in Figure 1, shows how it was designed for optimal performance and low power consumption. Key factors to consider are instruction set efficiency, memory utilization, special low power modes for static operation, power consumption during dynamic operation, and code density. Other important factors to consider during product design are the ease of interface to custom peripheral circuits and ASICS, on-chip JTAG/OnCETM emulation port and development tool support from third party vendors.
Instruction Set Efficiency
Optimal instruction set efficiency is accomplished in the M·CORE architecture by implementation of a universal load-store RISC engine. The core contains a 16 entry, 32-bit general purpose register file, and processes instructions using an efficient four-stage execution pipeline. All computational activity takes place within the internal registers thus reducing external bus transients which consume power.
The arithmetic unit contains a barrel shifter which provides fast multiply and signed or unsigned divides of integers, as well as special help in translation of incoming/outgoing data, such as single cycle bit reversal of a 32 bit word. Data movement is accomplished using load/stores of single or multiple registers in one instruction. This facilitates fast and efficient register utilization when entering/exiting subroutines and context switches between user and supervisor mode.
System-level power management
To provide optimal static power management for the overall system, the M·CORE architecture provides three instructions (stop, wait, and doze) that enable external logic to disable power to parts of the system. Execution of any of these instructions causes the processor to assert the LPMD1-0 output signals in the manner described in Table 1.
The external logic uses the LPMD1-0 inputs to determine exactly which parts of the overall system logic should be placed in a low-power state. The external logic can also place the processor in a low power mode by forcing the CLK input high.
Dynamic power consumption
Although reducing a system’s static power usage achieves the greatest overall reduction in power consumption, a true low power solution must address the issue of dynamic power consumption. By dynamic power consumption, we are referring to the power required by the system when it is actually being used. The M·CORE architecture optimizes dynamic power consumption by both minimizing the power needed to execute an instruction and minimizing the number of bytes that need to be fetched to perform a given function.
Power Aware instruction pipeline
The low power instructions discussed earlier provide a mechanism to power down select parts of the system when not used. With processors themselves becoming more complex, a logical extension of this is to only power up the parts of a processor that are required to execute an instruction. The M·CORE architecture achieves this benefit through its advanced power aware pipeline. The instruction pipeline recognizes which processor functions are required to execute a particular instruction. This enables it to ensure that data only transitions through the processor blocks that is actually needed to implement the instruction. For example, an add instruction would cause data to transition through the adder but not through the barrel shifter. By eliminating unnecessary transitions, the M·CORE architecture prevents switching of gates, loads, and wires in unused blocks, all of which would otherwise consume additional power.
Code density
Compilers were developed in conjunction with the M·CORE architecture instruction set to maximize code density. Code density is a measure of how many bytes of code are required to implement an application or function. Code density affects power consumption both statically and dynamically. The M·CORE architecture’s high code density results in a smaller executable image. This reduces an application’s memory requirements, which in turn reduces system cost and system power consumption. However, there is a second benefit to code density. Every time the processor fetches an instruction from memory, it must use a bus cycle. Bus cycles, of course, consume power. Since the M·CORE architecture’s dense code allows it to perform equivalent functionality with fewer bytes of code, a program executing on an M·CORE processor will consume less power because it will fetch fewer bytes from memory.
Embedded and portable benchmarks were used to make design trade-offs in the architecture and the compiler. The Powerstone benchmarks, which include paging, automobile control, signal processing, imaging and fax applications, are detailed in Table 2.
During initial analysis the M·CORE architecture instruction set was profiled by running the Powerstone benchmark suites on a cycle accurate C++ simulator. Table 3 shows the percentage of dynamic instructions utilizing the adder and barrel shifter, as well as the percentage of change of flow and load/store instructions.
Although the M·CORE architecture is 32-bits, it utilizes a 16-bit instruction set to achieve high code density. In addition, to providing improved code density, the 16-bit instruction set provides a performance advantage over conventional RISC architectures in many low-cost applications. It is common for such applications to minimize cost through use of a 16-bit bus. Since conventional RISC architectures use 32-bit wide instructions, they have to perform two bus cycles to fetch an instruction, negatively impacting overall instruction throughput. In contrast, the M·CORE architecture would only require a single bus cycle to perform an instruction fetch, enabling it to run at full speed even with a 16-bit bus.
A comparison to other popular architectures was made to evaluate instruction set efficiency and favorable results were realized as illustrated in Figure 2. Compiler efficiency played a key role in the code density comparisons especially when evaluating function call stacking, interrupt handlers, variable manipulation and the handling of if-else conditional statements. The implementation of conditional move, increment, decrement, and clear operations supplemented traditional change of flow instructions and helped improve compiler optimization.
Rich register set
To further minimize bus activity, the M·CORE architecture reduces the need to read and write data to and from memory. It achieves this by providing a rich set of registers that enables a program to keep data variables in registers while they are live. The M·CORE architecture provides a total of 37 32-bit data registers that are available to system programmers, one set of 16 general purpose registers, an alternate register file with 16 registers, and 5 scratch registers.
The register file consumes 16% of total processor power and 42% of data path power due to the high utilization of the registers in the instruction set. Since loads and stores in a typical commercial RISC constitute approximately 23% of the dynamic instructions executed, the implementation of the alternate register file coupled with the ability to load/store multiple registers improved interrupt entry and exit latency and reduced memory accesses for instruction fetches and variable save/restore.
Support for multiple data sizes
Some commonly used data types such as chars or shorts have 8- or 16-bit, rather than 32- bit,representations. This provides an additional opportunity for the M·CORE architecture to reduce power consumption when fetching data from memory. For example, the M·CORE architecture would only toggle the 8 bits required to read or write a char, minimizing power consumption by logic external to the processor core. To speed up memory copy and intitialization operations load multiple/load quadrant and store multiple/store quadrant instructions were added for block moves of registers to memory or memory to registers. This helped compiler resolution of variable alignment in memory.
Low Voltage
Since dynamic power consumption is proportional to the square of the supply voltage required, lowering the voltage provides a disproportionately large boost to battery life. M·CORE processors are designed to require only 1.8 volts to operate, with future versions planned to use as little as 0.9 volts.
Processor Power Distribution
Analysis of the architectural implementation showed that clock and data paths consumed a large portion of the power. This led to a critical decision on whether to synthesize or custom design the data path. Research showed that synthesis required 60% more transistors and 175% more area with an increase of 40% more power. Thus the data path was custom designed to reduce power and area.
Further analysis showed that Clock power was 36% of the total processor power consumption. The M·CORE processor uses a single global clock with local generation of dual phase nonoverlapping clocks. Clock gating can be performed which allows for complete or partial clock tree disabling. The ability of clock gating permits specific data paths to be shut down during pipeline stalls thus saving power. This is quite important since the data path consumes 36% of total power while the remaining 28% is consumed by control logic.
Interrupt latency was significantly improved by the use of a 32 channel programmable interrupt controller. The 16 alternate registers improved interrupt latency entry and exit by eliminating the need to perform memory accesses for saving/restoring processor state. The use of a Find First One (FF1) instruction eliminated the need for interrupt priority scanning routines. This combination of special circuits realized a 37% improvement over the ARM processor with respect to interrupt service handling when performing a virtual DMA benchmark.
Nota: Es probable que en esta página web no aparezcan todos los elementos del presente documento. Para tenerlo completo y en su formato original recomendamos descargarlo desde el menú en la parte superior
iMarketing.es – Consultoría informática y de gestión, servicios tecnológicos y de outsourcing
www.imarketing.es/articulos
Conéctate con GestioPolis
¿Qué hay de nuevo?
Lo que se está compartiendo
Otros artículos que te van a interesar
Explora todas las publicaciones por tema