Trends in VLSI
technology scaling demand that future computing devices be narrowly focused to
achieve high performance and high efficiency, yet also target the high volumes
and low costs of widely applicable general-purpose designs. To address these
conflicting requirements, we propose a modular re-configurable architecture
called Smart Memories, targeted at computing needs in the 0.1m technology generation. A Smart Memories
chip is made up of many processing tiles, each containing local memory,
local interconnect, and a processor core. For efficient computation under a
wide class of possible applications, the memories, the wires, and the
computational model can all be altered to match the applications. To show the
applicability of this design, two very different machines at opposite ends of
the architectural spectrum, the Imagine stream processor and the Hydra
speculative multiprocessor, are mapped onto the Smart Memories computing
substrate. Simulations of the mappings show that the Smart Memories
architecture can successfully map these architectures with only modest
performance degradation
The continued scaling
of integrated circuit fabrication technology will dramatically affect the
architecture of future computing systems. Scaling will make computation
cheaper, smaller, and lower power, thus enabling more sophisticated computation
in a growing number of embedded applications. However, the scaling of process
technologies makes the construction of custom solutions increasingly difficult due
to the increasing complexity of the desired devices. While designer
productivity has improved over time, and technologies like system-on-a-chip
help to manage complexity, each generation of complex machines is more
expensive to design than the previous one. High non-recurring fabrication costs
(e.g. mask generation) and long chip manufacturing delays mean that designs
must be all the more carefully validated, further increasing the design costs.
Thus, these large complex chips are only cost-effective if they can be sold in
large volumes. This need for a large market runs counter to the drive for
efficient, narrowly focused, custom hardware solutions. At the highest level,
a Smart Memories chip is a modular computer. It contains an array of processor
tiles and on-die DRAM memories connected by a packet-based, dynamically routed
network (Figure 1). The network also connects to high-speed links on the pins
of the chip to allow for the construction of multi-chip systems. Most of the
initial hardware design works in the Smart Memories project has been on the
processor tile design and evaluation, so this paper focuses on these aspects.
The organization of a
processor tile is a compromise between VLSI wire constraints and computational
efficiency. Our initial goal was to make each processor tile small enough so
the delay of a repeated wire around the semi-perimeter of the tile would be
less then a clock cycle. This leads to a tile edge of around 2.5mm in a 0.1m technology. This sized tile can contain
a processor equivalent to a MIPS R5000, a 64-bit, 2-issue,
in-order machine with 64KB of on-die cache. Alternately, this area can contain
2-4MB of embedded DRAM depending on the assumed cell size. A 400mm2 die would
then hold about 64 processor tiles, or a lesser number of processor tiles and
some DRAM tiles. Since large-scale computations may require more computation
power than what is contained in a single processing tile, we cluster four
processor tiles together into a “quad” and provide a low-overhead, intra-quad,
interconnection network. Grouping the tiles into quads also makes the global
interconnection network more efficient by reducing the number of global network
interfaces and thus the number of hops between processors. Our goal in the tile
design is to create a set of components that will span as wide an application
set as possible. In current architectures, computational elements are somewhat
standardized; today, most processors have multiple segmented functional units
to increase efficiency when working on limited precision numbers. Since much
work has already been done on optimizing the mix of functional units for a wide
application class, efforts on creating the flexibility needed to efficiently
support different computational models requires creating a flexible memory
system, flexible interconnection between the processing node and the memory,
and flexible instruction decode.
Continued technology
scaling causes a dilemma -- while computation gets cheaper, the design of
computing devices becomes more expensive, so new computing devices must have
large markets to be successful. Smart Memories addresses this issue by
extending the notion of a program. In conventional computing systems the
memories and interconnect between the processors and memories is fixed, and
what the programmer modifies is the code that runs on the processor. While this
model is completely general, for many applications it is not very efficient. In
Smart Memories, the user can program the wires and the memory, as well as the
processors. This allows the user to configure the computing substrate to better
match the structure of the applications, which greatly increases the efficiency
of the resulting solution.
Our initial tile
architecture shows the potential of this approach. Using the same resources
normally found in a superscalar processor, we were able to arrange those
resources into two very different types of compute engines. One is optimized
for stream-based applications, i.e. very regular applications with large
amounts of data parallelism. In this machine organization, the tile provides
engine was very high bandwidth and high computational throughput. The other
optimized for applications with small amounts of parallelism and irregular
memory access patterns. Here the programmability of the memory was used to
create the specialized memory structures needed to support speculation.
However, this flexibility comes at a cost.
The overheads of the
coarse-grain configuration that Smart Memories uses, although modest, are not
negligible; and as the mapping studies show, building a machine optimized for a
specific application will always be faster than configuring a general machine
for that task. Yet the results are promising, since the overheads and resulting
difference in performance are not large. So if an application or set of
applications needs more than one computing or memory model, our re-configurable
architecture can exceed the efficiency and performance of existing separate
solutions.
0 comments: