 |
Alpha architecture is still kings in
performance. Let we look how about his future. Overall performance of architecture is
determined by following components:
- Ability to exploit various parallelism
- Efficiency of instruction set
- Micro-architecture
- Logic, circuit design and layout
- Process technology
- Compiler quality
- Applications
Let we look close to various component of overall architecture performance, and what we
can expect in near future.
Ability to exploit various
parallelism
Today Alpha architecture not contains special support for extracting more explicit or
speculative parallelism than traditional out-of-order execution. Instruction set does not
contain instructions, which could direct to hardware information allowing exploiting more
parallelism.
Seems more likely, that Alpha 21464 will focus more on exploiting more
performance by multi-threaded design. Instead switching thread on L2, (L1) , TLB miss,
instructions from threads will be scheduled to execution from common reorder buffer,
improving by this way resources utilization also in case when is not sufficient
parallelism in one thread to efficiently utilize resources. By this way, 8-issue Alpha can
still extract enough parallelism for gaining performance. Focus on more throughout
oriented performance, than single-thread performance may be beneficial for servers, but
less for today's application. Never mind scalability issue. Just look into for example
Microsoft SQL server scalability. Here should be said that single-thread performance
should be not given as victim of throughput. If single-thread CPU throughput is 1, 2
thread has throughput 1.2 (equally both threads run on 60% speed as in single-thread
implementation, assuming totally independent threads), then in most cases single-thread
CPU should be preferred. Note that 4 2-threads CPU from scalability point of view are
8-CPU system. If theoretical performance in our case of 4 1-thread CPU is 4 and 2-thread
CPUs is 4.8, real performance getting into count scalability will be less that performance
of 4 single-issue CPUs. Multi-threaded CPU is obviously advantage in case, that
single-thread performance is not significantly penalized. However, CPU performance is
going up in faster pace that memory latency reduction, thus for high GHz and/or wide-issue
implementation advantage of multi-threading will be higher, delivering expected gains.
IN MY VIEW, MULTITHREADING SHOULD BE USED ONLY WHEN TECHNIQUES FOR SINGLE-THREAD
PERFORMANCE IMPROVEMENTS ARE EXHAUSTED. BUT TODAY, STILL EXISTS TECHNIQUES, WHICH CAN GET
ADVANTAGE OF WIDER-ISSUE IMPLEMENTATIONS TO BOOST SINGLE-THREAD PERFORMANCE.
Efficiency of instruction set
Here we will more in detail look into Alpha instruction set architecture and its effect on
future evolution of Alpha architecture.
Floating-point performance
One of very positive aspect of Alpha instruction set is that create very orthogonal
instruction set with respect of instruction interdependencies. There is not shared flag
register for conditional branches and instructions dependent on flags. There also not
shared accumulator or multiply-result register, like in other RISC architectures. Relaxing
memory ordering was one of key advantage of Alpha architecture, mainly for in-order
execution implementations. However, Alpha architecture instruction set, in today form
contains some limitations.
Floating-point performance of Alpha architecture today is good, thanks to very
high-frequency implementations and enough registers (but same number of register we can
find in some other RISC architectures). Floating-point intensive applications allow
typically exploit much more parallelism that in integer intensive applications. One of key
drawback of Alpha architecture is missing support for high-performance floating-point
implementations. Mostly is meaning floating-point multiply-add, floating-point
multiply-subtract and operations on packed floating-point operands.
With today 4 issue Alpha implementations containing one floating-point multiplier and one
floating-point adder, peak performance is 2 Flops/tick. When we expect in near future 1
GHz implementations, this is translated into 2 Gflops peak performance. This number is not
impressive in comparison to Sony Emotion Engine with peak 6 Gflops at 300 MHz. In
comparison with 6.4 Gflops peak single-precision floating-point performance of Merced at
800 MHz, is 2 Gflops not very impressive. Even that real performance is different from
peak however, still is more than likely, that Alpha (meaning 21264 and 21364) will lose
leading role.
Extending Alpha instruction set with support at least packed floating-point operations
will allow at least close performance gap. From my point of view, lack of fused
multiply-add/subtract has following impact. Typically, fused multiply-add/subtract latency
is smaller that latency of separate operations, especially when is saved one rounding.
However, by proper design of reorder-buffer scheduling mechanism is possible to get
advantage of fused operation. Modification is pretty straightforward. Reorder-buffer
receiving register numbers, for which will be result computed in next tick in arithmetic
unit. By sending this couple tick earlier, then result is computed (note that FP
operations are multi-cycle on Alpha and other architectures), allow to select subsequent
FP add operation earlier, allowing to fuse back-to-back FP add operation with FP multiply
in-progress. At least latency can be improved.
Another issue is throughput. To keep pace with fast development of other architectures,
especially for keeping pace in floating-point intensive applications, more than 4-issue
implementations of Alpha are imperative. Next is required 8-issue implementation. However,
if Alpha instruction set will be designed with support of fused multiply-add/subtract,
still 4-issue implementation will be enough - 2 fused multiply-add/subtract + memory +
memory/integer.
Requiring more and more FP performance will produce problems with
issue-width with number of FP write ports. For architecture with 4 fused
multiply-add/subtract is needed 4 register write ports and 4 issued instructions, while
Alpha needed 8 write ports and 8 issued instructions.
Adding fused multiply-add/subtract to Alpha architecture will not easy.
Mostly, because there is not too much left op-code space for introduction additional
register operand and additional op-codes. Relative useful solution would be creating R[rd]
= R[rd] + R[ra] * R[rb] or R[rd'] = R[rd] + R[ra] * R[rb], where rd' is derived from rd.
It is not as efficient as R[rd] = R[rc] + R[ra] * R[rb], but performance gain is
relatively good.
Another issue about FP performance is number of FP registers. While it
is not critical issue, for 2 multiply and add/subtract per tick architecture for 4
multiply and add/subtract is playing more important role and beyond is larger FP register
set (I am talking about architectural, not physical registers) more than required.
Also important issue is load/store bandwidth. One way is just adding
additional load/store unit and appropriate cache ports, but it is very expensive solution.
Supporting wider load/store operation (on packed aligned 32 or 64-bit FP quantities) is
less expensive way and saves also instruction issue bandwidth. I fully understand, that
register-renaming logic will be significantly more complicated when 2 destination
registers are written, because in theory it would require 2x more write ports to register
rename structure per instruction. However, with some restrictions it is possible to reduce
complexity to manageable level.
Question about support of 128-bit packed floating-point data types
remain. While it solves more issue at once - improved instruction issue width, increases
register file capacity, improves load/store throughput, it requires substantially more
resources, but mainly a lot of new instructions, or modified semantic of existing
instructions.
One of fundamental issue, in further improving FP performance, what is
missing now in Alpha (but is also missing in IA-64, but is included in E2k) is hardware
support compiler controlled prefetching. However, this hardware and related instruction
set extensions could be included in Alpha instruction set architecture in future.
Conclusion is, that Alpha instruction set should be extended with earlier mentioned
instructions, otherwise will be under-performer in not as far future, as one may think.
Integer performance
Improving integer performance is one of tougher task in comparison to improving
floating-point and multimedia performance.
First way of improving performance is reduce various penalties. Reducing branch penalty is
very important. There will be helpful other techniques, in addition to very effective
branch prediction mechanism included in Alpha 21264 for reducing branch penalties. One of
obvious candidates is predicated instruction execution. Alpha supports only conditional
move to reduce branch penalty. Still remain memory operations and subprogram calls, for
which conditional move only is not adequate. However, sophisticated compiler can play
tricks with using move conditional to change base address of load/store instruction to
simulate predicated execution. It will work in way, that if load/store should be
suppressed, move conditional will move to base register of load/store instruction address
of dummy memory area, used for this purpose. However, increased number of instructions and
memory resources is penalty. One of way can be adding guarding instruction, which contains
register-based condition, which will be used in subsequent instruction(s). This
instruction just adds tag into micro-operation stored into reorder buffer.
In many cases, counted loops with more than few iteration can be handled with 100% branch
prediction rate using immediate operand merging technique (see future of x86 architecture
articles).
Next issue is branch prepare instruction. Sometimes branch is hard predictable even by
dynamic branch prediction. In that case, branch prepare instruction allow reduce branch
penalty.
Another issue is then memory latency. When load waits to resolve cache miss, all
subsequent operations cannot retire, flooding reorder buffer. Even big Alpha's reorder
buffer will be full in case of L2 cache miss. By extending instruction set architecture by
load instruction, which allow "early retirement" is opened way for better
dealing with memory latency.
One of drawback of Alpha instruction set is that memory operation does
not contain information about locality. Including information about locality has two
effects. First, L1 cache miss latency is improved knowing that will be L1 miss and early
start L2 cache access. Second, cache hit ratio could be improved. Also Alpha memory
instructions does not contain field, which can direct prefetch hint. Really positive on
Alpha 21364 is dealing with memory latency by integrating L2 cache on chip and reducing
memory latency of memory subsystem by integrating memory controller.
Multimedia performance
Alpha concept of simple and fast instruction set architecture results in relative simple
MVI instruction set extension. Even that for many multimedia related tasks require more
processor cycles that Pentium-III, faster clock in many cases compensates penalty. But
because seems, that frequency disparity between Alpha and next generation x86 architecture
will be narrower, it is more likely that additional multimedia acceleration will be
required, to keep pace with rest of architectures. It should not negatively affect
frequency, however. Packed multiply is one of urgent candidate for inclusion in
instruction set. It is multi-cycle operation, which does not create critical path.
Different question is saturated arithmetic. Most likely, addition/subtraction with
saturation, for high frequency implementation, should be broken into two cycles. For
maximizing performance, packed multiply scaled-add/subtract instruction will allow defend
performance against first implementations of IA-64 architecture.
Micro-architecture
From architecture point of view, can be applied many of techniques described for x86
architecture. For unleashing more parallelism, it is needed to reduce length of dependency
chains and broke long instruction dependency chains. Immediate operand merging described
for x86 architecture is very important. Transforming on-the-fly move conditional based
code into predicated could significantly reduce dependency chains. Using stack caching in
registers as described in article about x86 can unleash more parallelism across subprogram
call boundaries, by reducing store-load from critical path.
Because instruction set architecture is lacking support for efficient
Boolean expression trees high reduction, doing this type of transformation on the fly by
hardware will be costly, but more that welcomed.
I think, that even 8-issue architecture implementation will be very
complex, but using code-transformation techniques could be extracted reasonable
performance gains. Two threads per CPU support could be some insurance for the case, that
code transformation techniques, compilers and applications will not live up to expectation
for this wide architecture implementation.
From my point of view, it is unlikely, that Alpha architecture should
be effectively scaled beyond 8-issue implementations. However, performance of 8-issue
implementation could be dramatically boosted by instruction set extensions, rather that
going for wider implementation.
Another very interesting approach could be chosen in combination with
multi-threading - creating ultra pipelined high-frequency implementation. While
single-thread performance will be not dramatically penalized, throughput could be
increased dramatically. However, logic, circuit design, layout and other engineering
factor makes this task extremely uneasy.
Logic, circuit design and layout
Still to nowadays Alpha products was known for excellent logic, circuit design and layout.
While still assumed that high quality for in this are will remain, there is more likely
that designers of IA-64 and other architectures will soon step-by-step reduce this gap.
Process technology
Process technology is very important. It is hard to overcome process gap in architecture.
To stay in leading position, Alpha should be produced at least at similar or better
process that other near future architectures.
Compiler quality
Alpha system architecture was also known for good compilers. However, with huge investment
in compilers technology induced by IA-64 architecture and overall shift to more
sophisticated compiler technology could eliminate any Alpha system architecture advantage.
Applications
Applications are evolving very fast. From application level, Alpha architecture does not
contain magic bullet, which allow speedup applications in new way. With migrating of other
architectures to 64-bit addressing, it is unlikely that Alpha can gain application
performance in other way, that other today architectures.
Conclusion
While today from pure from technical (not from business) point of view Alpha architecture
is really state of art and leading architecture, in short future, overall architecture
should be revisited to keep pace with other architectures development, because otherwise
can easily loss ground. From longer perspective, in about 10 years, it is more likely,
that Alpha architecture (based on today instruction set decisions) will shading. This does
not mean, however, that Alpha team is not able to bring new concept and then consequently
new instruction set and architecture.
|
Site design by Kornel Kiss - ©1999 Copyright ©1999
Linux3D.net & CPU Gurus |
|
 |