Top Logo

- The microprocessor architects -

Left menu - sections
[CPUs]
- IA64
- EPIC
working
- x86
working
- E2k
- post-EPIC
- Alpha
working

[Articles]
- Index
working

[Resources]
- Web -
-
Worthy links
working
- IA64
- EPIC
working
- x86
working
- E2k
working
- post-EPIC
- On site -
- Discussion board
working

[Contacts]
- Author  
working
- Comments

Back to Main menu

Future of Alpha architecture
Ability to exploit various parallelism
Efficiency of instruction set
Floating-point performance
Integer performance
Multimedia performance
Micro-architecture
Logic, circuit design and layout
Process technology
Compiler quality
Applications
Conclusion

 

Alpha architecture is still kings in performance. Let we look how about his future. Overall performance of architecture is determined by following components:

- Ability to exploit various parallelism
- Efficiency of instruction set
- Micro-architecture
- Logic, circuit design and layout
- Process technology
- Compiler quality
- Applications

Let we look close to various component of overall architecture performance, and what we can expect in near future.


Ability to exploit various parallelism

Today Alpha architecture not contains special support for extracting more explicit or speculative parallelism than traditional out-of-order execution. Instruction set does not contain instructions, which could direct to hardware information allowing exploiting more parallelism.
    Seems more likely, that Alpha 21464 will focus more on exploiting more performance by multi-threaded design. Instead switching thread on L2, (L1) , TLB miss, instructions from threads will be scheduled to execution from common reorder buffer, improving by this way resources utilization also in case when is not sufficient parallelism in one thread to efficiently utilize resources. By this way, 8-issue Alpha can still extract enough parallelism for gaining performance. Focus on more throughout oriented performance, than single-thread performance may be beneficial for servers, but less for today's application. Never mind scalability issue. Just look into for example Microsoft SQL server scalability. Here should be said that single-thread performance should be not given as victim of throughput. If single-thread CPU throughput is 1, 2 thread has throughput 1.2 (equally both threads run on 60% speed as in single-thread implementation, assuming totally independent threads), then in most cases single-thread CPU should be preferred. Note that 4 2-threads CPU from scalability point of view are 8-CPU system. If theoretical performance in our case of 4 1-thread CPU is 4 and 2-thread CPUs is 4.8, real performance getting into count scalability will be less that performance of 4 single-issue CPUs. Multi-threaded CPU is obviously advantage in case, that single-thread performance is not significantly penalized. However, CPU performance is going up in faster pace that memory latency reduction, thus for high GHz and/or wide-issue implementation advantage of multi-threading will be higher, delivering expected gains.
IN MY VIEW, MULTITHREADING SHOULD BE USED ONLY WHEN TECHNIQUES FOR SINGLE-THREAD PERFORMANCE IMPROVEMENTS ARE EXHAUSTED. BUT TODAY, STILL EXISTS TECHNIQUES, WHICH CAN GET ADVANTAGE OF WIDER-ISSUE IMPLEMENTATIONS TO BOOST SINGLE-THREAD PERFORMANCE.


Efficiency of instruction set

Here we will more in detail look into Alpha instruction set architecture and its effect on future evolution of Alpha architecture.


Floating-point performance

One of very positive aspect of Alpha instruction set is that create very orthogonal instruction set with respect of instruction interdependencies. There is not shared flag register for conditional branches and instructions dependent on flags. There also not shared accumulator or multiply-result register, like in other RISC architectures. Relaxing memory ordering was one of key advantage of Alpha architecture, mainly for in-order execution implementations. However, Alpha architecture instruction set, in today form contains some limitations.

Floating-point performance of Alpha architecture today is good, thanks to very high-frequency implementations and enough registers (but same number of register we can find in some other RISC architectures). Floating-point intensive applications allow typically exploit much more parallelism that in integer intensive applications. One of key drawback of Alpha architecture is missing support for high-performance floating-point implementations. Mostly is meaning floating-point multiply-add, floating-point multiply-subtract and operations on packed floating-point operands.
With today 4 issue Alpha implementations containing one floating-point multiplier and one floating-point adder, peak performance is 2 Flops/tick. When we expect in near future 1 GHz implementations, this is translated into 2 Gflops peak performance. This number is not impressive in comparison to Sony Emotion Engine with peak 6 Gflops at 300 MHz. In comparison with 6.4 Gflops peak single-precision floating-point performance of Merced at 800 MHz, is 2 Gflops not very impressive. Even that real performance is different from peak however, still is more than likely, that Alpha (meaning 21264 and 21364) will lose leading role.
Extending Alpha instruction set with support at least packed floating-point operations will allow at least close performance gap. From my point of view, lack of fused multiply-add/subtract has following impact. Typically, fused multiply-add/subtract latency is smaller that latency of separate operations, especially when is saved one rounding. However, by proper design of reorder-buffer scheduling mechanism is possible to get advantage of fused operation. Modification is pretty straightforward. Reorder-buffer receiving register numbers, for which will be result computed in next tick in arithmetic unit. By sending this couple tick earlier, then result is computed (note that FP operations are multi-cycle on Alpha and other architectures), allow to select subsequent FP add operation earlier, allowing to fuse back-to-back FP add operation with FP multiply in-progress. At least latency can be improved.
Another issue is throughput. To keep pace with fast development of other architectures, especially for keeping pace in floating-point intensive applications, more than 4-issue implementations of Alpha are imperative. Next is required 8-issue implementation. However, if Alpha instruction set will be designed with support of fused multiply-add/subtract, still 4-issue implementation will be enough - 2 fused multiply-add/subtract + memory + memory/integer.
    Requiring more and more FP performance will produce problems with issue-width with number of FP write ports. For architecture with 4 fused multiply-add/subtract is needed 4 register write ports and 4 issued instructions, while Alpha needed 8 write ports and 8 issued instructions.
    Adding fused multiply-add/subtract to Alpha architecture will not easy. Mostly, because there is not too much left op-code space for introduction additional register operand and additional op-codes. Relative useful solution would be creating R[rd] = R[rd] + R[ra] * R[rb] or R[rd'] = R[rd] + R[ra] * R[rb], where rd' is derived from rd. It is not as efficient as R[rd] = R[rc] + R[ra] * R[rb], but performance gain is relatively good.
    Another issue about FP performance is number of FP registers. While it is not critical issue, for 2 multiply and add/subtract per tick architecture for 4 multiply and add/subtract is playing more important role and beyond is larger FP register set (I am talking about architectural, not physical registers) more than required.
    Also important issue is load/store bandwidth. One way is just adding additional load/store unit and appropriate cache ports, but it is very expensive solution. Supporting wider load/store operation (on packed aligned 32 or 64-bit FP quantities) is less expensive way and saves also instruction issue bandwidth. I fully understand, that register-renaming logic will be significantly more complicated when 2 destination registers are written, because in theory it would require 2x more write ports to register rename structure per instruction. However, with some restrictions it is possible to reduce complexity to manageable level.
    Question about support of 128-bit packed floating-point data types remain. While it solves more issue at once - improved instruction issue width, increases register file capacity, improves load/store throughput, it requires substantially more resources, but mainly a lot of new instructions, or modified semantic of existing instructions.
    One of fundamental issue, in further improving FP performance, what is missing now in Alpha (but is also missing in IA-64, but is included in E2k) is hardware support compiler controlled prefetching. However, this hardware and related instruction set extensions could be included in Alpha instruction set architecture in future.

Conclusion is, that Alpha instruction set should be extended with earlier mentioned instructions, otherwise will be under-performer in not as far future, as one may think.


Integer performance

Improving integer performance is one of tougher task in comparison to improving floating-point and multimedia performance.
First way of improving performance is reduce various penalties. Reducing branch penalty is very important. There will be helpful other techniques, in addition to very effective branch prediction mechanism included in Alpha 21264 for reducing branch penalties. One of obvious candidates is predicated instruction execution. Alpha supports only conditional move to reduce branch penalty. Still remain memory operations and subprogram calls, for which conditional move only is not adequate. However, sophisticated compiler can play tricks with using move conditional to change base address of load/store instruction to simulate predicated execution. It will work in way, that if load/store should be suppressed, move conditional will move to base register of load/store instruction address of dummy memory area, used for this purpose. However, increased number of instructions and memory resources is penalty. One of way can be adding guarding instruction, which contains register-based condition, which will be used in subsequent instruction(s). This instruction just adds tag into micro-operation stored into reorder buffer.
In many cases, counted loops with more than few iteration can be handled with 100% branch prediction rate using immediate operand merging technique (see future of x86 architecture articles).
Next issue is branch prepare instruction. Sometimes branch is hard predictable even by dynamic branch prediction. In that case, branch prepare instruction allow reduce branch penalty.

Another issue is then memory latency. When load waits to resolve cache miss, all subsequent operations cannot retire, flooding reorder buffer. Even big Alpha's reorder buffer will be full in case of L2 cache miss. By extending instruction set architecture by load instruction, which allow "early retirement" is opened way for better dealing with memory latency.
    One of drawback of Alpha instruction set is that memory operation does not contain information about locality. Including information about locality has two effects. First, L1 cache miss latency is improved knowing that will be L1 miss and early start L2 cache access. Second, cache hit ratio could be improved. Also Alpha memory instructions does not contain field, which can direct prefetch hint. Really positive on Alpha 21364 is dealing with memory latency by integrating L2 cache on chip and reducing memory latency of memory subsystem by integrating memory controller.


Multimedia performance

Alpha concept of simple and fast instruction set architecture results in relative simple MVI instruction set extension. Even that for many multimedia related tasks require more processor cycles that Pentium-III, faster clock in many cases compensates penalty. But because seems, that frequency disparity between Alpha and next generation x86 architecture will be narrower, it is more likely that additional multimedia acceleration will be required, to keep pace with rest of architectures. It should not negatively affect frequency, however. Packed multiply is one of urgent candidate for inclusion in instruction set. It is multi-cycle operation, which does not create critical path. Different question is saturated arithmetic. Most likely, addition/subtraction with saturation, for high frequency implementation, should be broken into two cycles. For maximizing performance, packed multiply scaled-add/subtract instruction will allow defend performance against first implementations of IA-64 architecture.


Micro-architecture

From architecture point of view, can be applied many of techniques described for x86 architecture. For unleashing more parallelism, it is needed to reduce length of dependency chains and broke long instruction dependency chains. Immediate operand merging described for x86 architecture is very important. Transforming on-the-fly move conditional based code into predicated could significantly reduce dependency chains. Using stack caching in registers as described in article about x86 can unleash more parallelism across subprogram call boundaries, by reducing store-load from critical path.
    Because instruction set architecture is lacking support for efficient Boolean expression trees high reduction, doing this type of transformation on the fly by hardware will be costly, but more that welcomed.
    I think, that even 8-issue architecture implementation will be very complex, but using code-transformation techniques could be extracted reasonable performance gains. Two threads per CPU support could be some insurance for the case, that code transformation techniques, compilers and applications will not live up to expectation for this wide architecture implementation.
    From my point of view, it is unlikely, that Alpha architecture should be effectively scaled beyond 8-issue implementations. However, performance of 8-issue implementation could be dramatically boosted by instruction set extensions, rather that going for wider implementation.
    Another very interesting approach could be chosen in combination with multi-threading - creating ultra pipelined high-frequency implementation. While single-thread performance will be not dramatically penalized, throughput could be increased dramatically. However, logic, circuit design, layout and other engineering factor makes this task extremely uneasy.


Logic, circuit design and layout

Still to nowadays Alpha products was known for excellent logic, circuit design and layout. While still assumed that high quality for in this are will remain, there is more likely that designers of IA-64 and other architectures will soon step-by-step reduce this gap.


Process technology

Process technology is very important. It is hard to overcome process gap in architecture. To stay in leading position, Alpha should be produced at least at similar or better process that other near future architectures.


Compiler quality

Alpha system architecture was also known for good compilers. However, with huge investment in compilers technology induced by IA-64 architecture and overall shift to more sophisticated compiler technology could eliminate any Alpha system architecture advantage.


Applications

Applications are evolving very fast. From application level, Alpha architecture does not contain magic bullet, which allow speedup applications in new way. With migrating of other architectures to 64-bit addressing, it is unlikely that Alpha can gain application performance in other way, that other today architectures.


Conclusion

While today from pure from technical (not from business) point of view Alpha architecture is really state of art and leading architecture, in short future, overall architecture should be revisited to keep pace with other architectures development, because otherwise can easily loss ground. From longer perspective, in about 10 years, it is more likely, that Alpha architecture (based on today instruction set decisions) will shading. This does not mean, however, that Alpha team is not able to bring new concept and then consequently new instruction set and architecture.

 

 

 

 

 

Site design by Kornel Kiss - ©1999 Copyright ©1999 Linux3D.net & CPU Gurus