VISC CPU 3X the IPC?

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
If it works, VISC could be a boon for a large swath of the CPU industry. Soft Machines describes itself as being in the business of "licensing and co-developing" VISC products for a range of markets. Some familiar names may be at the front of the line for any licensing deals. Soft Machines says it has raised "over $125 million in funding to date" and counts among its investors Samsung Ventures, AMD, and Mubadala (the Abu Dhabi development company that owns GlobalFoundr

found this bit interesting? dont wanna turn this into an amd thread [oh god no] but could this be related to their 25x25 intentions?
 

positivedoppler

Golden Member
Apr 30, 2012
1,112
174
106
New art nobody has. Who cares if AMD happens to invest in this company. This better not turn into an AMD vs Intel piss match. I would however, love to hear people's expertise on this new chip design.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91

Why is the report dated Oct 27, 2014?

Interesting. Seems like a different and possibly useful idea.

However, the report seems fishy. Someone correct me if I am wrong here.


Things to note.

Compared with a high-IPC processor such as Haswell, the front end is of similar complexity, but the scheduler in each physical core is much simpler, as it manages only a few function units (versus eight in Has-well). The data cache also has fewer ports and can cycle faster. Because the execution resources of both cores can apply to a single thread, however, even a two-core VISC processor can deliver total ALU operations or memory op-erations per cycle that match or exceed those of Haswell.

But only on a single thread. If the task can be parallelized and is implemented in this fashion the performance gain is 0.

Soft Machines has designed and fabricated a test chip that implements its VISC architecture using two physical cores. It refused to discuss the number of function units or other basic microarchitecture capabilities of these cores but char-acterized them as A15-class CPUs. We interpret this state-ment to mean that each core can execute three to four op-erations per cycle with moderate instruction reordering.

The entire benchmark results are measured against A15. Again, this is only valid if the processor is similar in size and complexity to A15.

Because it uses a pipeline with only 10 stages (including some extra stages for VISC scheduling), the CPU cannot match the high clock speeds of leading-edge x86 and ARM processors. We estimate the chip runs at several hundred megahertz. Even at this low speed, it completes some pro-grams in less time than a low-end Haswell processor

You may see power savings by running at a lower frequency but IPC must be put into perspective: its useless without a high frequency.

Despite its relatively simple design, the chip achieves spectacular performance. On the single-thread SPEC2006 test suite, the company reports an average IPC of 2.1, counting ARM instructions rather than VISC instructions. This IPC compares with 0.71 for Cortex-A15 and 1.39 for Haswell. (For consistency, Soft Machines measured the IPC on all three processors using GCC rather than Intel’s favorite compiler, ICC.) Thus, the VISC chip achieved three times the IPC of ARM’s highest-end CPU shipping today and 50% better IPC than Intel’s fastest mainstream CPU.

It is important to note that they are comparing the performance of a single A15 or haswell core to VISC on multiple cores (2 in this case, or 4 simulated in the bar graph). More die is needed and more power will be used.

Although these results are impressive, they require some caveats to put them into perspective. A shorter CPU pipeline reduces branch penalties and other pipeline haz-ards, thereby improving IPC compared with a longer pipe-line. In addition, a low CPU speed reduces the effective latency of caches and main memory (measured in CPU cycles), again improving IPC relative to a CPU with a faster clock. The latter effect might explain why the test chip appears to perform better on SPEC2006 than on SPEC2000, which has a smaller memory footprint.

Easily seen, especially with wider designs, designed for low frequency.

Although the test chip has only two physical cores, Soft Machines has run simulations on a four-core design. As one might expect, the performance gains diminish for the additional cores: the third core adds 20–30% to single-thread performance, and the fourth adds only 10–20%. In total, the four-core design delivers about twice the perfor-mance of a single core. The unused resources in the extra cores, however, can be devoted to additional threads. For example, a four-core design could run two threads at close to their maximum performance.
Performance-critical applications

Dual Virtual Core/A15 IPC Ratio

This seems great but I have to ask. How are you getting 4x IPC scaling in your tests when you say later that 4C only doubles IPC? It simply doesn't make sense. There seems to be a Dhrystone benchmark showing >4x IPC on a 2C design (last test, powerpoint). This simply doesn't make sense unless you are dealing with a wider core than A15. You are getting more than perfect scaling. Execution resource efficiency only goes down with more cores so maximal output per physical core cannot go up. Let me be frank. They are saying that two of their cores achieves 7x the IPC of a single A15 on some tests with average ~3-4 which is impossible unless the theoretical core performance is different or utilization of theoretical core resources increases dramatically by 2x up to nearly 3.5x.


Performance-critical applications will benefit from VISC, but the technology can also apply to low-power de-signs. As the test chip demonstrates, a VISC design can operate at a relatively low clock speed while achieving the same absolute performance as a traditional design operat-ing at a higher clock speed. Thus, it should use less power, particularly if the voltage is reduced as well. Soft Machines, however, declined to reveal the power consumption of its test chip.
Other details also remain undisclosed, including die area. Details of how the pro-cessor handles privileged operations, inter-thread synchronization, traps, and interrupts could all affect how well it runs certain ap-plications. Performance could vary widely across different workloads.

Okay. So it gets the same MT performance as an equivalent chip but when it needs it can redistribute cores to the same thread (reverse HT in effect) and get superiour IPC for a SINGLE core.

Edit: I'm also not seeing where they are getting the power savings from. 4x is a lot but take it into context. Running one core gets you an IPC gain of 50-60% but will use >50-60% more power (another core running at 50-60% load). Perhaps you can drop frequencies by 50-60% to save power but I doubt this will save that much power. Chips tend to perform quite linearly in terms of perf/power in their "sweet" ranges.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Why do I get bitboys flashback? Also it sounds a lot like mitosis type in hardware.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
I'm skeptical, but I wouldn't mind another competitor in the CPU space. If they're truly limited to ~ 300mhz it might be useful for the whole "IoT" space.
 

Khato

Golden Member
Jul 15, 2001
1,225
281
136
Certainly an interesting concept. If I'm understanding it correctly, then a more traditional way to think of it would be dynamic adjustment of the execution width of the 'processor' available to a given 'thread'. So when you have a workload with high ILP (instruction level parallelism) it's possible to allocate far more execution resources to it than with a traditional design. That's why they can show such improvements in SPEC2006. Whereas for a thread with low ILP only a portion of a single physical core need be allocated with the remainder used for either other threads or put in a reduced power state.

I wonder if they have some novel approach to deal with the issue of keeping data synchronized across their physical cores all working on the same thread and the associated power issues? Because instead of local registers it's now quite feasible that a different CPU core would need a result, or similarly multiple cores would all have to copy the same data into their L1 caches - there's a lot of potential for duplication of reads/writes there which adds up in terms of power. Sure their chart implies that the performance per watt is excellent, but that's what happens when you can drop frequency and voltage while maintaining the same level of performance due to high ILP of the program being run.
 

Nothingness

Platinum Member
Jul 3, 2013
2,769
1,429
136
50% more IPC than Haswell? That looks sexy, but I guess frequency will take a very serious hit making single thread performance less sexy finally.
 

Haserath

Senior member
Sep 12, 2010
793
1
81
CEO and Co-founder from Intel with investments from Samsung, AMD, and Mubadala, and they have working silicon?Sounds promising.

Did AMD ever say Zen was based on x86(I know media jumped on x86)? It would seem like this would be the only way they'd ever have a shot at competing with Intel again.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
CEO and Co-founder from Intel with investments from Samsung, AMD, and Mubadala, and they have working silicon?Sounds promising.

Did AMD ever say Zen was based on x86(I know media jumped on x86)? It would seem like this would be the only way they'd ever have a shot at competing with Intel again.

Yes they did. Zen is x86 based, K12 is ARM based.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
[Independent review needed]

From the ET article, apparently this is made possible by extracting a lot of ILP, which is kind of surprising given how much ILP is already extracted by architectures like Haswell. Seeing the clock speed, I'd guess they've found a more efficient way of doing this (correct me if I'm wrong).
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
One of the methods for higher IPC is using the same approach as IA64. However it simply moves the work from hardware to software. Itaniums got higher IPC than Haswell for example. The 6 issue wide Itanium could hit IPC of 5.4 and the 12 issue can hit around 10. But again, it depends a lot on what exactly and its nothing that can be substained as such over a more variated workload.
 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,769
1,429
136
5.4 and 10 on SPEC 2006? I seriously doubt it . I guess you meant under some close to 100% cache hit traffic, right?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
5.4 and 10 on SPEC 2006? I seriously doubt it . I guess you meant under some close to 100% cache hit traffic, right?

Thats the point. It requires a special set of conditions.

As an example the workload in question runs at 5.4 on a 6 issue wide within L2. 4.5 at L3 and drops to a stagnating 0.8 when going to main memory.
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
What they claim sounds (to me) a lot like now famous RHT(reverse HT) myth and intel's mitosis. This time it supposedly works out of the box and in real time distributes the instructions from same thread to different "virtual cores". The cores are rather narrow and limited by themselves and cannot match any current x86 uarchitecture so I wonder what kind of MT performance are we looking at here and how many of these "mini virtual cores" is needed to match one Haswell-E (8C/16T) MPU in spec rate becnhmarks. If I'd had to guess I'd say 2 or 3x the core count of one Haswell-E MPU.
 

Nothingness

Platinum Member
Jul 3, 2013
2,769
1,429
136
Thats the point. It requires a special set of conditions.
I was just in the conditions of the slide that quotes IPC above, and that deals specifically with SPEC 2006

As an example the workload in question runs at 5.4 on a 6 issue wide within L2. 4.5 at L3 and drops to a stagnating 0.8 when going to main memory.
Makes sense. That highlights how good Haswell memory system is. And also why I somehow doubt VISC frequency won't impact its IPC. I'd like to be proven wrong, a brand new high-perf micro-architecture would help moving the status quo.
 

cytg111

Lifer
Mar 17, 2008
23,561
13,121
136
I dont get it. Is this not something akin to "reverse hyperthreading"? thought that was theorized, debunked and left4dead years ago.

The core of this technology is essentially the virtualization of multiple CPU cores into a single virtual core that enables much higher single threaded performance.
All I see is the concept of "virtual" thrown into the mix as the magical ingredient that is supposed to make it all work.

How would it work? Anyone care to offer a 'low level' explanation of the theory?
Two cores working on the same thread, sharing of resources, synchronization etc .. it does not compute .. In my mind the cores would have to share total address space, including read rights to each others registers (and pipelines?). Nope, cant wrap my head around it. Though, all that funding, those specific people.. Who knows. When can I buy one?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
I was just in the conditions of the slide that quotes IPC above, and that deals specifically with SPEC 2006

Makes sense. That highlights how good Haswell memory system is. And also why I somehow doubt VISC frequency won't impact its IPC. I'd like to be proven wrong, a brand new high-perf micro-architecture would help moving the status quo.

The example used was an 6 issue Itanium not Haswell. Note today there is 12 issue Itaniums.

SPEC is also very multicore friendly.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
What they claim sounds (to me) a lot like now famous RHT(reverse HT) myth and intel's mitosis. This time it supposedly works out of the box and in real time distributes the instructions from same thread to different "virtual cores". The cores are rather narrow and limited by themselves and cannot match any current x86 uarchitecture so I wonder what kind of MT performance are we looking at here and how many of these "mini virtual cores" is needed to match one Haswell-E (8C/16T) MPU in spec rate becnhmarks. If I'd had to guess I'd say 2 or 3x the core count of one Haswell-E MPU.

It would be way way more. Remember the massive singlethread IPC gains only come with the use of multiple cores. Using only one core the IPC is constant. So if using an a15 class core, MT is only equivalent to the number of A15 cores you have.

With a 300 mhz clock it would take a plethora of cores to match 8C haswell.

It is also worth noting that this may do nothing for SINGLE THREAD performance. 3x Haswell IPC at 300 mhz is still only equivalent performance on a single thread to a 900 mhz haswell.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
Two cores working on the same thread, sharing of resources, synchronization etc .. it does not compute .. In my mind the cores would have to share total address space, including read rights to each others registers (and pipelines?). Nope, cant wrap my head around it.

Yes, the cores have to share a single set of registers, and I assume they would also have register bypass networks between cores. This could only work if the cores were all very close to each other, and/or it runs at a very low frequency (which seems to be the case). I can't really imagine this leading to a truly high performance (not just high IPC/low clock speed) design.

SPEC is also very multicore friendly.

SPEC CPU2006 is a set of strictly single-threaded workloads. There's another thing called SPECrate, which is where you just run multiple copies of SPEC CPU2006 workloads as separate processes, and then measure the overall throughput of the system. There are other SPEC workloads like SPEC JVM, SPECjEnterprise, and others, that are truly multithreaded, but they are not nearly as popular as SPEC CPU2006 among computer architects.
 

SarahKerrigan

Senior member
Oct 12, 2014
606
1,486
136
The example used was an 6 issue Itanium not Haswell. Note today there is 12 issue Itaniums.

SPEC is also very multicore friendly.

Poulson is only 12-issue in the sense that Haswell is 8-issue. It has a 6-wide frontend, the 12-wide backend is only used for replays.

SPEC is single-threaded. Sadly, SPEC2006 reporting rules allow autoparallelization of that single-threaded code, which makes it somewhat less meaningful.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |