VISC CPU 3X the IPC?

Bubbleawsome · Oct 25, 2014

Huh. I won't claim to understand the specifics of any chip design, but wouldn't this be fairly limited use? At least to your normal user? It sounds like it would be great for gaming, but I imagine there would be core-syncing issues. There is a start-up I know of that could use two-cores-as-one but they doubt they could get them to sync well enough to make the game stable. Other things like web browsing don't need a super fast core really. Is this aimed purely at research and business class stuff?

DrMrLordX · Oct 25, 2014

ShintaiDK said:
Why do I get bitboys flashback? Also it sounds a lot like mitosis type in hardware.

It sounds exactly like speculative threading in hardware. If this technology works as well as the paper leads on, then clockspeed deficits could be overcome by just throwing more cores onto the CPU somehow and committing them to a single virtual core. Presumably, there will be limits as to how many actual cores can be shared by one virtual core.

Imagine if Piledriver could do this. Imagine if POWER8 could do this. Imagine if Xeon Phi could do this!

Xpage · Oct 25, 2014

I guess this would be best fit for mobile, 300mhz can be sipping power and deliver performance closer to mobile processors at 1.5ghz. Would be nice if they could crank it up to 1ghz then it'd be enough for tablets and notebooks.

cytg111 · Oct 25, 2014

Very sceptical, highly interested.

Haserath · Oct 25, 2014

http://www.extremetech.com/extreme/...nceptual-breakthrough-weve-been-waiting-for/2

The JPEG and color compression test data compared a dual-virtual-core VISC test chip with a 1MB L2 built on 28nm technology against the Asus Transformer Book T100A (Atom Z3740), an HP laptop with a Celeron-class Haswell (Intel 3550M, dual-core, no HT, 2.3GHz) and a Samsung Chromebook with a Cortex-A15 clocked at 1.7GHz. Run times and measured clock speeds are shown below:

Power consumption is listed as about the same, which doesnt tell us a great deal at this stage given that the VISC cores are in prototype

That's all there is for performance and power I have seen. Would have expected lower power.

SAAA · Oct 25, 2014

Haserath said:
That's all there is for performance and power I have seen. Would have expected lower power.

Well color me impressed if they can really do this in common workloads.
Why? Because it's spectacular to know that we can have 3x Haswell IPC when we are fighting everyday with a clock barrier around 4-5GHz that's physically impossible to break without new nodes.
Similar power consumption to a 1GHz Haswell is still decent, just try and get that thing to 2-3GHz, even at 200W and we will finally have a decent jump in single thread since long. :twisted:

Enigmoid · Oct 25, 2014

Haserath said:
http://www.extremetech.com/extreme/...nceptual-breakthrough-weve-been-waiting-for/2

That's all there is for performance and power I have seen. Would have expected lower power.

One wonders what the point of a singlethread test for an easily multithreadable tasks is.

Again. Is this an A15 class chip? Otherwise what black magic is being used to make a supposedly A15 class chip operate with the total throughput of a single A15 core clocked nearly 5 times higher? Theoretical utilization must be up by 243% if this is true.

Also note that for their power comparisons the A15 is on 32 nm.

http://www.notebookcheck.net/Samsung-Exynos-5250-Dual-SoC.86886.0.html

NostaSeronx · Oct 25, 2014

My research into the Soft Machines architecture. Make it out to be a Matrix-based MIMD architecture.

The closest architecture to it will be the IBM Cell Broadband Engine processor.

http://www.blachford.info/computer/Cell/Cell1_v2.html

There is also the ViRAM which matches the clock rate to x86 processors;
http://csl.stanford.edu/~christos/publications/2000.HotChips.VIRAM.pdf

videogames101 · Oct 25, 2014

DrMrLordX said:
It sounds exactly like speculative threading in hardware. If this technology works as well as the paper leads on, then clockspeed deficits could be overcome by just throwing more cores onto the CPU somehow and committing them to a single virtual core. Presumably, there will be limits as to how many actual cores can be shared by one virtual core.

Imagine if Piledriver could do this. Imagine if POWER8 could do this. Imagine if Xeon Phi could do this!

For single threaded tasks which can't reasonably be threaded anyways, I can't imagine a single "virtual" core being any more efficient than a single large core, otherwise you could just make a single large core. You still have the same instruction dependencies and still have the same instruction latency which are bottle-necking your single core.

Idontcare · Oct 25, 2014

Enigmoid said:
One wonders what the point of a singlethread test for an easily multithreadable tasks is.

It is the tide that lifts all boats.

From legacy apps to apps that won't be released until next year, if you can improve the single-threaded performance then the performance can go up across the board.

What's not to like if it is true? The approach itself is generic enough that there is the opportunity for any and all MPU architectures to morph to it in time.

If it is more innovation and less marketing gimmick that is.

As people are rightly noting, we've been down this road so many times already. As an industry, as a consumer segment, as a venture capitalist investor, etc.

Whether it was BitBoys, Transmeta, KillerNic, Lucid's Virtu, Infinium console (remember those guys!?), reverse-hyperthreading, mitosis, Larrabee, etc.

There has been no shortage of high-profile flash in the pans that come and go without generating much beyond a ton of hype, and put a few bucks into the pockets of the tech-journalists who at least got to make a mortgage payment or two by writing about it.

My first impressions of the marketing claims surrounding VISC is that it has an unavoidable air to it that cannot help but remind me of the marketing claims and hype that preceded the launch of Lucidlogix's "Hydra" which later morphed into "Virtu".

Sound too good to be true? Yes. Did we see it working? Sure. Do we have performance numbers? Not yet. So there's the rub for us. We really want to put this thing through its paces before we sign off on it. Running on both UT3 and Crysis (DX9 only for now -- DX10 before the product ships though) is cool, but claiming application agnostic linear scaling over an arbitrary number of GPUs of differing capability is a tough pill to swallow without independent confirmation.

http://www.anandtech.com/show/2596

Fast forward to today and you can't even find a mention of Lucidlogix's Hydra/Virtu on their homepage: http://lucidlogix.com/

Now we have the VISC, which at this time is walking like a duck and quacking like a duck. That puts the odds in favor of it too being just the latest duck.

However, If I allow myself to momentarily suspend all disbelief whilst simultaneously abandoning my history books, I can convince myself there is merit to the approach and some truth to claims of value being added.

But even within that narrowly defined perch of plausibility, the VISC strikes me as being no more widely applicable than say the Quantum Computing initiative which also finds itself as being technically legitimate but applicable to a narrow range of problem types.

I think we can safely conclude the VISC CPU is not smoke and mirrors the likes of BitBoy or the Infinium console given that the hardware exists and benches have been produced (internal only though, so outright fraud cannot be ruled out yet), but only time will tell us if the VISC is simply useless (akin to Lucidlogix Virtu) or actually has a niche viability within a limited class of compute problems (similar to quantum computing).

Idontcare · Oct 25, 2014

videogames101 said:
For single threaded tasks which can't reasonably be threaded anyways, I can't imagine a single "virtual" core being any more efficient than a single large core, otherwise you could just make a single large core. You still have the same instruction dependencies and still have the same instruction latency which are bottle-necking your single core.

Agreed.

To say it another way, one should always achieve higher performance the closer to "metal" you are operating, and it should perform all the faster the more your hardware (the metal) is dedicated hardware versus all-purpose generic hardware.

The VISC appears to be taking you two steps away from running on dedicated hardware and that much farther from the metal.

At this point it is starting to look more like magic and less like science, but that could just be a sign that my science knowledge is now too far behind the leading edge.

VirtualLarry · Oct 25, 2014

Sounds almost similar to something I proposed a few years back on these boards. Designing CPUs, like GPUs, and have an array of processing cores / resources, and have some sort of hardware thread manager that assigns those resources to ISA threads. Though I would have no idea how to actually implement that in hardware. (Design-wise, I imagined it much like ATI's R1950 cards.)

Enigmoid · Oct 25, 2014

Idontcare said:
It is the tide that lifts all boats.

From legacy apps to apps that won't be released until next year, if you can improve the single-threaded performance then the performance can go up across the board.

Perhaps I should have been more clear. The problem with using a highly multithreadable task is that it doesn't really show any single thread improvement on tasks that generally tend to be singlethreaded. It avoids the dependencies, branches, and other constraints that generally make code very hard to thread. I want to see how this issue is being addressed over multiple cores. For dependent code I can't see this approach going any faster for example.

Also note that you are not directly improving the singlethread performance of the core, instead you are allowing two cores to operates on the same thread.

No one seems to have addressed my question with the inconsistency in the claims; "the second core only adds 50%", "3-4 times the IPC with two cores".

pw257008 · Oct 25, 2014

videogames101 said:
For single threaded tasks which can't reasonably be threaded anyways, I can't imagine a single "virtual" core being any more efficient than a single large core, otherwise you could just make a single large core. You still have the same instruction dependencies and still have the same instruction latency which are bottle-necking your single core.

Except, maybe, in power and/or cost. I realize power consumption is "about equal" or whatever they said now--I'm thinking more for the future.

Maximilian · Oct 25, 2014

DrMrLordX · Oct 25, 2014

videogames101 said:
For single threaded tasks which can't reasonably be threaded anyways, I can't imagine a single "virtual" core being any more efficient than a single large core, otherwise you could just make a single large core. You still have the same instruction dependencies and still have the same instruction latency which are bottle-necking your single core.

Well, think about it like so: imagine that you've got a 4P machine with . . . oh I don't know, something as good as or better than QPI serving as the socket interconnect. If you've got a Xeon box and you're running one chunk of superbly-threaded software, then that one piece of software can utilize all the cores in all four CPUs, thanks in no small part of the QPI links' speed and low latency.

Now take the same basic platform, but replace it with VISC CPUs and replace the software with something that is single-threaded. In theory, thanks to the same quick/low latency QPI links, you should be able to commit all the "real" cores from all the CPUs to one virtual CPU, and make that thing run your one software thread.

There are limits to how much die space chip design firms can or will commit to a single core, hence the modern movement towards multicore processors along with the continued reliance on multiprocessor systems in enterprise or HPC environments (at the very least). Correct me if I am wrong, but the overall fp/integer throughput of these high-thread-capacity chips (Xeon Phi is a standout) per unit of die area seems to be higher than processors with fewer, larger cores dedicated to processing a smaller thread count, at least when comparing across the same (or similar) process nodes.

It also gives you the advantage of reconfiguring your processor (processors?) based on your workload. A VISC chip appears to be able to commit 100% of its resources to software no matter how many threads it spawns, whereas a conventional multicore CPU remains inefficient until processing at least n threads, where n = the CPUs intended thread count (which may be higher than the number of cores, think SMT/CMT).

In circumstances where the thread count always exceeds the system's overall thread capacity, then the VISC approach would appear to offer no real advantage. Just keep in mind that the VISC approach would permit the use of mass thread-parallelism in CPU design for general-purpose CPUs. We might see more stuff like the T1/T2, POWER8, Xeon Phi, or AMD's never-produced K9 design.

Idontcare said:
What's not to like if it is true? The approach itself is generic enough that there is the opportunity for any and all MPU architectures to morph to it in time.

If it is more innovation and less marketing gimmick that is.

Correct. Many promises are being made, but very little has been presented for examination. Time will tell.

videogames101 · Oct 25, 2014

DrMrLordX said:
Well, think about it like so: imagine that you've got a 4P machine with . . . oh I don't know, something as good as or better than QPI serving as the socket interconnect. If you've got a Xeon box and you're running one chunk of superbly-threaded software, then that one piece of software can utilize all the cores in all four CPUs, thanks in no small part of the QPI links' speed and low latency.

Now take the same basic platform, but replace it with VISC CPUs and replace the software with something that is single-threaded. In theory, thanks to the same quick/low latency QPI links, you should be able to commit all the "real" cores from all the CPUs to one virtual CPU, and make that thing run your one software thread.

There are limits to how much die space chip design firms can or will commit to a single core, hence the modern movement towards multicore processors along with the continued reliance on multiprocessor systems in enterprise or HPC environments (at the very least). Correct me if I am wrong, but the overall fp/integer throughput of these high-thread-capacity chips (Xeon Phi is a standout) per unit of die area seems to be higher than processors with fewer, larger cores dedicated to processing a smaller thread count, at least when comparing across the same (or similar) process nodes.

It also gives you the advantage of reconfiguring your processor (processors?) based on your workload. A VISC chip appears to be able to commit 100% of its resources to software no matter how many threads it spawns, whereas a conventional multicore CPU remains inefficient until processing at least n threads, where n = the CPUs intended thread count (which may be higher than the number of cores, think SMT/CMT).

In circumstances where the thread count always exceeds the system's overall thread capacity, then the VISC approach would appear to offer no real advantage. Just keep in mind that the VISC approach would permit the use of mass thread-parallelism in CPU design for general-purpose CPUs. We might see more stuff like the T1/T2, POWER8, Xeon Phi, or AMD's never-produced K9 design.

Correct. Many promises are being made, but very little has been presented for examination. Time will tell.

The major point here is that some software cannot be threaded, and if it could be then you could get the same performance form a normal multi-core CPU and proper programming. This is mostly because some data is used after it is computed. Simplistically:

A + B = C
D = A + C

You can't compute the second part until you compute the first part. It doesn't matter how many virtual cores, threads, VISC CPU's, or whatever, the second instruction cannot proceed until the first one is (mostly) completed.

Speaking just to single-threaded programs that could theoretically be threaded but simply were not written that way, I suppose I can imagine VISC is trying to spread different instructions in a single thread onto different cores, but then all it's doing is creating a single very large OoO core with resources spread across the four cores. Which will just be worse than a single very large OoO core for those very applications. Oh, and those very large cores suck because the amount of re-ordering you can do is limited. ...So then VISC must suck more

Idk, just musing - I hope they deliver.

monstercameron · Oct 25, 2014

how would this work in a gpu configuration and gpu workloads?

DrMrLordX · Oct 26, 2014

videogames101 said:
Idk, just musing - I hope they deliver.

Ditto. I couldn't even begin to tell you how they really get multiple cores working on the same thread, unless they are reconfiguring the hardware on the fly.

monstercameron said:
how would this work in a gpu configuration and gpu workloads?

Good question. All those little shaders . . .

tempestglen · Oct 26, 2014

http://wccftech.com/amd-invest-cpu-ipc-visc-soft-machines/

Arachnotronic · Oct 26, 2014

tempestglen said:
http://wccftech.com/amd-invest-cpu-ipc-visc-soft-machines/

The most useful part of that slide, IMO, is the comparison of the Apple Cyclone, ARM A15, ARM A57, and Haswell.

geoxile · Oct 26, 2014

Intel17 said:
The most useful part of that slide, IMO, is the comparison of the Apple Cyclone, ARM A15, ARM A57, and Haswell.

It'd be more useful if the code were 64-bit for the A57 and Cyclone.

Haserath · Oct 26, 2014

Intel17 said:
The most useful part of that slide, IMO, is the comparison of the Apple Cyclone, ARM A15, ARM A57, and Haswell.

One thing about that graph is they used the GCC for Intel's CPUs as well. ICC would help the performance significantly.

Nothingness · Oct 26, 2014

Haserath said:
One thing about that graph is they used the GCC for Intel's CPUs as well. ICC would help the performance significantly.

Using the same compiler is a good way to compare things on a level playing field. It's very well known that icc is heavily tuned for SPEC, which makes comparison quite biased.

For that last ten years, every time I tried icc on the various codes I worked on, it brought no speedup over gcc, not all code can benefit from icc (note it seems to shine at vectorizing, my code simply doesn't benefit from that as a lot of code).

Nothingness · Oct 26, 2014

geoxile said:
It'd be more useful if the code were 64-bit for the A57 and Cyclone.

Not all of SPEC benefits from running 64-bit code, far from it. And that also applies to x86 code.

VISC CPU 3X the IPC?

Diamond Member

Lifer

Senior member

Lifer

Senior member

Senior member

Platinum Member

Diamond Member

Diamond Member

Elite Member

Elite Member

No Lifer

Platinum Member

Senior member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Member

Lifer

Senior member

Senior member

Platinum Member

Platinum Member