Optical CORDIC and VLIW for next-gen x86?

eton975 · Jan 30, 2018

Thoughts? With modern transistor budgets, could a hardware dynarec be added to the old Elbrus 2000 architecture to eliminate the old argument of 'you need software optimisation for VLIW to work'? Thus, you have a 'pure CISC' x86 CPU gobbling up massive CISC instructions in one fell swoop, and with proper scheduling you have a >20 IPC CPU that also has high work per instruction? (due to CISC) How would such a CPU clock on a modern process? Would at least 1.4 Ghz, with 8 cores, be possible without a massive die size?

Going further, could a MEMS photovoltaic lookup table be used to replace the AVX units on most CPUs, if engineered in properly?

NTMBK · Jan 30, 2018

It might be worth asking your question over at the RealWorldTech forums: https://www.realworldtech.com/forum/?roomid=1 Quite a few actual CPU architects hang out there, and might be able to answer you more accurately than us mere overclockers

eton975 · Jan 30, 2018

"User registration is currently not allowed."

I'll take em at their word for now.

NTMBK · Jan 30, 2018

eton975 said:
"User registration is currently not allowed."

I'll take em at their word for now.

You can post threads without signing up for a RWT account. They have an anonymous posting system.

eton975 · Jan 30, 2018

NTMBK said:
You can post threads without signing up for a RWT account. They have an anonymous posting system.

Thanks, forgot about that after a few seconds.

NostaSeronx · Jan 30, 2018

arco.e.ac.upc.edu/wiki/images/8/8d/Ino_icalu_vld_main.pdf
arco.e.ac.upc.edu/wiki/images/0/02/Ooo_fifo_main.pdf
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1001.9949&rep=rep1&type=pdf
http://www.tdx.cat/bitstream/handle/10803/81561/TAD1de1.pdf?sequence=1&isAllowed=y
homepages.inf.ed.ac.uk/rkumar2/pubs/HPCC13.pdf
homepages.inf.ed.ac.uk/rkumar2/pubs/TOCS16.pdf
https://www.researchgate.net/profil...designed-acceleration-of-Android-bytecode.pdf
pages.cs.wisc.edu/~shiliang/doc/Thesis-x86vm.pdf

https://i.imgur.com/9o1M8GD.png
https://i.imgur.com/9VmKow9.png
https://i.imgur.com/6sTqJL1.png
https://i.imgur.com/kzeROG0.png

https://www.anandtech.com/show/10025...cture-visc-ipc
https://www.fool.com/investing/2017/...quisition.aspx

^-- collection for next-gen x86. Intel is also working on SVE(ARM)/Vector Length(RISC-V) -like instructions to replace MMX/SSE/AVX stack.

Legacy x86 => x32/x64/MMX/SSE/AVX/etc
Modern x86 => Itanium/VLA

William Gaatjes · Jan 30, 2018

When thinking of recent security issues, i find this very interesting to read about :

https://en.wikipedia.org/wiki/Elbrus_2000#cite_note-security-3

For security reasons the Elbrus 2000 architecture implements dynamic data type-checking during execution. In order to prevent unauthorized access, each pointer has additional type information that is verified when the associated data is accessed

I wonder if this can be extended to a privilege level.

For example, just like arm has predication bits for execution and conditional flag checking, it might be useful to embed privilegelevel bits on a per instruction basis to prevent privilege issues when doing speculative execution or speculative branching.

Tuna-Fish · Jan 30, 2018

NostaSeronx said:
^-- collection for next-gen x86. Intel is also working on SVE(ARM)/Vector Length(RISC-V) -like instructions to replace MMX/SSE/AVX stack.

Legacy x86 => x32/x64/MMX/SSE/AVX/etc
Modern x86 => Itanium/VLA

No. ... just no. Everything about that could not possibly be more wrong.

Just for the record: NostaSeronx is a total crank. He has all these weird "predictions" that he tries to state with authority, that are generally diametrically opposed to what is actually going on, that are based on absolutely nothing real. Then when proven wrong he just leaps to the next set of outlandish ridiculous claims. For proof, just look at his post history.

Topweasel · Jan 30, 2018

Tuna-Fish said:
No. ... just no. Everything about that could not possibly be more wrong.

Just for the record: NostaSeronx is a total crank. He has all these weird "predictions" that he tries to state with authority, that are generally diametrically opposed to what is actually going on, that are based on absolutely nothing real. Then when proven wrong he just leaps to the next set of outlandish ridiculous claims. For proof, just look at his post history.

Agreed there is a reason why EPIC(AMD has last laugh there)/Itantium/IA-64 or Transmeta failed. The problem with all X86 killers is the need to handle X86 in general which is one of 3 major problems

1. Needs to handle X86 code. Seems pretty basic, but its constantly getting harder and harder to do that without Intel/AMD patents.
2. Needs to handle X86 code better than available hardware. Maybe the potential is there but anything greater than a 10% deficit is going to struggle.
3. Needs to be worth the eventual transition. There are certain types of workloads that could near require a change to VLIW or any other architecture to work well. But general workloads also need to see a bonus in performance to be worth transitioning too.

This is why Intel attempted to stall out X86 development by keeping it 32bit and just ratcheting speed up on netburst while most of their R&D budget on Itantium. There are almost as much x86 code in the world as there are stars in the sky. So it has to run that, it has to be nearly as fast to keep people from buying X86 solutions, and it needs a benefit that most of the corporate world doing their general server stuff would see a benefit on.

What you are seeing is a shift to products designed for markets in general. The Xeonphi in the future and GPU's being used as co-processors, are going shift the server solutions that do need a more advanced architecture away from X86 but they are so far away in use case from what the other 6.999999 billion people need to use that any developments there will have almost zero affect on the CPU's we can purchase.

NostaSeronx · Jan 30, 2018

1. Physical decoders called vertical decoders crack CISC(x86) into RISC ops, re-use of current decoders. While horizontal decode stage is a translation-optimization layer(TOL) for fusing RISC ops vertically(time compression) and horizontally(execution width). The horizontal stage can implement a bypass in which optimized code(EPIC/IA-64/Itanium) can be pushed from Fetch(L1i) to TOL(L0i). The TOL then can optimize code in turn effectively making the instruction stream OoO. L0i micro-op caches can be phat in size and dispatch width.
2. The possible penalty of running x86 on a VISC-enhanced architecture is 5% to 20%. This performance loss can be hidden with virtual cores methodology that Intel got from Soft Machines.
3. The transition can be done behind the scenes or can be targeted. The targeted side will be a fused AMD64/IA64 virtual instruction set/programmers model; allowing x86-64 in hybrid mode to have 128 GPRs and 128 FPRs; and access to predicate registers. It is complicated, but it is in the language at Intel.

All of this can be found in the PDFs above or simply searching "Intel" HW/SW co-design, "Intel" TOL, etc.
http://cgo.org/cgo2010/epic8/slides/ditzel.pdf
Huge hint here, ya that guy, you know who he is. Fuse that with Soft Machines, boom production product within 30 months from buy-out. (September 2016 -> March 2019) ((Be a nice comparison to Kittson for the slides.))

For CORDIC, no.

In the early 1980s, the Intel® 8087 Math Coprocessor introduced hardware support for a small set of elementary transcendental functions (trigonometric, inverse trigonometric, exponential, and logarithmic), accessible through x87 instructions. In the 1990s Intel replaced the 8087’s CORDIC-based approximations of the elementary transcendental functions with polynomial-based approximations. These newer polynomial-based approximations provide a large degree of backwards compatibility with the CORDIC based approximations by approximating precisely the same functions, but with greater overall accuracy and speed.

80x87 -> 80486, P5 had a hardware multiplier thus the move to polynominal provided faster results.

A faster, fully hardware-based multiplier makes instructions such as MUL and IMUL several times as fast (and more predictable) than in the 80486; the execution time is reduced from 13~42 clock cycles down to 10~11 for 32-bit operands.

Yotsugi · Jan 31, 2018

eton975 said:
Thoughts? With modern transistor budgets, could a hardware dynarec be added to the old Elbrus 2000 architecture to eliminate the old argument of 'you need software optimisation for VLIW to work'?

No, VLIW never works for CPUs.
Just forget about it.

SarahKerrigan · Jan 31, 2018

Bondrewd said:
No, VLIW never works for CPUs.
Just forget about it.

Sure it does. Just not for general-purpose CPUs.

Topweasel · Jan 31, 2018

SarahKerrigan said:
Sure it does. Just not for general-purpose CPUs.

In a thread call next Gen x86, general purpose would be the whole point.

eton975 · Feb 20, 2018

So, I've been doing a fair bit of work on this. Excerpts:

http://forum.6502.org/viewtopic.php?f=4&t=5033
https://www.reddit.com/r/Amd/commen...g_and_globalfoundries_have_overtaken/dujczm9/

eton975 · Feb 20, 2018

eton975 · Feb 20, 2018

Checkmate Intel in 2019-2020?

eton975 · Feb 20, 2018

Accidental double post.

amd6502 · Mar 11, 2018

NostaSeronx said:
http://cgo.org/cgo2010/epic8/slides/ditzel.pdf
Huge hint here, ya that guy, you know who he is. Fuse that with Soft Machines, boom production product within 30 months from buy-out. (September 2016 -> March 2019)

It's a pretty dated slide; frequency predictions from 2010 were really optimistic.

Itanium still is alive. They're not cheap and not low wattage (must be massive area per core). Huge L3. I wonder what the application:

http://www.cpu-world.com/CPUs/Itanium_2/Intel-Itanium 2 9740.html

IntelUser2000 · Mar 11, 2018

Topweasel said:
Agreed there is a reason why EPIC(AMD has last laugh there)/Itantium/IA-64 or Transmeta failed. The problem with all X86 killers is the need to handle X86 in general which is one of 3 major problems.

So there are few problems displacing an established mass produced ISA like x86.

1. New ISAs may look good on paper, but not on actual implementation.
2. Even if the new ISAs are better, the world isn't a benchmark and not everything is equal.
3. It's not worth putting all that effort for a small improvement.
4. Transformational changes in technologies are merely enablers to allow continued advancement, rather than allowing an extraordinary jump.

Look at the state of EUV lithography. 15 years ago when people were expecting EUV to be available perhaps in the 22nm generation, they did not expect conventional methods to continue improving. Because the conventional methods keep improving, the big new step isn't so big after all.

Micron also stated that there were many, many technologies promising to replace DRAM. They said though due to the enormous investment(both in monetary terms and intellectual), DRAM broke through many barriers and rendered the so-called DRAM-killers dead in its tracks. In this case as well, the conventional, established technology continues to advance, to the point where the technology that was thought to be a big jump isn't anymore.

Thus, you have a 'pure CISC' x86 CPU gobbling up massive CISC instructions in one fell swoop, and with proper scheduling you have a >20 IPC CPU that also has high work per instruction.

Some code just doesn't have that much instruction level parallelism. I also think that if software didn't become fatter as time went on, modern cores like Skylake would have brought even less benefits than otherwise. It's like how there's zero benefit to be had running Quake 3 at 1024x768 resolution on a 1080 Ti. How much faster would modern cores be running Windows XP era applications?

Thala · Mar 11, 2018

IntelUser2000 said:
Some code just doesn't have that much instruction level parallelism. I also think that if software didn't become fatter as time went on, modern cores like Skylake would have brought even less benefits than otherwise. It's like how there's zero benefit to be had running Quake 3 at 1024x768 resolution on a 1080 Ti. How much faster would modern cores be running Windows XP era applications?

That`s an interesting proposition i do agree to. Essentially the set of use-cases, where increased CPU performance show benefits is getting smaller. What we see is diminishing returns of CPU performance on every increasing usage scenarios.
Another point i would like to throw into the discussion is efficiency. An architecture, which is potentially faster by throwing immense transistor budgets into the architecture is doomed to fail because today we already see thermal and power densities as the ultimate limit of performance. Therefore i am more than sceptic about the prospects of dynarec etc. in order to achieve x86 compatibility on ISA level because energy per instruction will go up. For the very same reasons i do consider x86 as a showstopper going forward.

eton975 · Mar 19, 2018

Thala said:
That`s an interesting proposition i do agree to. Essentially the set of use-cases, where increased CPU performance show benefits is getting smaller. What we see is diminishing returns of CPU performance on every increasing usage scenarios.
Another point i would like to throw into the discussion is efficiency. An architecture, which is potentially faster by throwing immense transistor budgets into the architecture is doomed to fail because today we already see thermal and power densities as the ultimate limit of performance. Therefore i am more than sceptic about the prospects of dynarec etc. in order to achieve x86 compatibility on ISA level because energy per instruction will go up. For the very same reasons i do consider x86 as a showstopper going forward.

I see potential solutions to this problem: Break the CPU die (for an 8-core design like this on 14nm FinFET: ~600-900mm^2) up into much smaller die. These smaller die implement the logic but are much cheaper to manufacture and have a much higher yield. Perhaps even move a lot of the DDR5 memory controller logic off the CPU and onto the motherboard/RAM: this logic will force-feed the CPU cache with data. An external branch predictor/memory controller design??? (Atomicity and cache coherency will be major problems)

Do initial low-volume manufacture on small 200mm wafers to lower costs. Use a TwinScan 200mm with ArF laser or better yet: electron beam etchers and precision ion dopers. Probe with an Electroglas 4090u+, TEL P-8XL, Accretech UF200, Cascade 200mm automatic or the various manual Micromanipulator, Teradyne, SPEA, Delta and Cascade manual probers at 200mm/8" or greater. Sell these early CPUs (with large registered ECC supporting motherboards, LGA 4600 or so pin sockets) to large server vendors in an aggressive bidding process for $500,000 apiece or more. Then, get GlobalFoundries, STMicroelectronics, TSMC and/or UMC to fabricate the chips at high volume on 300mm wafers. Within a few years, prices drop to $300 or less for a top-end model. Mid-end models will be PGA 2400 or so (easier to repair pin damage vs LGA socket damage).

CPUID maybe:
CertifiedCBR, CertifiedCTN, VIA VIA VIA , CentaurHauls, AuthenticAMD, ST ST ST ST .
801486-class or Am1486-class
8 core(s), 16 thread(s)
Model 0 Family 0 Stepping C0
Extensions supported:
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, VIA 64 (or AMD64 for AMD), VT-x, AES, AVX, AVX2, AVX-512, CVT16, FMA3, FMA4

(more details in that 6502.org thread at the beginning of this post).

And at 1.4Ghz, would the power draw really be so bad? The Elbrus 2000 itself, with its 300Mhz clockspeed (and sadly rather poor memory controller, cache, and large die size) only drew roughly 6W of power.

Aggressively push into photonics (OptiCORDIC??? on glass wafers/glass-doped silicon wafers/vacuum or argon cavity light chambers, with inter-core waveguides for light transmission) and room-temperature superconductors for 801586 and 801686 designs. 801586 will be 100x or more faster than the 1486 at AVX workloads or even others!

Intel goes bankrupt in the mid-2020s as investors desert them due to long lead times, poor yield of large dies and slow movement against external foundries and fabless/labless chipmakers.

nismotigerwvu · Mar 19, 2018

I know this is going to sound like I'm being sarcastic and I really promise I'm not, but has there ever been a VLIW design that was a sustained success in any field? The best case I can come up with was Terascale 1/2/3 and even then those cards only hung around for 3~4 years before being phased out for GCN, which proved better in basically every way (in part due to the 28 nm admittedly). Itanium might have been around longer, but it would be hard to classify as much of a success story either.

CHADBOGA · Mar 20, 2018

eton975 said:
Intel goes bankrupt in the mid-2020s as investors desert them due to long lead times, poor yield of large dies and slow movement against external foundries and fabless/labless chipmakers.

Sharikou Ph.D are you back?

eton975 · Mar 20, 2018

nismotigerwvu said:
I know this is going to sound like I'm being sarcastic and I really promise I'm not, but has there ever been a VLIW design that was a sustained success in any field? The best case I can come up with was Terascale 1/2/3 and even then those cards only hung around for 3~4 years before being phased out for GCN, which proved better in basically every way (in part due to the 28 nm admittedly). Itanium might have been around longer, but it would be hard to classify as much of a success story either.

The mid-2000s AMD GPUs were hardwired VLIW and enjoyed moderate success. The Elbrus 2000 and its derivatives have also had limited success in Russia for national security reasons.

IBM POWER and various SPARC CPUs could be said to be VLIW in the way they input and output data over their buses.

nismotigerwvu · Mar 20, 2018

You might need another cup of coffee, I mentioned those Terascale GPUs in my post

I've never really seen anyone describe POWER and SPARC as VLIW, I might have to read a bit more on that.

Optical CORDIC and VLIW for next-gen x86?

Senior member

Lifer

Senior member

Lifer

Senior member

Diamond Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member

Elite Member

Golden Member

Senior member

Golden Member

Platinum Member

Senior member

Golden Member