Sandy Bridge Socket

ilkhan · Dec 14, 2009

Computer Bottleneck said:
That is interesting because AMD Bulldozer for enthusiast socket doesn't have an integrated GPU according to the roadmap.

Im pretty sure starting with lliano ALL AMD chips will GPUs on die. Same as with Intel's s1156 line, ALL will have on chip/die GPUs.

IDC: 2 interesting things about the sandy bridge shot: The gpu being on the far side of the cores from the display controller (but its probably a low speed connection between them, so no big deal) and the rectangular destroying memory controller (again). Thoughts?

Idontcare · Dec 14, 2009

That sandy dieshot is so bad (intentionally) that I wouldn't but any faith in the labels beyond the big obvious ones (cores, gpu, L3$). All the rest of those functional units are a coin toss for labeling as far as I am concerned. I am open to being convinced otherwise, but what Goto-san has done so far doesn't convince me yet.

Nemesis 1 · Dec 14, 2009

drizek said:
Yes, it is a new architecture, but it won't be like Nehalem was to Conroe. Performance/clock is not going to go up significantly from what I understand, certainly not the 0-60% improvement in performance that came with nehalem.

Could you expand on the above statement please!

Because as I understand it . Sandy is going to be like Merom comparred to P4.

The available info supports this also provding its all true. According to my understanding Sandy will be able to break single threaded programms up into multi threaded programs with use of intels new software. But who knows . I personnally have zero reason not to believe this .

21stHermit · Dec 14, 2009

Idontcare said:
Same L2$ and less L3$.

Westmere:

IDC

Thank you for this image. One thing I've not understood about Clarkdale is the discrepancy in TDP between the CPU (65W TDP) and GPU (8-10W TDP). Conventionally die size drives TDP, however also clock speed matters a lot. Roughly 3GHz for the CPU and 700MHz for the GPU. Would the clock speed overwhelm the die size in this case?

Thanks

Nemesis 1 · Dec 14, 2009

I really think if your going to debate Sandy bridge, One should read about Intels Ct.

Ct is to intel what cuda is to nv . I read everthing I can on Ct if it does 1/2 of what is said Sandy will stomp all over nehelam

Nemesis 1 · Dec 14, 2009

User-defined Operators in Ct
Ct supports a complete scalar set of types for a couple of reasons. First, operations involving both TVECs and scalars are useful. Second, programmers should be able to define their own operations or function to apply across TVECs or to use as combining operators in reductions and scans. (For functional programming folks, think map, fold, and foldlist.) This flavor of data parallel programming can be referred to as “kernel” style, as opposed to the “vector” flavor also supported in Ct. Both are useful in differing circumstances. For linear algebra, we find the “vector” style easier to express algorithms, whereas for some signal and image processing algorithms, we find the “kernel” style easier. The underlying machinery is the same, but having both styles is an expressive convenience to the programmer.

Instead of expressing the following color conversion as:

Y = YRCoeff*R + YGCoeff*G + YBCoeff*B;
U = URCoeff*R + UGCoeff*G + UBCoeff*B;
V = VRCoeff*R + VGCoeff*G + VBCoeff*B;
The programmer can write:

Y = map(RGBtoY, R, G, B);
U = map(RGBtoU, R, G, B);
V = map(RGBtoV, R, G, B);
Where the programmer has defined:

TElt<U16> RGBtoY(TElt<U16> R, TElt<U16> G, TElt<U16> B) {
return (YRCoeff*R + YGCoeff*G + YBCoeff*B);
}
This is a basic example. Things get interesting when the programmer introduces some control flow (If-Then-Else's, For loops, While loops) or wants to touch neighboring pixels, say, for a 3x3 filter.

TElt<F32> threebythreefun(TElt<F32> arg, F32 w0, F32 w1, F32 w2, F32 w3, F32 w4) {
return w0*arg + w1*arg[-1][0] + w2*arg[0][-1] + w3*arg[1][0] + w4*arg[0][1];
}
…
map(threebythreefun, some2DGrid, smooth1, smooth2, smooth3, ... [ etc.]);
The alternative would have been to shift this image around, logically creating 4 additional copies. Though the compiler would have optimized these copies away, it’s a little misleading and not transparent enough in terms of the programmer’s intent. One can even express partial orders on the applications of these kernels (think deblocking filters for H.264):

TElt<I32> errordiffuse(TElt<I32> pixel) {
return someexpression(pixel,RESULT[-1][0],RESULT[-1][-1],RESULT[0][-1]);
};

ditheredcolors = map(errordiffuse, filteredcolors);
With reduction and scans, things get more interesting. For circuit folks, carry look-ahead adders are just specialized scans.

Tasks in Ct

The Ct API is a declarative way of specifying parallel tasks that are dependent on each other. So, collections of Ct operations on TVECs become collections of task dependence graphs. This still might be too constraining for those who would like to get at tasking abstractions directly. To this end, Ct introduces two abstractions for task parallelism.

The basic tasking unit in the Ct runtime is a future, an idea that goes back to MultiLisp. A future is basically a parallel-ready thunk (a function and arguments) that may or may not be executed immediately (and asynchronously), but whose execution must precede the reading of any value it generates (the runtime guarantees this). So, futures are exposed at the API level. Simply take any Ct functions and invoke it using the “spawn” function in the Ct namespace. The runtime makes sure any dependencies are satisfied so that the corrected order of execution is respected while parallelizing the call. The value a Ct future returns, in essence, is the only “effect” that is visible. Ct also incorporates a structured task parallel abstraction called hierarchical, synchronous tasks (HSTs). The basic idea is that tasks can execute in parallel, but their “effects” to be isolated from each other. Some of these effects are useful, but we want them combined in some predictable (deterministic) way. This can be viewed as a generalization of bulk synchronous processes. With HSTs, Ct easily support various other forms of structured parallelism, including pipeline and fork-join patterns (all with deterministic behaviors).

Execution Semantics of Ct

Ct is a dynamically compiled via a language runtime through library initialization. This is done so that one can expose these higher-level abstractions to heavy compiler optimizations, especially as one migrates code between platforms. This means that the Ct API calls used in a program may execute out of order and asynchronously with respect to the rest of one's C/C++ code. The only Ct API calls that are guaranteed to be “synchronous” with the C/C++ code are those that move data back and forth between TVECs and native C/C++ memory. The semantics of Ct allow this aggressive reordering by the compiler and runtime, though there will be some effort required if one is heavily intermixing C/C++ code and Ct API calls to optimize one's code. For functional programmers, the evaluation of Ct API calls can be viewed as “lenient” (Ct is strict and purely functional). For web programmers, the execution model will look similar to Python in that one “executes” all the code at least once, including the object declarations (which are bits of code that create and initialize the objects), then one uses the objects and their methods. In Ct, the analogy would be that the C/C++ sequencing of Ct API calls is the first pass through, then the runtime’s code cache is

Martimus · Dec 14, 2009

EDIT: Oops. Mixed up some members because they use the same avatar, and likely made myself look more like an ass than usual. Hopefully the member didn't read this before I caught it, as what was meant to be a compliment would have had the opposite effect since I was making the compliment to another member.

Nemesis 1 · Dec 14, 2009

Thats OK . I missed it . I was busy pasting just a little of Ct info available . I am sure he won't mind your error. I don't.

IntelUser2000 · Dec 14, 2009

ilkhan said:
Im pretty sure starting with lliano ALL AMD chips will GPUs on die. Same as with Intel's s1156 line, ALL will have on chip/die GPUs.

IDC: 2 interesting things about the sandy bridge shot: The gpu being on the far side of the cores from the display controller (but its probably a low speed connection between them, so no big deal) and the rectangular destroying memory controller (again). Thoughts?

Hey guys. I wanted to discuss about the IGP in Clarkdale and Sandy Bridge. According to reports so far, the fundamental architecture between the two graphics chips are the same.

If you look at the much clearer die shot here: http://en.expreview.com/img/2009/07/06/sandy_bridge.jpg

Notice on the top portion of the GPU it has symmetrical blocks. Now some of the guys at a German forum speculated that those are the Execution Units. You can count 12 for the large blocks.

Well, the question is: If the IGP uses a similar architecture then where's the similar symmetrical block in Clarkdale's IGP? You can even see that on the blurry Sandy Bridge die shot from the recent IDF presentation, why not on Clarkdale's? What are those symmetrical blocks for?

(Some info about code names:

Clarkdale
-Hillel CPU core
-Ironlake GPU core)

IDC

Thank you for this image. One thing I've not understood about Clarkdale is the discrepancy in TDP between the CPU (65W TDP) and GPU (8-10W TDP). Conventionally die size drives TDP, however also clock speed matters a lot. Roughly 3GHz for the CPU and 700MHz for the GPU. Would the clock speed overwhelm the die size in this case?

Thanks

You can't compare the two like that. GPU is a specialized unit and can be optimized for die size and power efficiency. It's sort of like caches. Caches in the Itanium 2 91xx processors are clocked at 1.6GHz and takes up something like 60% of the die size, but the power usage by the caches are under 6W(less than 5% of TDP).

BTW, on the G45, the Execution Units are clocked at 800MHz, but things like the texture unit and the ROPs are at 400MHz. They are higher on Clarkdale, but its still probably not up to EUs speeds.

taltamir · Dec 14, 2009

drizek said:
Yes, it is a new architecture, but it won't be like Nehalem was to Conroe. Performance/clock is not going to go up significantly from what I understand, certainly not the 0-60% improvement in performance that came with nehalem.

and you know that a year and a half before it is actually released? it is set to be released in 2011, it is now Q4 2009. There is quite a while before sandy.

IntelUser2000 · Dec 14, 2009

So it looks like OCW's early leak about Sandy Bridge having 1.5MB/core L3 is right.

ilkhan · Dec 15, 2009

IntelUser2000 said:
So it looks like OCW's early leak about Sandy Bridge having 1.5MB/core L3 is right.

Thats what the blurry die shot says, but I dont see 3 groupings per core, only 2. I doubt Intel would organize in 768KB blocks, and they look to be the same size. Id expect 1 or 2MB per core based on that alone. But hey, Intel could be goofy.

Idontcare · Dec 15, 2009

ilkhan said:
Thats what the blurry die shot says, but I dont see 3 groupings per core, only 2. I doubt Intel would organize in 768KB blocks, and they look to be the same size. Id expect 1 or 2MB per core based on that alone. But hey, Intel could be goofy.

There's actually four groupings (the light bar down the middle goes with 1/2 cache on the left and the right)...as for the organization blocks they already do 256KB for L2$ so what design layout concerns do you think they would be trying to avoid by not doing 768KB blocks? Serious question because I don't follow so I'm assuming I just don't get the issue you are raising. Can you elaborate?

ilkhan · Dec 15, 2009

Idontcare said:
There's actually four groupings (the light bar down the middle goes with 1/2 cache on the left and the right)...as for the organization blocks they already do 256KB for L2$ so what design layout concerns do you think they would be trying to avoid by not doing 768KB blocks? Serious question because I don't follow so I'm assuming I just don't get the issue you are raising. Can you elaborate?

Well, yeah. 4 controllers with probably 1 MB blocks on either side (2 blocks per bar down the middle, 1 bar arrangement per core), so the same 2MB/core arrangement. Can probably verify that by comparison cache size vs core size with bloomfield shots. Not perfect, but best we can do.

256KB is a 2^x arrangement, whereas 768KB isn't (nor is 1.5MB). Its an odd number that doesn't use a typical number. Thats all I meant.

IntelUser2000 · Dec 17, 2009

So from what's out there in rumor mills, its actually Socket 1366 that will need new socket for Sandy Bridge. The Socket 1155 supposedly uses same pin arrangement as 1156 so it can theoretically be used in Socket 1156 motherboards.

Three scenarios:
1. Both Socket 1366 and 1156 have to be replaced
2. Only Socket 1366 will be replaced, Socket 1156 will be able to take Sandy Bridge
3. Both the sockets can take Sandy Bridge, the extra PCI Express channels in the CPU just means more bandwidth for multi-GPU setups(the one in CPU+one in X58)

3 is very optimistic(as in "not likely"), 2 is the most likely scenario, although 1 is possible.

Sandy Bridge will use a faster QPI which isn't a problem for Socket 1156 CPUs as it doesn't have QPI, but for S1366 variants, that MIGHT be a problem. Plus some reports say that even the Socket 1366 variants will feature PCI Express controller on CPU die.

256KB is a 2^x arrangement, whereas 768KB isn't (nor is 1.5MB). Its an odd number that doesn't use a typical number. Thats all I meant.

You could be right but with the scarce info we have, anything could be true. I had my doubts with OCW's claim of 1.5MB/L3 per core, but with even PCWatch changing their stance, its becoming more favorable they might just be correct.

The 1.5MB/L3 is interesting because it represents a possible change in architecture beyond what most of us know. It also tells us how much we know(or don't ).

IntelUser2000 · Dec 17, 2009

You guys(specifically Ilkhan and Idontcare) still want to talk about what the top portion in the GPU part in Sandy Bridge might be?

Or you guys know what it really is? I don't think those are EUs, but I could be wrong.

Idontcare · Dec 17, 2009

IntelUser2000 said:
You could be right but with the scarce info we have, anything could be true. I had my doubts with OCW's claim of 1.5MB/L3 per core, but with even PCWatch changing their stance, its becoming more favorable they might just be correct.

The 1.5MB/L3 is interesting because it represents a possible change in architecture beyond what most of us know. It also tells us how much we know(or don't ).

There is nothing special or unique about sram layouts that restricts them to being 2^n blocks. Nice 2^n even numbers seem natural but that is purely an anthropogenic response and has no basis in design and layout fundamentals. We encounter the same "barriers of comfort" with the idea of 3-core and 6-core cpus...what!? aren't they just supposed to be 2, 4, and 8...all 2^n?

1.5MB L3$/core could simply be yet another refinement in the ever persistent trade-off between latency and cache size. Intel engineers have had a couple more years to analyze software applications to get a better feel for the maximal trade-off point, we should not be surprised if the conclusion is slightly less but lower-latency cache is better for whatever suite of applications studied by Intel's engineers.

IntelUser2000 said:
You guys(specifically Ilkhan and Idontcare) still want to talk about what the top portion in the GPU part in Sandy Bridge might be?

Or you guys know what it really is? I don't think those are EUs, but I could be wrong.

Yes we are interested, what do you think it is? 6 or 12 larrabee cores (or whatever they want to call those cores used in their cloud-computer on a chip deal)?

IntelUser2000 · Dec 17, 2009

The Nehalem has total of 8MB L3 cache, with both L1 and L2 caches inclusive with L3. The L1 and L2 aren't inclusive with each other though. I'm thinking maybe the hierarchy changed again.

That's if its really 1.5MB/L3 per core.

Idontcare said:
Yes we are interested, what do you think it is? 6 or 12 larrabee cores (or whatever they want to call those cores used in their cloud-computer on a chip deal)?

That's the problem. It's fundamental architecture is supposedly based on current IGPs. Shouldn't it look similar to Ironlake then? Texture units? Even controllers for what looks like caches?

ilkhan · Dec 17, 2009

IntelUser2000 said:
You guys(specifically Ilkhan and Idontcare) still want to talk about what the top portion in the GPU part in Sandy Bridge might be?

Or you guys know what it really is? I don't think those are EUs, but I could be wrong.

The timeline for the GPU in sandy really precludes it being larrabee based, that said, as long as it decodes 2x1080p x264, and does aero (both of which even arrandale won't have any problem with) (and maybe accelerates flash x264, fuck adobe containers) I couldn't care less. Mostly because Ill never use a sandy CPU. My sandy machine will have a discrete GPU, my "sandy" gen laptop will be an Ivy.

A change in cache could be the L3 answer indeed. It maybe be a anthropomorphic, but the designers are human too.
IDC: would the cores or cache shrink more? compared to a bloomfield die, the sandy cache is slightly smaller. (stretching the cache to vertically cover the core took 75% stretch on sandy, 65% on bloomfield. An admittedly crude measurement)

Idontcare · Dec 17, 2009

ilkhan said:
IDC: would the cores or cache shrink more?

The cache will shrink more than the logic, which in turn will shrink more than the IO.

What do estimate the mm^2 for bloomfield's 8MB cache to be versus the mm^2 for sandy?

The comparison to make though is the cache block density of westmere to Sandy's cache block mm^2.

I will do it eventually if you don't beat me to it...feeling kinda lazy tonight.

IntelUser2000 · Dec 17, 2009

Won't that be hard to do because we don't really know Sandy Bridge's actual die size?

Idontcare · Dec 17, 2009

The estimates are 22X mm^2, not 2XX mm^2, so somebody out there has the info and has shared it to that degree of accuracy. (not surprising given that first silicon has already been booted, etc)

I can't remember the last time a leaked diesize turned out to be seriously in error.

Sandy Bridge Socket

Golden Member

Elite Member

Lifer

Senior member

Lifer

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Elite Member

Golden Member

Elite Member

Golden Member

Elite Member

Elite Member

Elite Member

Elite Member

Golden Member

Elite Member

Elite Member

Elite Member