Search results

B
Intel Skylake / Kaby Lake

what do you mean with fully enabled, two 512-bit FMA per core ? i.e. twice throughput with AVX-512 code vs. AVX2 code ? and what is your source for this ?
- bronxzv
- Post #11,485
- Jun 1, 2017
- Forum: CPUs and Overclocking
B
Who's buying Skylake-X? (You may now change your vote)

void
- bronxzv
- Post #224
- May 22, 2017
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!
- bronxzv
- Post #144
- Jul 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

I don't see how you can reach this conclusion (log N instead of N) based on the spherical vector distribution method which has an extra inner loop for each step with a hard to predict termination branch IIRC the example discussed here the other day...
- bronxzv
- Post #141
- Jul 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

void
- bronxzv
- Post #139
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

I don't see what this has to do with the problem at hand which complexity is O(M*N) with M particles and N steps, for both the original and your proposal now, maybe some pseudo-code will convince me that I'm wrong ?
- bronxzv
- Post #137
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

neat! thank you a good example of pseudoscientific arguments IMHO is your analogy with "DFT vs FFT" computational complexity: it makes it sound as if your proposal is somewhat less "brute force" than the original your solution looks actually more complex than the original and less amenable...
- bronxzv
- Post #134
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

note that I wrote "pseudo scientific" not "scientific"
- bronxzv
- Post #132
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

I don't get it, still can't tell if you have more/less/same number of particles than the original i.e. number of distinct positions at the end of the simulation well, this is the common word for random walk and more generally all numerical simulations, anyway the one used by borandi in the...
- bronxzv
- Post #129
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

too bad, it explains why it wasn't so easy to try to fix the supposed cache issues I have raised (hint: the out of this world speedup with HT enabled) anyway thank you for your answer, too bad I don't have free time these days to post a fully vectorized solution
- bronxzv
- Post #127
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

it's not clear (IMHO) from your description what you are aiming at 1) processing fewer particles 2) processing fewer steps per particle 3) faster steps 4) other some pseudo-code will probably help to understand what you are trying to explain with quite strange and irrelevant (to me) analogies
- bronxzv
- Post #126
- Jul 14, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

huh ? I have dozens of loops perfectly vectorized with transcentals in my code (polynomial approximations and if conversion when there is more than one branch) for an of the shelf solution, look at a concrete example here using SMVL ...
- bronxzv
- Post #107
- Jun 23, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

I don't see the point to unroll by hand, you seem to have the same strange idea than DrMrLordX on this, just write x[i] = r[i]*cos(valpha[i]); in a loop, it's easy to maintain and vectorize well
- bronxzv
- Post #104
- Jun 22, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

it is used automatically by the ICC vectorizer, once you have a data layout amenable to vectorization void SinTest (const float * __restrict x, float * __restrict y, size_t size) { __assume_aligned(x,32); __assume_aligned(y,32); for (int i=0; i<size; i++) y[i] = sin(x[i]); } ...
- bronxzv
- Post #100
- Jun 22, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

are you sure SVML is used for the trig funcs ? you should see stuff such as call __svml_cosf8 in the ASM dump
- bronxzv
- Post #98
- Jun 22, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

nope, don't worry, a good compiler will use a single clock rcp throughput instruction such as vcvtps2pd (for AVX targets) and you do it only once per >10000 iterations (with sin(), cos()) ..., I see that's it's no more a full checksum (x only) in your new version though (why ?), *trust me* one...
- bronxzv
- Post #94
- Jun 20, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

cool to see how it evolves, I'll advise to use a double for the checksum value non to have a better check against regressions
- bronxzv
- Post #89
- Jun 20, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

as I mentioned above in this thread an alternate possible interpretation is [...] pos[dummy][2] += newz; pos[dummy][2] = abs(pos[dummy][2]); this variant seems more likely when you imagine the particles bouncing on the plane z=0 the original author's input will be welcome
- bronxzv
- Post #84
- Jun 18, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

I'll advise to repost the whole code after fixes
- bronxzv
- Post #83
- Jun 18, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

IMO this should be: [...] for (counter=0;counter<steps;counter++) { newz = 2.0*real_rand()-1.0; alpha = real_rand()*2.0*3.141592654; r = sqrtf(1-newz*newz); pos[dummy][0] += r*cosf(alpha); pos[dummy][1] +=...
- bronxzv
- Post #80
- Jun 18, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

this looks basically correct, I will use another name than "rand" since it's already used in LibC with another meaning (integer values) another possible intepretation is : int steps, count; float newz, alpha, r, part_x=0, part_y=0, part_z=0; for (count = 0; count < steps; count++) { newz...
- bronxzv
- Post #74
- Jun 18, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

from this original author's comment http://forums.anandtech.com/showpost.php?p=37249640&postcount=235 we have deduced that it should be interpreted as: particle[i].z += abs(newz); not the clamp to 0.0 in the pseudo code, this way the delta is a unit vector at each step our main problem here...
- bronxzv
- Post #71
- Jun 18, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

there is obviously a need to renormalize after your proposed transform (translate + scale z) example: let's start with a normalized vector v = [0.4 0.3 -0.866] after the transform you described it will become v' = [0.4 0.3 0.067] |v| = ~1.0, but |v'| = ~0.25 v' is 4x too short in this...
- bronxzv
- Post #66
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

then normalize the whole vector again to have a unit vector at each step as in the original example
- bronxzv
- Post #64
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

thank you!, I see that the 1st step is an iterative method (calling two rand() per loop iteration!) so it will be way slower than the *simpler* solution discussed here and this one isn't amenable to vectorization since the number of loops is not known statically, for modern code we prefer to use...
- bronxzv
- Post #63
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

void
- bronxzv
- Post #62
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

looks cool but this isn't exactly what the code discussed here is doing, the original code do some progress in Z at each step, only dx/dy are selected on a circle so that the x,y sum is very near 0,0 after a lot of steps but the z sum progress toward infinity it's more or less a long cylinder...
- bronxzv
- Post #59
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

yes, this looks like the author's intended meaning note that the same result will be obtained by replacing float newz = 2 x randgen() - 1; in the original source with simply float newz = randgen(); this way newz will be >= 0 from the start and no absolute value will be required...
- bronxzv
- Post #57
- Jun 15, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

not only, unrolling by hand makes it very difficult to experiment, unlike with a compiler where you simply say unroll 4x,8x, etc., the end result is code at subpar performance on most targets since the best amount of unrolling is code path dependent (obviously, you'll not unroll the same for...
- bronxzv
- Post #55
- Jun 13, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

it's a big deal in the real world for code maintenance, why don't you work in C++ with a mature vectorizer instead ? the source code will be simpler, more readable and the "fixed" benchmark really faster
- bronxzv
- Post #53
- Jun 13, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

only dz is without an uniform angular distribution, dx/dy are with an uniform (azimuth aka alpha) angle distribution, thus my proposal to store sin(alpha) cos(alpha) in LUTs FWIW the (buggy) pseudo-code is available here: http://forums.anandtech.com/showpost.php?p=37006057&postcount=197
- bronxzv
- Post #50
- Jun 12, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

one may also argue that it's not fair to write the code with 8x unrolling by hand as you do, how much performance does it buy you btw ?
- bronxzv
- Post #48
- Jun 12, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

note that the original pseudo code isn't a proper spherical distribution, the delta Z component is chosen over an uniform [-1.0,1.0] range so the elevation angle (aka "beta" in the literature) distribution isn't uniform
- bronxzv
- Post #47
- Jun 12, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

note that small LUTs are typically used for polynomial coefficients in piecewise approximations, this is useless for smooth functions like sin/cos, though in this case (pre-generated random values) you can simply store precomputed sin(alpha) cos(alpha) values since alpha is dependent on a...
- bronxzv
- Post #42
- Jun 11, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

void
- bronxzv
- Post #35
- Jun 10, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

if such low accuracy is OK for you, I'll suggest to use a 1024-entry lookup table instead, it's probably too low precision for the problem at hand, though anyway, without access to the original 3DPM source and some checksums/results to compare with your version comparing timings is a worthless...
- bronxzv
- Post #33
- Jun 10, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

FYI a Google search returns this page with whole source code http://www.netlib.org/fdlibm/ the source for sin() after range reduction is in the file k_sin.c and it's based on a 13th order minimal polynomial approximation (using the Horner's form of the polynomial and Remez algorithm for better...
- bronxzv
- Post #29
- Jun 9, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

this looks very strange to me, where have you seen this explained ?
- bronxzv
- Post #26
- Jun 8, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

Taylor series are generally a bad idea, you have far better results (speed vs. accuracy) with Legendre or Chebyshev polynomial approximations also for sin/cos, with any approximation a challenging part is range reduction a very good book on the subject that I'll recommend is Elementary...
- bronxzv
- Post #23
- Jun 6, 2015
- Forum: CPUs and Overclocking
B
3DPM: Can we fix it? Yes we can!

note that this is what I've done and reported here a few months ago: http://forums.anandtech.com/showpost.php?p=37026239&postcount=225 AFAIK there isn't a new version of the bench available
- bronxzv
- Post #22
- Jun 6, 2015
- Forum: CPUs and Overclocking

RESOURCES

Top Bottom