Discussion Forums > Technology

Your view on AMD's Bulldozer

<< < (5/43) > >>

Tiffanys:
I hope they're successful because competition will drive prices down with their main competitor - Intel, and a free market is more healthy for the people than a monopoly.

As for me, I'm sticking with Intel.

I wasn't all that impressed with their world record breaking 8Ghz, whatever, liquid nitrogen/helium stunt either. Though I hear iBuyPower are going to try to make those available to the public? But getting those kinds of speeds without extreme cooling is impossible at this point. They'd have to have a mini fridge for a case, even standard liquid cooling wouldn't come close to doing that...

kitamesume:
^ i wanna see ivy bridge's i3 start at 75$ and i5 at 150$ ::)

TMRNetShark:

--- Quote from: kitamesume on September 29, 2011, 07:20:18 PM ---^ i wanna see ivy bridge's i3 start at 75$ and i5 at 150$ ::)

--- End quote ---

That would never really happen. Intel knows their place... so they will keep it.

kitamesume:
^ unless the bulldozer's i3 equivalent is prized at 80$, then intel will either lower their price or suck their thumbs while throwing lawsuits XD

Lupin:
It looks like x264 benefits alot from bulldozer's new instructions (FMA4 and XOP). FMA4 and XOP are new instructions from AMD.

http://www.planet3dnow.de/vbulletin/showpost.php?p=4501020&postcount=562
http://www.planet3dnow.de/vbulletin/showpost.php?p=4502029&postcount=585

full IRC logs: http://akuvian.org/src/x264/freenode-x264dev.log.bz2

(click to show/hide)
--- Quote ---2011-09-16 23:42:16 < Dark_Shikari> Oh YI, we know now why AVX is useless on bulldozer
2011-09-16 23:42:20 < Dark_Shikari> *FYI
2011-09-16 23:42:22 < Dark_Shikari> Move elimination
2011-09-16 23:42:29 < Dark_Shikari> Their OOE engine eliminates moves and resolves them before ALU stage
2011-09-16 23:42:34 < Dark_Shikari> So moves are free, so AVX doesn't help
2011-09-16 23:42:39 < Dark_Shikari> Except reducing code size ofc
--- End quote ---


--- Quote ---2011-09-23 18:56:03 < Dark_Shikari> Okay, so I have a massive series of bulldozer profiles ready
2011-09-23 18:56:13 < Dark_Shikari> It has instruction-based sampling and all sorts of awesome stuff
2011-09-23 18:56:43 < JEEB> AMD? Awesome stuff? This sounds like something that doesn't happen very often
2011-09-23 18:57:21 < Gramner> any NDA?
2011-09-23 18:59:53 < Dark_Shikari> Technically yeah
2011-09-23 19:00:08 < Dark_Shikari> Though a lot of the stuff isn't bulldozer-specific, its performance counters are just awesome
2011-09-23 19:00:32 < Dark_Shikari> Unsurprisingly, our load/store queue is full in pixel_avg functions.
2011-09-23 19:01:25 < Dark_Shikari> Er, load queue.
2011-09-23 19:01:36 < Dark_Shikari> Our store queue, on the other hand, fills in plane_copy, mc_copy...
2011-09-23 19:01:38 < Dark_Shikari> slicetype_mb_cost?
2011-09-23 19:02:12 < Dark_Shikari> cache_load and cache_save, guess that's obvious
2011-09-23 19:02:33 < Dark_Shikari> analyse_init, naturally
2011-09-23 19:02:50 < Dark_Shikari> Okay, time for INEFFECTIVE_SW_PREF ETCHES
2011-09-23 19:03:05 < Dark_Shikari> Oh, this is awesome. It tells you when a prefetch is useless, i.e. the data was already in L1 cache
2011-09-23 19:03:12 < Dark_Shikari> Almost all of the "useless prefetches", pengvado, are in hpel_filter
2011-09-23 19:03:21 < Dark_Shikari> The rest are in cache_load
2011-09-23 19:03:23 < Dark_Shikari> Guess that's expected.
2011-09-23 19:04:02 < Dark_Shikari> Next: DECODER_EMPTY.
2011-09-23 19:04:17 < Dark_Shikari> I... think this is where the instruction decoder... hmm. Is this where the decoder is too fast, or too slow?
2011-09-23 19:04:43 < Dark_Shikari> Okay, it's where the decoder is too slow (there's nothing to dispatch)

(...)
2011-09-23 21:47:40 < Dark_Shikari> Thank you performance counters, I think I just made CABAC RD way faster
2011-09-23 21:48:37 < LordRPI> nice
2011-09-23 21:49:22 < Dark_Shikari> 50% of the branch mispredictions in cabac were on one line of code
2011-09-23 21:49:26 < Dark_Shikari> a restructure of the function, kabam
--- End quote ---


--- Quote ---2011-09-27 00:55:51 < Dark_Shikari> pengvado: oh oops, vpermilps and pd are 5-operand (!!!!!)
2011-09-27 00:55:57 < Dark_Shikari> dst,src1,src2,selector,imm8
2011-09-27 00:56:25 < Dark_Shikari> I mean seriously wtf
2011-09-27 01:04:02 < Dark_Shikari> Also, they apparently dropped 3DNOW
--- End quote ---


--- Quote ---2011-09-28 01:33:41 < Dark_Shikari> AVX mbtree propagate is slower than sse2
2011-09-28 01:33:49 < Dark_Shikari> FMA only barely manages to get it fast again.
2011-09-28 01:33:49 < kemuri-_9> lol
2011-09-28 01:33:52 < Sean_McG> hahah
2011-09-28 01:33:59 < Dark_Shikari> SSE2: 342 cycles
2011-09-28 01:34:00 < Dark_Shikari> AVX: 374
2011-09-28 01:34:05 < Dark_Shikari> FMA4: 340
2011-09-28 01:34:18 < kemuri-_9> lol
2011-09-28 01:34:26 < Dark_Shikari> I guess this makes sense given that it only has 128-bit execution units
2011-09-28 01:34:34 < Dark_Shikari> and the INT16_TO_FLOAT code is obnoxiously slow because avx sucks
2011-09-28 01:34:41 < Dark_Shikari> i.e. avx has no way of doing int16_t -> float fast
2011-09-28 01:35:18 < Dark_Shikari> Hmm. I wonder if FMA4 supports sse registers?
2011-09-28 01:35:37 < Dark_Shikari> Oh. It *does*...
2011-09-28 01:35:38 < Dark_Shikari> Let me try that.
2011-09-28 01:37:45 * codestr0m ears perk up
2011-09-28 01:49:29 < Dark_Shikari> FMA4: 314 cycles. Much better
2011-09-28 01:49:46 < codestr0m> Dark_Shikari: what was the change?
2011-09-28 02:01:21 < Dark_Shikari> using the sse instead of avx version
2011-09-28 02:01:26 < Dark_Shikari> as the basis for xop
--- End quote ---


--- Quote ---2011-10-01 02:09:51 < Dark_Shikari> xop will make this a lot easier, but I'm trying to do ssse3 first
--- End quote ---


--- Quote ---2011-10-04 04:46:38 < Dark_Shikari> C, with mode analysis shortcuts: 253 cycles
2011-10-04 04:46:45 < Dark_Shikari> My crappy, badly optimized XOP asm: 93 cycles
2011-10-04 04:46:56 < Dark_Shikari> This is kinda awesome
2011-10-04 04:49:35 < Dark_Shikari> Oh, and old without shortcuts: 379 cycles
2011-10-04 04:49:45 < Dark_Shikari> My asm is 4 times faster than the existing... wait where have we seen this before? XD
2011-10-04 04:49:57 < Dark_Shikari> It's just like SAD_4x4_x9 all over again!
2011-10-04 04:50:10 < JEEB>
2011-10-04 04:50:18 < JEEB> that sounds pretty awesome
2011-10-04 04:50:21 < Dark_Shikari> Except this time I'm still wondering how best to do it without vpperm
2011-10-04 04:50:33 < Dark_Shikari> Thanks AMD, for bringing back the best instruction ever after 15+ years of hiatus.
--- End quote ---

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version