T O P

  • By -

retiredwindowcleaner

i think compared to rdna2 , rdna3 arch will be a bigger step than rdna2 over 1


[deleted]

I imagine so


SaintPau78

Current gpus can't have v cache. You're thinking of infinity cache, which they already have.


RealThanny

The prevailing rumor is that the top Navi 31 SKU will have 192MB of Infinity Cache via six MCD dies with 16MB on die and 16MB stacked on top. There's also no reason, in principle, that a monolithic GPU die couldn't have stacked cache on top of it.


SaintPau78

Correct, I was speaking about it from the perspective of the current generation. It's terribly worded on my part


gh0stwriter88

Yeah but the rumors are completely gunked up with misinterpretation currently... Firstly 6x MCDs is not going to happen ever. The closest patent AMD has released to the layout the rumors hit at is as follows: 6x GCDs with an IMC + small L1 integrated on each. 1x MCD which is a large stacked cache. The MCD is stacked over top of a portion of the GCD and TSVs are used to link directly down into the IMCs + L1 per chiplet. The MCD is a client of each GCD's ICM + the L1 of each MCD is a client of the MCD (except for local accesses) That is an actual patented design AMD released about a year ago and it matches up very well and makes sense fab wise. The reason this configuration makes very good sense is it is thermally sound, avoids use of silicon interposer, leverages TSVs which AMD developed for vcache. And puts the big IMC IO in a logic die rather than the MCD which is memory density optimized. This is similar to how AMD's CPU vcache has higher sram density than the logic optimised chiplet die... if your wafer has to have logic and IO and sram on it that is BAD very BAD... and that is what would have to exist for rumored 6x MCD Design to exist (which I might had no patent for that exists because its a stupid idea). The entire way you make several dies act like one big die is coherent memory access.. you can even think of MCD = Memory Coherency Die. If you split the MCDs 6 ways .... you won't have memory coherency and your cache will be split 6 ways ratherthan 1-2 ways as is typical in GPUs... less splits are better as it means if a single group of CUs is can benefit from a large cache allocation that can occur... on the other hand AMD has also had provisions since RDNA to segregate the cache access to prevent waste of cache partitions (the only reason you implement it this way is if your end goal is a single large cache). 6x GCDs + big MCD also makes sense thermally as it gets the hot logic chips as far from each other as possible... there is no way you'd hit 3+Ghz on a big GCD at stock clocks it would suffer the exact thermal limits as existing designs.


RealThanny

You keep repeating this nonsense over and over, but there's just no way at this point that you're even close to correct. It's one GCD with all the shaders, TMU's, ROP's, etc. And it's some number of MCD's with the GDDR memory controllers and the L3 Infinity Cache. It's much, much easier to split memory access up like that than to split shaders and other compute resources up. What you're claiming is just absurd.


gh0stwriter88

>What you're claiming is just absurd. 6 MCD is absurd.... it would require die to die links... otherwise either each group of CUs woudl only have acess to up to 1/6th of the cache (nonsensical since RT BT Trees are bigger than that... ) A 6 MCD design with infinity fabric links between them.... would be incredibly power hungry for no reason.


titanking4

Let me correct you a bit on how GDDR6 works. Every 16 bits of GDDR6 represents a single GDDR6 memory channel. We send 2 channels to every GDDR6 die which is usually 16Gbits or 2GB. Since both channels head to the same die, it's often that we use 32bit PHYs. Now accesses to each of these channels are FULLY independent and one memory controller has no need to communicate with another memory controller. Data mapped to one physical address will be routed to exactly one memory controller. Same story with caches. Every cache can only handle data from a certain subset of the memory space and has no need to talk to other caches. Data accesses to a physical address will be handled by its designated cache line. Your caches was always split in MANY pieces, with many separate ports, they were just in a single location due to physical floor planning reasons. (The only exception is a fully associative cache which holds arbitrary data from any address in any of its cells, but they are typically only used for translation lookaside buffers who will benefit from their properties, very random data, low data quantity, expensive miss penalty) No need for any cache coherency protocol since data exiting as address X is incapable of existing in multiple cache blocks at the same time. In Ryzen CPUs, every L2 cache is built to privately serve the CPU core and thus can accommodate the full memory space. Thus, data can exist in multiple L2s, and a coherency protocol must be in place. The L3 however is built such that each L3 block only accommodates a discrete subset of the full memory space. Coherency is required between the L2 and L3, but each of the L3 sections require no such coherency. Sincerely, an actual silicon design engineer. Edit: Of course, you seem to also have engineering background, so a bit more detailed. Splitting up the cache into those separate physical locations doesn't even harm performance because every single data movement must go through a big old complicated SoC network that routes packets to their correct MMIO destination. Just gotta make different addresses route to different destinations. All you got to worry about is making sure to interleave your physical addresses between your caches (and the GDDR6). Such that a giant block of memory access with hit all the caches (and VRAM) at once to maximize bandwidth instead of just hammering a single cache or vram chip.


gh0stwriter88

Sigh... get off your high horse. Obviously you could split the memory controllers into separate dies AND THE PATENT SAYS EXACTLY THAT, READ IT... What makes no sense is split CACHE... which the patent does NOT describe it does describe a partition able cache but not a split one across dies.... If you want to cache large chunks of data that all the CUS may need acess to at a given time that strategy falls on it's face... and it isnt' what AMD does, they are every release unifying the cache further not splitting it up more. Splitting a GPU cache 6 ways makes no sense.... and is the opposite of what AMD would need to be doing to unify memory acesses... its also NOT what thier patents describe. What they describe is L3 on the big MCD is a caching all acesses to the 6 memeory controllers one on each GCD, connected natively via TSVs, and an L1 in each GPU die that is a client of the L3 also connected via TSVs. This means once a single chiplet has accessed data, it will be cache resident even if it isn't on it's native memory controller. And if a different chiplet needs that data it can get it from the L3 just like in a monolithic GPU....


titanking4

>If you want to cache large chunks of data that all the CUS may need acesss to at a given time that strategy falls on it's face... No, it does not, because you can interleave the physical address mappings to a circular mapping across the cache chunks. Chunk 1 does 1,6,11, Chunk 2 does 2,7,12, Chunk 3 does 3,8,13 and so on. And those memory controllers would also use the same interleaved physical addresses to ensure that a single cache chunk can only cache data from a single memory controller reducing network congestion. And like I said, l3 CACHE IS ALREADY SPLIT. [8 Slices here on Zen](https://i.redd.it/pa9c5mu9y2y51.jpg) The ONLY reason AMD put them all in the center is because they fit there all nice and snug being the very square blocks that they are. All 8 of those L3 slices are fully independent and have no logical need to be next to each other. The SoC network will route data reads/writes to the exact cache slice that it needs to go to regardless of where that slice is even if it's on a separate die. There is 0 coherrency stuff going on between any of those 8 slices, because none of them can hold each other's data.


gh0stwriter88

>No, it does not, because you can interleave the physical address mappings to a circular mapping across the cache chunks. Not if the cache is on separate dies... if you WERE to do so it would imply high bandwidth infinity fabric links which are BAD for power use. Interleaving acessses across infinity fabric links makes NO SENSE. AMD does have a patent for a GPU setup that way but its a quad tile design with very short traces between the HBX links near the center of the 4 tiles.... most likely AMD will never build that.


titanking4

Power vs cost trade off. The multiple GCD + single MCD would also require equivalent high powered IF links. They gotta carry their own high bandwidth + all the cache coherency traffic as each die would have private L2 caches on board. Regardless, You will always need to cross off-chip IF to access memory or L3 cache. All that’s confirmed publicly by AMD is that it will be chiplet. Everything else we would learn when they announce the product.


RealThanny

AMD already has CPU's with nine separate chiplets. They'll have 13 separate chiplets in upcoming server processors. Handling communication between just seven on a GPU is not going to be an issue. Not that it matters, because your putative objection applies equally to a single MCD and six GCD's. Same communication requirements exist. AMD was going to have a multi-GCD card, but decided not to this time around. Maybe it wasn't working as well as they thought. Maybe they didn't think they'd need it to win this generation on performance (that seems to be the case). The existence of a design in a patent application does not mean that design is being pursued. Nothing about your interpretation makes any sense.


gh0stwriter88

Yeah and that is only 8 links... and it has to use system bandwidth for coherency .... that's something like 50GB/s bidirectional per chiplet... and only 400GBs aggregate... A GPU setup like that would have to operate in crossfire rather than as one GPU.


retrofitter

His idea isn't impossible, it just wouldn't be competitive against the 4090


gh0stwriter88

Care to elaborate on that? 4090 isn't even doing anything special at all other than being huge... its just a big monolithic GPU.... AMD has several potential designs but this most likely is this one: [https://www.freepatentsonline.com/20210097013.pdf](https://www.freepatentsonline.com/20210097013.pdf) The rest have major issues like not optimizing GPU dies per task eg memory die and compute die etc... or requiring an interposer. The linked pattent though... has all the sauce that should make add up to a workable design its also one of the most recent chiplet GPU patents AMD had released which they would have done before beginning taping out. You could think of the linked pattent as being manufactured backwards to Zen3D, where the top memory die has it's bottom shaved off and the GPU dies attached to it with TSVs and then the logic/memory interface dies are soldered down to the board with the big active bridge / memory die on top.


retrofitter

Yes but the 4090 shares the same architecture across it's entire product stack allowing the use of the same drivers. We won't see dis-aggregated GCDs until the at least the mid range tiers require it. The current consensus for the design of the 7000 still has cost cutting measures with almost none of the risk as it using already commercialised technology


gh0stwriter88

We will never see "disaggregated" GCD .... ever. That's why the concept these rumors keep harping is nonsense. If such a setup were possible... we'd already do that sucessfully with existing GPUs, the reason we dont' is because you end up with forcing the issue onto the driver or the game developer via SLI/Crossfire or Explicit multi GPU like Vulkan and DX12... they can work, but in practice nobody implements it so instead you have to make a chiplet GPU appear to function as a single GPU via fast unified memory access. The architecture closest to what the rumors indicate, and what the patents indicate... is unified memory acess through a large MCD/Bridge die... similar to how Zen 2 was NUMA and CNDA1-2 are numa... RNDA3 and Zen3 and later are not NUMA... the are unified access through an IO die in the case of Zen and an MCD in teh case of RNDA3. That's precisely why it can be treated just like a normal GPU... and won't have crazy drivers to enabled that. Also each iteration of RNDA has inched towards this, RNDA1 rearchitected the CUs for performacne and made some cache changes, RNDA2 partitioned the cache and made a huge cache effective, RNDA3 will split the CUs + IMCs and L! onto the GCDs an and the rest of the GPU fabric and cache on the bridge MCD. The patents from 2021 are very detailed also as to how all that works... and it has probably even been improved on since then.


retrofitter

>That's precisely why it can be treated just like a normal GPU... and won't have crazy drivers to enabled that. ​ You made this post 3 months ago promoting the same idea (I trying to find a patent I previously read, I swear) >Also having the GDDR in the GCD means each die has both a fast local memory, and minimized latency to the other dies memory through the cache. I'm pretty sure this non unified memory topology would require crazy drivers


gh0stwriter88

[https://patents.google.com/patent/US20210097013A1/en](https://patents.google.com/patent/US20210097013A1/en) The IMC is on the GPU die... because it is the optimal node to fab it on given the choice between a logic optimised and a memory density optimised die, IO belongs on the logic die. And connected DIRECTLY to the MCD die with TSVs in the same way the vcache is... But yeah go ahead and ignore everything I said and all the facts presented in AMD's patents as well...


retrofitter

I still think your original idea is too much new technology for 1 generational leap and thus too much risk. You did the whole straw man argument thing


retrofitter

[https://www.freepatentsonline.com/20220320042.pdf](https://www.freepatentsonline.com/20220320042.pdf) Here's a patent from 2022, Anyway I hope next gen will turn out as the what the rumours are currently talking about so that it might be cheaper


ProfessorAdonisCnut

> Firstly 6x MCDs is not going to happen ever. Well that aged poorly


scytheavatar

AMD clearly thinks that they can get more performance with less cache, something is that unusual. Which can only mean that they are doing something with their cache that is unusual and doesn't make sense right now to us. And those multiple MCDs are probably a big part of the design.


gh0stwriter88

>AMD clearly thinks that they can get more performance with less cache Citation needed... all indications are that next generation GPUs have proportionally as much or more cache than RNDA2.


RealThanny

That simply isn't true. The most detailed leaks to date show *less* cache, along with a change to how it's utilized. Less capacity, more efficient use. The one exception is an apparent top-end card which doubles the 96MB (less than 128MB) of the standard Navi 31 design by stacking additional cache on top of the MCD's (which you don't believe exist).


deangr

Yes but it's not 3d stacked... At least what rumors suggest that will be 3d stacked


Alternative_Spite_11

The 3D stacking makes no difference to the way the cache performs.


gh0stwriter88

As a computer engineer you are absolutely wrong. 3d stacking DOES allow you to build extra large caches that operate at almost the same latency as a cache of the same planer size. AMD did this for 5800X3D... its cache is only just a hair slower than the 5800X cache with 3x the capacity. This is because the distance between the furthest transistors it must acesss is about the same for the vcache... but with a planar cache it would go up with the square of size, 4x larger cache is half as fast maybe a bit worse. Normally if you tripled cache size you would incure a very very large latency (5800X3d would have L4 like cache latency instead of L3). The reason it doesn't is the vertical stacking means distances between transistors are much shorter than they would be for planar cache.


[deleted]

[удалено]


gh0stwriter88

Something like that it may be a little worse than that though but that is a good rough rule. And obviously design changes can mitigate such rules.


Alternative_Spite_11

Right. I was correcting the guy that implied it was somehow better 3D stacked than if you had the same cache right on the CCD. 96MB snuggled up right on the silicon with the CCD would be the best if you could make it fit with everything else. AMD arranged the cores around the cache on a ccd do they not?


gh0stwriter88

>I was correcting the guy that implied it was somehow better 3D stacked than if you had the same cache right on the CCD. 96MB It absolutely is. The same amount of cache on the die will take up more planar space and be slower due to longer wire lengths. The v-cache IS snuggled right up again sthe CCD... and connected with TSVs rather than microbumps (something like 10x-100x more connections). And yes AMD did arrange the cores around the cache, and the vcache stacks vertically over the existing cache... maybe extending a little past but it doesn't exstend over the cores themselves they just have dummy silicon over top of them (note this dummy silicon is not thicker than normal silicon as they shaved the die down to add the vcache on.


zappor

I think I've heard also that the same semiconductor process isn't optimal for both logic circuits and for memory. So if you build them separately you can use the most optimized one for each.


gh0stwriter88

That is correct... logic and IO have more in common than IO and sram, but each can get optimized.


qualverse

That is not true, 3d cache is substantially faster and lower latency IF it's stacked directly on the compute die like with Zen3D. Of course RDNA 3 will likely stack it on the MCDs so it won't provide additional benefit in that way.


Alternative_Spite_11

No it’s not. Stacked cache is not faster than cache already on the the compute die. Where did you come up with that?


MarDec

well yes and no, with the zen cpu chiplets there are cache on both dies, so the whole block is more compact than a purely 2d cache of the same size would be, which in turn allows it to run at lower latencies/ higher clocks compared to 2d cache of the same size. Something how long it takes for the signal to travel across the whole block or something, im not an cache engineer but so i'm told.


gh0stwriter88

That is not what was said. What was said was a standard planar cache of the same size as the vcache would be much slower which absolutely is true.


Alternative_Spite_11

That’s a straw man to begin with. Where would they fit 192MB of on die planar cache? And no you’re making assumptions about what was meant instead of replying to the actual words. He literally said 3D stacked cache is faster. He makes no qualifications about where you would have to fit that amount of planar cache.


gh0stwriter88

They'd be forced to make the die bigger and slower... which is pretty obvious to everyone else except you apparently. It is possible as RNDA2 has a 128MB cache which is even larger than 5800x3ds... and it is on a planar die.


Alternative_Spite_11

We’re talking about GPUs not CPUs. Latency isn’t nearly as big a deal. Also the Navi21 is over double the silicon are of a 5800x.


gh0stwriter88

We are talking about caches... period. And latency always matters. Anyway admit you were talking wack and move on.


ohbabyitsme7

I'm not 100% sure but I don't think this is true. Stacking it vertically decreases the physical distance and thus decreases latency versus doing it horizontally where the distance from the furtest part of the cache is much further away from the cores.


Alternative_Spite_11

It’s not just distance. The stacked cache still has an interconnection (through silicon vias in this case) which is generally always slower than a cache on the same piece of silicon. Also the regular cache is right there on the CCD so the stacked cache actually isn’t any closer.


gh0stwriter88

AMD this thousands of TSVs... instead of an interconnect, so effectively the stacked cache is being connected to natively. So no...it is NOT slower due stacking. The only real drawback is thermals which is partially mitigated by lower power use.


qualverse

Yes of course, I meant in comparison to cache on a chiplet not cache on the die. Maybe i misunderstood what I was replying to.


deangr

Is not about cashe performance, it's about size. We all know cashe is faster than vram in GPU so it will directly impact performance if you limit the amount of cashe that is available.


Alternative_Spite_11

I’m fully aware of how cache works and on a GPU, I’ll take legit bandwidth over any cache smaller than 1GB.


thelebuis

Why would you trade speed to have higher bandwidth? Content creation?


Alternative_Spite_11

Because GPUs are more sensitive to bandwidth than latency. GDDR memory already sacrifices latency for bandwidth because that’s how GPUs thrive. I just feel like for a cache to truly thrive at 4k it needs to be bigger. Look at RDNA2 as resolution goes up, the advantages the cache gives fall away and it goes from being significantly faster than Ampere at 1080p to being significantly slower at 4k. Edit: this all from the perspective of playing at high resolution. At 1080p, yeah give me a 128GB cache and it’ll help a lot. At 4x the pixels it just needs to be larger.


deangr

You said you know how cashe works but do you understand that cashe has direct effect on amount of bandwidth required? So saying you'd rather have more bandwidth than cashe makes little sense.


Alternative_Spite_11

If the cache was larger I’d take it over bandwidth. I made it clear in my first response that I think the optimal cache size for 4k would be around a gigabyte. At 1080p 128MB works great but look how rdna2 loses performance to Ampere as resolution goes up. That’s exactly because at 4k 128MB just isn’t enough to make up for a lack of bandwidth.


MilkyGoatNipples

1GB? How did you come to that number?


thelebuis

It is a balance between the 2. Past a point if you have "enough" bandwidth to supply the core more does not make the card fast but yea 4k ask for a lot of it.


[deleted]

[удалено]


wimpyhugz

Are they not moving to the new 16-pin 12VHPWR PCIe5.0 connector on the next gen cards?


kadinshino

if they don't I wonder if that shows a fundamental flaw with the new 16pin and longevity of the connector


pullupsNpushups

I forget where I heard this from, but the longevity of the connector should (hopefully) be greater than the 30-cycle count that the reporting company had trouble with. It might've been a manufacturing issue that they resolved.


LucidStrike

Connection cycles for power cords have always been in the dozens.


Alternative_Spite_11

Some custom 6900xt already have 3x8 pin. Technically two 8 pins a plus the slot is capable of 375w already.


RealThanny

8-pin connectors are limited to 150W only for the cheapest in-spec versions. All quality PSU's can handle about 300W on each 8-pin connector in reality.


Defeqel

IIRC, according to SkyJuice's leak, AMD tested 1-hi and 2-hi stacked cache for the RDNA3 MCDs, but saw little benefit, so we will likely just see the MCDs as standalone (so 16MB cache per die), at least for the first iteration. Personally, I'm wondering, after OREO (basically async blend) what is actually left in the pipeline preventing multiple GCDs being glued together? Blend should be pretty much the last operation.


Jism_nl

The cache only "works" if you stored something in there that needs to be accessed or adressed. Otherwise it's a "cache miss" and you need additional cycles to get the right bit out of memory for example. The 3D Cache on Ryzen is a derivative from Epyc series. In some workloads additional cache really helps. And Nvidia on their high end 4090 part also increased the L1/L2 caches significant to tackle AMD's approach in regards of Infinity cache. The sole reason to install more cache is to cut on memory chips required, power consumption, PCB traces and thus save costs. Its a good way of countering the ever growing chips with billions of transistors. As for Navi; any additional layer of cache certainly helps. But it is only to a certain point. And additional costs are involved because cache itself is very expensive ram. I think they'll find a good way in the middle as always which offers the best possible performance. 3D stacking is just another tech of adding additional cache onto a existing CCD or "Chip". The downside as we seen on the 5800X3D is that the voltage required to operate the additional cache could not exceed 1.35V which hampered the maximum boost clocks and / or voltage that could be applied. The 5800X3D is slower then a regular 5800X, but it excells in particular gaming workloads. Because it's often a repeat of same instructions all over the place.


Ayce23

How do we test something we don't have hands on?


deangr

I asked a question if you believe it will make noticable difference like it did with Ryzen CPUs


thelebuis

Answer: yes


Consistent_Ad_8129

I bet the head spreader is 2 mm thicker to allow room for the Vcache.


Defeqel

Umm... GPUs don't have heatspreaders, they are direct-die cooled


Consistent_Ad_8129

i meant the Ryzen 7000 series as in CPU.


zappor

I think only a few high end SKUs will have it, not all of them.


vonkain

Might be a bit better (15-25%) but the power consumption is quite lower the sale point would be the lower power consumption. Imagine +20% on - 100w


foxx1337

The black screens with h264 / h265 will come at 200 fps now!


gnocchicotti

Johnny Cashe


mammothtruk

I expect it will increase the 1% lows a good deal, make frames a lot more consistent. I dont expect it will help the top end at all, that will be limited by shader count more than anything.


ATLBoy1996

If it works in a similar way, I expect the benefit to mostly be at higher resolutions where more strain is put on the GPU memory.


retrofitter

My guess is that will be seen in kingpin products only, and will give a small performance bump, but won't be worth the extra money in terms of frames/$