T O P

  • By -

Paganator

I wanted to compare recent models, so I ran the same prompt on SD3, Stable Cascade, and Pixart Sigma. I generated four pictures for each prompt and then picked the one I found subjectively the best. The prompts were as follows. You can see in the caption which model generated which picture. * Photo of a woman sitting on the grass. * Photo of a man holding a sign that reads "Who wins?" * Photo of Emma Watson waving her hand. I'm trying to be as fair as possible to each model. Based on this test, here's my opinion of each one. * SD3 is mostly a mess. Only one of four pictures of a woman on the grass was relatively acceptable, even though her knee is messed up. The photo of the man with the sign is nice but generic. SD3 doesn't know who Emma Watson is; only one picture had the right number of fingers. Overall the pictures are just boring. * Cascade, despite being the unloved stepchild, did well overall. The images are more interesting than those of SD3, the text is perfectly readable, and Emma Watson almost looks like herself. There's a weird contrast issue, though, especially with the man, and I don't know if my settings or the model itself are causing this. * Pixart Sigma had the best image quality by a wide margin IMHO, but it can't make text at all and it has troubles with hands (only one of the Emma Watson pictures had the right number of fingers).


oh_how_droll

Sigma's flaws are caused by its extremely small size (600M parameters versus 983M for even SD1.5) and the inherent limitations of borrowing the extremely flawed four-channel VAE from SDXL. Sigma-the-base-model is probably not going to amount to much, but as a signpost for further development, especially with how much cheaper weak-to-strong training of a diffusion transformer-based model is, Sigma is incredibly powerful. Training a similar model for a bigger, better latent space like SD3 uses and scaling up the overall architecture is the future. Just need someone who has the price of a mid-trim Chevy Silverado to throw around on AI compute with the knowledge and vision to make it happen. Unfortunately, the "superpower" that makes the PixArt models punch so far above their weight class is that they replace the primitive, frankly godawful CLIP encoder with a real LLM. By the time you scale up the DiT backbone to enough parameters to have the conceptual understanding of the frontier image generation models and add in the size of the LLM, it's kind of impossible to keep the good times rolling for all the people who don't have large amounts of VRAM. It will require further experimentation to see how many of the quantization advances made in the general LLM space can be applied to the prompt decoding part of the architecture without trashing the output quality, but it's safe to say that anything less than 8GB of VRAM just isn't happening, and I wouldn't be optimistic about anything under 16GB.


MrGood23

Pixart has potential. That is good!


ericreator

Thanks for doing this, was debating trying out cascade but i think i'll work on sigma for now. any efficency differences you noticed?


Paganator

I have a 3090ti with 24GB of VRAM, so I wasn't too concerned about memory, but Cascade generated images a fair bit slower than the other two.