This video explains the underlying mechanisms of AI image and video generation, focusing on diffusion models. It breaks down the process by first introducing CLIP (Contrastive Language–Image Pre-training) and its concept of a shared embedding space, then explains diffusion models and DDPM (Denoising Diffusion Probabilistic Models), detailing how they learn to reverse a noise-adding process by predicting vector fields. Finally, it discusses how CLIP and diffusion models are combined for text-to-image/video generation using techniques like conditioning, guidance, and negative prompts.
The transcript doesn't directly address why all AI images might look like "AI images" even when appearing photorealistic initially. However, it does provide clues that could contribute to this phenomenon:
Essentially, while AI models are getting incredibly good at mimicking reality, the underlying processes (averaging, not quite reaching the true data manifold, or subtle misinterpretations of prompts) can still introduce tell-tale signs that differentiate them from genuine photographs.