Abstract
Contemporary multimodal artificial intelligence systems, particularly large language models augmented with vision and audio capabilities, are frequently described as systems that "see." This dissertation argues that such descriptions are category errors. Despite impressive performance on perceptual benchmarks, web-trained multimodal models lack the necessary architectural, dynamical, and epistemic properties required for genuine seeing. Drawing on computational neuroscience, statistical physics, cognitive science, and philosophy of perception, this work demonstrates that current systems perform context-conditioned inference rather than perception grounded in persistent world-models. A formal distinction is developed between retrodictive plausibility engines and predictive seeing systems. The dissertation concludes by proposing a principled research program for machine perception based on dynamical world-models, temporal coherence, and falsifiable commitments to reality.
Chapter 1: Introduction — The Misuse of the Term "Seeing"
1.1 Motivation
The term "seeing" has been widely applied to modern AI systems that classify images, describe scenes, and answer questions about visual inputs. This chapter argues that such usage conflates behavioral success with perceptual ontology. The motivation of this work is to clarify what seeing entails, why current systems do not meet that standard, and why this distinction is essential for progress toward artificial general intelligence.
1.3 Central Thesis
Web-trained multimodal models do not see because they lack persistent world-states, predictive commitments, and mechanisms for belief revision under uncertainty. They model correlations in appearance, not the dynamics of reality.