Television as World Model Installation
What if we could study the world model that television installs in its viewers?
I grew up moving around a lot. The places I grew up in---France, Senegal, Italy and the US---are very different and there are a few things that become very salient when you live in places that are very different. The first is that what people consider “natural”, “normal”, “obvious” and “common sense” in one society are anything but in another. What may feel like common sense in Paris can feel odd in Dakar and bizarre in New York. Another thing that becomes obvious is how different people’s media diets are; especially, the television they watch. The shows people watch, the news they consume, the ads they are subject to and the stories they tell themselves about how the world works all vary enormously from one place to another. I’ve always been fascinated by these differences and think about them a lot.
Recently, I’ve been thinking about this through the lens of machine learning (ML) and, specifically, the development of world models. This has led me to an experimental idea I would love to see pursued. Unfortunately, I don’t have the data or the methodological expertise to do it myself, but I think it could be interesting. At the very least, it is interesting to think about so I’m writing it up here in the hope that others might find it interesting as well.
World Models
A world model is a machine learning model that maintains an internal representation of its environment that it can use to predict what happens next, to plan actions and to simulate scenarios it hasn’t directly observed. The idea has a long history in cognitive science and ML, but recent work has made it concrete in a new way. OpenAI’s Sora, for example, is a video generation model that learns enough physics and spatial reasoning from video that it can generate plausible continuations of scenes it has never seen. Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) takes a different approach and learns abstract representations that predict future states without generating pixels. The details differ but the core idea is the same, i.e., train a model on enough observations of the world so that it builds an internal model of how things work.
The basic observation that underlies this post is that television has been doing something very similar to humans for decades.
Cultivation Theory
In the 1960s and 70s, the communication researcher George Gerbner developed what he called cultivation theory. The original conceptual framework appeared in his 1969 paper “Toward ‘Cultural Indicators’: The Analysis of Mass Mediated Public Message Systems” but the most cited paper is the 1976 paper co-authored with Larry Gross, “Living with Television: The Violence Profile”. I highly recommend reading it. It is well written, easy to understand and very interesting.
The main idea---which is “obvious” now but not in 1960’s---was that heavy television viewers build their model of reality from what they see on TV. And because television systematically distorts the world (e.g., overrepresents violence, skews demographics, simplifies causality) viewers who consume a lot of it end up with a distorted view of how the world works.
Gerbner documented several effects, including:
mean world syndrome: heavy TV viewers significantly overestimate the prevalence of violence and crime because television dramatically overrepresents both.
mainstreaming: people from very different backgrounds who watch a lot of TV tend to converge on a shared worldview that looks like the world according to television.
resonance: when someone’s real-life experience happens to align with what TV depicts, the TV depiction gets reinforced and amplified.
One way to think about this is as Television “installing” a world model in people that assigns high probability to certain events like violent crime, dramatic confrontations and rapid resolutions; and low probability to others like mundane daily life, slow institutional processes and ambiguous outcomes. People then use this distorted model to interpret their real-life observations which, in turn, leads to systematically incorrect beliefs about the world.
The consequences of these distortions are not just theoretical. One of the most well-documented cases is the systematic association between Black people and criminality in American media. Growing up in the US in the 1990s, I watched this happen in real time. The evening news, shows like Cops, and crime dramas relentlessly depicted Black people as perpetrators; far out of proportion to actual crime statistics. Study after study confirmed that Black people were dramatically overrepresented as criminals on television and underrepresented as victims, as professionals, and as ordinary people living ordinary lives. Dixon and Linz (2000) found that Black Americans were depicted as perpetrators 37% of the time on local TV news while constituting only 21% of those actually arrested. Entman (1992) found that Black suspects were shown in more physically threatening and dehumanizing ways than comparable White suspects.
The effect on viewers was exactly what cultivation theory predicts. Gilliam and Iyengar (2000) showed that when White viewers saw digitally edited news broadcasts with a Black suspect rather than a White one, they expressed significantly greater support for punitive criminal justice policies. Dixon (2008) found that even after controlling for neighborhood diversity and local crime rates, time spent watching local TV news predicted stereotyping of Black people as criminals. Eberhardt et al. (2004) demonstrated that this association operates at the level of visual cognition: participants primed with Black faces identified images of weapons faster and those primed with crime-related objects detected Black faces faster. So television didn’t just distort the frequency of crime; it distorted the causal structure around it. It installed a model in which being Black was a predictor of threat. This is a distorted world model with real consequences in policing, hiring and housing. But also in how people cross the street, walk around at night or get in an elevator. The world model that American television installed in its viewers for decades was, among other things, a racist one and it was effective precisely because viewers experienced it not as bias but simply as “the way things are”.
A major problem with this idea, of course, is that you can’t easily detect the distortion from the inside. If your model of how likely something is comes from television, that model will just feel to you like common sense. An American viewer doesn’t experience their worldview as “the American television worldview” and a Senegalese viewer doesn’t experience their worldview as “the Senegalese worldview”; they just feel like reality. This is exactly why cross-cultural comparison can be revealing. When you’ve lived in places with very different media environments, you can feel the differences intuitively. You notice that what counts as “obvious” in one place is highly local and not universal at all.
The Experiment
Here is where the AI/ML connection comes in. We now have ML architectures that learn world models from video, which means we have the technical machinery to train the same model on different datasets and compare what it learns. What if we did exactly that but with national media diets as the training data?
The core experiment would work like this. Take a single world model architecture and train separate instances on media from different countries, e.g., Indian, Senegalese, French, Chinese and US television. In addition, train the model on raw real-world unscripted, unedited and boring footage, e.g., dashcams, security cameras, documentary footage as a control.
Then compare the resulting world models along several dimensions.
Surface Level Analysis
The most straightforward analysis would replicate Gerbner’s work but at a scale he could never achieve. Some interesting questions would be:
what do the models predict about the frequency of events?
how often does violence occur?
what’s the demographic distribution of people in different roles?
how much wealth is on display?
By comparing each model’s predicted distributions against real-world base rates you would get a precise cultivation differential for each media diet; i.e., a quantitative measure of exactly how much each country’s television distorts each aspect of reality.
This would already be interesting, but it’s essentially doing what Gerbner and his collaborators did with better tools.
Structural Level Analysis
Another interesting analysis would ask: what causal structure do the models learn?
You could present each model with the same ambiguous social situation (e.g., two people in a tense conversation) and ask what happens next? A model trained on one media diet might predict escalation to confrontation and a model trained on another might predict a different arc entirely. A model trained on raw dashcam footage might predict that nothing happens; that people just walk away.
This is where one gets at something that Gerbner couldn’t access: the implicit model that each media environment “installs” in its viewers. You can go beyond the result that “TV viewers think there is more crime” to analyzing a full causal model, what follows what, what causes what, what’s possible and what’s unlikely.
You could present each model with a scene involving a Black male in an “ambiguous” context1: walking through a neighborhood, reaching into a pocket, standing near a store at night and compare predictions. A model trained on 1990s American television, saturated with Cops and local news crime coverage, might predict criminality or confrontation. A model trained on Senegalese television, where being Black carries no particular narrative associations would likely predict something different; if anything at all. Note that here, what I am suggesting is to test the American and Senegalese trained models on the same scene. The conjecture is that the Senegalese-trained model will have learned to recognize escalation or threat based on different features than the American-trained model.
A model trained on real-world footage might predict the most common real-world outcome, which is that nothing happens. The gap between these predictions would be a measurable proxy for the racist causal structure that media installed in its viewers and it is made legible precisely because you can compare it against models trained on other media environments where the same visual inputs don’t carry the same learned associations.
More broadly, if the models disagree across media diets on a given set of scenarios, one has found places where each country’s media is installing different causal structure in its viewers. Whereas if all the TV-trained models agree on a set of scenarios but diverge from the real-world-footage model one has found distortions inherent to television as a medium. It would be a sort of “fiction gap”. And where the real-world-trained model also gets things wrong, maybe one has learned something about the limits of learning from passive observation.
Temporal dimension. You could also train on television, movies and news from different eras: the 1970s, the 1990s, the 2020s and compare the resulting world models they generate. How has the model changed over time? This would give you a way to track cultural shift.
Cross-media comparison. Finally, you could extend beyond television. Train on social media feeds, YouTube recommendations, cable news, and long-form streaming separately. Does the medium itself---not just its content---introduce systematic distortions? Is the world model installed by TikTok structurally different from the one installed by CNN even when the subject matter overlaps? Binge-watching and algorithmic feeds might function as more intense cultivation mechanisms than scheduled broadcasting.
Important Limitations
I should note that cultivation theory has faced methodological criticism. A meta-analysis by Morgan and Shanahan (1997) found that while cultivation effects are statistically significant, they are small.
With that caveat about the underlying theory, I also want to stress the limits of what an experiment like this could tell us. Machine learning models are not human brains. They don’t have bodies, social contexts, prior beliefs, or the kind of lived experience that shapes how people actually process media. A world model trained on television learns statistical regularities in pixels; a person watching television is doing something far more complex and filtering what they see through identity, memory, conversations and a hundred other things.
So the results of these experiments would not tell us what model television installs in humans. They would tell us what model a particular architecture extracts from a particular corpus which is a different and much more limited thing. The value is not in the literal correspondence but in the comparison. If the same architecture trained on different media diets learns systematically different causal structures then that might tell us something real about what is in those media diets regardless of how closely the model’s learning process resembles human cognition. In some sense, the machine learning model is a measuring instrument not a model of the viewer.
Practical Considerations. Of course, this would be non-trivial. You would need curated, representative media data for each country and it’s not clear how you would decide what is “representative”. You would need to choose a world model architecture and would need the computing power to train multiple instances. You would need an evaluation methodology for comparing model “beliefs” possibly drawing on techniques from mechanistic interpretability.
As I mentioned, I don’t have the access to data and models to do this myself, but I think it’s the kind of experiment that could be interesting to researchers from both the AI and communication studies sides.
Conclusion
There are several reasons I think this might be interesting.
First, it makes the implicit explicit. We’ve known since Gerbner’s work that television shapes perception but we’ve never had tools to precisely characterize what model television installs. World model training makes the implicit generative structure extractable and comparable. You can look at the model’s weights, probe its representations, and ask it to generate predictions; all things you can’t do with a human viewer’s intuitions.
You might even be able to sample from its television-installed world model so you can interactively see or “visit” its internal world.
Second, there’s an AI safety angle that I think is underappreciated. If we’re worried about what world models AI systems learn from the internet video---and we should be---then we should be at least as worried about what world models humans have been learning from television and media for decades.
The AI version of the problem actually makes the human version a bit legible (with all the caveats mentioned above). We can study the distortions in the machine model precisely because it’s a machine and then ask: are these the same distortions we’ve been installing in people?
Third, this could be a powerful tool for media literacy. Cross-cultural comparison could reveal that “common sense” is a locally installed model. Showing someone that a model trained on their country’s television makes systematically different predictions from one trained on another country’s television is a concrete, visceral way to demonstrate that their sense of “how the world works” is, in part, an artifact of their media diet.
Finally, this connects 60-year-old communication theory research to frontier AI research in a way that I think is interesting. In some sense, Gerbner was studying world model installation before the term existed and the AI community is building tools that can make his theory precise. Bridging these feels natural to me.
Epilogue
This discussion about television reminded me of a song by Aesop Rock called Basic Cable. For context, Aesop Rock is a rapper from Long Island that came out in the late 90s. He put out some great albums, my favorite of which is Labor Days, but I should warn that he is an acquired taste. His lyrics are extremely dense and his flow is really unorthodox (and this was long before Kendrick made unorthodox flows mainstream). But if you give it enough time it eventually clicks and starts to make sense. In his second full length album, Float, he had a song called Basic Cable that was about his---and all of our---addiction to television. The song starts with with
Television, all hail grand pixelated god of
Fantasy, murder scape bent perspective
and ends with
My everyday is sitcom, soaps, news, bad dramatization
Come along with me, my friend for the most glorious sensation
I put ambiguous in quotes because walking is not normally considered ambiguous, but walking while Black is.

