Tech vs arts/humanities
I was struck by the comment from Trevor Paglen in the Winter 2024 issue of Aperture that
The theory of perception underlying so much of AI and computer vision is shockingly bad from a humanities perspective.
There was a time in the “recent past” in which there was non-trivial overlap between the two. I remember Marvin Minsky giving a talk at Mass College of art in the early 1980’s in which he said that programming a computer to be expert at chess was going to be a lot easier than giving it the common sense reasoning of a toddler. A prediction which has been proven to be very spot on. Even now, in early 2025, I don’t know of any system that can match the common sense reasoning of a toddler but for decades we’ve had many that are expert at chess.
There’s also the 1987 example of
Philip E. Agre and David Chapman’s Pengi: An Implementation of a Theory of Activity
Investigation of the dynamics of everyday routine activity reveals important regularities in the interaction of very simple machinery with its environment.
Agre & Chapman once discussed their work in an informal talk — their whole framing of the problem was based on ideas initially developed in the humanities, and mentioned Heidegger’s Being and Time and its explication of deictic references and being-in-the-world as providing a key motivation for their approach.1 Oddly they also characterized it as a book describing in detail how to make breakfast, which I considered to be an overstatement at the time, finding it impossible to imagine how an 800 page book could possibly do that, a 3K page book might make a start, but not a whole breakfast. Being and Time did have a good description of what it was like to use a hammer and pick up a coffee cup, which were the inspiration for the photos at the head of this blog.
Sadly, neither of them seem particularly involved in the field any longer. They appearto have become disillusioned with the field and in Agre’s case has become what can only be called a recluse
Making AI Philosophical Again: On Philip E. Agre’s Legacy is a very good overview of Agre’s approach
As Paglan mentions, it does seem that the most publicized current AI/vision efforts aren’t trying to model, or even provide insight into, how people actually think about things. They’ve focussed their efforts on solving problems. Which is fine, at some level, as long as you know what the problem is that is being solved.
However, solving the problem isn’t understanding the problem. And this divergence can be fundamental.
That said, I’m not sure what problems the current crop of AI models are trying to solve, probably because the term AI is used across so many fields in so many ways, it is unfair to paint them with a single broad brush. If you’re just2 Not to trivialize: these problems are very hard, and the results that have been published are deeply impressive. trying to do protein design or protein structure prediction, you may not need much understanding of the fundamental physics involved, since enough can be gleaned from the regularities exhibited in the extant data. In any case you’re not going to be referring to Being and Time, nor any of the concepts covered therein.
On the other hand, if you’re trying to develop AGI (Artificial General Intelligence) Being and Time could be handy
After all, people bring a manifold of experiences to their understanding and these experiences form the basis of the metaphors that they use to understand the world. This brings me back to Lakoff’s Women, Fire and Dangerous Things which talks in terms of prototypes and resemblance to prototypes as the prime basis for categorization and general thinking. Note: categories may consist of more than one prototype and each prototype could conceptually fit more than one category if decorated with suitable attributes.
My prior is that there are a couple axes here that appear problematic for an LLM, the first consists of subtle differences in word usage of the same word either by different cultures within a discipline or between geographic areas. It would seem that if you’re not providing the LLM of a representation that includes coding for these other axes it would be very difficult for the system to tease these differences out of a text only dataset. This would especially be true if the words are tokenized in a way that assumes a particular type and implicit distance between word meanings which is insensitive to this difference
This was just my prior, so I did a moderately detailed search as to how well LLMs preform in detecting the different flavors of each word when operating with a large portion of The Internet’s English corpus.
The empirical data confirms my fears
Can Large Language Models Understand Context indicates that it’s currently remains a struggle
Experimental results reveal that LLMs under in-context learning struggle with nuanced linguistic features within this challenging benchmark, exhibiting inconsistencies with other benchmarks that emphasize other aspects of language
As does this paper from NeurIPS 2024 CultureLLM: Incorporating Cultural Differences into Large Language Models
In this paper, we propose CultureLLM, a cost-effective solution to incorporate cultural differences into LLMs.
So, current LLMs aren’t going to do nuance, and they will likely blur the meaning words across multiple subcategories.
Reminds of Marc Andreesen’s talk of Marxism and Capitalism as if they were each a single thing. Thinking like that isn’t going to allow you (human or bot) to compete with the average human thinker (see the discussion of stability in In the Pipeline)
If you’re wanting to achieve something on the order of GAI, it’s going to be necessary to understand how people communicate their thoughts about the world, the various media that are used to communicate them and the differences between two models/representations/communications about them that may appear very similar but are actually very different.
This affords the heuristic that developing right features to identify the important objects in a domain is difficult, but critical.
This is exemplified by the supplementary materials for the paper Simulating 500 million years of evolution with a language model The authors have a feature rich architecture, ESM3 consisting of 7 unique tracks including those which are specialized to convey structure information. The LLM is NOT expected to derive these from the sequence data
Such additional mission critical ancillary information forms the foundation of their achievement. 3 I don’t intent to imply that other systems haven’t done this certainly AlphaFold 2 & 3 do similar things, and I’d argue that the tokenizers used in LLMs also implicitly provide such information


ESM3 Architecture
This identifies one of the core errors in expecting LLMs to get these nuances from an ingestion of text data: the subtleties are too sparse and get lost in and incorporated into the mass information, which results in the muddying of the popular sentiments and eliminating the nuances. I’d argue that this is the core of what is missed with the deprecation of humanities in big tech: the other tape channels aren’t there and even if they were desired, the skills to identify and incorporate them are missing
Obviously, this even more so for such axes as color, taste, smell, sound and proprioception (proprioception also feels like a bridge to the hard problem of consciousness, a bridge I’m going to refrain from crossing at the moment)
Leave a Reply