I built a system for Claude Code to browse 100 non-fiction books and find interesting connections between them.
I started out with a pipeline in stages, chaining together LLM calls to build up a context of the library. I was mainly getting back the insight that I was baking into the prompts, and the results weren't particularly surprising.
On a whim, I gave CC access to my debug CLI tools and found that it wiped the floor with that approach. It gave actually interesting results and required very little orchestration in comparison.
One of my favourite trail of excerpts goes from Jobs’ reality distortion field to Theranos’ fake demos, to Thiel on startup cults, to Hoffer on mass movement charlatans (https://trails.pieterma.es/trail/useful-lies/). A fun tendency is that Claude kept getting distracted by topics of secrecy, conspiracy, and hidden systems - as if the task itself summoned a Foucault’s Pendulum mindset.
Details:
* The books are picked from HN’s favourites (which I collected before: https://hnbooks.pieterma.es/).
* Chunks are indexed by topic using Gemini Flash Lite. The whole library cost about £10.
* Topics are organised into a tree structure using recursive Leiden partitioning and LLM labels. This gives a high-level sense of the themes.
* There are several ways to browse. The most useful are embedding similarity, topic tree siblings, and topics cooccurring within a chunk window.
* Everything is stored in SQLite and manipulated using a set of CLI tools.
I wrote more about the process here: https://pieterma.es/syntopic-reading-claude/
I’m curious if this way of reading resonates for anyone else - LLM-mediated or not.
Edit/update: if you are looking for the phantom thread between texts, believe me that an LLM cannot achieve it. I have interrogated the most advanced models for hours, and they cannot do the task to any sort of satisfactory end that a smoked-out half-asleep college freshman could. The models don't have sufficient capacity...yet.
… realize that it’s nonsense and the LLM is not smart enough to figure out much without a reranker and a ton of technology that tells it what to do with the data.
You can run any vector query against a rag and you are guaranteed a response. With chunks that are unrelated any way.
The one I found most connected that the LLm didn’t was a connection between Jobs and the The Elephant in the Brain
The Elephant in the Brain: The less we know of our own ugly motives, the easier it is to hide them from others. Self-deception is therefore strategic, a ploy our brains use to look good while behaving badly.
Jobs: “He can deceive himself,” said Bill Atkinson. “It allowed him to con people into believing his vision, because he has personally embraced and internalized it.”
I'm seeing "Thanos committing fraud" in a section about "useful lies". Given that the founder is currently in prison, it seems odd to consider the lie useful instead of harmful. It kinda seems like the AI found a bunch of loosely related things and mislabeled the group.
If you've read these books I'm not seeing what value this adds.
Theranos is the fraud mentioned in the piece.
https://habr.com/en/articles/456476/
https://android-review.googlesource.com/c/platform/system/bt...
This is the part that always stuck with me:
I have often felt that programming is an art form, whose real value can only be appreciated by another versed in the same arcane art; there are lovely gems and brilliant coups hidden from human view and admiration, sometimes forever, by the very nature of the process. You can learn a lot about an individual just by reading through his code, even in hexadecimal. Mel was, I think, an unsung genius.
In "Father wound" the words "abandoned at birth" are connected to "did not". Which makes it look like those visual connections are just a stylistic choice and don't carry any meaning at all.
I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
I do like the idea though — perhaps there is a way to refine the prompting to do a second pass or even multiple passes to iteratively extract themes before the linking step.
Anyway, it introduced me to the idea of using computational methods in the humanities, including literature. I found it really interesting at the time!
One of the the terms it introduced me to is "distant reading", whose name mirrors that of a technique you may have studied in your gen eds if you went to university ('close reading"). The idea is that rather than zooming in on some tiny piece of text to examine very subtle or nuanced meanings, you zoom out to hundreds or thousands of texts, using computers to search them for insights that only emerge from large bodies of work as wholes. The book argued that there are likely some questions that it is only feasible to ask this way.
An old friend of mine used techniques like this for dissertation in rhetoric, learning enough Python along the way to write the code needed for the analyses she wanted to do. I thought it was pretty cool!
I imagine LLMs are probably positioned now to push distant reading forward in an number of ways: enabling new techniques, allowing old techniques to be used without writing code, and helping novices get started with writing some code. (A lot of the maintainability issues that come with LLM code generation happily don't apply to research projects like this.)
Anyway, if you're interested in other computational techniques you can use to enrich this kind of reading, you might enjoy looking into "distant reading": https://en.wikipedia.org/wiki/Distant_reading
LLMs are great at finding media by vague descriptions. ;)
The book is almost certainly by *Franco Moretti*, who coined the term "distant reading." Given the timeframe ("maybe a decade ago") and the description, it's most likely one of these two:
1. *"Distant Reading"* (2013) — A collection of Moretti's essays that directly takes the concept as its title. This would fit well with "about a decade ago."
2. *"Graphs, Maps, Trees: Abstract Models for Literary History"* (2005) — His earlier and very influential work that laid out the quantitative, computational approach to literary analysis, even if it didn't use "distant reading" as prominently in the title.
Moretti, who founded the Stanford Literary Lab, was the major proponent of the idea that we should analyze literature not just through careful reading of individual canonical texts, but through large-scale computational analysis of hundreds or thousands of works—looking at trends in genre evolution, plot structures, title lengths, and other patterns that only emerge at scale.
Given that the commenter specifically remembers learning the term "distant reading" from the book, my best guess is *"Distant Reading" (2013)*, though "Graphs, Maps, Trees" is also a strong possibility if their memory of "a decade" is approximate.
I think that this sucks the discreet joy out of reading and learning. Having the ways that the topics within a certain book can cross over in lead into another book of a different topic externalized is hollowing and I don’t find it useful.
On the other hand I feel like seeing this process externalized gives us a glimpse at how “the algorithms” (read: recommender systems) suggest seemingly disjunctive content to users. So as a technical achievement I can’t knock what you’ve done and I’m satisfied to see that you’re the guy behind the HN Book map that I thought was nice too.
At its core this looks like a representation of the advantages that LLMs can afford to the humanities. Most of us know how Rob Pike feels about them. I wonder if his senior former colleague feels the same: https://www.cs.princeton.edu/~bwk/hum307/index.html. That’s a digression, but I’d like to see some people think in public about how to reasonably use these tools in that domain.
Intuitively, I agree. This feels like the different between being a creator (of your own thoughts as inspired by another person's) and a consumer (although in a somewhat educational sense). There would need to be a big advantage to being taught those initial thoughts, analogous to why we teach folks algebra/calculus via formulas rather than having every student figure out proofs for themselves.
I was recently trying to remember a portal fantasy I read as a kid. Goodreads has some impressive lists, not just "Portal Fantasies"[0], but "Portal Fantasies where the portal is on water[1], and a seven more "where/what's the portal" categories like that.
But the portal fantasy I was seeking is on the water and not on the list.
LLMs have failed me so far, as has browsing the larger portal fantasy list. So, I thought, what if I had an LLM look through a list of kids books published in the 1990s and categorize "is this a portal fantasy?" and "which category is the portal?"
I would 1. possibly find my book and 2. possibly find dozens of books I could add to the lists. (And potentially help augment other Goodread-like sites.)
Haven't done it, but I still might.
Anyway, thanks for making this. It's a really cool project!
[0] https://www.goodreads.com/list/show/103552.Portal_Fantasy_Bo...
[1] https://www.goodreads.com/list/show/172393.Fiction_Portal_is...
You have an interesting idea here, but looking over the LLM output, it's not clear what these "connections" actually mean, or if they mean anything at all.
Feeding a dataset into an LLM and getting it to output something is rather trivial. How is this particular output insightful or helpful? What specific connections gave you, the author, new insight into these works?
You correctly, and importantly point out that "LLMs are overused to summarise and underused to help us read deeper", but you published the LLM summary without explaining how the LLM helped you read deeper.
A trail that hits that balance well IMO is https://trails.pieterma.es/trail/pacemaker-principle/. I find the system theory topics the most interesting. In this one, I like how it pulled in a section from Kitchen Confidential in between oil trade bottlenecks and software team constraints to illustrate the general principle.
I'm not familiar with he term "Pacemaker Principle" and Google search was unhelpful. What does it mean in this context? What else does this general principle apply to?
I'm perfectly willing to believe that I am missing something here. But reading thought many of the supportive comments, it seems more likely that this is an LLM Rorschach test where we are given random connections and asked to do the mental work of inventing meaning in them.
I love reading. These are great books. I would be excited if this tool actually helps point out connections that have been overlooked. However, it does not seem to do so.
This made me realize that so many influential figures have either absent fathers, or fathers that berated them or didn't give them their full trust/love. I think there's something to the idea that this commonality is more than coincidence. (that's the only topic of the site I've read through yet, and I ignored the highlighted word connections)
How is that different from having an insight yourself and later doing the work to see if it holds on closer inspection?
But so many of the links just don't make sense, as several comments have pointed out. Are these actually supposed to represent connections between books, or is it just a random visual effect that's suppose to imply they're connected?
I clicked on one category and it has "Us/Them" linked to "fictions" in the next summary. I get that it's supposed to imply some relationship but I can't parse the relationships
this to me sounds off. I read the same 8, to 10 books over and over and with every read discover new things. the idea of more books being more useful stands against the same books on repeat. and while I'm not religious, how about dudes only reading 1 book (the Bible, or Koran), and claiming that they're getting all their wisdom from these for a 1000 years?
If I have a library of 100+ books and they are not enough then the quality of these books are the problem and not the number of books in the library?
I won't pile on to what everyone else has said about the book connections / AI part of this (though I agree that part is not the really interesting or useful thing about your project) but I think a walk-through of how you approach UI design would be very interesting!
This is the best way to re-enforce a copilot because models are pretty smart most of the time and I can correct the cases where it stumbles with minimal cognitive effort. Over time I find more and more tasks are solved by agent intelligence or these happy path hints. As primitive as it is, CLAUDE.md is the best we have for long-term adaptive memory.
The visual style of linking phrases from one section to the next looks neat, but the connections don’t seem correct. There’s a link from “fictions” to “internal motives” near the top of the first link and several other links are not really obviously correct.
There's two stages to the linking: first juxtaposing the excerpts, then finding and linking key phrases within them. I find the excerpts themselves often have interesting connections between them, but the key phrases can be a bit out there. The "fictions" to "internal motives" one does gel for me, given the theme of deceiving ourselves about our own motivations.
#1: would a larger dataset increase the depth and breadth of insight ( go to #2) #2: with the initial top 100, are there key ‘super node’ books that stand out as ones to read due the breadth they offer. Would a larger dataset identify further ‘super node’ books.
Wouldn't it be good if recursive Leiden and cypher was built into an embedded DB?
That's what I'm looking into with mcp-server-ladybug.
https://en.wikipedia.org/wiki/Netflix_Prize
(Are people still trying to improve upon the original winning solution?)
Solid technical execution too. Well done!
I really appreciate you mentioning this. I think this is the nature of LLMs in general. Any symbol it processes can affect its reasoning capabilities.
Interesting... seems like it wants the keys on your system! ;)
It's all fun and game 'till someone loses an eye/mind/even-tenuous-connection-to-reality.
Edit: I'd mention that the themes Claude finds qualify as important stuff imo. But they're all pretty grim and it's a bit problematic focusing on them for a long period. Also, they are often the grimmest spin things that are well known.
A LLM is a transformer. It transforms a prompt into a result.
Or a human idea into a concrete java implementation.
Currently I'm exploring what unexpected or curious transformations LLMs are capable of but haven't found much yet.
At least I myself was surprised that an LLM can transform a description of something into an IMG by transforming it into a SVG.
:DWhat makes exploration valuable is the cycle: act, observe, recognize whether you're closer to what you wanted, then refine. Without that recognition ("closer" or "drifting"), you're exploring blind.
Context is what lets the loop close. You need enough of it to judge the outcome. I think that real shift isn't generators → agents. It's one-shot output → iterative refinement with judgment in the loop.