It uses Snowflake’s Arctic model for embeddings and HNSW for fast similarity search. Each “story cluster” shows who published first, how fast it propagated, and how the narrative evolved as more outlets picked it up.
Would love feedback on the architecture, scaling approach, and any ways to make the clusters more accurate or useful.
Live demo: https://yandori.io/news-flow/
I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story, but things have not worked out that way. If you can manage to truly develop something like this it would be a valuable tool for rewarding the work of reporting over SEO.
Anyway, please consider that headlines and time stamps do not tell the entire story when it comes to sourcing.
For example: Your website offers this story (https://hotspotatl.com/6587626/dr-jackie-married-to-medicine...) as first to publish. But right in the text it cites another website BOSSIP as the source of the interview.
Also: there doesn't appear to be a way to link results from your website.
Yea, I need to do some work on improving first to publish... currently I'm relying pretty heavily on the published date provided in the story itself, but sometimes that is wrong and makes it look like a later publisher was first to publish.
This is complicated somewhat by the few that take an already-circulating story and then add their own actual research rather than just rewording and opining.
e.g. the recent Mark Kelly story, I went through many articles trying to find a link to the actual video of what he said. couldn’t find it
headlines with “[person said X]” tend to be bullshit
It’s all circular.
I don’t know how one is supposed to trust any of the media at this point. Especially “reputable” ones that are just as guilty of circular nonsense as anything else.
If you don’t follow the media, you are uninformed. If you follow it, you are misinformed.
No translation yet.
I think the biggest problem is im relying on published date from the news source itself too much and its wrong sometimes... not super often, but if 1 out of 100 sources get its wrong then it can steal credit for being source article when its not.
[1] https://wnyt.com/ap-top-news/rubio-says-us-ukraine-talks-on-...
I think checking source in story is next step...
This surprises me. The system is based on embeddings. AFAIK embeddings cluster the same concept in different languages in roughly the same place? Maybe it depends on the model (or maybe it's not exact and the clustering cutoff loses it).
The embeddings themselves will (pry) cluster ok in different languages (but I have not tested this yet)
I’m not aware of any that don’t. RSS is alive and well.
Cool website. As others note if this could tie in deep sources like FB, X, Reddit, etc...it would be almost "chain of evidence" canonical.
A view where websites/sources were associated with geo data (possibly involving a globe or map) would be very cool, too.
I’ve been curious how much news starts from social media. So many news stories today are “someone said x on twitter”.
I'm not pulling from social media yet.
I'll dump a few thoughts as they come for the creators, feel free to riff with me on the thread if that'll be of value.
My perspective, as a User, is I'm interested in rooting out bias and where it's coming from. Moreover, the influence networks are fascinating as well.
I think, for example, understanding which publications "picked up" a story vs didn't is very very viral use case as you could imagine people using you as a backdrop to a social post about editorial bias. That said, I think you need to pick who you serve because the folks who will be interested in this aren't the average person as they're not super news focused.
One way to learn may be looking at the types of meta-stories posted about the analysis on media and see how you could support those types of ongoing analysis. Scoring, honestly, is an another really interesting idea. What are publications "for" or "against" based on how they do editorial, and how they bias their headlines, and ledes.
We have been (low-keep) working on something similar (more from an academic point of view) for the past few years:
This is the introductory article (open access): "Comparison of news commonality and churn in international news outlets with TARO" https://dl.acm.org/doi/abs/10.1145/3603163.3609062
(Allow me a moment of pride for the student leading this project: the paper won the Ted Nelson Award at ACM Hypertext 2023.)
The view showing the flow with a play animation was a nice concept but I couldn’t see much value in it, wondering if you could try to get a more aggregate stats that shows a connection between these different flows, maybe they follow a pattern like ad-based campaigns or publishers who own these domains, which would explain things. Expanding on this idea, could even try and setup different scores and metrics based on major groups and sponsored content versus organic spread.
Where'd you find all those RSS feeds? Have you done anything else with RSS feeds? :)
Also agree with the others this definitely needs interactive graphs!
Curious how you sourced the feeds? It seems to have a bias towards Indian/Srilanka/Iran/Indonesia/Turkey etc - i.e. not the traditional western centric reporting. Always interested in trying to get a more balanced news diet so anything you could share around that would be interesting. Most out of the box news tools seem to automatically lean west
FYI layout sometimes breaks like so:
Thanks for that bug feedback - ill get fix.
Front-end downstream of clicking on a card doesn't seem to work correctly on every reload... but it works sometimes.
Cool concept though - the source count and "+N" spread metrics give a quick sense of which stories have legs.
Afaict, it is the usual topic trending over time, or maybe it is showing direct sindication?
Computing actual derivation flow would be neato, esp precisely at scale vs just the usual embeddings
For any given clip, short or excerpt, find the most complete, unedited version that it was taken from.
Some stories are very clearly manufactured
Ubuntu 24.04, Firefox 145.0.1 (64-bit)
> Opinion: Operation Holiday serves a critical need in our communities
> Dhru Fusion WooCommerce Integration Plugin
> Powering the Future of Wellness Through Premium Food Supplement Ingredients
That isn't even remotely important at all so really unreliable.
I get most of it, but I think especially around the holiday some stuff is getting through... Some black friday deals were actually hitting like news does...
The following headlines look more like spam rather than factual breaking news.
Thanks for sharing some details. Its cool that HNSW is useful for near realtime usage. For some reason I had categorized it in my head as having very very high insertion cost, needing to rebuild worlds to work but that's not at a well founded belief; very cool that it's usable here.
I really hope we see some open source work of this variety. Trying to understand news or even social media is something the world seems to unprepared for. Different subject sort of, but watching Internet Observatory be dismantled by the current political administration, by disinformation grifters, was a woeful loss of one of the few mirrors the that humanity had to understand itself with, to see how we networked.
A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.
Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.
Overall, really nice work. The propagation timeline is especially useful.
maybe the author uses LLMs in some comments and not others. that is, it's not a bot, just someone manually using LLM tools sometimes