I'm just a bit skeptical about this quote:
> Harper takes advantage of decades of natural language research to analyze exactly how your words come together.
But it's just a rather small collection of hard-coded rules:
https://docs.rs/harper-core/latest/harper_core/linting/trait...
Where did the decades of classical NLP go? No gold-standard resources like WordNet? No statistical methods?
There's nothing wrong with this, the solution is a good pragmatic choice. It's just interesting how our collective consciousness of expansive scientific fields can be so thoroughly purged when a new paradigm arises.
LLMs have completely overshadowed ML NLP methods from 10 years ago, and they themselves replaced decades statistical NLP work, which also replaced another few decades of symbolic grammar-based NLP work.
Progress is good, but it's important not to forget all those hard-earned lessons, it can sometimes be a real superpower to be able to leverage that old toolbox in modern contexts. In many ways, we had much more advanced methods in the 60s for solving this problem than what Harper is doing here by naively reinventing the wheel.
Before our rule engine has a chance to touch the document, we run several pre-processing steps that imbue semantic meaning to the words it reads.
> LLMs have completely overshadowed ML NLP methods from 10 years ago, and they themselves replaced decades statistical NLP work, which also replaced another few decades of symbolic grammar-based NLP work.
This is a drastic oversimplification. I'll admit that transformer-based approaches are indeed quite prevalent, but I do not believe that "LLMs" in the conventional sense are "replacing" a significant fraction of NLP research.
I appreciate your skepticism and attention to detail.
1. https://jalammar.github.io/illustrated-word2vec/
2. https://jalammar.github.io/visualizing-neural-machine-transl...
3. https://jalammar.github.io/illustrated-transformer/
4. https://jalammar.github.io/illustrated-bert/
5. https://jalammar.github.io/illustrated-gpt2/
And from there it's mostly work on improving optimization (both at training and inference time), training techniques (many stages), data (quality and modality), and scale.
---
There's also state space models, but don't believe they've gone mainstream yet.
https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
And diffusion models - but I'm struggling to find a good resource so https://ml-gsai.github.io/LLaDA-demo/
---
All this being said- many tasks are solved very well using a linear model and tfidf. And are actually interpretable.
Indeed, before that there was a lot of work on applying classical ML classifiers (Naive Bayes, Decision Trees, SVM, Logistic Regression...) and clustering algorithms (fancily referred to as unsupervised ML) to bag-of-words vectors. This was a big field, with some overlap with Information Retrieval, lending to fancier weightings and normalizations of bag-of-words vectors (TF-IDF, BM25). There was also the whole field of Topic Modeling.
Before that there was a ton of statistical NLP modeling (Markov chains and such), primarily focused around machine translation before neural-networks got good enough (like the early version of Google Translate).
And before that there were a few decades of research on grammars (starting with Chomsky), with a lot of overlap with compilers, theoretical CS (state-machines and such) and symbolic AI (lisps, logic programming, expert systems...).
I myself don't have a very clear picture of all of this. I learned some in undergrad and read a few ancient NLP books (60s - 90s) out of curiosity. I started around the time where NLP, and AI in general, had been rather stagnant for a decade or two, it was rather boring and niche, believe it or not, but was starting to be revitalized by the new wave of ML and then word2vec with DNNs.
The Neovim configuration for the LSP looks neat: https://writewithharper.com/docs/integrations/neovim
The whole thing seems cool. Automattic should mention this on their homepage. Tools like this are the future of something.
(^^ alien language that was developed in less than a decade)
Not an insurmountable problem, ChatGPT will use "aight fam" only in context-sensitive ways and will remove it if you ask to rephrase to sound more like a professor, but RHLFing slang into predictable use is likely a bigger potential challenge than simply ensuring the word list of an open source program is sufficiently up to date to include slang whose etymology dates back to the noughties or nineties, if phrasing things in that particular vernacular is even a target for your grammar linting tool...
aight, trippin, fr (at least the spoken version), and fam were all very common in the 1990s (which was the last decade I was able to speak like that without getting jeered at by peers).
Also, it takes at most few developers to write those rules into a grammar checking system, compared to millions and more that need to learn a given piece of "evolved" language as it becomes impossible to avoid learning it. It's not only fast enough to do this manually, it also takes much less work-intensive and more scalable.
Or, in other words: if you "just" want a utility that can learn speech on the fly, you don't need a rigid grammar checker, just a good enough approximator. If you want to check if a document contains errors, you need to define what an error is, and then if you want to define it in a strict manner, at that point you need a rule engine of some sort instead of something probabilistic.
Languages are used to successfully communicate. To achieve this, all parties involved in the communication must know the language well enough to send and receive messages. This obviously includes messages that transmit changes in the language, for instance, if you tried to explain to your parents the meaning of the current short-lived meme and fad nouns/adjectives like "skibidi ohio gyatt rizz".
It takes time for a language feature to become widespread and de-facto standardized among a population. This is because people need to asynchronously learn it, start using it themselves, and gain critical mass so that even people who do not like using that feature need to start respecting its presence. This inertia is the main source of slowness that I mention, and also and a requirement for any kind of grammar-checking software. From the point of such software, a language feature that (almost) nobody understands is not a language feature, but an error.
> You also seem to be making an assumption that natural languages (English, I'm assuming) can be well defined by a simple set of rigid patterns/rules?
Yes, that set of patterns is called a language grammar. Even dialects and slangs have grammars of their own, even if they're different, less popular, have less formal materials describing them, and/or aren't taught in schools.
I find that clear-cut, rigid rules tend to be the least helpful ones in writing. Obviously this class of rule is also easy/easier to represent in software, so it also tends to be the source of false positives and frustration that lead me to disable such features altogether.
When writing for utility and communication, though, English grammar is simple and standard enough. Browsing Harper sources, https://github.com/Automattic/harper/blob/0c04291bfec25d0e93... seems to have a lot of the basics already nailed down. Natural language grammar can often be represented as "what is allowed to, should, or should not, appear where, when, and in which context" - IIUC, Harper seems to tackle the problem the same way.
Even these few posts follow innumerable “rules” which make it easier to (try) to communicate.
Perhaps what you’re angling against is where rules of language get set it stone and fossilized until the “Official” language is so diverged from the “vulgar tongue” that it’s incomprehensibly different.
Like church/legal Latin compared to Italian, perhaps. (Fun fact - the Vulgate translation of the Bible was INTO the vulgar tongue at the time: Latin).
Certainly we would never want our language to be less expressive. There’s no point to that.
And what would be the point of changing for the sake of change? Sure, we blop use the word ‘blop’ instead of the word ‘could’ without losing or gaining anything, but we’d incur the cost of changing books and schooling for … no gain.
Ah, but it’d be great to increase expressiveness, right? The thing is, as far as I am aware all human languages are about equal in terms of expressiveness. Changes don’t really move the needle.
So, what would the point of evolution be? If technology impedes it … fine.
Being equally as expressive overall but being more focussed where current needs are.
OTOH, I don't think anything is going to stop language from evolving in that way.
Agreed. Same with those non-ASCII single and double quotes.
https://github.com/languagetool-org/languagetool
I generally run it in a Docker container on my local machine:
https://hub.docker.com/r/erikvl87/languagetool
I haven't messed with Harper closely but I am aware of its existence. It's nice to have options, though.
It would sure be nice if the Harper website made clear that one of the two competitors it compares itself to can also be run locally.
https://dev.languagetool.org/finding-errors-using-n-gram-dat...
I would suggest diving into it more because it seems like you missed how customizable it is.
I also only ever used the web app, so copy+pasting as installing the app is for all intentness and purposes is installing a key logger.
Grammar works on rules, not sure why that needs an LLM, Grammarly certainly worked better for me when it was more dumb, using rules.
It's not a problem; I make the determination which option I like better, but it is funny.
Not that I think LLM is always better, but it would be interesting to compare these two approaches.
Given LISP was supposed to build "The AI" ... pretty sad than a dumb LLM is taking its place now
They must have acquired fantastic data for their Models. Especially because of the business language and professional translations which they focus on.
They keep your intended message in tact and just refine it. Like a book post editing. Grammarly and other tools force you to sound like they think is best.
DeepL shows, in my opinion, how much more useful a model trained for specific uses is.
So just like English teachers I see
I've relied on Grammarly to spellcheck all my writing for a few years (dyslexia prevents me from seeing the errors even when reading it 10 times). However, I find its increasing focus on LLMs and its insistence on rewriting sentences in more verbose ways bothers me a lot. (It removes personality and makes human-written text read like AI text.)
So I've tried out alternatives, and Harper is the closest I've found at the moment... but i still feel like grammarly does a better job at the basic word suggestion.
Really, all I wish for is a spellcheck that can use the context of the sentence to suggest words. Most ordinary dictionary spellchecks can pick the wrong word because it's syntactically closer. They may replace "though" with "thought" because I wrote "thougt" when the sentence clearly indicates "though" is correct; and I see no difference visually between any of the three words.
There are some areas where it seems like LLMs (or even SLMs) should be way more capable. For example, when I touch a word on my Kindle, I'd think Amazon would know how to pick the most relevant definition. Yet it just grabs the most common definition. For example, consider the proper definition of "toilet" in this passage: "He passed ten hours out of the twenty-four in Saville Row, either in sleeping or making his toilet."
No errors detected. So this needs a lot of rule contributions to get to Grammarly level.
> In large, this is _how_ anything crawler-adjacent tends to be
It suggests
> In large, this is how _to_ anything crawler-adjacent tends to be
Even in British I'm not sure how widely they actually use it - do they say "I've a car" and "I haven't a car"?
Contractions are common in Australian English to, though becoming less so due to the influence of US English.
Has to be a bug in their rule-based system?
Using an LLM would also help make it multilingual. Both Grammarly and Harper only support English and will likely never support more than a few dozen very popular languages. LLMs could help cover a much wider range of languages.
LLMs are trained so hard to be helpful that it's really hard to contain them into other tasks
it is of course mostly very good at it, but it's very far from "trustworthy", and it tends to mirror mistakes you make.
https://writewithharper.com/docs/integrations/language-serve...
https://automattic.com/2024/11/21/automattic-welcomes-harper...
i honestly don't trust grammarly ... i mean, its essentially a keylogger.
i did try it a bit once, and i never seem to have it work that well for me. But i am multilingual so maybe thats part of my hurdle
Do you have a setup where this is possible or do you copy paste between text fields? (Genuine question. I’d love to use a local LLM integrating with an LSP).
I use grammarly briefly when it came out and liked the idea. Admittedly it has more polish than vale for people writing in google docs, &c. Still, I stick with Vale. Is there any case for moving to Harper?
[0] https://vale.sh/
It’s missing a default rule set with rules that are generally okay without being too opinionated.
Passes.
For reference: https://youtu.be/w-R_Rak8Tys?si=h3zFCq2kyzYNRXBI
Otherwise, it's great work. There should be an option to import/export the correction rules though.
i guess it's a nice and lightweight enhancement on top of the good old spellchecker, though
I wonder whether it will impact the performance (Firefox) and things will become noticeably slower...
Recently i noticed highlighting extensions in Firefox were slowing things down significantly, not just loading but also while scrolling up and down web pages.
Why would you pass a writing job to someone who isn't 100% fluent in the language and then make up for it by buying complex tools?
Instead tell me how it compares to the built-in spellcheck in my browser/IDE/word processor/OS.
The Chrome enhanced grammar checker is still awful after decades.
Maybe the AI hype will finally fix this? I'm still surprised this wasn't the first thing they did.
https://tidbits.com/2025/01/30/why-grammarly-beats-apples-wr...
(George Carlin or something, quote's veracity depends on what you mean by “average.”)
I think everybody could benefit from having something like Grammarly on their computer. None of us writes perfectly, and it's always beneficial to strive for improvement.
Also, once I asked LLM to check the message. It said everything looked fine and made a copy of the message in its response with one sentence in the middle removed.
We've had some contributors have a go at adding LaTeX support in the past, but they've yet to succeed with a truly polished option. The irregularity of LaTeX makes it somewhat difficult to parse.
We accept contributions, if anyone is interested in getting us across the finish line.
Is there any reason why there is no firefox extension?
If Harper does better at this I’d change in a minute.
I.e. if you write an "MISTAEK" and then you scroll the highlight follows me around the page
I tried with the following phrase -- "This should can't logic be done me." --
No errors.
> We currently only support English and its dialects British, American, Canadian, and Australian. Other languages are on the horizon, but we want our English support to be truly amazing before we diversify.
Then post COVID with the increase in screen sharing video calls, I soon realised nearly every non-native English speaker from countries around the world heavily relied on it in their jobs. As I could see it installed when people share screens.
Huge market, good luck.
https://writewithharper.com/docs/rules
https://github.com/Automattic/harper/blob/0c04291bfec25d0e93...
"PointIsMoot" => (
["your point is mute"],
["your point is moot"],
"Did you mean `your point is moot`?",
"Typo: `moot` (meaning debatable) is correct rather than `mute`."
),
That it doesn't use LLMs is its advantage, it runs in under 10ms and can be easily embedded in software and still provide useful grammar checking even if it's not exhaustive