As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.
Nail on the head. Every time I've seen it applied, its awful at this. However this is the one thing I loathe in human reviews as well, where people are leaving twenty comments about naming and then the actual FUNCTIONAL issue is just inside all of that mess. A good code reviewer knows how to just drop all the things that irk them and hyperfocus on what matters, if there's a functional issue with the code.
I wonder if AI is ever gonna be able to conquer that one as its quite nuanced. If they do though, then I feel the industry as it is today, is kinda toast for a lot of developers, because outside of agency, this is the one thing we were sorta holding out on being not very automatable.
What this meant was that instead of leaving nitpicky comments, people would just change things that were nitpicky but clear improvements. They'd only leave comments (which blocked release) for stuff that was interesting enough to discuss.
This was typically a big shock for new hires who were used to the "comment for every nitpick" system; I think it can feel insulting when someone changes your feature. But I quickly came to love it and can't imagine doing code review any other way now. It's so much faster!
I'm not sure how to tie this to AI code review tbh. Right now I don't think I'd trust a model's taste for when to change things and when to leave a comment. But maybe that'll change. I agree that if you automated away my taste for code it'd put me in a weird spot!
I like this review method too though, and like that some pr review tools have a 'suggest changes' and 'apply changes' button now too
Fire both. There is no amount of skill and productivity that can justify that amount of pettiness.
You fire both or at least one of them. Problem solved.
This is my dream; have only had a team with little enough ego to actually achieve it once for an unfortunately short period of time. If it's something that there's a 99% chance the other person is going to say 'oh yeah, duh' or 'sure, whatever' then it's just wasting both of your time to not just do it.
That said, I've had people get upset over merging their changes for them after a LGTM approval when I also find letting it sit to be a meaningless waste of time.
If the reviewer can make changes without someone reviewing their change, it's just waiting to blow up in. your face.
It used to be "open source" in that some of the code was available, but afaik it wasn't ever possible to actually run it externally because of how tightly it integrated with other internal systems.
It's pretty straightforward: you checkout a PR, move around, and either make some edits (that you can commit and push to the feature branch) or add comments.
If I had to pick, I actually think ONLY being able to submit "counter-patches" would be better than only being able to submit comments. Comments could just be actual programming language style comments submitted as code changes.
On the other hand, if it's less important code or the renaming is not clearly an improvement it can be quite useless. But I've met some developers who has the opinion of reviews as pointless and just say "this works, just approve it already" which can be very frustrating when it's a codebase with a lot of collaboration.
1. you are violating a previously agreed upon standard for naming things
2. inconsistent naming, eg some places you use "catalog ID" and other places you use "item ID" (using separate words and spaces here because case is irrelevant).
3. the name you chose makes it easy to conflate two or more concepts in your system
4. the name you chose calls into question whether you correctly understood the problem domain you are addressing
I'm sure there are other good naming comments, but this is a reasonable representation of the kinds of things a good comment will address.
However, most naming comments are just bike shedding.
The bigger problem is people who feel ownership of shared codebases tied to their ego and who get angry when people suggest changes to names and other bits of interfaces instead of just making the suggested change.
If you get code review feedback, the default answer is "Done" unless you have a strong reason not to. If it's not obvious whether the name suggested by the author or the reader is better, the reader's choice should be taken every time.
I used to think that way, but in many nontrivial circumstances, every conceivable name will be a mismatch for where some person is coming from, and not be self-evident for their mental model. Even the same person, over a longer time span. There is often a gap to bridge from name to meaning, and a comment isn’t the worst way to bridge it.
yes but it can be severely diminishing returns. Like lets step back a second and ask ourselves if:
var itemCount = items.Count;
vs
var numberOfItems = items.Count;
is ever worth spending the time discussing, versus how much of a soft improvement it makes to the code base. I've literally been in a meeting room with three other senior engineers killing 30 minutes discussing this and I just think that's a complete waste of time. They're not wrong, the latter is clearer, but if you have a PR that improves the repo and you're holding it back because of something like this, then I don't think you have your priorities straight.
Sort of place that is fussy about test naming so where I would do smth like:
TestSearchCriteriaWhere
they'd want
Test_That_Where_Clauses_In_Search_Criteria_Work
I think its a waste of typing but idk, I'm willing to let it slide because I think its a pointless hill to die on.
var itemCount = items.Count;
depends on what `items` is, no? Is the `.Count` O(1)? Do you really need a variable or is it fine for the (JIT) compiler to take care of it? Is it O(n) and n is significant enough? Maybe you need a variable and spend time arguing about that name. Yes I chose this because almost everyone I know at least would argue you always have to create the variable (and then argue about the name) ;) fussy about test naming
I get fussiness about test naming. I believe that a good test "name" should tell you enough for you to be able to "double check" the test setup as well as the assertions against the test name with some sort of "reasonable" knowledge of the code/problem domain.As such both of those test names are really bad, because they can't tell anything at all about whether you're testing for the correct thing. How do I know that your assertions are actually asserting that it "works"?
Instead, I'd want a test named something like this (assuming that that's what this particular test is actually about - i.e. imagine this particular test in the context of a user defined search, where one of the options is that they can specify a project to search by and this particular test is about verifying that we check the permissions the user has for said project. There would be different tests for each of the relevant where clauses that specifying a project in the search params would entail and different tests again for each of the other user specifiable parameters that result in one or more where clauses to be generated):
shouldCheckProjectPermissionsWhenProjectIdInSearchParams()
Every single test case gives you the ability to specify both a good name and clear, concise test assertions. If I see anything but a bunch of assertions related to project permissions for the logged in user in this test, I will fight you tooth and nail on that test ;) I couldn't care less tho if you use camelCase or snake_case or whatever. I just had to choose something to post. I also couldn't care less if you had 17 different assertions in the test (we all know that "rule", right? I think the "test one thing" and "one assertion" is not about the actual number of "assert statements". People that think that, got the rule wrong. It's all about the "thing" the assertions test. If you have 17 assertions that are all relevant to testing the project permission in question then they're great and required to be there. If 1 is for asserting the project permissions and the other 16 are repeating all the other "generic assertions" we copy and pasted from previous tests, then they're not supposed to be there. I will reject such a PR every time.I appreciate neither of those test names are great but it was just a strawman example to show the fussiness.
So clearly distinguishing the local `numberOfItems` from `items.Count` _could_ be helpful. But I wouldn't ping it in a review.
(Because the correct English term is “item count”, not “items count”.)
Personally, I tend to only name it “count” if it’s a variable that is used to keep a count, i.e. it is continually incremented as new items are processed.
Otherwise I tend to prefer `numItems`.
Yes, this is very close to bike-shedding. There is, however, an argument to be made for consistency in a code base.
var relevantCount = items.Where(x => x.SomeValue > 5);
vs
var numberOfRelevantItems = items.Where(x => x.SomeValue > 5);
so it wasn't necessarily cheap enough to want to repeat.
Here though you're missing the ".Count", so
var relevantItems = items.Where(x => x.SomeValue > 5);
relevantItems.Count
As long as it's not a property that's calculated every time it's accessed, this still seems better than pulling the count into its own variable to me.All of the non critical words in english aren't useless bloat, they remove ambiguity and act as a kind of error correction if something is wrong.
It may feels to many. I mostly use suggestion, thought, and todo. When I type down "nit..." I realized it usually does not worth it. I'd rather make comment about higher level of the changes.
If the comment must be addressed before the review is approved, then it is not a nit, it is a blocker (a "changes required"). Blockers should not be marked as nits — nor vice versa.
I agree that prefixing comments with "Nit:" (or vice versa in extreme cases "This is a big one:") is psychologically useful. Yet another reason it's useful is that it's not uncommon for perceived importance to vary over time: you start with "hmm, this could be named blah" and a week later you've convinced yourself it's a blocker — so, force yourself to recognize that it was originally phrased as a nit, and force yourself to come back and say explicitly "I've changed my mind: I think this is important." With or without the "nit/blocker" prefixing pattern, the reviewer may come off as capricious; but with the pattern, he's at least measurably capricious.
Eventually I've told him "if your comment does not affect performance or business logic, I'm ignoring it". He finally got the message. The fact that he accepted this tells me that deep down he knew his comments were just bike shedding.
Then, going through their code, they make excuses about their code not meeting the same standards they demand.
As the other responder recommends, a style guide is ideal, you can even create an unofficial one and point to it when conflicting style requests are made
Yes!! Exactly. When it comes to my PRs, he once made this snarky comment about him having high expectations in terms of code quality. When it comes to his PRs, he does the things he tells me not to do. In fact, I once sent him a "dis u?" with a link to his own code, as a response to something he told me I shouldn't do. To his credit he didn't make excuses, he responded "I could've done better there, agreed".
In general he's not bad, but his nitpicking is bad. I don't really understand what's going on in his mind that drives this behavior, it's weird.
For example, most people would agree you should use exhaustive checks when possible(matching in rust, records in typescript, etc.). But how do you enforce that? Ban other types of control flow? But even if you find a good balanced way to enforce it, you won't always want to enforce it. There's plenty of good use cases where you explicitly don't want a check to be exhaustive. At which point now you gotta make sure there's an escape mechanism to whatever crackhead check you've setup. Better to just leave a comment with a link to your style guide explaining why this is done. Many experienced developers that are new to rust or typescript simply never think of things like this, so it's worthwhile to document it.
- If it's a rough PR, you're looking for feedback on direction rather than nitpicks.
- If it's in a polished state, it's good to nitpick assuming you have a style guide you're intending to adhere to.
Perhaps this can be provided in the system prompt?
As an aside, naming is highly subjective. Like in writing, you tailor naming to the problem domain and the audience.
You should check it at Augmentcode.com
It's not very good with the rest, because there's an intuition that needs to be developed over time that takes all the weirdness into account. The dead code, the tech debt, the stuff that looks fundamentally broken but is depended on because of unintended side effects, etc. The code itself is not enough to explain that, it's not a holistic documentation of the system.
The AI is no different to a person here: something doesn't 'feel' right, you go and fix it, it breaks, so you have to put it back again because it's actually harder than you think to change it.
I think this is the problem with just about every tool that examines code.
I've had the same problem with runtime checkers, with static analysis tools, and now ai code reviews.
Might be the nature of the beast.
probably happens with human code reviews too. Lots of style false positives :)
We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up
Much has changed on our approach since then, so we'll probably write a a new blog post.
The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff
Now, that could happen with a human reviewer as well. But it didn't catch the context of the change.
I wouldn't replace human review with LLM-review but it is a good complement that can be run less frequently than human review.
Maybe that's why I find it easy to ignore the noise, I have it to a huge review task after a lot of changes have happened. It'll find 10 or so things, and the top 3 or 4 are likely good ones to look deeper into.
I have to constantly push back against it proposing C++ library code, like std::variant, when C-style basics are working great.
1) give it a number of things to list in order of severity
and
2) tell it to grade how serious of a problem it may be
The human reviewer can then look at the top ten list and what the LLM thinks about its own list for a very low overhead of thinking (i.e. if the LLM thinks its own ideas are dumb a human probably doesn't need to look into them too hard)
It also helps to explicitly call out types of issue (naming, security, performance, correctness, etc)
The human doesn't owe the LLM any amount of time considering, it's just an idea generating tool. Looking through a top ten list formatted as a table can be scanned in 10 seconds in a first pass.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
do you have a bot to do this too?
Sample number of true positives says more or less nothing on its own.
Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
Reading embarrassment into that is extremely childish and disrespectful.
9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?
// stuff
obj.setSomeData(something);
// fifteen lines of other code
obj.setSomeData(something);
// more stuff
The 'something' was a little bit more complex, but it was the same something with slightly different formatting.My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
It also caught a repeat call in
List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList();
// ...
var something = someFunc(objs);
Thingy someFunc(List<Objs> param) {
return param.stream().filter(o -> o.field.isPresent()). ...
Where one of the filter calls is unnecessary... and it caught that across a call boundary.So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.
So you've got this Java:
public List<Integer> someCall() {
return IntStream.range(1,10).boxed().toList();
}
public List<Integer> filterEvens(List<Integer> ints) {
return ints.stream()
.filter(i -> i % 2 == 0)
.toList();
}
int aMethod() {
List<Integer> data = someCall();
return filterEvens(data.stream().filter(i -> i % 2 == 0).toList()).size();
}
And I can mock the class and return a spied'ed List. But now I've got to have that spied List return a spied stream that checks to see if .filter(i -> i % 2 == 0) was called. But then someone comes and writes it later as .filter(i -> i % 2 != 1) and the test breaks. Or someone adds another call to sort them first, and the test breaks.To that end, I'd be very curious to see the test code that verifies that when aMethod() is called that the List returned by SomeCall is not filtered twice.
What's more, it's not a useful test - "not filtered twice" isn't something that is observable. It's an implementation detail that could change with a refactoring.
Writing a test that verifies that filterEvens returns a list that only contains even numbers? That's a useful test.
Writing a test that verifies that aMethod returns back the size of the even numbers that someCall produced? That's a useful test.
Writing a test that tries to enforce a particular implementation between the {} of aMethod? That's not useful and incredibly brittle (assuming that it can be written).
I think they are just arguing for the sake of arguing.
I think we're at the point where you need concrete examples to talk about whether it's worth it or not. If you have functions that can't be called twice, then you have no other option to test details in the implementation like that.
Yeah there's a tradeoff between torturing your code to make everything about it testable and enforce certain behavior or keeping it simpler.
I have worked in multiple code bases where every function call had asserts on how many times it was called and what the args were.
How would you assert that a given std::vector only was filtered by std::ranges::copy_if once? And how would you test that the code that was in the predicate for it wasn't duplicated?
How would you write a failing test for this function keeping the constraint that you are working with std::vector?
std::vector<int> doThing(const std::vector<int>& nums) {
std::vector<int> tmp1;
std::vector<int> tmp2;
std::ranges::copy_if(data,
std::back_inserter(tmp1),
[](int n) { return n % 2 == 0; }
);
std::ranges::copy_if(tmp1,
std::back_inserter(tmp2),
[](int n) { return n % 2 == 0; }
);
return tmp2;
} void func() {
printEvens(someCall().stream().filter(n -> n % 2 == 0).toList());
}
void printEvens(List<Integer> nums) {
nums.stream().filter(n -> n % 2 == 0).forEach(n -> System.out.println(n));
}
The first filter is redundant in this example. Duplicate code checkers are checking for exactly matching lines.I am unaware of any linter or static analyzer that would flag this.
What's more, unit tests to test the code for printEvens (there exists one) pass because they're working properly... and the unit test that calls the calling function passes because it is working properly too.
Alternatively, write the failing test for this code.
You could write a test that makes sure the output of someCall is passed directly to printeven without being modified.
The example as you wrote is hard to test in general. It's probably not something you would write if your serious about testing.
#include <vector>
#include <iostream>
#include <algorithm>
std::vector<int> someCall()
{
return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
}
void printEvens(const std::vector<int>& nums)
{
std::ranges::for_each(nums, [](int n)
{
if (n % 2 == 0)
{
std::cout << n << '\n';
}
});
}
int main()
{
std::vector<int> data = someCall();
std::vector<int> tmp;
std::ranges::copy_if(data,
std::back_inserter(tmp),
[](int n) { return n % 2 == 0; }
);
printEvens(tmp);
return 0;
}
---Nothing in there is wrong. There is no test that would fail short of going through the hassle of creating a new type that does some sort of introspection of its call stack to verify which function its being called in.
Likewise, identify if a linter or other static analysis tool could catch this issue.
Yes, this is a contrived example and it likely isn't idiomatic C++ (C++ isn't my 'native' language). The actual code in Java was more complex and had a lot more going on in other parts of the files. However, it should serve to show that there isn't a test for printEvens or someCall that would fail because it was filtered twice. Additionally, it should show that a linter or other static analysis wouldn't catch the problem (I would be rather impressed with one that did).
From ChatGPT a code review of the code: https://chatgpt.com/share/69780ce6-03e0-8011-a488-e9f3f8173f...
But why would anyone ever do that? There's nothing incorrect about the code, it's just less efficient than it should be. There's no reason to limit calls to printEven to accept only output from someCall.
You could pick it up if you were to explicitly track whether it's being called redundantly but it'd be very hard and by the time you'd thought of doing that you'd certainly have already manually checked the code for it.
Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
It doesn't replace a human reviewer.
I don't see the point of paying for yet another CI integration doing LLM code review.
It's very unlikely that any of these tools are getting better results than simply prompting verbatim "review these code changes" in your branch with the SOTA model du jour.
Not guaranteed though of course.
I would on the whole prefer a 'lint-style' tool to catch most stuff because they don't hallucinate.
But obviously they don't catch everything so an LLM-based review seems like an additional useful tool.
Where they suck is high level problems like - is the code actually solving the business problem? Is it using right dependencies? Does it fit into broader design?
Which is expected for me and great help. I'm more happy as a human to spend less time checking if you're managing lifecycle of the pointer correctly and focus on ensuring that code is there to do what it needs to do.
It regularly finds problems, including subtle but important problems that human reviewers struggle to find. And it can make pretty good suggestions for fixes.
It also regularly complains about things that are possible in theory but impossible in practice, so we've gotten used to just resolving those comments without any action. Maybe if we used types more effectively it would do that less.
We pay a lot more attention to what CodeRabbit says than what DeepSource said when use used it.
Like anything else AI you need to understand what you're doing, so you need to understand your code and the structure of your application or service or whatever because there are times it will say something that's just completely wide of the mark, or even the polar opposite of what's actually the case. And so you just ignore the crap and close the conversation in those situations.
At the same time, it does catch a lot of bugs and problems that fall into classes where more traditional linters really miss the mark. It can help fill holes in automated testing, spot security issues, etc., and it'll raise PRs for fixes that are generally decent. Sometimes not but, again, in these cases you just close them and move on.
I'd certainly say that an AI code review is better than no code review at all, so it's good for a startup where you might be the only developer or where there are only one or two of you and you don't cross over that much.
But the point I actually wanted to get to is this: I use Copilot because it's available as part of my GitHub subscription. Is it the best? I don't know. Does it add value with zero integration cost to me? Yes. And that, I suspect, is going to make it the default AI code review option for many GitHub subscribers.
That does leave me wondering how much of a future there is for AI code review as a product or service outside of the hosting platforms like GitHub and Gitlab, and I have to imagine that an absolutely savage consolidation is coming.
Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
The low quality of HN comments has been blowing my mind.
I have quite literally been doing what you describe every working day for the last 6+ months.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
it's a valuable problem to solve, amplified by the fact that ai coding produces much more code.
that being said, i think it's damn hard to compete with openai or anthropic directly on a core product offering in the long run. they know that it's an important problem and will invest accordingly.
* Suggested to silence exception instead of crash and burn for "style" (the potential exception was handled earlier in code but it did not manage to catch that context). When I commented that silencing the exception could lead to uncaught bugs it replies "You're absolutely right, remove the try-catch" which I of course never added * Us using python 3.14 is a logic error as "python 3.14 does not exist yet" * "Review the async/await patterns Heavy use of async in model validation might indicate these should be application services instead." whatever this vague sentence means. Not sure if it is suggesting us changing the design pattern used in our entire code base.
Also the "confidence" score added to each PR being 4/5 or something due to these irrelevant comments was a really annoying feature IMO. In general AI tools giving a rating when they're wrong feels like a big productivity loss as then the human reviewer will see that number and think something is wrong with the PR.
--
Before this we were running Coderabbit which worked really well and caught a lot of bugs / implementation gotchas. It also had "learnings" which it referenced frequently so it seems like it actually did not repeat commenting on intentional things in our code base. With Coderabbit I found myself wanting to read the low confidence comments as well since they were often useful (so too quiet instead of too noisy). Unfortunately our entire Coderabbit integration just stopped working one day and since then we've been in a long back and forth with their support.
--
I'm not sure what the secret sauce is but it feels like Greptile was GPT 3.5-tier and Coderabbit was Sonnet 4.5-tier.
In my case using these prompts:
https://github.com/masoncl/review-prompts
Took things from "pure noise" to a world where, if you say there's a bug in your patch, people's first question will be "has the AI looked at it?"
FWIW in my case the AI has never yet found _the_ bug I was hunting for but it has found several _other_ significant bugs. I also ran it against old commits that were already reviewed by excellent engineers and running in prod. It found a major bug that wasn't spotted in human review.
Most of the "noise" I get now just leads me to say "yeah I need to add more context to the commit message". E.g the model will say "you forgot to do X" when X is out of scope for the patch and I'm doing it in a later one. So ideally the commit messages should mention this anyway.
You have a great product. Looking forward to get back to using it!
However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.
Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.
Again, I’m rooting for you guys. But I think this is food for thought.
There is no reason to not just ask Claude for a review and then distill this into PR comments. Especially because "every LLM output has to be filtered through a human" is a good policy with the current state of these tools.
However, this industry loves distilling frivolities into web tools and it sells for some unfathomable reason. It is the same with the existing static analyzers etc that some orgs pay for. I do not understand why.
Those tools are better than vanilla agents by dedicating expensive human time on evaluating and fine tuning models. You can also build various integration, management and reporting features to add value. If you freeze model progress today, or 12 months ago when most of those companies started, it's a viable business I think.
But any gains you make on the first part will be lost to newer models, and the 2nd part is not as valuable when llms allow people to build fairly complicated features quickly.
I don't if worthless but all those companies have very limited time to gather customers and at least make themselves valuable for an acquisition
The real question is how can you compete. There are lots of answers here, but something new and good is rare.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
> Independence
Any "agent" running against code review instead of code generation is "independent"?
> Autonomy
Most other code review tools can also be automated and integrated.
> Loops
You can also ping other code review tools for more reviews...
I feel like this article actually works against you by presenting the problem and inadequately solving them.
It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
> Autonomy
Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
> Loops
You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
You're assuming models/prompts insist on a previous iteration of their work being right. They don't. Models try to follow instructions, so if you ask them to find issues, they will. 'Trust' is a human problem, not a model/harness problem.
> Our view is that code validation will be completely autonomous in the medium term.
If reviews are going to be autonomous, they'd be part of the coding agent. Nobody would see it as an independent activity, you mentioned above.
> Our first step towards making this easier is a native Claude Code plugin.
Claude can review code based on a specific set of instructions/context in an MD file. An additional plugin is unnecessary.
My view is that to operate in this space, you gotta build a coding agent or get acquired by one. The writing was on the wall a year ago.
It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
“Vibes” (2/3 on scale) are ok, just honestly curious.
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
> Today's agents are better than the median human code reviewer
Which is it? You cannot have it both ways.
"...at catching issues and enforcing standards, and they're only getting better".
I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.
Here's a summary of the top-level ideas behind it. Hope it's helpful!
Core Philosophy
- "Advisor, not gatekeeper" - Every issue includes a "Could be wrong if..." caveat because context matters and AI can't see everything. Developers make the final call.
(Just this idea makes it less annoying and stops devs going down rabbit holes because it it pretty good at thinking why it might be wrong)
- Prompt it to be critical but not pedantic - Focus on REAL problems that matter (bugs, security, performance), not style nitpicks that linters handle.
- Get the team to run it on the command line just before each commit. Small, focused reviews not after batching 10 commits. Small diffs get better feedback.
Smart Context Gathering
- Full file contents, not just diffs - The tool reads complete changed files plus 1-level-deep imports to understand how changed code interacts with the codebase.
Prompt Engineering
- Diff-first, context-second - The diff is marked as "REVIEW THIS" while context files are explicitly marked "DO NOT REVIEW - FOR UNDERSTANDING ONLY" to prevent false positives on unchanged code. BUT that extra context makes a huge difference in correctness.
- Structured output format - Emoji-prefixed bullets ( Critical, Major, Minor), max 3 issues per section, no fluff or praise.
- Explicit "Do NOT" list - Prevents common AI review mistakes: don't flag formatting (Prettier handles it), don't flag TypeScript errors (IDE shows them), don't repeat issues across files, don't guess line numbers.
Final note
- Also plugged it in to a github action for last pass, but again non blocking.
But like everything else with it, it tries to do too much.
What I want is a review "wizard" agent -- something that identifies the pieces I should look at, and takes me through them diff by diff asking me to read them, while offering its commentary ("this appears to be XX....") and letting me make my own.
So in the end I think there will still be some disappointment, as one would expect it should be fully automated and only about reading the code, like this article suggests. In reality, I think it is harder than writing code.
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
When developers create a PR, they already feel they are "done", and they have likely already shifted their focus on another task. False positive are horrible at this point, especially when they keep changing with each push of commits.
If by engineering you mean using the engineering design process than it would not be engineered at all, let alone over engineered.
https://www.greptile.com/benchmarks https://www.greptile.com/greptile-vs-coderabbit https://www.greptile.com/greptile-vs-bugbot
I feel like these are often not well defined? "Its not a bug it's a feature", "premature optimization is the root of all evil", etc
In different contexts, "performant enough" means different things. Similarly, many times I've seen different teams within a company have differing opinions on "correctness"
I would be interested to hear of some specific use-cases for LLMs in code review.
With static analysis, tests, and formatters I thought code review was mostly interpersonal at this point. Mentorship, ensuring a chain of liability in approvals, negotiating comfort levels among peers with the shared responsibility of maintaining the code, that kind of thing.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.
i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
> The point of a PR is to share knowledge and to catch structural gaps.
Well, it was to share knowledge and to catch structural gaps.
Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.
PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.
A code review requires reasoning and understanding, things that to my knowledge a generative model cannot do.
Surely the most an AI code review ever could be is something that looks like a code review.
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
Or it won’t understand some invariant that you know but is not explicit anywhere
https://www.augmentcode.com/blog/we-benchmarked-7-ai-code-re...
has anyone tried it?
You do get a handful of false positives, especially if what it reports is technically correct, but we’re just handling the issue in a sort of weird/undocumented way. But it’s only one one comment that’s easy to dismiss, and it’s fairly rare. It’s not like huge amounts of AI vomit all over PRs. It’s a lot more focused.
I would think this idea of creating a third-party to verify things likely centers more around liability/safety cover for a steroidal increase in velocity (i.e. --dangerously-skip-permissions) rather than anything particularly pragmatic or technical (but still poised to capture a ton of value)
> As the proprietors of an, er, AI code review tool suddenly beset by an avalanche of competition, we're asking ourselves: what makes us different?
> Human engineers should be focused only on two things - coming up with brilliant ideas for what should exist, and expressing their vision and taste to agents that do the cruft of turning it all into clean, performant code.
> If there is ambiguity at any point, the agents Slack the human to clarify.
Was this LLM advertisement generated by an LLM? Feels so at least.
still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.
Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.
The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.
The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.
To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.
So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?
> The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
We spend a ton of time looking at the code and blocking merges, and the end result is still full of bugs. AI code review only provides a minor improvement. The only reason we do code review at all is humans don't trust that the code works. Know another way to tell if code works? Running it. If our code is so utterly inconceivable that we can't make tests that can accurately assess if the code works, then either our code design is too complicated, or our tests suck.
OTOH, if the reason you're doing code review is to ensure the code "is beautiful" or "is maintainable", again, this is a human concern; the AI doesn't care. In fact, it's becoming apparent that it's easier to replace entire sections of code with new AI generated code than to edit it.
Tests don't catch architectural mistakes or time bombs. If you remove reviews and rely solely on tests, you end up with a "working" big ball of mud that is impossible to maintain. AI won't help if it's the one generating the mud.
No amount of telling the LLM to "Dig up! Make no mistakes!" will help with non-designed slop code actively poisoning the context, but you have to admire the attempt when you see comments added while removing code, referring to the code that's being removed.
It's weird to see tickets now effectively go from "ready for PR" to 0% progress, but at least you're helping that person meet whatever the secret AI* usage quota is for their performance review this year.
This is what acceptance tests are for. Does it do the thing you wanted it to do? Design a test that makes it do the thing, and check the result matches what you expect. If it's not in the test, don't expect it to work anywhere else. Obviously this isn't easy, but that's why we either need a different design or different tests. Before that would have been a tremendous amount of work, but now it's not.
(Making this work requires learning how to make it work right. This is a skill with brand-new techniques which 99.999% of people will need over year to learn)
> or that a core new piece that's going to be built upon next is barely-coherent, poorly-performing slop that "works" but is going to need to be actually designed while being rewritten by the next person instead
This is the "human" part I mentioned being irrelevant now. AI does not care if the code is slop or maintainable. AI can just rewrite the entire thing in an hour. And if the tests pass, it doesn't matter either. Take the human out of the loop.
(Concerned about it "rewriting tests" to pass them? You need independent agents, quality gates, determinism, feedback loops, etc. New skills and methods designed to keep the AI on the rails, like a psychotic idiot savant that can build a spaceship if you can keep it from setting fire to it)
> or that you skipped trying to understand how the feature should work or thinking about the performance characteristics of the solution before you started and just let the LLM drive, so you never designed anything
This is not how AI driven coding works. You have to give the AI very specific design instructions. If you do it right, it will make what you want. Sadly, this means most programmers today will be irrelevant because they can't design their way out of a wet paper bag.
(You know how agile eschews planning and documentation, telling developers and product people to just build "whatever works right now" and keep rewriting it indefinitely as they meet blockers they never planned for? AI now encourages the planning and documentation.)
I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.
Not my experience
> A human rubber-stamping code being validated by a super intelligent machine
What? I dunno how they define intelligence, but LLMS are absolutely not super intelligent.
> If agents are approving code, it would be quite absurd and perhaps non-compliant to have the agent that wrote the code also approve the code.
It's all the same frontier models under the hood. Who are you kidding.
since they're likely telling you things you know if you test and write your own code.
oh - writing your own code is a thing of the past - a.i writes, a.i then finds bugs
Can drop the extra words
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.
https://news.ycombinator.com/newsguidelines.html
Edit: Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.