Tree sitter grammars are primarily written to support syntax highlighting and often use a best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it. We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.
GumTree is good at returning a result quickly, but there are quite a few cases where it always returned bad matches for us, no matter how many follow-up papers with improvements we tried to implement. In the end we switched over to a dijkstra based approach that tries to minimize the cost of the mapping, which is more computationally expensive but gives much better results. Difftastic uses a similar approach as well.
As to mismatches: yes, they are bound to happen in some cases. Even for line-based diffing, Git uses rather convoluted heuristics to avoid them (with the "histogram" diff algorithm), but they can't be completely ruled out there either. I hope that with enough safeguards (helper to review merges, downstream consistency checks with local fall-back to line-based diffing) they can be lived with. I'm happy to try other matching algorithms if they are more promising though (there isn't much coupling with the rest of the pipeline).
Concerning tree-sitter, I have noticed some small issues, but nothing that was a show-stopper so far. I actually like it that it's designed for syntax highlighting, because it's really helpful that the representations it gives stay faithful to the original source, to avoid introducing reformatting noise in the merging process. Parsers written for a specific language can sometimes be too zealous (stripping comments out, doing some normalizations behind your back). That's a problem in Spork (which uses Spoon, a pretty advanced Java parser). And the uniform API tree-sitter offers over all those parsers is just too good to give up, in my opinion.
We try to really get the diff quality nailed down before going after merges. We don't have merge functionallity in SemanticDiff yet.
The main issue we have with tree-sitter is that the grammars are often written from scratch and not based on the upstream grammar definition. Sometimes they only cover the most likely cases which can lead to parsing errors or incorrectly parsed code. When you encounter parsing errors it can be difficult to fix them, because the upstream grammar is structured completely different. To give you an example, try to compare the tree-sitter Go grammar for types [1] with the upstream grammar [2]. It is similar but the way the rules are structured is somewhat inverted.
We use separate executables for the parsers (this also helps to secure them using seccomp on Linux), and they all use the same JSON schema for their output. This allows us to write the parser executable in the most appropriate language for the target language. Building all them statically and cross-platform for our VS Code extension isn't easy though ;)
[1]: https://github.com/tree-sitter/tree-sitter-go/blob/master/gr... [2]: https://go.dev/ref/spec#Types
- for diffing, the matching of the leaves is what matters the most, for merging the internal nodes are more important,
- for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
- diffing needs to be faster than merging, so the accuracy/speed tradeoffs can be different.
Packaging parsers into separate executables seems like hard work indeed! I assume you also considered fixing the tree-sitter grammars (vendoring them as needed, if the fixes can't be upstreamed)? Tree-sitter parsers are being used for a lot more than syntax highlighting these days (for instance GitHub's "Symbols" panel) so I would imagine maintainers should be open to making grammars more faithful to the official specs. I'm not particularly looking forward to maintaining dozens of forked grammars but it still feels a lot easier than writing parsers in different languages. I guess you have different distribution constraints also.
The leaves are the ones that end up being highlighted in the diff, but the inner nodes play an important role as well. We try to preserve as much of the code structure as possible when mapping the nodes. A developer is unlikely to change the structure of the code just for fun. A mapping with a larger number of structural changes is therefore more likely to be incorrect.
> - for diffing, it feels more acceptable to restrict the matching to be monotonous on the leaves since it's difficult to visually represent moves if you can detect them. For merging, supporting moves is more interesting as it lets you replay changes on the moved element,
We use a pipeline based approach and visualizing the changes is the last step. For some types of changes we don't have a way to visualize them yet (e.g. moves within the same line) and ignore that part of the mapping. We are still trying to get the mapping right though :)
We upstreamed a few bug fixes for tree-sitter itself. The grammars were a bit more complicated because we were just using them as a starting point. We patched tree-sitter, added our own annotations to the grammars and restructured them to help our matching algorithm achieve better results and improve performance. In the end there was not much to upstream any more.
Using a well tested parsing library, such as Roslyn for C#, and writing some code to integrate it into our existing system aligned more with our goals than tinkering with grammars. Context-sensitive keywords in particular were a constant source of annoyance. The grammar looks correct, but it will fail to parse because of the way the lexer works. You don't want your tool to abort just because someone named their parameter "async".
I imagine this means you're trying to abstract over those parsers somehow? How well is that going, and have you written about your approach?
(I wrote `resholve` to identify and rewrite references to external dependencies in bash/posixy Shell scripts to absolute paths. This is helpful in the Nix ecosystem to confirm the dependencies are known, specified, present, don't shift when run from a service with a different PATH, etc.
It builds on the mostly-bash-compatible OSH parser from the oilshell/oils-for-unix project for the same reasons you're citing.
It would be ~nice to eventually generalize out something that can handle scripts for other shell languages like fish, zsh, nushell, elvish, the ysh part of the oils-for-unix project, etc., but I suspect that'll be a diminishing-return sort of slog and haven't had any lightbulb-moments to make it feel tractable yet.
We also have some ~related needs here around identifying hardcoded or user-controlled exec...)
The language specific logic does not end with the parsers though. The core of SemanticDiff also contains language specific rules that are picked up by the matching and visualization steps. For example, the HTML module might add a rule that the order of attributes within a tag is irrelevant. So it all comes down to writing a generic rule system that makes it easy to add new languages.
Of course, the existence of the preprocessor means there are situations where it's completely impossible to know what the correct parse is; it will necessarily be heuristic in some cases.
https://docs.rs/clang/latest/clang/struct.Parser.html#method...
https://docs.rs/clang/latest/clang/enum.EntityKind.html#vari...
https://docs.rs/clang/latest/clang/enum.EntityKind.html#vari...
I wonder how you transition from tree-sitter to other builtin parsers? Tree-sitter gave a unified interface to all languages. Using language native parsers will require significant work for various FFI if I am not wrong.
For a tool I’m writing, the tree-sitter query language is a core piece of the puzzle as well. Once you only have JSON of the concrete syntax trees, you’re back to implementing a query language yourself. Not that OP needs it, but ast-grep might?
But surely you need to support code that doesn't parse correctly by the actual language's grammar anyway? 'Merge branch fix-syntax-error'
I've wondered if you could add annotation keywords to languages to convert them into something that could be parsed reliably with a tree sitter grammar.
I say this as someone that feels like you really want diffs that say, changed 'struct x name from y to z' instead of here's a huge list of files with ten changes each.
Armed with that, I can tolerate some rough edges. Without it, I'll get stuck in weird ways that your docs can't anticipate.
I really don’t like that. At the language level, order may not matter, but quite frequently in such cases the order does matter, insofar as almost every human would put the two things in a particular order; or where there is a particular convention active. If you automatically merge the two sides in a different order from that, doing it automatically has become harmful.
My clearest example: take Base `struct Foo; struct Bar;`, then between these two items, Left inserts `impl Foo { }`, Right inserts `struct Baz;`. To the computer, the difference doesn’t matter, but merging it as `struct Foo; struct Baz; impl Foo { } struct Bar;` is obviously bad to a human. This is the problem: it’s handling language syntax semantics, but can’t be aware of logical semantics. (Hope you can grasp what I’m trying to convey, not sure of the best words.) Left was not inserting something between Foo and Bar, it was attaching something to the end of Foo. Whereas Right was probably inserting something between Foo and Bar—but maybe even it was inserting something before Bar. You perceive that these are all different things, logically.
Another example where this will quickly go wrong: in CSS rulesets, some will sort the declarations by property name lexicographically, some by property name length (seriously, it’s frequently so pretty), some will group by different types of property… you can’t know.
https://projectlombok.org/features/constructor
The first two kinds of conflicts that Mergiraf handles looks somewhat dangerous to me when handled by a computer.
I think it's the most useful CSS organization method I've found yet.
The reason for that is that most of the time when I'm resolving huge numbers merge conflicts... I don't give a shit about details like field order. I just want to get some code that's functionally correct at p<0.05 so I can figure out what performance characteristics my old feature branch would have if I resurrected it. Or I want to kick off the slow integration tests ASAP. 9/10 times it's that kind of thing.
The 1/10 times where I'm like "OK now I actually wanna merge this code, I have to look over the resolutions, I will upload them to Gerrit/GitHub and do a self-review" I am more than happy to spend 20 minutes correcting order etc. Or I'll happily just switch this tool off and do a totally manual merge.
So yeah I think it just comes down to usecase.
> To let Mergiraf reorder using statements to fix conflicts, we actually need to specify that they can be reordered with any of their siblings (any other child of their parent in the syntax tree).
That’s too coarse-grained. No idea about C♯, but languages could impose a rule that imports must come before anything else—long ago Rust had such a rule, for example. So it might be that within your compilation_unit node, only its using_directive children are commutative, and only among themselves.
Otherwise (lapsing into Rust syntax for convenience), with Base `use A; struct X…`, Left `use A; use B; struct X…` and Right `use A; struct Y; struct X…`, you could end up with the invalid `use A; struct Y; use B; struct X…`.
C#'s using aren't imports, and order indeed doesn't matter.
As soon as you do any merge you're accepting that there might be edits you don't agree with.
Kudos for the nonviolence. :)
I just watched a surgeon slice open a woman and save her life and the baby. Blood was everywhere, it was pretty violent. Nothing wrong with it.
Violence != Malice
Mergiraf looks pretty awesome though.
But we humans that use these animals as metaphors when trying to improve our communication skills can apply a moral judgement and say that this way of communicating is better than that way. But we literal minded often get derailed when the metaphors are not 100% solid so I often think that metaphors are bad, or at least risky. (but I like these specific metaphors in nvc somehow)
Surgery is not a harmless procedure. But the surgeon must decide that the patient is more likely to benefit overall from it.
x = input()
if x == 'x': print('foo')
print('bar')
If you delete the first print in a branch, delete the other print in another branch, then merge the two branches, you'll have this:x = input()
if x == 'x':
Both branches delete a portion of the code inside the if block, leaving it only with a whitespace. In Python it is not a valid code, as they empty scopes need to be declared with pass.
I installed Mergiraf to see if it can solve this situation, but sadly, it doesn't support Python...
if a == b: print("x")
versus
if a == b:
print("x")
Another approach would be for the tool to accept doing structured merging even if there are error nodes in the parsed tree. If those error span the parts of the file where the extended language is used, then the tool could still help with merging the other parts, treating the errors as atomic blocks. I'd be a bit reluctant to do that, because there could be errors for all sorts of other reasons.
The rust bindings themselves are a thin ffi wrapper.
If you wanted to make it a little smoother than needing to compile the tree sitter syntax you could compile/bundle grammars up with wasm so its sandboxed and cross platform
Edit: found this vscode extension that dynamically loads syntaxes compiled to wasm. You should be able to do the same thing in rust: https://github.com/selfint/vscode-tree-sitter
The best I've got to is zdiff3 in VSCode (not using their fancy merge view which I don't understand at all). But it's missing:
1. Blame for the merge base.
2. Detection of the commit that introduced the first conflict.
3. Most annoyingly, no way to show diffs between the "current" and "incoming". IIRC it has buttons to compare both of those to the merge base, but not to each other. That often leaves me visually scanning the text to manually find differences like a neanderthal. Sometimes it's annoying enough that I copy & paste current/incoming into files and then diff those but that's a right pain.
Can this also detect some scenarios where semantic conflicts (but not line conflicts) arise, usually due to moved code?
I don't know the exact circumstances when this happens, but occasionally you can have e.g a function defined twice after two branches both move the same function elsewhere.
I never found the time to make a cli out of this.