Two prompts:
Build a tool where I can drag and drop on two PDF files and
it uses PDF.js to turn each of their pages into canvas
elements and then displays those pages side by side with a
third image that highlights any differences between them, if
any differences exist
rewrite that code to not use React at all
Here's the result: https://tools.simonwillison.net/compare-pdfsIt actually works quite well! Screenshot here: https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5...
I didn't find any actual difference. But - Maybe it's just me that's hallucinating
For terminal access I like using my own https://llm.datasette.io/ tool with the https://github.com/simonw/llm-claude-3 plugin
For Python library access I recommend checking out Claudette: https://www.answer.ai/posts/2024-06-21-claudette.html
I modified the HTML a tiny bit before publishing it - I set the font to Helvetica and added the note at the bottom of the page showing the prompt I used.
The whole project took less than 5 minutes - then another 10 to write it up.
I remember evaluating this diff-pdf tool and finding that it fell short in some way, although it's been so long that I don't recall the specifics. Most of them failed to identify changes or reported false positives. I also remember being disappointed since this one was open source and could easily be scripted.
ImageMagick can do a visual PDF compare:
magick compare -density "$DENSITY" -background white "$1[0]" "$2[0]" "$TMP"
(density = 100, $1 and $2 are the filenames to compare, $TMP the output file)You need to do some work to support multiple pages, so I use this script:
https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb...
This also uses `imgcat` to show the difference directly in the terminal.
You can also use ImageMagick get a perceptual hash difference using something like:
convert -metric phash "$1" null: "$2" -compose Difference -layers composite -format '%[fx:mean]\n' info:
I use the fact you can configure git to use custom diff tools and take advantage of this with the following in my .gitconfig: [diff "pdf"]
command = ~/bin/git-diff-pdf
And in my .gitattributes I enable the above with: *.pdf binary diff=pdf
~/bin/git-diff-pdf does a diff of the output of `pdftotext -layout` (from poppler) and also runs pdf-compare-phash.To use this custom diff with `git show`, you need to add an extra argument (`git show --ext-diff`), but it uses it automatically if running `git diff`.
I'm still blown away how powerful imagemagick is after using it for a decade or two, what an inspiring piece of open source software.
https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05d...
The code contains git-stored reference PDFs, and the test suite re-generate them and assert that nothing has changed.
Helped a lot to audit visual changes, or PDF library upgrades!
A number of the financial and medical institutions I deal with re-generate PDFs every time you request them, but the content is 99-100% identical. Sometimes just a date changes. So I use a perceptual hash and content comparison to automate detecting truly new documents vs. ones that are only slightly changed.
I give out the original page, the original rectangle, the original page with colored rectangle, the new page and the new rectangle, the diff cropped and uncropped only after which I start using my caveman eyeballs
I also pixelate it a bit and have a brightness cutoff for the diff to see if the diff actually matters and i also try if re-cropping a bit so shifting by a limited amount of pixels makes it look like an ignorable difference because everything just moved to the left a bit but that is optional.
I also recommend exporting the new pdf from the CI/CD tool to be put back into the test as reference. Even between Linux distros and versions small changes in fonts and stuff like that make a difference
Whole article is worth reading, but if you want the relevant bits search for “ I wrote a Dart script that would take a PDF of the book”.
That said, next project we want to try something more integrated with EDA tools. If anyone else has followed this path, we'd love to know.
I'm genuinely curious - I heard of lot of BC being 'the tool' for diffing. I'm used to Meld, but my current employee has a pretty strict policy which tools could be used so at some point I've managed a licence for some older version of BC. But for some reason I've found its UI/the way it works a bit less optimal that I was accustomed for. Since I'm using that primarily for text diffs these day I usually use a diff tool from IntelliJ Idea (I have Idea open all the time).
For me, I eliminated BC immediately because I was often diffing prose and it didn't have word wrap; that ability is apparently available now in the beta version of BC5, but it wasn't when I was testing it. I suspect it will continue to be non-optimized for prose in how it handles long lines.
It shows the differences in the GUI side-by-side instead of overlayed.
Another option is to compare the two files visually in a simple GUI, using the --view argument:
$ diff-pdf --view a.pdf b.pdf
This opens a window that lets you view the files' pages and zoom in on details. It is also possible to shift the two pages relatively to each other using Ctrl-arrows (Cmd-arrows on MacOS). This is useful for identifying translation-only differences.
https://github.com/github-linguist/linguist/blob/master/docs...
exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf
That won't help you with small differences in the contents, but might help with small differences in metadata. Running `md5sum` on the stripped PDF should give more reliable dedupe results.I was recently working on a similar problem for JPG, RAW, and MP4 files (photo/video backup) so it is fresh in my mind.
I used vbindiff instead.
The diff-pdf project was my inspiration but I wanted to create a version that was distributable to non-programmers.
I rely heavily on PDF comparison via PDF-XChange Editor, which is accurate for text, but often has trouble highlighting visual changes correctly.
[1]
Good to see post-cyberresilience alternatives :)
PDF diffs are really great for versioning/comparing PCB-Designs. (The only real use case I had 15 yrs back)
I genuinely need a side-by-side PDF comparison tool, and the diff-pdf tool linked from the main link doesn't do that. Any thoughts?
- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.
- Claude only diff'd the text and refused to believe that there images in the PDFs.
- ChatGPT attempted to write and execute python code for this, which errored out.
I agree it's not the best initial example to demonstrate the tool, but it does show how it can be used to detect even minor spacing changes.