Explaining my fast 6502 code generator (2023)

111
20
zdw
4 months ago
pubby.games

kragen
·
4 months ago
·
[ - ]

This is probably the most exciting thing I've read about compilers since Alexia Massalin's dissertation. gabrielsroka found this previous thread https://news.ycombinator.com/item?id=35294111 with discussion with the author and a lot of related reading.

The title is an amusing crash blossom; it's about a generator of fast 6502 code, rather than a fast generator of 6502 code or one of the other four possibilities.

He says it generates better code than anything he's seen, and on reading the writeup, it's easy to see why:

> The ordered basic blocks are then processed from beginning to end, generating multiple assembly code combinations per operation. The list of combinations will grow at an exponential rate, but most can be pruned using ideas from dynamic programming and branch-and-bound. (...) simulated annealing (...) solving a partitioned boolean quadratic problem (PBQP).

It's tempting to assume that all this would have been infeasible for financial reasons when the 6502 was in common use, and maybe it is a little memory-hungry, but I think the bigger issue is that this is the kind of work you'd normally expect to see in a doctoral dissertation in operations research, applied mathematics, or GOFAI; there just weren't that many people out there who such an approach would occur to.

Now that it's out there, though, it seems like you could apply the same techniques to code generation for more mainstream hardware. (Some of the techniques like the tree-search register allocation would probably require adjustment.) Maybe it wouldn't provide the same benefits because generating optimal code is so easy now, I don't know. Has anyone tried?

HarHarVeryFunny
·
4 months ago
·
[ - ]

Right, back in the day the main challenge was just to do what you wanted to do within memory and performance constraints, especially when trying to run a compiler on the target system as opposed to cross-compiling. This kind of super-sophisticated code generation, making the compiler itself larger and more memory hungry, just wasn't an option.

This is a super hard-core hobby project! The 6502 is a great CPU, at least if you are hand-writing assembler for it, but a poor target for high level languages and difficult to generate high quality code for due to the extremely limited number of registers, quirky addressing modes, and lack of regularity. It's great to see the 6502 getting some love, even if it's 40 years too late to be very useful!

lmm
·
4 months ago
·
[ - ]

I've done this kind of thing for a toy VLIW processor. In some sense it's the most obvious, naive way to write an optimising compiler: just generate all the possible combinations and then pick the fastest. I kind of assumed it wouldn't scale up to full size programs of the kind people write these days, but who knows.

kragen
·
4 months ago
·
[ - ]

Well, it's not just doing exhaustive search. I don't think exhaustive search would get you far.

lmm
·
4 months ago
·
[ - ]

Further than you might think, but yeah then you use heuristics and various ways to prune the tree. But conceptually you're searching through the space, rather than optimising a single piece of code - I found it a useful way to flip my perception anyway.

twic
·
4 months ago
·
[ - ]

Wasn't most performance-sensitive 6502 code back in the day handwritten in assembly? How does the performance of this compiler compare to that?

Not to detract from this work - it's really cool and interesting! But i wonder if it would have been relevant at the time.

dzdt
·
4 months ago
·
[ - ]

Essentially all commercial programs released for the 6502 were written in assembly language. Modern compiled languages like this hope to get close to the speed of hand-optimized assembly language while making the code much quicker and easier to write and modify. At the time systems like NES and C64 were popular, this would absolutely have been relevant. The time schedule for delivery of games was often extremely tight, and anything that would have let developers accomplish more in less time (with fairly minor compromises on execution speed) would have seen strong usage.

throw2030401
·
4 months ago
·
[ - ]

While this is very impressive there is an important qualifier missing from the article.

They are NOT comparing the output from the same programming languages and so probably not the same source code either.

- Their compiler uses a custom language (NesFAB) and does NOT support C

- The reference compilers are all using C

"NESFab is a new programming language for creating NES games. Designed with 8-bit limitations in mind, the language is more ergonomic to use than C, while also producing faster assembly code."

And http://pubby.games/nesfab/doc.html#_what_is_nesfab

Presumably they made a best effort to implement their algorithm as closely as possible in the two languages, but that's a lot different than comparing the output of compilers using identical input source code.

So instead of saying

- "In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to."

It would be more accurate to say

- " In a bizarre twist, my compiler [and language] generate faster code than GCC, LLVM, and every other compiler I compared it to [using C].

Joker_vD
·
4 months ago
·
[ - ]

Eh. The article uses

    extern char fn(void);

    char kekw(void) {
        const char foo = fn() ^ 5;
        return foo - (foo & 3);
    }

as a running example, and while godbolt only has cc6502, that thing indeed generates worse code, with helper subroutine calls and doing some strange stack juggling.

throw2030401
·
4 months ago
·
[ - ]

It is clearly not what their charts are based on though.

Their example output

  jsr fn
  eor #5
  ldx #3
  sax S
  sec
  sbc S
  rts

Is better but not drastically so than the output of another 6502 compiler on godbolt:

  _kekw_sloc0_1_0:
          .ds 1
  _kekw:
          jsr     _fn
          eor     #0x05
          tax
          and     #0x03
          sta     *_kekw_sloc0_1_0
          txa
          sec
          sbc     *_kekw_sloc0_1_0
          rts

throw2030401
·
4 months ago
·
[ - ]

However the output of llvm-mos on godbolt is garbage

djmips
·
4 months ago
·
[ - ]

I tend to agree with your sentiment since there are a number of other 6502 compilers available to compare with - I'm most familiar with KickC for example.

A decent recent summary here showing a lot of various approaches over the years (doesn't include NesFAB though)

https://8bitworkshop.com/blog/compilers/higher-level-6502-pr...

A possibly interesting comparision of KickC vs CC65 vs VBCC vs 6502-gcc

https://sgadrat.itch.io/super-tilt-bro/devlog/219534/benchma...

gabrielsroka
·
4 months ago
·
[ - ]

2023

laconicmatt
·
4 months ago
·
[ - ]

As someone who makes games for two different 6502 systems, this is a really cool read. I'm curious if I could use this in any of the toolchains I use for those platforms.

delduca
·
4 months ago
·
[ - ]

I am writing a 6502 emulator in Lua.

My goal is to build an NES emulator. I will try using this code generator to check compatibility.

Source code: https://github.com/willtobyte/NES/blob/main/MOS6502.lua

·
4 months ago
·
[ - ]

sema4hacker
·
4 months ago
·
[ - ]

Is it meaningful to measure the speed of code generation without measuring the speed of the generated code when executed?

masfuerte
·
4 months ago
·
[ - ]

Eh? The whole article is about generating code that runs faster, not about doing the code generation quickly.

codetrotter
·
4 months ago
·
[ - ]

And even so, fast code generation and slower performing compiled program would be a desirable property for me when programming. I’m making changes and checking if it works. If compile time is sped up 10x and run time slowed down to 0.5x that would be a win for my development cycle. Then I’d finally compile normally with optimisations at regular slow compile speed when I’m happy with my changes.

m000
·
4 months ago
·
[ - ]

Dude, TFA is talking about the code-generator of a cross-compiler targetting a 40 years old, 8bit, 2MHz CPU, that can address a whopping max of 64KB of memory.

There's really not a whole lot of code to generate, and your compiler host machine is a beast compared to the target machine. Focusing on making code generation faster makes almost zero sense in this case.

codetrotter
·
4 months ago
·
[ - ]

I know, but I’m talking about the general case now. In response to what parent commenter was saying.