Having a fuzz test was in-va-lu-ab-le. I caught several bugs right there. The delete function was particularly tricky.
I ended up with two fuzz tests as my hash table has a key feature: convergence. Having same contents, it would have exactly same bits in the buffer. In other words, it is independent of the insertion/deletion order. For this, I added another fuzz test. I would add a third one if I realize there is an important invariant I did not fuzz test. That is not much work, but so much useful!
https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASH....
https://github.com/gritzko/librdx/blob/master/abc/fuzz/HASHd...
P.S. In your case, I recommend clang's libfuzzer with ASAN.
https://github.com/FRRouting/frr/blob/master/tests/lib/test_...
(this file is #include'd from another, with some #define set up:)
https://github.com/FRRouting/frr/blob/master/tests/lib/test_...
The "tribe" of C developers with distaste for the preprocessor will absolutely hate how this is written. But it works and caught real issues. It doesn't use an external fuzzing tool but rather a deterministic PRNG (see prng_* calls) to get a good mix of call combinations. Coverage analysis was done to make sure it reaches everywhere.
(disclaimer: I wrote this. And it did catch bugs. Also it really should have a lot of comments.)
https://github.com/c-blake/bst/blob/main/test/bst-interact.c
Essentially, you just write a very simple (some might say trivial) interactive command-shell that can include an auto-checking upon edit of (in that case tree) invariants keyed-off an environment variable. The only missing piece is some little tool to generate random inserts/deletes/etc.If the micro-shell has a pretty printer then it is easy to "replay" any failed series of edits with some "p" commands before & after invariant failures.
(Almost: we have a full-featured command shell and just use that for testing)
Shell bindings: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
Input: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
Expected output: https://github.com/FRRouting/frr/blob/master/tests/ospf6d/te...
Absolutely agree this is a great option to have for some kinds of tests.
I have done this for data structures even as fast as integer-keyed hash tables (though in that hyper-fast case you might need to try to measure & subtract off parser-loop/IO dispatch overhead and/or need statistical testing for "actually even different at all", perhaps along the lines of https://github.com/c-blake/bu/blob/main/doc/tim.md).
https://craftinginterpreters.com/hash-tables.html#deleting-e...
First, a correctness question:
- if a struct contains padding, the value of these padding bytes is undefined. What happens if you use such a struct as a key?
And then some integration questions:
- how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?
- how do you make this typesafe for its keys and values, i.e. remove accident-prone casts on read access?
- how do you not call the compare and hash functions through function pointers? (Calling through function pointers is both performance relevant as well as a target for exploitation by overwriting the pointer)
(None of these have "nice" answers. But thinking about them is IMHO extremely valuable since these are the most challenging in organizing and engineering an actual codebase itself.)
It's undefined for uninitialized objects. If I understand correctly, when an initializer is provided the padding bits are set to 0. Additionally, setting the whole chunk of memory to 0 w/ memset would work.
https://stackoverflow.com/a/37642061/24042444
> - how do you make this typesafe against mixing up 2 distinct uses of hashmaps in the application?
Lacking templates, the user of the library could create slim wrapper methods for each type (and coerce to void* to call the underlying hashmap lib); You could still accidentally call the wrong method but it would help a bit. I suppose you could wrap the real hashmap in a new type to help further (so each set of wrapper methods for a given type would be forced to use the same wrapper struct).
Alternatively, the library could provide helper macros to accomplish the same.
https://interrupt.memfault.com/blog/c-struct-padding-initial... looks at this (albeit a bit outdated; clang 13 and GCC 11) and claims to see undefined values even when initializers are given, at higher optimization levels
> Additionally, setting the whole chunk of memory to 0 w/ memset would work.
This seems to be the only thing that can be relied on to work (… sadly)
An alternate answer is: don't treat structs as chunks of bytes when the chunk crosses padding (i.e. create functions to hash specific structs)
Another alternate answer: remove the padding and add "_pad12[34]" members instead. But then you need to figure out where you actually have padding, which is architecture dependent…
In order to speed it up by reducing the number of malloc() calls, it may be worth adding a simple arena memory allocation measure: by allocating one larger block (e.g. 1 MB) initially and then doubling the memory size each time you run out, all malloc()/calloc() calls can become local salmagundi_alloc() calls that are just macro invocations that return an arena buffer pointer and also increment said pointer as a side effect.
I also recommend you have a look at Chris Hanson's book "C: Interfaces and Implementations", which has a few neat C API tricks that your code could benefit from (e.g. for reducing name space pollution, for avoiding null pointer argument errors, API method naming etc.).
Furthermore, a typical arena allocator only grows but doesn't shrink. A deleted item from the memory block will still hold the memory and will not be released. You will need a more sophisticated allocator for deletion. For fixed sized items, memory pool may be an option.
Additionally, in case of overwrite, the memory previously allocated for the key is leaked.
All in all, I don't think row 100 is really needed, and removing it wouldn't do any harm.
Finally, deletes are hard in hash maps. Either you decide that deletes are not allowed, or you implement them with tombstones or re-insertions. A half-backed solution is not an option.
Hope these suggestions help you.
Deletes are hard in open hash maps like here; they're trivial in chained hash maps.
(I'm only posting this because this seems to be an exercise/learning-ish HN post and I think it's a relevant point to not forget.)
Thinking about it, I don't grasp why the author has opted for linear probing here. Given the amount of malloc(), they might have used linked lists as well.
And for an exercise, it might be worth implementing linear probing just to get more learning experience out of it(?)
You can corrupt memory in such a way that allows executing arbitrary code, in which you could do anything that the process has privilege to do, which is probably reading and writing files as the current user and a lot more.
Algorithm for hash collisions is just to find the next empty slot (linear probing)? What happens when the original entry is removed, are the ones in subsequent slots cycled backwards? I don't see that...
Also the name is new to me! TIL 'salmagundi' is an English word (not even a loan-word) for a type of salad made of many ingredients: an excellent name for a hash map that can contain many effectively random items.
Many of my toy and hobby projects are exactly that. Some make allowances for the sake of performance and generality.
The hash map, though, is up there with sorting algorithms in the breadth of wild implementations. Salmagundi is unlikely to be the fastest, or the smartest. But it’s cute and portable.
I edited my comment with some more questions, btw, likely as you were answering - apologies if that ended up confusing.
I need a bit more clarification on your first question. There should be no mutation of the underlying map when it grows — we’re just allocating more room for entries and adjusting the old map’s item “ownership” — moving pointers->items around.
https://www.nytimes.com/1978/01/16/archives/the-salmagundi-d...
#ifndef BD9DF82A4540BB19368E48E4747C0706
#define BD9DF82A4540BB19368E48E4747C0706
Does some IDE do that for you? Did you just pick a random sequence? Why not just a variant of #ifndef _salmagundi_h that is more common?Other than that, I strongly echo other's advice about avoiding so many mallocs.
Eschewing pragma once these days is pure ignorance and superstition --- then again, so is writing new code in C at all.
And I think C is one of the better languages you could use to write code today. Also from the programs I use daily, the more stable and reliable ones tend to be written in C.
i#ifndef __foo_h <escape>yyp:s/ifndef/define/<enter>o<enter>#endif<escape>
… which I'd fix by using it more :D
Not having the include guard use the same mechanism makes these a bit easier to read and removes a little bit of cognitive load. (And every bit of cognitive load counts since that's a very hard limit of people's brains.)
Personally (and for the projects I work on), I couldn't care less whether it exists in ISO C or not. I have a list of compilers and platforms I want/need to support. But of course your project may be different.
Then indent your macros for readability. clang-format can do it for you if you don't want to maintain them.
¹ especially across an open source project with more than a hundred contributors from a multitude of companies.
In the hm_put function, when you overwrite, why do you malloc and copy the key again, and what happens with the old key pointer and the old value pointer? (no free for the key and the memset zeros everything?)
Looks like the key logic is missing an overwrite case, the value is reallocated on overwrite.
(void)k_sz;
doing? I've seen fussy people doing: (void) printf( "hello" );
but not what you have written. As I say, probably ignorance on my part.It's only supported by gcc 11+ and clang 11+ (both from ~2021) because it was a suggested C2x extension. It's been accepted in C23 so other compilers will eventually support it, but I wouldn't hold my breath regarding when that happens.
idx = (idx + 1) % map->cap;
becomes iterCount++;
idx = (idx + iterCount) % map->cap;
Never change, HN.
I consider `char*` to be the type of `foo`. What's the rationale of having `char *foo`?
char* cat, dog; /* one char pointer, one char */
char *cat, *dog; /* two char pointers */
typedef foo * FooP;
FooP a, b; // Both are pointers
Pointer typedefs are misguided but there is no denying that C is inconsistent on whether the '*' is part of the type or not. char[8] buffer;
Alas, it's not the syntax Dennis Ritchie settled upon.Also, case in point: how do you format "char * const foo"?
Similarly, what makes C type declarators tricky for many is their "inverse nature" - "char *foo" means, applying the operator "*" to the identifier "foo" yields a base type "char". So, the "char* foo" form is literally creating a cue in the opposite direction of operator precedence just like "x+y * z" would. For just one indirection it's no big deal, but for things like "char *foo[]" and beyond it becomes more important. So, the "mis"-spacing "char* foo" sets people up for later conceptual failure.
This is in addition to the above point by @onre about multiple identifiers and @anacrolix's initial rather, ahem, pithy take. There is also another "where would ()s be redundant/unneeded" rationale (though this largely recapitulates the 2nd paragraph here). If you are looking for a catchphrase then maybe it could be "Don't space-against-syntax".
The only difference with the spacing against syntax in "x+y * z" is that there are fewer type-relevant operators (e.g. '+' does not work in a type expression). One does NOT hear people advocating for more space around '*' than '+' (though I have sometimes heard "equal space"). Your non-disagreement has not persuaded me that this is a bad way to explain the problem, although I agree I could cut down on the double negatives. :-) { Even grade school kids have some sense of precedence of '+' & '*' which is so broadly adopted that even hand calculators do it. }
"How misleading" spacing against syntax can be is admittedly more subjective. I think whitespace can really matter to human readers. There are many other examples. gcc added -Wmisleading-indentation a while back. Off-side rule languages like Python embrace things that sometimes prevent that problem. The Nim compiler was once sensitive to exactly this kind of operator spacing where it would parse "x+y * z" { space around '*' } as "(x+y)*z" in an attempt to have the compiler "Do what I mean".