Show HN: Cloudberry Database – Greenplum Fork, Now an Apache Incubator Project - https://news.ycombinator.com/item?id=42256186 - Nov 2024 (1 comment)
It seems like cloudberry uses postgresql in some way? What does this entail? Can we use postgresql extensions? How does it compare to paradedb?
Sorry for the question dump:-P
Both are based on old versions of Postgres; once you fork Postgres and change the innards, keeping up with the upstream has historically been a chore (you see the same problem with forks like Yugabyte). Cloudberry is not so bad compared to some; they are now up to date with Postgres 14, I believe.
In the Cloudberry architecture, as with Greenplum, database tables are partitioned into "segments", which are automatically distributed over a cluster of "segment hosts". A special optimizer called GPORCA figures out how to parallelize the query across these segments and then merge the results back together.
The strategy is a classic shared-nothing, single-master architecture, which differs from newer disaggregated compute/data architectures (used especially in "data lake" systems like Delta Lake and Iceberg) in that compute is kept close to the original data; each segment is basically a full database instance, except it just has a subset of the data.
GPORCA achieves high speed by "pushing down" operators such as filters and joins down to the individual segments. Greenplum is designed to be used together with a low-latency, high-bandwidth network interconnect, on top of which they use a custom UDP protocol, because each query needs to fan out to potentially a large number of parallel executors.
Like ClickHouse, Cloudberry supports columnar table layouts (like CH's MergeTree engine family) as well as the native Postgres row-oriented layout (like CH's Atomic table engine). A difference from CH is that there's not really any single server mode; distributed tables are always distributed by the cluster for you.
I can't compare CH's optimizer to Cloudberry's, but I suspect the latter is more sophisticated. I also don't know how performance compares. Cloudberry inherits a lot from Postgres, so I suspect that for non-columnar (OLTP) data performance may be a lot better, but not necessarily for columnar (OLAP) use cases.
I think there is a plan on integration with iceberg, you can take this for reference: https://github.com/apache/cloudberry/discussions/369. We are also discussing the new roadmap, FYI.