At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.
What I observed in practice:
- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.
At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.
I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operation
With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 second
I’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?
If you can drop the “distributed” part, then plug DuckDB (https://duckdb.org/) and query Parquet (out of the box) or Vortex (https://duckdb.org/docs/stable/core_extensions/vortex.html) with it.
My best experience has been ignoring SQL and using (sparse) matrix formats for the genomic data itself, possibly combined with some small metadata tables that can fit easily in existing solutions (often even in memory). Sparse matrix formats like CSC/CSR can store numeric data at ~12 bytes per non-zero entry, so a single one of your servers should handle 10B data points in RAM and another 10x that comfortably on a local SSD. Maybe no need to pay the cost of going distributed?
Self plug: if you're in the single cell space, I wrote a paper on my project BPCells which has some storage format benchmarks up to a 60k column, 44M row RNA-seq matrix.
I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.
The second option is to hide part of the join complexity with views.
I created a system to support my custom object store where the metadata tags are stored within key-value stores. I can use them to create relational tables and query them just like conventional row stores used by many popular database engines.
My 'columnar store database' can handle many thousands of columns within a single table. So far, I have only tested it out to 10,000 columns, but it should handle many more.
I can get sub-second query times against it running on a single desktop. I haven't promoted this feature since everyone I have talked to about it, never had a compelling use for it.
A concrete case where this comes up is multi-omics research. A single study routinely combines ~20k gene expression values, 100k–1M SNPs, thousands of proteins and metabolites, plus clinical metadata — all per patient.
Today, this data is almost never stored in relational tables. It lives in files and in-memory matrices, and a large part of the work is repeatedly rebuilding wide matrices just to explore subsets of features or cohorts.
In that context, a “wide table” isn’t about transactions or joins — it’s about having a persistent, queryable representation of a matrix that already exists conceptually. Integration becomes “load patients”, and exploration becomes SELECT statements.
I’m not claiming this fits every workload, but based on how much time is currently spent on data reshaping in multi-omics, I’m confident there is a real need for this kind of model.
As I indicated in my previous post, I have a unique kind of data management system that I have built over the years as a hobby project.
It was originally designed to be a replacement for conventional file systems. It is an object store where you could store millions or billions of files in a single container and attach metadata tags to each one. Searches for data could be based on these tags. I had to design a whole new kind of metadata manager to handle these tags.
Since thousands or millions of different kinds of tags could be defined, each with thousands or millions of unique values within them; the whole system started to look like a very wide, sparse relational table.
I found that I could use the individual 'columnar stores' that I built, to also build conventional database tables. I was actually surprised at how well it worked when I started benchmarking it against popular database engines.
I would test my code by downloading and importing various public datasets and then doing analytics against that data. My system does both analytic and transactional operations pretty well.
Most of the datasets only had a few dozen columns and many had millions of rows; but I didn't find any with over a thousand columns.
As I said before, I had previously only tested it out to 10,000 columns. But since reading your original question, I started to play with large numbers of columns.
After tweaking the code, I got it to create tables with up to a million columns and add some random test data to them. A 'SELECT *' query against such a table can take a long time, but doing some queries where only a few dozen of the columns were returned, worked very fast.
How many patients were represented in your dataset? I assume that most rows did not have a value in every column.
ClickHouse and Scuba are extremely good at what they’re designed for: fast OLAP over relatively narrow schemas (dozens to hundreds of columns) with heavy aggregation.
The issue I kept running into was extreme width: tens or hundreds of thousands of columns per row, where metadata handling, query planning, and even column enumeration start to dominate.
In those cases, I found that pushing width this far forces very different tradeoffs (e.g. giving up joins and transactions, distributing columns instead of rows, and making SELECT projection part of the contract).
If you’ve seen ClickHouse or Scuba used successfully at that kind of width, I’d genuinely be interested in the details.
Feel free to email if you want to chat more.
You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.
What is the design?
Reference: https://www.hopsworks.ai/post/a-taxonomy-for-data-transforma...
Usually this would be stored in a sparse long form though. So I might be wrong.
That said, I have never seen 1 million columns.
It used to only be available for big enterprises, but now there is a totally free version you can try out: https://www.exasol.com/personal