DuckDB unknown unknowns
Some software is very lovable (see zed_is_good_software.md
). For the past couple of months, duckdb was one such software I held in high praise. But a single hair-ripping-ly frustrating day later after debugging certain issues1, I have a more nuanced opinion about duckdb.
This scenario brings to mind a certain favourite quote There are unkown unkowns and this great article . In this context, I wrangled with two issues
Duckdb does not support query timeout:
This seems so crazy to me. How does someone consider this for a production usecase without a fundamental feature like this. All the workaround for this talk about a multi processing (a mult-thread doesn’t work) with a timeout (and consequently move away from in-memory usage to the FS usage of duckdb to avoid penalty of running queries against uncached tables each time).
Again I understand the reasoning for this (now). DuckDB is an single-in-process engine and its just not feasible to implement a timeout. But I wish this was considered before using this in our system.
Duckdb and AWS support is poor.
I’m not sure why, but I saw intermittent issues when in the middle of a long task in ECS, DuckDB would just stop processing queries and throw a 400 auth exception. I’m guessing its related to one of these github issues - issue A, issue B, …. But the point is that debugging it was much harder than it should have been. And clearly going by the github issues the AWS support can be better.
Just glad to be over this speed bump. :/
-
We use it work in the python lib form to run 1000’s of queries against parquets files sitting s3 to generate insights. ↩︎