KantoDB developer guide

Welcome to the KantoDB developer guide.

This document is intended to give a competent software developer familiar with the topics of backend software, storage, data structures and Rust a better idea of how KantoDB works internally.

The Open Source source code for KantoDB is available at https://git.kantodb.com/kantodb/kantodb.

Keeping documentation up to date is a difficult task, doubly so for developer-oriented documentation. Statements made in this document may in fact be false. Projections of what work will be done in the future, and how things might work one day, are even more hazardous.

If anything here seems out of touch with reality, please reach out on the community forums.

Roadmap

As of 2025Q1, the content and order of goals are very much in flux.

Your input is very much welcome, please engage us in a support contract or on the community forums.

About this page

Tags used in the roadmap

#external: Likely source of the bug is not us, but in a dependency
#maybe: We might not do this
#urgency/high
#urgency/medium
#urgency/low
#severity/high
#severity/medium
#severity/low
#size/small
#size/medium
#size/big
#api-change: Changes already-published API
#bug: Likely bug in the software (not documentation)
#enhancement: New feature or improving behavior some other way
#performance: Improves performance, minimal effect on other behaviors
#ops: Affects operational aspects, not functionality; for example, observability
#dev: Only affects development and building from source, not the final product
#test: Topic where test coverage needs to be better
#doc: Improvements or additions to documentation
#good-first-issue: Good for newcomers
#need-more-info: We don't have know enough yet to work on this
#help-wanted: Extra attention would be appreciated
#ecosystem/*: We are hoping our dependencies will improve
#waiting: Cannot proceed at this time

The checklist items are in very rough priority order. Order will be adjusted based on feedback and discovery of critical paths.

The big picture

v0: Bootstrap to a real SQL database
v1: Production-ready database using RocksDB
v2: Pure Rust with new storage subsystem

Platforms

Support for non-Linux operating systems is not currently on the roadmap. We intend to actively use Linux-specific functionality for performance, security and simplicity.

For now, only x86-64 is actively tested, but everything is supposed to work cross-architecture (higher risk: 32-bit processing >4 GiB of data at once, 32-bit in general, big-endian).

Pending work

Work that is more or less "about to be done".

#dev make sequence inc take type, promise never to return u64::MAX, add some intermediate type or trait to make that work, avoid lots of error checks
#dev re-evaluate rocky vs rust-rocksdb high-level API
#dev stringy internal errors like internal error#, in key-format-v1 and record-format-v1; do we need to return even more errors
#dev extract higher-level backend provider, split non-RocksDB specifics from kanto_backend_rocksdb
#dev eliminate remaining expect (Panic policy)
#dev eliminate or excuse clippy exceptions either missing reason or where reason includes TODO
- clippy::indexing_slicing
#dev BlockingSentinel or such to make sure we're doing blocking right
#bug random rare pthread lock: invalid argument (likely RocksDB shutdown, which is supposed to be safe and cancelling background tasks is said to be speed up only)
table scan: start/stop from DataFusion predicate pushdown #performance #severity/high #ref/1twdr3jqbpzen
table scan: inform DataFusion of the sort order we guarantee for primary keys #performance
- populate output_ordering:
- ExecutionPlan::properties
- ExecutionPlanProperties::output_ordering
implement TableProvider::constraints https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html#method.constraints #performance
- background: https://docs.rs/datafusion/latest/datafusion/common/struct.FunctionalDependence.html
- note non-public helper function for building constraints https://github.com/apache/datafusion/issues/15443 #ecosystem/datafusion
- ugly: to fill Unique, we need to know all indexes for the table; maybe merge IndexDef into TableDef (and then maybe IndexId gets scoped within one table, change RocksDB keys)
#performance use indexes for lookups, not just correctness
- see https://github.com/apache/datafusion/discussions/9963
- variants: table scan (current), table get, index scan, index get, index and table
- is that maybe TableProvider::get_logical_plan?
- probably need a custom query planner step
- index scan: inform DataFusion of the sort order we guarantee for keys of this index #performance
- index scan: start/stop from DataFusion predicate pushdown #performance #severity/high #ref/1twdr3jqbpzen
- maybe use RocksDB multi_get for table lookups from index scan: https://docs.rs/rocksdb/latest/rocksdb/struct.DBCommon.html#method.batched_multi_get_cf_opt
  - also multi_get_for_update
#test fix everything from cargo-mutants
#test validate index state better, integrity test to run between mutations
#test transaction semantics better, including isolation and concurrent access, SELECT FOR UPDATE #test
- use https://github.com/ept/hermitage
- use http://www.redbook.io/ch6-isolation.html
- see https://jepsen.io/consistency
- see https://www.postgresql.org/docs/current/sql-select.html#SQL-FOR-UPDATE-SHARE
- see LOCK IN SHARE MODE https://stackoverflow.com/questions/32827650/mysql-innodb-difference-between-for-update-and-lock-in-share-mode
- ensure iterator snapshot consistency
  - add mutation test tap point where disabling this causes test failure
hide arbitrary behind a feature
re-evaluate unsafe (non-validating) use of strings
- consider https://crates.io/crates/bstr -- but can't make Datafusion use that
study rocksdb size estimation
- https://github.com/facebook/rocksdb/wiki/Approximate-Size
- https://docs.rs/librocksdb-sys/0.17.1+9.9.3/librocksdb_sys/fn.rocksdb_approximate_sizes_cf.html
- https://docs.rs/librocksdb-sys/0.17.1+9.9.3/librocksdb_sys/fn.rocksdb_approximate_sizes_cf_with_flags.html
- https://docs.rs/librocksdb-sys/0.17.1+9.9.3/librocksdb_sys/fn.rocksdb_column_family_metadata_get_size.html
kantodb rocksdb-create twice with same path gives confusing error message
ensure fsync on commit, audit our use of RocksDB, validate with tests
- prefer to delay until we can demonstrate it working, for trust
- leave a mutation test tap point where disabling fsync on commit causes test failure
- is this just WriteOptions sync, or more involved?
  - maybe that can optimized based on things being no-sync within transaction, only transaction commit is sync
- https://github.com/facebook/rocksdb/wiki/Basic-Operations#synchronous-writes

Unimplemented or buggy SQL operations

DataFusion explicitly refuses to work with many SQL types even when it has the corresponding underlying type; see https://docs.rs/datafusion/latest/datafusion/sql/planner/struct.SqlToRel.html#method.build_schema and convert_data_type, e.g. UInt64 vs UNSIGNED BIGINT
large number literals behave differently than in Postgres; SELECT i64_max_plus_1 gives float, Postgres looks to be using big numbers
CREATE UNIQUE INDEX .. [NULLS DISTINCT] with null columns (#ref/unimplemented_sql/w9j8q485anior)
- use a multi-index with nullable columns in value not key, prefix scan on insert to detect conflict
UPDATE of column in primary key (#ref/unimplemented_sql/fpb157cronqcy)
tables without primary keys are not usable
- Datafusion UPDATE semantics are tricky, we need to dissect the input LogicalPlan and convert to table scan of the current data (instead of blindly executing it), update records as we see them in the scan
- #bug DELETE (#ref/gs5muy13tu1uk) or UPDATE (#ref/i61jpodrazscs) for a table using row_id (not primary keys)
- #bug DELETE or UPDATE for a table using row_id (not primary keys) #ref/gs5muy13tu1uk #ref/i61jpodrazscs
- possible fix: adjust logical_plan to also get a magic-named column __kanto_row_key, use that to do deletes
- possible fix: understand the logical_plan, don't use DataFusion to produce record batches, run a custom table/filter scan that also deletes on the fly
return number of rows affected
index existing data at CREATE INDEX time #severity/high #urgency/high
- ideally go straight to concurrent index building logic, no table locks
INSERT RETURNING (#ref/hixyde6h9n77o), UPDATE RETURNING (#ref/pp5wz7nq4yhzs), DELETE RETURNING (#ref/yn84qzsk1f9ny) refused by DataFusion SQL parser, not sure how to work around #ecosystem/datafusion #severity/high #urgency/medium
CTEs (test coverage): SELECT, UPDATE, INSERT, DELETE inside a CTE, SELECT, UPDATE, INSERT, DELETE following a CTE
rest of the data types
- should we enforce VARCHAR(30) etc length, truncate; forbidding that syntax until clarified
- INTEGER(10) etc precision
  - type { INTEGER(precision) | INT(precision) | INT4(precision) } (#ref/unimplemented_sql/tiptbb5d8buxa)
  - type { UNSIGNED INTEGER(precision) | UNSIGNED INT(precision) | UNSIGNED INT4(precision) } (#ref/unimplemented_sql/fzt1q8mfgkr6n)
  - type { (UNSIGNED) INT2(precision) | (UNSIGNED) SMALLINT(precision) } (#ref/unimplemented_sql/g57gd8499t4hg)
  - type { BIGINT(precision) | INT8(precision) } (#ref/unimplemented_sql/88fzntq3mz14g)
  - type { UNSIGNED BIGINT(precision) | UNSIGNED INT8(precision) } (#ref/unimplemented_sql/sze33j7qzt91s)
  - type TINYINT(precision) (#ref/unimplemented_sql/hueii8mjgk8qq)
  - type UNSIGNED TINYINT(precision) (#ref/unimplemented_sql/ewt6tneh7ecps)
- type { NUMERIC | DECIMAL | DEC } (#ref/unimplemented_sql/uoakmzwax448h)
- FLOAT(42) etc precision
  - type FLOAT(precision) (#ref/unimplemented_sql/xfwbr486jaw5q)
  - type DOUBLE(precision) (#ref/unimplemented_sql/wo43sei9ubpon)
- NUMERIC(precision, scale), NUMERIC(precision), NUMERIC; aka DECIMAL(precision, scale)
- type { CHARACTER | CHAR | FIXEDSTRING } (#ref/unimplemented_sql/hdhf8ygp9zna4): fixed-length strings
- type INT128 (ClickHouse syntax) (#ref/unimplemented_sql/4sg3mftgxu9j4)
- type INT256 (ClickHouse syntax) (#ref/unimplemented_sql/ex9mnxxipbqqs)
- type UINT128 (ClickHouse syntax) (#ref/unimplemented_sql/riob7j6jahpte)
- type UINT256 (ClickHouse syntax) (#ref/unimplemented_sql/q6tzjzd7xf3ga)
- type { MEDIUMINT (UNSIGNED) } (#ref/unimplemented_sql/m7snzsrqj68sw)
- type { BIGNUMERIC | BIGDECIMAL } (BigQuery syntax) (#ref/unimplemented_sql/an57g4d4uzwew)
- type BIT (VARYING) (#ref/unimplemented_sql/88uneott9owc1)
- type UUID (#ref/unimplemented_sql/531ax7gb73ce1)
- type TIMESTAMP (#ref/unimplemented_sql/njad51sbxzwnn)
- type TIME (#ref/unimplemented_sql/bppzhg7ck7xhr)
- type DATE (#ref/unimplemented_sql/1r5ez1z8j7ryo)
- type DATE32 (#ref/unimplemented_sql/781y5fxih7pnn)
- type DATETIME (ClickHouse syntax) (#ref/unimplemented_sql/oak3s5g3rnutq)
- type DATETIME (MySQL syntax) (#ref/unimplemented_sql/dojz85ngwo5wr)
- type INTERVAL (#ref/unimplemented_sql/5wxynwyfua69s)
- type { JSON | JSONB } (#ref/unimplemented_sql/abcw878jr35qc)
  - JSON manipulation functions #ecosystem/datafusion (best done in DataFusion https://github.com/apache/datafusion/issues/7845)
  - inspiration: https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse
- type BINARY (#ref/unimplemented_sql/i16wqwmie17eg)
- type ENUM (#ref/unimplemented_sql/1geqh5qxdeoko)
- type ARRAY (#ref/unimplemented_sql/hnmqj3qbtutn4)
- type SET (#ref/unimplemented_sql/hm53sq3gco3sw)
- type STRUCT (#ref/unimplemented_sql/nsapujxjjk9hg)
- type MAP (#ref/unimplemented_sql/7ts8zgnafnp7g)
- type UNION (#ref/unimplemented_sql/oownfhnfj5b9r)
- type TUPLE (#ref/unimplemented_sql/x7qwkwdhdznxe)
- type NESTED (#ref/unimplemented_sql/fa6nofmo1s49g)
- type { CHARACTER LARGE OBJECT | CHAR LARGE OBJECT | CLOB } (#ref/unimplemented_sql/ew7uhufhkzj9w)
- type TRIGGER (#ref/unimplemented_sql/zf5jx1ykoc5s1)
- custom type (#ref/unimplemented_sql/dt315zutjqpzc)
- type UNSPECIFIED (#ref/unimplemented_sql/916n48obgiqh6)
- type LOWCARDINALITY (#ref/unimplemented_sql/78exey8xjz3yk)
- type REGCLASS (Postgres syntax) (#ref/unimplemented_sql/quzjzcsusttgc)
- type NULLABLE (ClickHouse syntax) (#ref/unimplemented_sql/ra5kspnstauu6)
emulate more Postgres system tables, like pg_cancel_backend

Unimplemented SQL syntax support

For reference, see Postgres SQL, SQLite SQL, and MariaDB SQL.

SET LOCAL (#ref/unimplemented_sql/7b1ab6tms3uz4)
SHOW variable (#ref/unimplemented_sql/7a69iu5nr1oxk)
CREATE TABLE IF NOT EXISTS (#ref/unimplemented_sql/pf1mfhd9sz3jn)
DROP TABLE (#ref/unimplemented_sql/47hnz51gohsx6)
ALTER TABLE (#ref/unimplemented_sql/dny7j9hx3ihha)
CREATE INDEX IF NOT EXISTS (#ref/unimplemented_sql/tzwc96kk9173n)
DROP INDEX (#ref/unimplemented_sql/4kgkg4jhqhfrw)
ALTER INDEX (#ref/unimplemented_sql/ha1fybdqqwa8k)
SET TRANSACTION (#ref/unimplemented_sql/yybqytw7q8bmn)
- may cause a bigger refactoring of our transaction start; we currently do eager creation
CREATE TABLE without columns (#ref/unimplemented_sql/g7hdmcm5bk69a)
constraints on tables #severity/high #urgency/high
- CREATE TABLE with table constraint CHECK (#ref/unimplemented_sql/8zaqnh9mb9tor)
constraints on columns
- CREATE TABLE with column option { PRIMARY KEY | UNIQUE } (#ref/unimplemented_sql/hwcjakao83bue) #severity/high #urgency/medium
- CREATE TABLE with column option CHECK (#ref/unimplemented_sql/dwcpda8aj66d6) #severity/high #urgency/medium
default and computed values for columns
- CREATE TABLE with column option DEFAULT (#ref/unimplemented_sql/944dzboki4djc) #severity/high #urgency/medium
- CREATE TABLE with column option GENERATED (#ref/unimplemented_sql/bxzpaj1quo5a4) #severity/high #urgency/medium
  - see also https://www.postgresql.org/docs/current/ddl-generated-columns.html
CREATE INDEX with dotted column name (#ref/unimplemented_sql/kbezqzh7p6yky)
CREATE INDEX with non-column expression (#ref/unimplemented_sql/8cjqfzsfpnduw)
CREATE INDEX WHERE (#ref/unimplemented_sql/9134bk3fe98x6): partial indexes
CREATE INDEX .. WITH (#ref/unimplemented_sql/w7t6c8xrehsnh)
CREATE INDEX .. INCLUDE (#ref/unimplemented_sql/ip415b5s8sa6h): non-key included columns
foreign keys #severity/high #urgency/high
- CREATE TABLE with table constraint FOREIGN KEY (#ref/unimplemented_sql/mamfde4rdzgeo)
- CREATE TABLE with column option REFERENCES (#ref/unimplemented_sql/ufnoiinkkp3wy)
- enforce
roles and access control #severity/high #urgency/medium
- CREATE ROLE (#ref/unimplemented_sql/hqyj3srdk1g4h)
- DROP ROLE (#ref/unimplemented_sql/q8wh8nqixoytq)
- ALTER ROLE (#ref/unimplemented_sql/u6set5k1gzyhh)
- GRANT (#ref/unimplemented_sql/mekqqm6hxy64s)
- REVOKE (#ref/unimplemented_sql/wha7884jy4bty)
- SET ROLE (#ref/unimplemented_sql/ryqxdr55r49sr)
prepared statements
- PREPARE (#ref/unimplemented_sql/fgi986yc4d7cw)
- EXECUTE (#ref/unimplemented_sql/3w91mfrbzguyh)
- DEALLOCATE (#ref/unimplemented_sql/fn1q45ys1t8ks)
views, first dynamic, then materialized, then incremental updates etc extras #severity/high
- CREATE VIEW (#ref/unimplemented_sql/b9m5uhu9pnnsw)
- ALTER VIEW (#ref/unimplemented_sql/uhwpi8nzojcxy)
- DROP VIEW (#ref/unimplemented_sql/hwk7gp6ffh8zy)
TRUNCATE (#ref/unimplemented_sql/uf3rr4diw3o1o)
temporary tables
- CREATE TEMPORARY TABLE (#ref/unimplemented_sql/3u561ykck5m76)
sequences (exposed to SQL)
- CREATE SEQUENCE (#ref/unimplemented_sql/rdkjx9ryredcw)
- DROP SEQUENCE (#ref/unimplemented_sql/ca149fq1ixm5h)
START TRANSACTION READ ONLY (#ref/unimplemented_sql/bhetj3emnc8n1)
multiple schemas
- make schema creation explicit not implicit
- CREATE SCHEMA (#ref/unimplemented_sql/x9cna73uf8cew)
- DROP SCHEMA (#ref/unimplemented_sql/5rnpn33hia5a6)
multiple databases/catalogs/backends; all three might be the same thing for us
- re-evaluate the whole "catalog name is backend" idea, especially when a backend can store multiple catalogs, duplication can lead to mismatch, renames are weird; maybe it's a lookup order?
  - is a backend a TABLESPACE?
    - sqlparser-rs doesn't have TABLESPACE support as of 2025-03
- re-evaluate the idea of invisibly spreading every transaction across all backends; maybe force each transaction to stick to one, maybe make it lazy
- Postgres docs specifically say "Databases are equivalent to catalogs"
- CREATE DATABASE (#ref/unimplemented_sql/6c76p5eqsuea6)
- ALTER DATABASE: not in sqlparser v0.54.0 #ecosystem/sqlparser-rs
- DROP DATABASE (#ref/unimplemented_sql/qn6s3c397johw)
EXPLAIN (#ref/unimplemented_sql/fugjaua7zkfd4)
CREATE INDEX DESC (#ref/unimplemented_sql/n7x3zhjk7bf8k): descending sort order
CREATE INDEX CONCURRENTLY (#ref/unimplemented_sql/wya1a3w874xba)
COMMIT AND CHAIN (#ref/unimplemented_sql/ic5jpcqsjghm6)
ROLLBACK AND CHAIN (#ref/unimplemented_sql/q1pay866nux9n)
COPY (#ref/unimplemented_sql/1fxo667s3wwes): the kind that transfers data over wire; missing support in https://github.com/sunng87/pgwire #waiting
- we do not intend to support anything where server-side file read/write gets path from user over network
RELEASE SAVEPOINT (#ref/unimplemented_sql/hyzhkewjpnug6): #severity/low #urgency/low #size/small
- test error from trying to use it again
{ LISTEN | UNLISTEN | NOTIFY } (#ref/unimplemented_sql/8t9hqpx4umzaq)
user-defined functions #severity/medium #urgency/low
- CREATE FUNCTION (#ref/unimplemented_sql/fj9ny1gu79cbg)
- DROP FUNCTION (#ref/unimplemented_sql/ptp9d3xe1456h)
- SHOW FUNCTIONS (#ref/unimplemented_sql/yqiswgsqjc5uh): #severity/low
user-defined types #severity/medium #urgency/low
- CREATE TYPE (#ref/unimplemented_sql/ur33y834br1wa): user defined types (see [Postgres](https://www.postgresql.org/docs/16/sql-createtype.html, DataFusion issue)
- DROP TYPE (#ref/unimplemented_sql/t1fti3yz8jsdn)
collation, as in locale-specific ordering:
- CREATE TABLE .. COLLATE (#ref/unimplemented_sql/qd36gh6hj78br): configurable collation order for columns: #severity/medium #urgency/low
- see https://databasearchitects.blogspot.com/2016/08/equivalence-of-unicode-strings-is.html
- see https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
procedures #severity/medium #urgency/low
- CREATE PROCEDURE (#ref/unimplemented_sql/69jzm59u8c6jc)
- CALL (#ref/unimplemented_sql/dx5ic5bkaaq7c): see Postgres
- DROP PROCEDURE (#ref/unimplemented_sql/w173ctuodtcaa)
triggers #severity/medium #urgency/low
- CREATE TRIGGER (#ref/unimplemented_sql/cxts4du59pwqn)
- DROP TRIGGER (#ref/unimplemented_sql/i1omqjczqksqk)
KILL [CONNECTION] (#ref/unimplemented_sql/jqji1j6my4f5w): #severity/medium #urgency/low
KILL QUERY (#ref/unimplemented_sql/77ecapaed3bso): #severity/medium #urgency/low
MERGE (#ref/unimplemented_sql/y5gzyy5kxg5ws): #severity/low #urgency/low
CREATE TABLE .. COMMENT (#ref/unimplemented_sql/dgyhzknpfx756)
CREATE TABLE with column option COMMENT (#ref/unimplemented_sql/4itjrujemcztq) #severity/medium #urgency/low
miscellaneous column options
COMMENT ON (#ref/unimplemented_sql/e8uhkzz1krcgw): #severity/low #urgency/low
populate table with data at creation time #severity/low #urgency/low
- CREATE TABLE AS SELECT (#ref/unimplemented_sql/cirnphk69igxy): (Ctas in https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.WriteOp.html)
- CREATE TABLE AS SELECT (#ref/unimplemented_sql/nuqha5e3r7g6h)
READ COMMITTED: currently upgrading to READ REPEATABLE, could support directly (#ref/bm4q45m5ajz9s)
START TRANSACTION ISOLATION LEVEL SERIALIZABLE (#ref/unimplemented_sql/dy1ch15p7n3ze): maybe just a better error message: #severity/low #urgency/low
- i think supporting this requires two things: TransactionOptions::set_snapshot(true) and every get is upgraded to get_for_update.
- then marking transactions as READ ONLY can relax that for performance
cursors (see Postgres; not in standard interactive SQL) #severity/low #urgency/low
- DECLARE (#ref/unimplemented_sql/159jmakdyuewy)
- FETCH (#ref/unimplemented_sql/9nf8hju6ie9e1)
- CLOSE (#ref/unimplemented_sql/gc1557zek4hne)
row-level security (see Postgres)
- CREATE POLICY (#ref/unimplemented_sql/gce4sb6hptjfy)
- DROP POLICY (#ref/unimplemented_sql/n9zgxwn6736wr)
- ALTER POLICY (#ref/unimplemented_sql/wj1tnq4hkth7s)
DISCARD (#ref/unimplemented_sql/9a5zkcwgbhewh): (see Postgres): #severity/low #urgency/low
REPLACE INTO (MySQL syntax) or INSERT OR REPLACE (SQLite syntax) (#ref/unimplemented_sql/54xqh7ornzpso): #severity/low #urgency/low
CREATE OR REPLACE TABLE (#ref/unimplemented_sql/65so99qr1hyck): #severity/low #urgency/low
ASSERT (PowerSQL syntax) (#ref/unimplemented_sql/qq1s1fq45ztdw): #severity/low #urgency/low
CREATE TABLE .. LIKE (#ref/unimplemented_sql/xutp6m7rjnc54)
CREATE TABLE .. CLONE (#ref/unimplemented_sql/7b5kah43yqquc): optionally also copies data; likely RocksDB backend will not support that option
CREATE TABLE .. WITH (#ref/unimplemented_sql/3kozdfoeczmrk): see Postgres, likely parse superficially and error if given #severity/low #urgency/low
RENAME TABLE (MySQL syntax) (#ref/unimplemented_sql/85ax9gtrkcneg)
CREATE TABLE with column option CHARACTER SET (#ref/unimplemented_sql/ifi7dygtua3oc): parse superficially and refuse non-utf8 #severity/low #urgency/low
CREATE TABLE .. DEFAULT CHARSET (MySQL syntax) (#ref/unimplemented_sql/dcan9mz8g4eja): parse superficially and refuse non-utf8 #urgency/low #severity/low #size/small
relaxing constraints to end of transaction:
- CREATE TABLE with table constraint PRIMARY KEY .. DEFERRABLE (#ref/unimplemented_sql/uzf4oabd178rr)
- CREATE TABLE with table constraint PRIMARY KEY .. INITIALLY DEFERRED (#ref/unimplemented_sql/ysnkhzw6ukbms)
BEGIN {DEFERRED | IMMEDIATE | EXCLUSIVE} (SQLite syntax) (#ref/unimplemented_sql/o969ikdsjz76r): #severity/low #urgency/low
CREATE TABLE with column option ON CONFLICT (SQLite syntax) (#ref/unimplemented_sql/ehefbqgnnfqwe) #severity/medium #urgency/low
PRAGMA (SQLite syntax) (#ref/unimplemented_sql/a1k38hk48fpzr): #severity/low #urgency/low
SHOW CREATE (MySQL syntax) (#ref/unimplemented_sql/c7cewp15zm5ys)
SHOW STATUS (MySQL syntax) (#ref/unimplemented_sql/ejj6ir6dsxhrc)
SHOW DATABASES (MySQL syntax) (#ref/unimplemented_sql/kskgotui7rupc)
SHOW COLUMNS (MySQL syntax) (#ref/unimplemented_sql/zx98bjxguht5o)
SHOW VARIABLES (MySQL syntax) (#ref/unimplemented_sql/n5cs7ue5gzpde)
SHOW SCHEMAS (Snowflake syntax) (#ref/unimplemented_sql/poi5oxunjyq5a)
SHOW VIEWS (Snowflake syntax) (#ref/unimplemented_sql/zzi1wb3jq5jeh)
{ EXPLAIN | DESCRIBE } TABLE (MySQL syntax) (#ref/unimplemented_sql/4tpmma7ewdf3a): #severity/low #urgency/low
CREATE TABLE .. AUTO_INCREMENT (MySQL syntax) (#ref/unimplemented_sql/18kxgbi8rzb5r): #severity/low #urgency/low
USE (#ref/unimplemented_sql/ng34e6kt6rrt4): #severity/low #urgency/low
CREATE TABLE with table constraint PRIMARY KEY .. COMMENT (MySQL syntax) (#ref/unimplemented_sql/ss9iqz1c8ffrh)
MySQL-style names on constraints or indexes:
- CREATE TABLE with table constraint PRIMARY KEY given a name (MySQL syntax) (#ref/unimplemented_sql/5rr7ouyasn1dk)
- CREATE TABLE with table constraint PRIMARY KEY given an index name (MySQL syntax) (#ref/unimplemented_sql/mexuaio6ngad4)
CREATE TABLE with table constraint UNIQUE {INDEX | KEY} (MySQL syntax) (#ref/unimplemented_sql/m3op6enhb5pgw)
CREATE TABLE with table constraint PRIMARY KEY .. NOT ENFORCED (MySQL syntax) (#ref/unimplemented_sql/u339u8jzzkqdy)
CREATE TABLE with column option ON UPDATE timestamp function (MySQL syntax) (#ref/unimplemented_sql/zaub41ukgqfmq) #severity/low #urgency/low
CREATE TABLE with column option MATERIALIZED (ClickHouse syntax) (#ref/unimplemented_sql/ouq6dfrgbxn9w) #severity/low #urgency/low
CREATE TABLE with column option EPHEMERAL (ClickHouse syntax) (#ref/unimplemented_sql/78qipewmqn31h) #severity/low #urgency/low
CREATE TABLE with column option ALIAS (ClickHouse syntax) (#ref/unimplemented_sql/1adkef6yd1qgc) #severity/low #urgency/low
CREATE TABLE with column option OPTIONS (BigQuery syntax) (#ref/unimplemented_sql/7drpfhozbre1k) #severity/low #urgency/low
CREATE TABLE with column option TAG (Snowflake syntax) (#ref/unimplemented_sql/sjpe4f636kn8n) #severity/low #urgency/low
CREATE TABLE with column option { MASKING POLICY | PROJECTION POLICY } (Snowflake syntax) (#ref/unimplemented_sql/ynucdy5u7a35w) #severity/low #urgency/low
CREATE TABLE with weird column option (#ref/unimplemented_sql/4jqhus4cj4fho) #severity/low #urgency/low

There are plenty of corners of the SQL syntax we have chosen to not support. These are marked in the source code as NotSupportedSql. We may have judged wrong; let us know if you need some non-supported syntax.

Later

Features and improvements we would like to work on, but that aren't started yet.

#admin set up security contacts
- https://www.rfc-editor.org/rfc/rfc9116
- https://www.rfc-editor.org/rfc/rfc2142
- https://securitytxt.org/
- inspiration: https://supabase.com/.well-known/security.txt
reconsider each item in https://datafusion.apache.org/user-guide/configs.html
maybe interpret INT8 and UINT8 bits vs bytes based on SQL dialect; Clickhouse is the odd one out
safer arithmetic than silent wraparound in SQL
we could pack booleans tighter in a record, prefer finding an unused bit in last byte/u32 allocated to booleans
existence of test_session method in a couple of places probably means the non-test API is not convenient enough
move away from arrow-row for keys, it's not guaranteed stable; we're mostly using it to avoid writing our own SIMD code for now
#maybe #dev split types so different byte slices aren't confused with each other: row key, index key, rocksdb key, record, arrow data, etc
- would require wrapping the rocksdb api to enforce there?
#maybe #performance special case point query so it doesn't go through a table scan and RocksDB iterator
- rust-rocksdb makes copies of keys to do FFI, bad design
- perhaps there is even a way to avoid columnar result overhead?
make information_schema from DataFusion show our types, not Utf8View etc
- potential: DataFusion supporting logical types: https://github.com/apache/datafusion/issues/12622 #waiting
- alternatively, re-implement ourselves
- issue on what is missing from information_schema https://github.com/apache/datafusion/issues/8916
#performance #maybe partition workloads (inputs for insert, key ranges for table scan) and run multiple RocksDB blocking worker threads
- datafusion::physical_plan::execute_stream_partitioned instead of execute_stream
#performance consider switching to minitrace: https://crates.io/crates/minitrace, or NIH it https://news.ycombinator.com/item?id=41570163
#performance support pgwire extended queries, newer clients might not implement simple queries
#ops make server use only UTC time as native, never local time
#ops landlock, seccomp etc sandboxing
#cleanup #dev maybe we want stronger types to never confuse key-value keys and values?
#dev #doc KantoError convention, error codes
#dev lint: every UnimplementedSql and NotSupportedSql must have a SLT test
#dev lint: handle Unimplemented like UnimplementedSql, figure out a way to avoid duplication
#dev lint: all expect must be internal errors and unique (in Panic policy)
#dev lint: all #waiting have URL to upstream issue
#dev lint: thiserror source field: named source or error, be consistent
#ops more readable logging (maybe by postprocessing the JSON)
#ops collect metrics in-memory and expose them via a Prometheus endpoint
- maybe also expose Tokio metrics: https://docs.rs/tokio-metrics/latest/tokio_metrics/
TLS support in Postgres-compatible server
shutdown on I/O error (in Panic policy)
- see also https://github.com/facebook/rocksdb/wiki/Background-Error-Handling
reevaluate panic on OOM
create a REPL for ad hoc interaction (useful for library, not as useful for server) (limitation: exclusive access only): https://docs.rs/clap/latest/clap/_derive/_cookbook/repl/index.html
#enhancement Streaming remote backup
#dev set up branch coverage (needs nightly rust: https://github.com/taiki-e/cargo-llvm-cov/issues/8)
#dev continuous-integration: ./task/nextest-run-valgrind
#dev continuous-integration: cargo miri nextest run
#dev continuous-integration: (with unstable) cargo fmt --check -- --unstable-features --config-path=./rustfmt-unstable.toml
#dev continuous-integration: cargo run -p lint-unique-error-codes
#dev continuous-integration: cargo run -p lint-unimplemented-sql
#dev continuous-integration: cargo shear
#dev continuous-integration: cargo spellcheck
#dev continuous-integration: typos
#test tests for CREATE TABLE, CREATE INDEX etc that dump the rkyv TableDef / IndexDef as JSON
#test regularly test on nightly
- with -Zrandomize-layout (see flake.nix)
- re-evaluate rustc flags for enhanced testing every now and then
#dev lint: no accidental public API changes, and eyeball the public API for sanity
- cargo-public-api
  - broken with Rust >=1.84: https://github.com/cargo-public-api/cargo-public-api/issues/750 #ecosystem #waiting
  - the diff logic is annoying, I really don't want networking or touching my git checkout; use as library only: https://docs.rs/public-api/latest/public_api/
- cargo-semver-checks
#dev lint that we re-export every crate exposed by public APIs
#dev speed up builds
- study https://matklad.github.io/2021/09/04/fast-rust-builds.html
- study https://fasterthanli.me/articles/why-is-my-rust-build-so-slow
- study https://doc.rust-lang.org/cargo/reference/timings.html
- study https://crates.io/crates/cargo-llvm-lines
- study https://www.reddit.com/r/rust/comments/1bhpfeb/vastly_improved_recompile_times_in_rust_with/
- study https://corrode.dev/blog/tips-for-faster-rust-compile-times/
- study https://nnethercote.github.io/perf-book/build-configuration.html#minimizing-compile-times
- #maybe speed up linking with lld or mold
#dev set up MSRV policy
#doc #website support /docs/<version>, navigation between those and /docs/latest
bindings to C
bindings to Typescript/Javascript: Deno, NodeJS
bindings to Python
does compiling to WASM make sense, given the filesystem use; is a good enough API emerging

Maybe

Possible features we have not at this time committed to implementing, or that are so far out that we just don't have a clear picture of how to implement them.

#performance kanto-backend-rocksdb: use merges where possible (e.g. x=x+1)
- what are the wanted update messages? set, delete, delta_wraparound, delta_saturated? https://en.wikipedia.org/wiki/Fractal_tree_index#Upserts
kanto-backend-rocksdb: use compaction filtering for vacuum
encryption at rest
standalone integrity checker/fixer ("fsck")?
- Postgres background: https://research.google/pubs/data-corruption-bugs-diagnosis-and-lessons/, https://www.pgevents.ca/events/pgconfdev2024/schedule/session/93-data-corruption-bugs-diagnosis-and-lessons/, https://www.postgresql.org/docs/current/amcheck.html
UNIX domain socket support in Postgres-compatible server
RocksDB bind to C++ directly
- see also https://github.com/rustfoundation/interop-initiative/blob/main/problem-statement.md
#performance #maybe tracing: filter out some low-level tracing messages when in release mode
- https://docs.rs/tracing/latest/tracing/level_filters/index.html#compile-time-filters
- not sure i actually want to remove TRACE level; it's a bit annoying that DEBUG is more important than TRACE.
#doc #maybe #urgency/low include section in docs with all used licenses
- see https://github.com/EmbarkStudios/cargo-about/blob/main/about.hbs
#admin study tokio-console, would it benefit us
- https://github.com/tokio-rs/console
- https://docs.rs/console-subscriber/latest/console_subscriber/
re-evaluate legacy compatibility with SQLite
- we have marked some vendor-specific SQL as NotSupportedSql that could be implemented; they were just not relevant for Postgres compatibility
re-evaluate legacy compatibility with MySQL
- we have marked some vendor-specific SQL as NotSupportedSql that could be implemented; they were just not relevant for Postgres compatibility

Small cleanups

exact_zip helper: we use Iterator::zip in many places where having non-equal lengths would be a bug, systematize

Ecosystem wishlist

DataFusion

Specific bugs and missing features:

DataFusion CASE WHEN is wrong: https://github.com/apache/arrow-datafusion/issues/8475
DataFusion DataFrame::cache breaks if some CASE branches are NULL: https://github.com/apache/arrow-datafusion/issues/8476
DataFusion CASE mishandle: https://github.com/apache/arrow-datafusion/issues/8909
DataFusion COALESCE mishandle: https://github.com/apache/arrow-datafusion/issues/8910 #ecosystem/datafusion #waiting #severity/medium #urgency/low
DataFusion MIN mishandle: https://github.com/apache/datafusion/issues/8911
DataFusion does not support avg(DISTINCT): https://github.com/apache/arrow-datafusion/issues/2408 #ecosystem/datafusion #waiting #severity/medium #urgency/low
DataFusion non-unique column names https://github.com/apache/arrow-datafusion/issues/6543, https://github.com/apache/arrow-datafusion/issues/8379 #ecosystem/datafusion #waiting #severity/high #urgency/medium
support Postgres '\x0102' style escaped strings
wishlist: datafusion should have better ways to convert from ResolvedTableReference to TableReference::Full
- https://docs.rs/datafusion/latest/datafusion/sql/enum.TableReference.html
- https://docs.rs/datafusion/latest/datafusion/catalog/struct.ResolvedTableReference.html
#project-management enumerate DataFusion trouble from sqllogictest is_datafusion_limitation, ensure upstream tickets exist
TableProvider::constraints helper function is no longer public: https://github.com/apache/datafusion/issues/15443

Architectural:

Arrow is_null/is_valid API is broken by design: https://github.com/apache/arrow-rs/issues/4840
DataFusion schema handling is not great for us
- see https://github.com/apache/datafusion/discussions/8887
- with the "pre-scan" idea, the statement gets scanned twice, once by us once by DataFusion
INSERT RETURNING (#ref/yn84qzsk1f9ny)` et al; or, take more control over DML execution

sqlparser-rs

UNCACHE fails to parse with a semicolon: https://github.com/sqlparser-rs/sqlparser-rs/issues/1244
docs link to non-English content https://github.com/apache/datafusion-sqlparser-rs/issues/1523
CreateTable.location is confusing: https://github.com/apache/datafusion-sqlparser-rs/issues/1518 #severity/low
Snowflake COPY INTO fails to parse with a semicolon: https://github.com/apache/datafusion-sqlparser-rs/issues/1519 #severity/low
Snowflake LIST @foo and REMOVE @foo fail to parse with a semicolon: https://github.com/apache/datafusion-sqlparser-rs/issues/1773 #severity/low
SKEWED BY not supported/buggy https://github.com/apache/datafusion-sqlparser-rs/issues/1499 #severity/low
docs: typo in docs of ClickHouse column option MATERIALIZED https://github.com/apache/datafusion-sqlparser-rs/issues/1760 #severity/low #urgency/low #doc #dev
docs: ColumnOption::ForeignKey docs have wrong SQL syntax https://github.com/apache/datafusion-sqlparser-rs/issues/1761 #severity/low #urgency/low #doc #dev

sqllogictest-rs

expected column type API is wrong way around: https://github.com/risinglightdb/sqllogictest-rs/issues/227
- our *.slt probably has lots of incorrect query I lines because nothing is enforcing correctness yet
sqllogictest-rs can't run sqlite tests: https://github.com/risinglightdb/sqllogictest-rs/issues/148

clippy (?)

no way to lint or debug_assert that we're not blocking in async context; how do I validate that I never use RocksDB's potentially-blocking FFI functions in Tokio async context?
- blocking rocksdb calls should only be made inside block_in_place or such
- https://github.com/rust-lang/wg-async/issues/19, https://github.com/rust-lang/wg-async/issues/248
- clippy issues: https://github.com/rust-lang/rust-clippy/issues/4377, https://github.com/rust-lang/rust-clippy/issues/10794
- https://github.com/rust-lang/wg-async/issues/73
- https://github.com/rust-lang/async-book/issues/64
- https://internals.rust-lang.org/t/warning-when-calling-a-blocking-function-in-an-async-context/11440
- maybe only use my own wrapper that uses thread-local state for that
- Tokio's own docs claim that their block_on panics if called in wrong context

tracing

tracing_subscriber (and thus test_log) intentional leak makes valgrind annoying: https://github.com/tokio-rs/tracing/issues/2069#issuecomment-1096929365
the tracing output formats are not very useful
- wish to combine best of tailing logs and nested traces
- i think i want to log to json, and use some sort of a viewer GUI -- but what?

Rust

rustfmt fails silently extremely often: https://github.com/rust-lang/rustfmt/issues/3863, https://github.com/rust-lang/rustfmt/issues/6202, https://github.com/rust-lang/rustfmt/issues/3697, https://github.com/rust-lang/rustfmt/issues/6491
- i find any discussion about "overlong lines" simply a sign of bad design; a formatter needs to work toward a goal, not silently stop working except in the one case where it can achieve perfection; the fact that the line remains long doesn't mean you should stop indenting it (or even worse, the whole containing block)
- the fact that rustfmt can eat comments is really concerning to me; this is just a bad architecture: https://github.com/rust-lang/rustfmt/issues/6044
error_generic_member_access stabilized could help with providing context in KantoError https://github.com/rust-lang/rust/issues/99301, https://github.com/rust-lang/rust/issues/96024
- Likely reimplement Op, maybe code too, with this? Make sure unique code scanner keeps working, including code kind that now comes from enum variant.

misc

branch coverage stabilized: https://github.com/taiki-e/cargo-llvm-cov/issues/8
uninitialized value in opendir (seen with valgrind)
- looks like https://stackoverflow.com/questions/66485441/uninitialized-bytes-in-interceptor-popen

Details

RocksDB bind to C++ directly #maybe

RocksDB is a C++ library we're currently using through a less-featureful C shim. This crate allows calling C++ directly: https://github.com/dtolnay/cxx/. What are the downsides?

Compare to https://github.com/hkalbasi/zngur, https://crates.io/crates/cpp.

Prior work at https://github.com/google/autocxx/issues/1073. autocxx has been stagnant since 2022, so I'm pessimistic on this approach now.

MSRV policy

https://doc.rust-lang.org/cargo/reference/manifest.html#the-rust-version-field
verify it builds with cargo-msrv or just Nix
- https://foresterre.github.io/cargo-msrv/, https://github.com/foresterre/cargo-msrv
discussions:

Streaming remote backup

also usable for Change Data Capture (CDC), e.g. send to pubsub and update consumers
inspiration: Litestream (WAL follower), LiteFS (FUSE server that understands SQLite file formats), LiteVFS (SQLite plugin):
- Litestream: https://litestream.io/how-it-works/, https://fly.io/blog/litefs-cloud/, https://fly.io/blog/all-in-on-sqlite-litestream/
- LiteFS has a more efficient to restore format (tl;dr stores pages not WAL segments): https://news.ycombinator.com/item?id=36604556
- some details on the LiteFS file format https://news.ycombinator.com/item?id=36606114
- using LiteFS as basis of read-only replicas: https://news.ycombinator.com/item?id=36604370
- experiences using Litestream: https://news.ycombinator.com/item?id=39065201

Streaming remote backup for RocksDB backend

could be done in a second process with https://github.com/facebook/rocksdb/blob/main/examples/multi_processes_example.cc
inspiration: replication features of https://github.com/apache/kvrocks
inspiration: https://github.com/pinterest/rocksplicator (dead project)

Coding conventions

Prefer lints over documenting things here. What needs to be explicitly said, we do not have a lint for.

C pointers

Name the struct field raw (or raw_* for multiple).
Use std::ptr::NonNull wherever possible.

Keepalives

Name the struct field _keepalive_*, or if it needs to be assigned to after creation keepalive_*.
Put keepalives at the end of the struct; struct fields are dropped in declaration order.

Reference count keepalives

If A depends on B (for example, rocksdb_a_destroy must be called before rocksdb_b_destroy), and doesn't naturally refer to B, add a struct field _keepalive_b: B (or Arc<B>, if the reference count is not hidden as an inner field).

C pointer keepalives

If you are passing Rust-allocated memory to C, where C holds on to that memory past the function call, use std::pin::Pin to ensure Rust does not move the data from underneath.

No lint leftovers

(Needed to state justification for manual #[allow]s.)

Use #[expect(...)] instead of #[allow(...)] to allow lints.

Acceptable exceptions:

macros or other generated code that only sometimes triggers the lint rule

For lints that trigger only with some architectures/platforms/configurations, express that with cfg.

Indent don't align

Trying to align "columns" in source code where the amount of content in the "left column" varies leads to unnecessary code churn; don't do it.

Examples:

aligning # or // in inline comments
aligning = in assignments

`tx` means database transactions, not transmit

The identifier tx shall refer to database transactions, or lower-level transactions such as RocksDB transaction used to implement a database transaction.

For consistency, avoid txn.

Ends of channels are called "sender" and "receiver", not tx/rx.

Rust value builders have `.build()` methods

For consistency, avoid .finish(). Transactions are finished.

Inner error fields are named `error` not `source`

Prefer `Box<[T]>` over `Vec` when the length does not change

Also collect directly into Box<[T]>, not to Vec<T> only to call Vec::into_boxed_slice.

Check vs validate vs assert

To validate user input, parse don't validate: fn parse(input: &[u8]) -> Result<..., KantoError>.

Check non-parsing related operational preconditions, where failing the check is within normal operation and depends on user input. Think 42.checked_add(13). Typically fn check_foo(...) -> Result<(), KantoError>.

Validate invariants and such that are always expected to be true, and failing means corruption or internal error. Typically fn validate_foo(...) -> Result<(), KantoError>.

Assert additional, potentially costly, checks that are only done in debug builds. Typically fn assert_foo(...) with no return value.

It is less painful to type `tracing::debug!` than to deal with import churn

Sometimes being explicit and verbose is actually the shorter and simpler way.

Manifesto

(For things we do have a lint for.)

Panic policy

The codebase still has undesirable use of .unwrap() or .expect(). We are working to limit their use.

KantoDB should not panic regardless of library API misuse, corrupt or malicious database files, or user input (SQL queries, network protocol). Please report it on the community forum if you see an unexpected panic.

using .unwrap() can be ok during development, but typically not in final code
can panic on out of memory; for now, this is the easy path, waiting for Rust Allocator changes and ecosystem to catch up
ideally, never panic on corrupt data, but this is tough to enforce if panic on internal errors is allowed; fuzzing is only probabilistic
debug mode only invariant checking can panic (use debug_assert!); this should be used only where it is impossible to trigger by file/network input
open question: should we panic on internal errors where we "know it will be Some" but can't express that through APIs; this should be used only where it is impossible to trigger by file/network input
- current plan is to unify to .expect("internal error XYZ: ...") style, add lint that those are the only expects, and then let them stay in the code as long as a fuzzer doesn't hit them
never panic on I/O error; however, default policy might change to controlled stop & recommend reboot

Resources:

Decisions

Not using `TableProvider.insert_into`

That's a fine convenience, but since there's nothing comparable for UPDATE or DELETE, we're going to need the less convenient code path anyway.

Links:

Developer glossary

Resources

Unorganized

rocksdb crate docs
The Design of the Postgres Storage System (VLDB 1987)
How CockroachDB's MVCC Model Works (RocksDB Meetup 2015, video, slides
- they store one key for metadata, then key per each MVCC version
- custom order logic to get MVCC versions sorted largest-first
RocksDB transactions: https://github.com/facebook/rocksdb/wiki/Transactions#optimistictransactiondb, https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions, https://github.com/facebook/rocksdb/wiki/WriteUnprepared-Transactions
RocksDB objects must be deleted before exit (or you crash with "pure virtual method called" in destructor): https://github.com/facebook/rocksdb/issues/649#issuecomment-119779477 via https://github.com/rust-rocksdb/rust-rocksdb/issues/463
RocksDB merge operator: https://github.com/facebook/rocksdb/wiki/Merge-Operator, https://github.com/facebook/rocksdb/wiki/Merge-Operator-Implementation
each RocksDB column family takes about 64 MiB RAM: https://github.com/facebook/rocksdb/issues/2426

KantoDB Developer Guide