chdb-core
chdb-core is the foundational engine of the chDB ecosystem — an in-process SQL OLAP Engine powered by ClickHouse [^1]
Table of Contents
chDB Ecosystem
The chDB project is split into two packages:
| Package |
Role |
Install |
| chdb-core (this repo) |
C++ engine + Session / Connection / DB-API interfaces |
pip install chdb-core |
| chDB |
Pandas-compatible DataStore API built on top of chdb-core |
pip install chdb |
┌───────────────────────────────────────────┐
│ chDB (pip install chdb) │
│ ┌─────────────────────────────────────┐ │
│ │ DataStore: pandas-like lazy API │ │
│ │ QueryPlanner / dual-engine exec │ │
│ └──────────────────┬──────────────────┘ │
│ │ │
│ ┌──────────────────▼──────────────────┐ │
│ │ chdb-core (pip install chdb-core) │ │
│ │ C++ ClickHouse Engine │ │
│ │ Session / Connection / DB-API │ │
│ │ query() / UDF / Stream │ │
│ └─────────────────────────────────────┘ │
└───────────────────────────────────────────┘
chdb-core provides everything you need to run SQL queries with ClickHouse performance — no server required. If you want a higher-level pandas-compatible DataFrame API, install chDB instead.
Features
- In-process SQL OLAP Engine, powered by ClickHouse
- No need to install ClickHouse
- Minimized data copy from C++ to Python with python memoryview
- Input & Output support Parquet, CSV, JSON, Arrow, ORC and 60+ more formats
- Session and Connection management with stateful queries
- Streaming query support for constant-memory processing
- Python DB-API 2.0 compliance
- User Defined Functions (UDF) support
- AI-assisted SQL generation
Architecture
Installation
Currently, chdb-core supports Python 3.9+ on macOS and Linux (x86_64 and ARM64).
pip install chdb-core
Quick Start
import chdb
result = chdb.query("SELECT version()", "Pretty")
print(result)
API Reference
One-shot Query
The simplest way to run SQL — no session or connection needed:
import chdb
# Basic query with CSV output (default)
result = chdb.query("SELECT 1, 'hello'")
print(result)
# Pandas DataFrame output
df = chdb.query("SELECT number, number * 2 AS double FROM numbers(10)", "DataFrame")
print(df)
# Parameterized queries
df = chdb.query(
"SELECT toDate({base_date:String}) + number AS date "
"FROM numbers({total_days:UInt64}) "
"LIMIT {items_per_page:UInt64}",
"DataFrame",
params={"base_date": "2025-01-01", "total_days": 10, "items_per_page": 5},
)
print(df)
Query on Files (Parquet, CSV, JSON, Arrow, ORC and 60+)
import chdb
res = chdb.query('SELECT * FROM file("data.parquet", Parquet)', "JSON")
print(res)
res = chdb.query('SELECT * FROM file("data.csv", CSV)', "CSV")
print(res)
# Query result statistics
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, "
f"storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, "
f"elapsed {res.elapsed()} seconds")
# Pandas DataFrame output
chdb.query('SELECT * FROM file("data.parquet", Parquet)', "Dataframe")
Connection API
Connection-based API for cursor-style interaction, supporting both in-memory and file-based databases:
import chdb
conn = chdb.connect(":memory:")
cur = conn.cursor()
cur.execute("CREATE TABLE test (id UInt32, name String) ENGINE = Memory")
cur.execute("INSERT INTO test VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Charlie')")
cur.execute("SELECT * FROM test ORDER BY id")
print(cur.fetchone()) # (1, 'Alice')
print(cur.fetchmany(2)) # ((2, 'Bob'), (3, 'Charlie'))
print(cur.column_names()) # ['id', 'name']
print(cur.column_types()) # ['UInt32', 'String']
# Use the cursor as an iterator
cur.execute("SELECT number FROM system.numbers LIMIT 3")
for row in cur:
print(row)
# Always close resources when done
cur.close()
conn.close()
For more details, see examples/connect.py.
# File-based persistent database
conn = chdb.connect("mydata.db")
conn.query("CREATE TABLE IF NOT EXISTS logs (ts DateTime, msg String) ENGINE = MergeTree ORDER BY ts")
conn.query("INSERT INTO logs VALUES (now(), 'started')")
result = conn.query("SELECT * FROM logs", "Pretty")
print(result)
conn.close()
Stateful Session
Sessions provide a higher-level API with automatic resource management:
from chdb import session as chs
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table (x String, y Int) ENGINE = Log")
sess.query("INSERT INTO db_xxx.log_table VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5)")
sess.query("CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table LIMIT 4")
print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))
see also: test_stateful.py.
Streaming Query
Process large datasets with constant memory usage through chunked streaming:
from chdb import session as chs
sess = chs.Session()
rows_cnt = 0
with sess.send_query("SELECT * FROM numbers(200000)", "CSV") as stream_result:
for chunk in stream_result:
rows_cnt += chunk.rows_read()
print(rows_cnt) # 200000
# Example 2: Manual iteration with fetch()
rows_cnt = 0
stream_result = sess.send_query("SELECT * FROM numbers(200000)", "CSV")
while True:
chunk = stream_result.fetch()
if chunk is None:
break
rows_cnt += chunk.rows_read()
print(rows_cnt) # 200000
For more details, see test_streaming_query.py.
Python DB-API 2.0
import chdb.dbapi as dbapi
print("chdb driver version: {0}".format(dbapi.get_client_info()))
conn1 = dbapi.connect()
cur1 = conn1.cursor()
cur1.execute('select version()')
print("description: ", cur1.description)
print("data: ", cur1.fetchone())
cur1.close()
conn1.close()
Query on Table (Pandas DataFrame, Parquet file/bytes, Arrow bytes)
import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",
tbl1=df1, tbl2=df2)
print(ret_tbl)
# Query on the DataFrame Table
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))
# Pandas DataFrames are automatically registered as temporary tables in ClickHouse
chdb.query("SELECT * FROM Python(df1) t1 JOIN Python(df2) t2 ON t1.a = t2.c").show()
Python Table Engine
Query on Pandas DataFrame
import chdb
import pandas as pd
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6],
"b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
}
)
chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()
Query on Arrow Table
import chdb
import pyarrow as pa
arrow_table = pa.table(
{
"a": [1, 2, 3, 4, 5, 6],
"b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
}
)
chdb.query("SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b").show()
see also: test_query_py.py.
User Defined Functions (UDF)
chDB supports native Python UDFs that run in-process with full type safety.
import chdb
from chdb.session import Session
from chdb.sqltypes import INT64, STRING
sess = Session()
# Using the @chdb.func decorator
@chdb.func([INT64, INT64], INT64)
def add(a, b):
return a + b
print(sess.query("SELECT add(12, 22)"))
# With type annotations (types inferred automatically)
@chdb.func()
def multiply(a: int, b: int) -> int:
return a * b
print(sess.query("SELECT multiply(3, 7)"))
# Using chdb.create_function directly
chdb.create_function("strlen", len, arg_types=[STRING], return_type=INT64)
print(sess.query("SELECT strlen('hello')"))
# Remove a registered function
chdb.drop_function("strlen")
Key features:
- Type-safe: supports
INT64, FLOAT64, STRING, BOOL, DATETIME64, etc. See chdb.sqltypes for full list.
- Type inference: automatically inferred from Python annotations (
int, str, bool, etc.)
- NULL handling:
on_null=NullHandling.SKIP (default) skips the function call and returns NULL; NullHandling.PASS passes None to the function.
- Exception handling:
on_error=ExceptionHandling.PROPAGATE (default) raises the error to the caller; ExceptionHandling.IGNORE returns NULL for that row and continues.
See also: test_func_udf.py, test_func_udf_types.py.
Query Progress
import chdb
# Auto-detect: TTY progress in terminal, progress bar in notebook
conn = chdb.connect(":memory:?progress=auto")
conn.query("SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4")
Progress options: progress=auto | progress=tty | progress=err | progress=off
AI-assisted SQL Generation
import chdb
conn = chdb.connect("file::memory:?ai_provider=openai&ai_model=gpt-4o-mini")
conn.query("CREATE TABLE nums (n UInt32) ENGINE = Memory")
conn.query("INSERT INTO nums VALUES (1), (2), (3)")
sql = conn.generate_sql("Select all rows from nums ordered by n desc")
print(sql) # SELECT * FROM nums ORDER BY n DESC
print(conn.ask("List the numbers table", format="Pretty"))
Command Line
python3 -m chdb SQL [OutputFormat]
python3 -m chdb "SELECT 1,'abc'" Pretty
For more examples, see examples and tests.
Demos and Examples
Benchmark
---
Documentation
AI Coding Agent Skill
chDB provides an AI Skill that teaches AI coding agents (Cursor, Claude Code, etc.) chDB's multi-source data analytics API. Install it so your AI assistant can write correct chDB code out of the box:
curl -sL https://raw.githubusercontent.com/chdb-io/chdb/main/install_skill.sh | bash
Events
Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Help test and report bugs
- Help improve documentation
- Help improve code quality and performance
Bindings
We welcome bindings for other languages, please refer to bindings for more details.
Version Guide
Please refer to VERSION-GUIDE.md for more details.
Paper
License
Apache 2.0, see LICENSE for more information.
Acknowledgments
chDB is mainly based on ClickHouse [^1]
for trade mark and other reasons, I named it chDB.
[^1]: ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners.