DuckDB
Install dlt with DuckDBβ
To install the dlt library with DuckDB dependencies, run:
pip install "dlt[duckdb]"
Setup Guideβ
1. Initialize a project with a pipeline that loads to DuckDB by running:
dlt init chess duckdb
2. Install the necessary dependencies for DuckDB by running:
pip install -r requirements.txt
3. Run the pipeline:
python3 chess_pipeline.py
Write dispositionβ
All write dispositions are supported.
Data loadingβ
dlt
will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are okay with installing pyarrow
, we suggest switching to parquet
as the file format. Loading is faster (and also multithreaded).
Data typesβ
duckdb
supports various timestamp types. These can be configured using the column flags timezone
and precision
in the dlt.resource
decorator or the pipeline.run
method.
- Precision: supported precision values are 0, 3, 6, and 9 for fractional seconds. Note that
timezone
andprecision
cannot be used together; attempting to combine them will result in an error. - Timezone:
- Setting
timezone=False
maps toTIMESTAMP
. - Setting
timezone=True
(or omitting the flag, which defaults toTrue
) maps toTIMESTAMP WITH TIME ZONE
(TIMESTAMPTZ
).
- Setting
Example precision: TIMESTAMP_MSβ
@dlt.resource(
columns={"event_tstamp": {"data_type": "timestamp", "precision": 3}},
primary_key="event_id",
)
def events():
yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123"}]
pipeline = dlt.pipeline(destination="duckdb")
pipeline.run(events())
Example timezone: TIMESTAMPβ
@dlt.resource(
columns={"event_tstamp": {"data_type": "timestamp", "timezone": False}},
primary_key="event_id",
)
def events():
yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}]
pipeline = dlt.pipeline(destination="duckdb")
pipeline.run(events())
Names normalizationβ
dlt
uses the standard snake_case naming convention to keep identical table and column identifiers across all destinations. If you want to use the duckdb wide range of characters (i.e., emojis) for table and column names, you can switch to the duck_case naming convention, which accepts almost any string as an identifier:
\n
\r
and"
are translated to_
- multiple
_
are translated to a single_
Switch the naming convention using config.toml
:
[schema]
naming="duck_case"
or via the env variable SCHEMA__NAMING
or directly in the code:
dlt.config["schema.naming"] = "duck_case"
duckdb identifiers are case insensitive but display names preserve case. This may create name collisions if, for example, you load JSON with
{"Column": 1, "column": 2}
as it will map data to a single column.
Supported file formatsβ
You can configure the following file formats to load data to duckdb:
- insert-values is used by default
- parquet is supportednote
duckdb
cannot COPY many parquet files to a single table from multiple threads. In this situation,dlt
serializes the loads. Still, that may be faster than INSERT. - jsonl
duckdb
has timestamp types with resolutions from milliseconds to nanoseconds. However
only microseconds resolution (the most common used) is time zone aware. dlt
generates timestamps with timezones by default so loading parquet files
with default settings will fail (duckdb
does not coerce tz-aware timestamps to naive timestamps).
Disable the timezones by changing dlt
parquet writer settings as follows:
DATA_WRITER__TIMESTAMP_TIMEZONE=""
to disable tz adjustments.
Supported column hintsβ
duckdb
can create unique indexes for columns with unique
hints. However, this feature is disabled by default as it can significantly slow down data loading.
Destination Configurationβ
By default, a DuckDB database will be created in the current working directory with a name <pipeline_name>.duckdb
(chess.duckdb
in the example above). After loading, it is available in read/write
mode via with pipeline.sql_client() as con:
, which is a wrapper over DuckDBPyConnection
. See duckdb docs for details.
The duckdb
credentials do not require any secret values. You are free to pass the credentials and configuration explicitly. For example:
# will load data to files/data.db (relative path) database file
p = dlt.pipeline(
pipeline_name='chess',
destination=dlt.destinations.duckdb("files/data.db"),
dataset_name='chess_data',
dev_mode=False
)
# will load data to /var/local/database.duckdb (absolute path)
p = dlt.pipeline(
pipeline_name='chess',
destination=dlt.destinations.duckdb("/var/local/database.duckdb"),
dataset_name='chess_data',
dev_mode=False
)
The destination accepts a duckdb
connection instance via credentials
, so you can also open a database connection yourself and pass it to dlt
to use.
import duckdb
db = duckdb.connect()
p = dlt.pipeline(
pipeline_name="chess",
destination=dlt.destinations.duckdb(db),
dataset_name="chess_data",
dev_mode=False,
)
# Or if you would like to use in-memory duckdb instance
db = duckdb.connect(":memory:")
p = pipeline_one = dlt.pipeline(
pipeline_name="in_memory_pipeline",
destination=dlt.destinations.duckdb(db),
dataset_name="chess_data",
)
print(db.sql("DESCRIBE;"))
# Example output
# ββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββββ¬ββββββββββββ
# β database β schema β name β column_names β column_types β temporary β
# β varchar β varchar β varchar β varchar[] β varchar[] β boolean β
# ββββββββββββΌββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββββΌββββββββββββ€
# β memory β chess_data β _dlt_loads β [load_id, schema_nβ¦ β [VARCHAR, VARCHAR, β¦ β false β
# β memory β chess_data β _dlt_pipeline_state β [version, engine_vβ¦ β [BIGINT, BIGINT, VAβ¦ β false β
# β memory β chess_data β _dlt_version β [version, engine_vβ¦ β [BIGINT, BIGINT, TIβ¦ β false β
# β memory β chess_data β my_table β [a, _dlt_load_id, β¦ β [BIGINT, VARCHAR, Vβ¦ β false β
# ββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββ
Be careful! The in-memory instance of the database will be destroyed, once your Python script exits.
This destination accepts database connection strings in the format used by duckdb-engine.
You can configure a DuckDB destination with secret / config values (e.g., using a secrets.toml
file)
destination.duckdb.credentials="duckdb:///_storage/test_quack.duckdb"
The duckdb:// URL above creates a relative path to _storage/test_quack.duckdb
. To define an absolute path, you need to specify four slashes, i.e., duckdb:////_storage/test_quack.duckdb
.
Dlt supports a unique connection string that triggers specific behavior for duckdb destination:
- :pipeline: creates the database in the working directory of the pipeline, naming it
quack.duckdb
.
Please see the code snippets below showing how to use it
- Via
config.toml
destination.duckdb.credentials=":pipeline:"
- In Python code
p = pipeline_one = dlt.pipeline(
pipeline_name="my_pipeline",
destination=dlt.destinations.duckdb(":pipeline:"),
)
Additional configurationβ
Unique indexes may be created during loading if the following config value is set:
[destination.duckdb]
create_indexes=true
dbt supportβ
This destination integrates with dbt via dbt-duckdb, which is a community-supported package. The duckdb
database is shared with dbt
. In rare cases, you may see information that the binary database format does not match the database format expected by dbt-duckdb
. You can avoid that by updating the duckdb
package in your dlt
project with pip install -U
.
Syncing of dlt
stateβ
This destination fully supports dlt state sync.
Additional Setup guidesβ
- Load data from Jira to DuckDB in python with dlt
- Load data from Coinbase to DuckDB in python with dlt
- Load data from Bitbucket to DuckDB in python with dlt
- Load data from Azure Cloud Storage to DuckDB in python with dlt
- Load data from Microsoft SQL Server to DuckDB in python with dlt
- Load data from HubSpot to DuckDB in python with dlt
- Load data from Klaviyo to DuckDB in python with dlt
- Load data from Google Analytics to DuckDB in python with dlt
- Load data from Rest API to DuckDB in python with dlt
- Load data from Adobe Analytics to DuckDB in python with dlt