Filesystem & buckets
The Filesystem destination stores data in remote file systems and bucket storages like S3, Google Storage, or Azure Blob Storage. Underneath, it uses fsspec to abstract file operations. Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it.
đź’ˇ Please read the notes on the layout of the data files. Currently, we are getting feedback on it. Please join our Slack (icon at the top of the page) and help us find the optimal layout.
Install dlt with filesystem​
To install the dlt library with filesystem dependencies:
pip install "dlt[filesystem]"
This installs s3fs
and botocore
packages.
You may also install the dependencies independently. Try:
pip install dlt
pip install s3fs
so pip does not fail on backtracking.
Initialise the dlt project​
Let's start by initializing a new dlt project as follows:
dlt init chess filesystem
This command will initialize your pipeline with chess as the source and the AWS S3 filesystem as the destination.
Set up bucket storage and credentials​
AWS S3​
The command above creates a sample secrets.toml
and requirements file for AWS S3 bucket. You can install those dependencies by running:
pip install -r requirements.txt
To edit the dlt
credentials file with your secret info, open .dlt/secrets.toml
, which looks like this:
[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,
[destination.filesystem.credentials]
aws_access_key_id = "please set me up!" # copy the access key here
aws_secret_access_key = "please set me up!" # copy the secret access key here
If you have your credentials stored in ~/.aws/credentials
, just remove the [destination.filesystem.credentials] section above, and dlt
will fall back to your default profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: dlt-ci-user
):
[destination.filesystem.credentials]
profile_name="dlt-ci-user"
You can also pass an AWS region:
[destination.filesystem.credentials]
region_name="eu-central-1"
You need to create an S3 bucket and a user who can access that bucket. dlt
does not create buckets automatically.
You can create the S3 bucket in the AWS console by clicking on "Create Bucket" in S3 and assigning the appropriate name and permissions to the bucket.
Once the bucket is created, you'll have the bucket URL. For example, If the bucket name is
dlt-ci-test-bucket
, then the bucket URL will be:s3://dlt-ci-test-bucket
To grant permissions to the user being used to access the S3 bucket, go to the IAM > Users, and click on “Add Permissions”.
Below you can find a sample policy that gives a minimum permission required by
dlt
to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. Remember to place your bucket name in the Resource section of the policy!
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DltBucketAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:GetObjectAttributes",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::dlt-ci-test-bucket/*",
"arn:aws:s3:::dlt-ci-test-bucket"
]
}
]
}
- To grab the access and secret key for the user. Go to IAM > Users and in the “Security Credentials”, click on “Create Access Key”, and preferably select “Command Line Interface” and create the access key.
- Grab the “Access Key” and “Secret Access Key” created that are to be used in "secrets.toml".
Using S3 compatible storage​
To use an S3 compatible storage other than AWS S3 like MinIO or Cloudflare R2, you may supply an endpoint_url
in the config. This should be set along with AWS credentials:
[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,
[destination.filesystem.credentials]
aws_access_key_id = "please set me up!" # copy the access key here
aws_secret_access_key = "please set me up!" # copy the secret access key here
endpoint_url = "https://<account_id>.r2.cloudflarestorage.com" # copy your endpoint URL here
Adding Additional Configuration​
To pass any additional arguments to fsspec
, you may supply kwargs
and client_kwargs
in the config as a stringified dictionary:
[destination.filesystem]
kwargs = '{"use_ssl": true, "auto_mkdir": true}'
client_kwargs = '{"verify": "public.crt"}'
Google Storage​
Run pip install "dlt[gs]"
which will install the gcfs
package.
To edit the dlt
credentials file with your secret info, open .dlt/secrets.toml
.
You'll see AWS credentials by default.
Use Google cloud credentials that you may know from BigQuery destination
[destination.filesystem]
bucket_url = "gs://[your_bucket_name]" # replace with your bucket name,
[destination.filesystem.credentials]
project_id = "project_id" # please set me up!
private_key = "private_key" # please set me up!
client_email = "client_email" # please set me up!
Note that you can share the same credentials with BigQuery, replace the [destination.filesystem.credentials]
section with a less specific one: [destination.credentials]
which applies to both destinations
if you have default google cloud credentials in your environment (i.e. on cloud function) remove the credentials sections above and dlt
will fall back to the available default.
Use Cloud Storage admin to create a new bucket. Then assign the Storage Object Admin role to your service account.
Azure Blob Storage​
Run pip install "dlt[az]"
which will install the adlfs
package to interface with Azure Blob Storage.
Edit the credentials in .dlt/secrets.toml
, you'll see AWS credentials by default replace them with your Azure credentials.
Two forms of Azure credentials are supported:
SAS token credentials​
Supply storage account name and either sas token or storage account key
[destination.filesystem]
bucket_url = "az://[your_container name]" # replace with your container name
[destination.filesystem.credentials]
# The storage account name is always required
azure_storage_account_name = "account_name" # please set me up!
# You can set either account_key or sas_token, only one is needed
azure_storage_account_key = "account_key" # please set me up!
azure_storage_sas_token = "sas_token" # please set me up!
If you have the correct Azure credentials set up on your machine (e.g. via azure cli),
you can omit both azure_storage_account_key
and azure_storage_sas_token
and dlt
will fall back to the available default.
Note that azure_storage_account_name
is still required as it can't be inferred from the environment.
Service principal credentials​
Supply a client ID, client secret and a tenant ID for a service principal authorized to access your container
[destination.filesystem]
bucket_url = "az://[your_container name]" # replace with your container name
[destination.filesystem.credentials]
azure_client_id = "client_id" # please set me up!
azure_client_secret = "client_secret"
azure_tenant_id = "tenant_id" # please set me up!
Concurrent blob uploads
dlt
limits the number of concurrent connections for a single uploaded blob to 1. By default adlfs
that we use, splits blobs into 4 MB chunks and uploads them concurrently which leads to gigabytes of used memory and thousands of connections for a larger load packages. You can increase the maximum concurrency as follows:
[destination.filesystem.kwargs]
max_concurrency=3
Local file system​
If for any reason you want to have those files in a local folder, set up the bucket_url
as follows (you are free to use config.toml
for that as there are no secrets required)
[destination.filesystem]
bucket_url = "file:///absolute/path" # three / for an absolute path
For handling deeply nested layouts, consider enabling automatic directory creation for the local filesystem destination. This can be done by setting kwargs
in secrets.toml
:
[destination.filesystem]
kwargs = '{"auto_mkdir": true}'
Or by setting environment variable:
export DESTINATION__FILESYSTEM__KWARGS = '{"auto_mkdir": true/false}'
dlt
correctly handles the native local file paths. Indeed, using the file://
schema may be not intuitive especially for Windows users.
[destination.unc_destination]
bucket_url = 'C:\a\b\c'
In the example above we specify bucket_url
using toml's literal strings that do not require escaping of backslashes.
[destination.unc_destination]
bucket_url = '\\localhost\c$\a\b\c' # UNC equivalent of C:\a\b\c
[destination.posix_destination]
bucket_url = '/var/local/data' # absolute POSIX style path
[destination.relative_destination]
bucket_url = '_storage/data' # relative POSIX style path
In the examples above we define a few named filesystem destinations:
- unc_destination demonstrates Windows UNC path in native form
- posix_destination demonstrates native POSIX (Linux/Mac) absolute path
- relative_destination demonstrates native POSIX (Linux/Mac) relative path. In this case
filesystem
destination will store files in$cwd/_storage/data
path where $cwd is your current working directory.
dlt
supports Windows UNC paths with file:// scheme. They can be specified using host or purely as path
component.
[destination.unc_with_host]
bucket_url="file://localhost/c$/a/b/c"
[destination.unc_with_path]
bucket_url="file:////localhost/c$/a/b/c"
Windows supports paths up to 255 characters. When you access a path longer than 255 characters you'll see FileNotFound
exception.
To go over this limit you can use extended paths. dlt
recognizes both regular and UNC extended paths
[destination.regular_extended]
bucket_url = '\\?\C:\a\b\c'
[destination.unc_extended]
bucket_url='\\?\UNC\localhost\c$\a\b\c'
SFTP​
Run pip install "dlt[sftp]
which will install the paramiko
package alongside dlt
, enabling secure SFTP transfers.
Configure your SFTP credentials by editing the .dlt/secrets.toml
file. By default, the file contains placeholders for AWS credentials. You should replace these with your SFTP credentials.
Below are the possible fields for SFTP credentials configuration:
sftp_port # The port for SFTP, defaults to 22 (standard for SSH/SFTP)
sftp_username # Your SFTP username, defaults to None
sftp_password # Your SFTP password (if using password-based auth), defaults to None
sftp_key_filename # Path to your private key file for key-based authentication, defaults to None
sftp_key_passphrase # Passphrase for your private key (if applicable), defaults to None
sftp_timeout # Timeout for establishing a connection, defaults to None
sftp_banner_timeout # Timeout for receiving the banner during authentication, defaults to None
sftp_auth_timeout # Authentication timeout, defaults to None
sftp_channel_timeout # Channel timeout for SFTP operations, defaults to None
sftp_allow_agent # Use SSH agent for key management (if available), defaults to True
sftp_look_for_keys # Search for SSH keys in the default SSH directory (~/.ssh/), defaults to True
sftp_compress # Enable compression (can improve performance over slow networks), defaults to False
sftp_gss_auth # Use GSS-API for authentication, defaults to False
sftp_gss_kex # Use GSS-API for key exchange, defaults to False
sftp_gss_deleg_creds # Delegate credentials with GSS-API, defaults to True
sftp_gss_host # Host for GSS-API, defaults to None
sftp_gss_trust_dns # Trust DNS for GSS-API, defaults to True
For more information about credentials parameters: https://docs.paramiko.org/en/3.3/api/client.html#paramiko.client.SSHClient.connect
Authentication Methods​
SFTP authentication is attempted in the following order of priority:
Key-based authentication: If you provide a
key_filename
containing the path to a private key or a corresponding OpenSSH public certificate (e.g.,id_rsa
andid_rsa-cert.pub
), these will be used for authentication. If the private key requires a passphrase, you can specify it viasftp_key_passphrase
. If your private key requires a passphrase to unlock, and you’ve provided one, it will be used to attempt to unlock the key.SSH Agent-based authentication: If
allow_agent=True
(default), Paramiko will look for any SSH keys stored in your local SSH agent (such asid_rsa
,id_dsa
, orid_ecdsa
keys stored in~/.ssh/
).Username/Password authentication: If a password is provided (
sftp_password
), plain username/password authentication will be attempted.GSS-API authentication: If GSS-API (Kerberos) is enabled (sftp_gss_auth=True), authentication will use the Kerberos protocol. GSS-API may also be used for key exchange (sftp_gss_kex=True) and credential delegation (sftp_gss_deleg_creds=True). This method is useful in environments where Kerberos is set up, often in enterprise networks.
1. Key-based Authentication​
If you use an SSH key instead of a password, you can specify the path to your private key in the configuration.
[destination.filesystem]
bucket_url = "sftp://[hostname]/[path]"
file_glob = "*"
[destination.filesystem.credentials]
sftp_username = "foo"
sftp_key_filename = "/path/to/id_rsa" # Replace with the path to your private key file
sftp_key_passphrase = "your_passphrase" # Optional: passphrase for your private key
2. SSH Agent-based Authentication​
If you have an SSH agent running with loaded keys, you can allow Paramiko to use these keys automatically. You can omit the password and key fields if you're relying on the SSH agent.
[destination.filesystem]
bucket_url = "sftp://[hostname]/[path]"
file_glob = "*"
[destination.filesystem.credentials]
sftp_username = "foo"
sftp_key_passphrase = "your_passphrase" # Optional: passphrase for your private key
The loaded key must be one of the following types stored in ~/.ssh/: id_rsa, id_dsa, or id_ecdsa.
3. Username/Password Authentication​
This is the simplest form of authentication, where you supply a username and password directly.
[destination.filesystem]
bucket_url = "sftp://[hostname]/[path]" # The hostname of your SFTP server and the remote path
file_glob = "*" # Pattern to match the files you want to upload/download
[destination.filesystem.credentials]
sftp_username = "foo" # Replace "foo" with your SFTP username
sftp_password = "pass" # Replace "pass" with your SFTP password
Notes:​
- Key-based Authentication: Make sure your private key has the correct permissions (
chmod 600
), or SSH will refuse to use it. - Timeouts: It's important to adjust timeout values based on your network conditions to avoid connection issues.
This configuration allows flexible SFTP authentication, whether you're using passwords, keys, or agents, and ensures secure communication between your local environment and the SFTP server.
Write disposition​
The filesystem destination handles the write dispositions as follows:
append
- files belonging to such tables are added to the dataset folderreplace
- all files that belong to such tables are deleted from the dataset folder, and then the current set of files is added.merge
- falls back toappend
🧪 merge
with delta
table format​
The upsert
merge strategy is supported when using the delta
table format.
The upsert
merge strategy for the filesystem
destination with delta
table format is considered experimental.
@dlt.resource(
write_disposition={"disposition": "merge", "strategy": "upsert"},
primary_key="my_primary_key",
table_format="delta"
)
def my_upsert_resource():
...
...
Known limitations​
hard_delete
hint not supported- deleting records from child tables not supported
- This means updates to json columns that involve element removals are not propagated. For example, if you first load
{"key": 1, "nested": [1, 2]}
and then load{"key": 1, "nested": [1]}
, then the record for element2
will not be deleted from the child table.
- This means updates to json columns that involve element removals are not propagated. For example, if you first load
File Compression​
The filesystem destination in the dlt library uses gzip
compression by default for efficiency, which may result in the files being stored in a compressed format. This format may not be easily readable as plain text or JSON Lines (jsonl
) files. If you encounter files that seem unreadable, they may be compressed.
To handle compressed files:
- To disable compression, you can modify the
data_writer.disable_compression
setting in your "config.toml" file. This can be useful if you want to access the files directly without needing to decompress them. For example:
[normalize.data_writer]
disable_compression=true
- To decompress a
gzip
file, you can use tools likegunzip
. This will convert the compressed file back to its original format, making it readable.
For more details on managing file compression, please visit our documentation on performance optimization: Disabling and Enabling File Compression.
Files layout​
All the files are stored in a single folder with the name of the dataset that you passed to the run
or load
methods of the pipeline
. In our example chess pipeline, it is chess_players_games_data.
Bucket storages are, in fact, key-blob storage so the folder structure is emulated by splitting file names into components by separator (/
).
You can control files layout by specifying the desired configuration. There are several ways to do this.
Default layout​
Current default layout: {table_name}/{load_id}.{file_id}.{ext}
The default layout format has changed from {schema_name}.{table_name}.{load_id}.{file_id}.{ext}
to {table_name}/{load_id}.{file_id}.{ext}
in dlt 0.3.12. You can revert to the old layout by setting it manually.
Available layout placeholders​
Standard placeholders​
schema_name
- the name of the schematable_name
- table nameload_id
- the id of the load package from which the file comes fromfile_id
- the id of the file, is there are many files with data for a single table, they are copied with different file idsext
- a format of the file i.e.jsonl
orparquet
Date and time placeholders​
Keep in mind all values are lowercased.
timestamp
- the current timestamp in Unix Timestamp format rounded to secondstimestamp_ms
- the current timestamp in Unix Timestamp format in millisecondsload_package_timestamp
- timestamp from load package in Unix Timestamp format rounded to secondsload_package_timestamp_ms
- timestamp from load package in Unix Timestamp format in milliseconds
Both timestamp_ms
and load_package_timestamp_ms
are in milliseconds (e.g., 12334455233), not fractional seconds to make sure millisecond precision without decimals.
- Years
YYYY
- 2024, 2025Y
- 2024, 2025
- Months
MMMM
- January, February, MarchMMM
- Jan, Feb, MarMM
- 01, 02, 03M
- 1, 2, 3
- Days of the month
DD
- 01, 02D
- 1, 2
- Hours 24h format
HH
- 00, 01, 02...23H
- 0, 1, 2...23
- Minutes
mm
- 00, 01, 02...59m
- 0, 1, 2...59
- Seconds
ss
- 00, 01, 02...59s
- 0, 1, 2...59
- Fractional seconds
SSSS
- 000[0..] 001[0..] ... 998[0..] 999[0..]SSS
- 000 001 ... 998 999SS
- 00, 01, 02 ... 98, 99S
- 0 1 ... 8 9
- Days of the week
dddd
- Monday, Tuesday, Wednesdayddd
- Mon, Tue, Weddd
- Mo, Tu, Wed
- 0-6
Q
- quarters 1, 2, 3, 4,
You can change the file name format by providing the layout setting for the filesystem destination like so:
[destination.filesystem]
layout="{table_name}/{load_id}.{file_id}.{ext}" # current preconfigured naming scheme
# More examples
# With timestamp
# layout = "{table_name}/{timestamp}/{load_id}.{file_id}.{ext}"
# With timestamp of the load package
# layout = "{table_name}/{load_package_timestamp}/{load_id}.{file_id}.{ext}"
# Parquet-like layout (note: it is not compatible with the internal datetime of the parquet file)
# layout = "{table_name}/year={year}/month={month}/day={day}/{load_id}.{file_id}.{ext}"
# Custom placeholders
# extra_placeholders = { "owner" = "admin", "department" = "finance" }
# layout = "{table_name}/{owner}/{department}/{load_id}.{file_id}.{ext}"
A few things to know when specifying your filename layout:
- If you want a different base path that is common to all filenames, you can suffix your
bucket_url
rather than prefix yourlayout
setting. - If you do not provide the
{ext}
placeholder, it will automatically be added to your layout at the end with a dot as a separator. - It is the best practice to have a separator between each placeholder. Separators can be any character allowed as a filename character, but dots, dashes, and forward slashes are most common.
- When you are using the
replace
disposition,dlt
will have to be able to figure out the correct files to delete before loading the new data. For this to work, you have to- include the
{table_name}
placeholder in your layout - not have any other placeholders except for the
{schema_name}
placeholder before the table_name placeholder and - have a separator after the table_name placeholder
- include the
Please note:
dlt
will mark complete loads by creating a json file in the./_dlt_loads
folders that corresponds to the_dlt_loads
table. For example, ifchess__1685299832.jsonl
file is present in the loads folder, you can be sure that all files for the load package1685299832
are completely loaded
Advanced layout configuration​
The filesystem destination configuration supports advanced layout customization and the inclusion of additional placeholders. This can be done through config.toml
or programmatically when initializing via a factory method.
Configuration via config.toml
​
To configure the layout and placeholders using config.toml
, use the following format:
[destination.filesystem]
layout = "{table_name}/{test_placeholder}/{YYYY}-{MM}-{DD}/{ddd}/{mm}/{load_id}.{file_id}.{ext}"
extra_placeholders = { "test_placeholder" = "test_value" }
current_datetime="2024-04-14T00:00:00"
# for automatic directory creation in the local filesystem
kwargs = '{"auto_mkdir": true}'
Ensure that the placeholder names match the intended usage. For example, {test_placeholer}
should be corrected to {test_placeholder}
for consistency.
Dynamic configuration in the code​
Configuration options, including layout and placeholders, can be overridden dynamically when initializing and passing the filesystem destination directly to the pipeline.
import pendulum
import dlt
from dlt.destinations import filesystem
pipeline = dlt.pipeline(
pipeline_name="data_things",
destination=filesystem(
layout="{table_name}/{test_placeholder}/{timestamp}/{load_id}.{file_id}.{ext}",
current_datetime=pendulum.now(),
extra_placeholders={
"test_placeholder": "test_value",
}
)
)
Furthermore, it is possible to
- Customize the behavior with callbacks for extra placeholder functionality. Each callback must accept the following positional arguments and return a string.
- Customize the
current_datetime
, which can also be a callback function and expected to return apendulum.DateTime
instance.
import pendulum
import dlt
from dlt.destinations import filesystem
def placeholder_callback(schema_name: str, table_name: str, load_id: str, file_id: str, ext: str) -> str:
# Custom logic here
return "custom_value"
def get_current_datetime() -> pendulum.DateTime:
return pendulum.now()
pipeline = dlt.pipeline(
pipeline_name="data_things",
destination=filesystem(
layout="{table_name}/{placeholder_x}/{timestamp}/{load_id}.{file_id}.{ext}",
current_datetime=get_current_datetime,
extra_placeholders={
"placeholder_x": placeholder_callback
}
)
)
Recommended layout​
The currently recommended layout structure is straightforward:
layout="{table_name}/{load_id}.{file_id}.{ext}"
Adopting this layout offers several advantages:
- Efficiency: it's fast and simple to process.
- Compatibility: supports
replace
as the write disposition method. - Flexibility: compatible with various destinations, including Athena.
- Performance: a deeply nested structure can slow down file navigation, whereas a simpler layout mitigates this issue.
Supported file formats​
You can choose the following file formats:
Supported table formats​
You can choose the following table formats:
- Delta is supported
Delta table format​
You need the deltalake
package to use this format:
pip install "dlt[deltalake]"
You also need pyarrow>=17.0.0
:
pip install 'pyarrow>=17.0.0'
Set the table_format
argument to delta
when defining your resource:
@dlt.resource(table_format="delta")
def my_delta_resource():
...
dlt
always usesparquet
asloader_file_format
when using thedelta
table format. Any setting ofloader_file_format
is disregarded.
Delta table partitioning​
A Delta table can be partitioned (Hive-style partitioning) by specifying one or more partition
column hints. This example partitions the Delta table by the foo
column:
@dlt.resource(
table_format="delta",
columns={"foo": {"partition": True}}
)
def my_delta_resource():
...
It is not possible to change partition columns after the Delta table has been created. Trying to do so causes an error stating that the partition columns don't match.
Storage options​
You can pass storage options by configuring destination.filesystem.deltalake_storage_options
:
[destination.filesystem]
deltalake_storage_options = '{"AWS_S3_LOCKING_PROVIDER": "dynamodb", DELTA_DYNAMO_TABLE_NAME": "custom_table_name"}'
dlt
passes these options to the storage_options
argument of the write_deltalake
method in the deltalake
library. Look at their documentation to see which options can be used.
You don't need to specify credentials here. dlt
merges the required credentials with the options you provided, before passing it as storage_options
.
âť—When using
s3
, you need to specify storage options to configure locking behavior.
get_delta_tables
helper​
You can use the get_delta_tables
helper function to get deltalake
DeltaTable objects for your Delta tables:
from dlt.common.libs.deltalake import get_delta_tables
...
# get dictionary of DeltaTable objects
delta_tables = get_delta_tables(pipeline)
# execute operations on DeltaTable objects
delta_tables["my_delta_table"].optimize.compact()
delta_tables["another_delta_table"].optimize.z_order(["col_a", "col_b"])
# delta_tables["my_delta_table"].vacuum()
# etc.
Syncing of dlt
state​
This destination fully supports dlt state sync. To this end, special folders and files that will be created at your destination which hold information about your pipeline state, schemas and completed loads. These folders DO NOT respect your settings in the layout section. When using filesystem as a staging destination, not all of these folders are created, as the state and schemas are managed in the regular way by the final destination you have configured.
You will also notice init
files being present in the root folder and the special dlt
folders. In the absence of the concepts of schemas and tables
in blob storages and directories, dlt
uses these special files to harmonize the behavior of the filesystem
destination with the other implemented destinations.
Additional Setup guides​
- Load data from The Local Filesystem to PostgreSQL in python with dlt
- Load data from Coinbase to AWS S3 in python with dlt
- Load data from IFTTT to Azure Cloud Storage in python with dlt
- Load data from Azure Cloud Storage to Supabase in python with dlt
- Load data from Bitbucket to Azure Cloud Storage in python with dlt
- Load data from Chargebee to Azure Cloud Storage in python with dlt
- Load data from Trello to The Local Filesystem in python with dlt
- Load data from The Local Filesystem to Microsoft SQL Server in python with dlt
- Load data from Azure Cloud Storage to Redshift in python with dlt
- Load data from Box Platform API to Google Cloud Storage in python with dlt