(deltalake)= (deltalake-loader)= # Delta Lake I/O ## About Import and export data into/from [Delta Lake] tables ([paper]), for humans and machines. ## Synopsis Load from Delta Lake table: ```shell ctk load table {file,http,https,s3,abfs,gs,hdfs,lakefs}+deltalake://... ``` Export to Delta Lake table: ```shell ctk save table {file,http,https,s3,abfs,gs,hdfs,lakefs}+deltalake://... ``` ## Install ```shell uv tool install --upgrade 'cratedb-toolkit[deltalake]' ``` :::{tip} For speedy installations, we recommend using the [uv] package manager. Install it using `brew install uv` on macOS or `pipx install uv` on other operating systems. ::: ## Usage Delta Lake supports different backends like S3, Azure Data Lake, and Google Cloud, see [Loading a Delta Table]. Please look up available configuration parameters in the reference documentation of the uniform API library `object_store` ([s3 options], [azure options], [gcs options]). Otherwise, derive your ETL commands from the examples shared below. ### Load Load from filesystem. ```shell ctk load table \ "file+deltalake://./var/lib/delta/demo/taxi-tiny" \ --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny" ``` Load from HTTP. ```shell ctk load table \ "https+deltalake://datahub.example.org/delta/demo/taxi-tiny" \ --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny" ``` Load from AWS S3. ```shell ctk load table \ "s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=&AWS_SECRET_ACCESS_KEY=&AWS_ENDPOINT=&AWS_REGION=" \ --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny" ``` Load from Azure Data Lake. ```shell ctk load table "abfs+deltalake://container/path?AZURE_STORAGE_ACCOUNT_NAME=devstoreaccount1&AZURE_STORAGE_ACCOUNT_KEY=foo&..." ``` Load from Google Cloud. ```shell ctk load table "gs+deltalake://bucket1/demo/taxi-tiny?GOOGLE_SERVICE_ACCOUNT=...&GOOGLE_SERVICE_ACCOUNT_KEY=" ``` Load from HDFS. ```shell ctk load table "hdfs+deltalake://localhost:9000" ``` Load from LakeFS. ```shell ctk load table "lakefs+deltalake://bucket1/demo/taxi-tiny?endpoint=https://lakefs.example.org&access_key_id=LAKEFSID&secret_access_key=LAKEFSKEY" ``` :::{tip} After loading your data into CrateDB, query it. ```shell ctk shell --command 'SELECT * FROM "demo"."taxi-tiny";' ctk show table '"demo"."taxi-tiny"' ``` ::: ### Save Save to filesystem. ```shell ctk save table \ --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny" \ "file+deltalake://./var/lib/delta/demo/taxi-tiny" ``` Save to AWS S3. ```shell ctk save table \ --cluster-url="crate://crate:crate@localhost:4200/demo/taxi-tiny" \ "s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=&AWS_SECRET_ACCESS_KEY=&AWS_ENDPOINT=&AWS_REGION=" ``` For other target URLs, see "Load" section. ### Cloud A canonical invocation for copying data from a Delta Lake table on AWS S3 to CrateDB Cloud. Please note the `ssl=true` query parameter on the CrateDB cluster URL. ```shell ctk load table \ "s3+deltalake://bucket1/demo/taxi-tiny?AWS_ACCESS_KEY_ID=&AWS_SECRET_ACCESS_KEY=&AWS_ENDPOINT=&AWS_REGION=" \ --cluster-url="crate://admin:dZ...6LqB@green-shaak-ti.eks1.eu-west-1.aws.cratedb.net:4200/testdrive/demo?ssl=true" ``` ## Options ### Generic parameters #### `batch-size` The source URL accepts the `batch-size` option to configure pagination. You will need to find an optimal value based on the shape of your data. The default value is `75000`. :::{rubric} Example usage ::: ```shell ctk load table "file+deltalake://./var/lib/delta/?batch-size=200000" ``` ```shell ctk save table --cluster-url="crate://?batch-size=200000" ``` ### CrateDB parameters #### `if-exists` The target CrateDB table will be created automatically, if it does not exist. If it does exist, the `if-exists` URL query parameter can be used to configure this behavior. The default value is `fail`, the possible values are: * `fail`: Raise a ValueError. (default) * `replace`: Drop the table before inserting new values. * `append`: Insert new values to the existing table. :::{rubric} Example usage ::: In order to always replace the target table, i.e. to drop and re-create it prior to inserting data, use `?if-exists=replace`. ```shell export CRATEDB_CLUSTER_URL="crate://crate:crate@localhost:4200/testdrive/demo?if-exists=replace" ctk load table ... ``` ### Delta Lake parameters #### `mode` By default, the target Delta Lake table must not exist. If it exists, and you want to overwrite it, use `mode=overwrite`. On the other hand, `mode=append` will append to an existing table. :::{rubric} Example usage ::: ```shell ctk save table "file+deltalake://./var/lib/delta/?...&mode=append" ``` [azure options]: https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variants [Delta Lake]: https://delta.io/ [gcs options]: https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants [paper]: https://www.databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf [Loading a Delta Table]: https://delta-io.github.io/delta-rs/usage/loading-table/ [s3 options]: https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html#variants [uv]: https://docs.astral.sh/uv/