• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Serverless Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • Astra DB Architecture FAQ
      • Astra DB glossary
      • Get support
    • Getting Started
      • Astra Vector Search Quickstart
      • Create your database
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting private endpoints
        • AWS Private Link
        • Azure Private Link
        • GCP Private Endpoints
        • Connecting custom DNS
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Connecting Legacy drivers
        • Drivers retry policies
      • Get Secure Connect Bundle
    • Migrating
      • Components
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Proxy Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
        • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Leverage metrics provided by ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
        • Cassandra Data Migrator
        • DSBulk Migrator
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Glossary
      • Contribution guidelines
      • Release Notes
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Delete an account
        • Bring Your Own Key
          • BYOK AWS Astra Portal
          • BYOK GCP Astra Portal
          • BYOK AWS DevOps API
          • BYOK GCP DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Export metrics to third party
          • Export metrics via Astra Portal
          • Export metrics via DevOps API
        • Manage access lists
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing BYOK AWS
        • Managing BYOK GCP
        • Managing access list
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Integrations
    • Astra CLI
    • Astra Vector Search
      • Quickstarts
      • Examples
      • Create a serverless database with Vector Search
      • Query Vector Data with CQL
        • Using analyzers
      • Data modeling
      • Working with embeddings
    • Astra Block
      • Quickstart
      • FAQ
      • Data model
      • About NFTs
    • API QuickStarts
      • JSON API QuickStart
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL CQL-first API QuickStart
    • Developing with APIs
      • Developing with JSON API
      • Developing with Document API
      • Developing with REST API
      • Developing with GraphQL API
        • Developing with GraphQL API (CQL-first)
        • Developing with GraphQL API (Schema-first)
      • Developing with gRPC API
        • gRPC Rust Client
        • gRPC Go Client
        • gRPC Node.js Client
        • gRPC Java Client
      • Developing with CQL API
      • Tooling Resources
      • Node.js Document Collection Client
      • Node.js REST Client
    • API References
      • Astra DB JSON API v1
      • Astra DB REST API v2
      • Astra DB Document API v2
      • Astra DB DevOps API v2
  • DataStax Astra DB Serverless Documentation
  • Migrating
  • Phase 2: Migrate and validate data
  • DSBulk Migrator

DSBulk Migrator

Use DSBulk Migrator to perform simple migration of smaller data quantities, where data validation (other than post-migration row counts) is not necessary.

DSBulk Migrator prerequisites

  • Install or switch to Java 11.

  • Install Maven 3.8.x.

  • Optionally install DSBulk Loader, if you elect to reference your own external installation of DSBulk, instead of the embedded DSBulk that’s in DSBulk Migrator.

  • Install Simulacron 0.12.x and its prerequisites, for integration tests.

Building DSBulk Migrator

Building DSBulk Migrator is accomplished with Maven. First, clone the git repo to your local machine. Example:

cd ~/github
git clone git@github.com:datastax/dsbulk-migrator.git
cd dsbulk-migrator

Then run:

mvn clean package

The build produces two distributable fat jars:

  • dsbulk-migrator-<VERSION>-embedded-driver.jar : contains an embedded Java driver; suitable for live migrations using an external DSBulk, or for script generation. This jar is NOT suitable for live migrations using an embedded DSBulk, since no DSBulk classes are present.

  • dsbulk-migrator-<VERSION>-embedded-dsbulk.jar: contains an embedded DSBulk and an embedded Java driver; suitable for all operations. Note that this jar is much bigger than the previous one, due to the presence of DSBulk classes.

Testing DSBulk Migrator

The project contains a few integration tests. Run them with:

mvn clean verify

The integration tests require Simulacron. Be sure to meet all the Simulacron prerequisites before running the tests.

Running DSBulk Migrator

Launch the DSBulk Migrator tool:

java -jar /path/to/dsbulk-migrator.jar { migrate-live | generate-script | generate-ddl } [OPTIONS]

When doing a live migration, the options are used to effectively configure DSBulk and to connect to the clusters.

When generating a migration script, most options serve as default values in the generated scripts. Note however that, even when generating scripts, this tool still needs to access the Origin cluster in order to gather metadata about the tables to migrate.

When generating a DDL file, only a few options are meaningful. Because standard DSBulk is not used, and the import cluster is never contacted, import options and DSBulk-related options are ignored. The tool still needs to access the Origin cluster in order to gather metadata about the keyspaces and tables for which to generate DDL statements.

DSBulk Migrator reference

  • Live migration command-line options

  • Script generation command-line options

  • DDL generation command-line options

  • Getting DSBulk Migrator help

  • DSBulk Migrator examples

Live migration command-line options

The following options are available for the migrate-live command. Most options have sensible default values and do not need to be specified, unless you want to override the default value.

-c

--dsbulk-cmd=CMD

The external DSBulk command to use. Ignored if the embedded DSBulk is being used. The default is simply 'dsbulk', assuming that the command is available through the PATH variable contents.

-d

--data-dir=PATH

The directory where data will be exported to and imported from. The default is a 'data' subdirectory in the current working directory. The data directory will be created if it does not exist. Tables will be exported and imported in subdirectories of the data directory specified here. There will be one subdirectory per keyspace in the data directory, then one subdirectory per table in each keyspace directory.

-e

--dsbulk-use-embedded

Use the embedded DSBulk version instead of an external one. The default is to use an external DSBulk command.

--export-bundle=PATH

The path to a secure connect bundle to connect to the Origin cluster, if that cluster is a DataStax Astra DB cluster. Options --export-host and --export-bundle are mutually exclusive.

--export-consistency=CONSISTENCY

The consistency level to use when exporting data. The default is LOCAL_QUORUM.

--export-dsbulk-option=OPT=VALUE

An extra DSBulk option to use when exporting. Any valid DSBulk option can be specified here, and it will passed as is to the DSBulk process. DSBulk options, including driver options, must be passed as --long.option.name=<value>. Short options are not supported.

--export-host=HOST[:PORT]

The host name or IP and, optionally, the port of a node from the Origin cluster. If the port is not specified, it will default to 9042. This option can be specified multiple times. Options --export-host and --export-bundle are mutually exclusive.

--export-max-concurrent-files=NUM|AUTO

The maximum number of concurrent files to write to. Must be a positive number or the special value AUTO. The default is AUTO.

--export-max-concurrent-queries=NUM|AUTO

The maximum number of concurrent queries to execute. Must be a positive number or the special value AUTO. The default is AUTO.

--export-max-records=NUM

The maximum number of records to export for each table. Must be a positive number or -1. The default is -1 (export the entire table).

--export-password

The password to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all. Omit the parameter value to be prompted for the password interactively.

--export-splits=NUM|NC

The maximum number of token range queries to generate. Use the NC syntax to specify a multiple of the number of available cores. For example, 8C = 8 times the number of available cores. The default is 8C. This is an advanced setting; you should rarely need to modify the default value.

--export-username=STRING

The username to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all.

-h

--help

Displays this help text.

--import-bundle=PATH

The path to a secure connect bundle to connect to the Target cluster, if it’s a DataStax Astra DB cluster. Options --import-host and --import-bundle are mutually exclusive.

--import-consistency=CONSISTENCY

The consistency level to use when importing data. The default is LOCAL_QUORUM.

--import-default-timestamp=<defaultTimestamp>

The default timestamp to use when importing data. Must be a valid instant in ISO-8601 syntax. The default is 1970-01-01T00:00:00Z.

--import-dsbulk-option=OPT=VALUE

An extra DSBulk option to use when importing. Any valid DSBulk option can be specified here, and it will passed as is to the DSBulk process. DSBulk options, including driver options, must be passed as --long.option.name=<value>. Short options are not supported.

--import-host=HOST[:PORT]

The host name or IP and, optionally, the port of a node from the Target cluster. If the port is not specified, it will default to 9042. This option can be specified multiple times. Options --import-host and --import-bundle are mutually exclusive.

--import-max-concurrent-files=NUM|AUTO

The maximum number of concurrent files to read from. Must be a positive number or the special value AUTO. The default is AUTO.

--import-max-concurrent-queries=NUM|AUTO

The maximum number of concurrent queries to execute. Must be a positive number or the special value AUTO. The default is AUTO.

--import-max-errors=NUM

The maximum number of failed records to tolerate when importing data. The default is 1000. Failed records will appear in a load.bad file in the DSBulk operation directory.

--import-password

The password to use to authenticate against the Target cluster. Options --import-username and --import-password must be provided together, or not at all. Omit the parameter value to be prompted for the password interactively.

--import-username=STRING

The username to use to authenticate against the Target cluster. Options --import-username and --import-password must be provided together, or not at all.

-k

--keyspaces=REGEX

A regular expression to select keyspaces to migrate. The default is to migrate all keyspaces except system keyspaces, DSE-specific keyspaces, and the OpsCenter keyspace. Case-sensitive keyspace names must be entered in their exact case.

-l

--dsbulk-log-dir=PATH

The directory where DSBulk should store its logs. The default is a 'logs' subdirectory in the current working directory. This subdirectory will be created if it does not exist. Each DSBulk operation will create a subdirectory in the log directory specified here.

--max-concurrent-ops=NUM

The maximum number of concurrent operations (exports and imports) to carry. The default is 1. Set this to higher values to allow exports and imports to occur concurrently. For example, with a value of 2, each table will be imported as soon as it is exported, while the next table is being exported.

--skip-truncate-confirmation

Skip truncate confirmation before actually truncating tables. Only applicable when migrating counter tables, ignored otherwise.

-t

--tables=REGEX

A regular expression to select tables to migrate. The default is to migrate all tables in the keyspaces that were selected for migration with --keyspaces. Case-sensitive table names must be entered in their exact case.

--table-types=regular|counter|all

The table types to migrate. The default is all.

--truncate-before-export

Truncate tables before the export instead of after. The default is to truncate after the export. Only applicable when migrating counter tables, ignored otherwise.

-w

--dsbulk-working-dir=PATH

The directory where DSBulk should be executed. Ignored if the embedded DSBulk is being used. If unspecified, it defaults to the current working directory.

Script generation command-line options

The following options are available for the generate-script command. Most options have sensible default values and do not need to be specified, unless you want to override the default value.

-c

--dsbulk-cmd=CMD

The DSBulk command to use. The default is simply 'dsbulk', assuming that the command is available through the PATH variable contents.

-d

--data-dir=PATH

The directory where data will be exported to and imported from. The default is a 'data' subdirectory in the current working directory. The data directory will be created if it does not exist.

--export-bundle=PATH

The path to a secure connect bundle to connect to the Origin cluster, if that cluster is a DataStax Astra DB cluster. Options --export-host and --export-bundle are mutually exclusive.

--export-consistency=CONSISTENCY

The consistency level to use when exporting data. The default is LOCAL_QUORUM.

--export-dsbulk-option=OPT=VALUE

An extra DSBulk option to use when exporting. Any valid DSBulk option can be specified here, and it will passed as is to the DSBulk process. DSBulk options, including driver options, must be passed as --long.option.name=<value>. Short options are not supported.

--export-host=HOST[:PORT]

The host name or IP and, optionally, the port of a node from the Origin cluster. If the port is not specified, it will default to 9042. This option can be specified multiple times. Options --export-host and --export-bundle are mutually exclusive.

--export-max-concurrent-files=NUM|AUTO

The maximum number of concurrent files to write to. Must be a positive number or the special value AUTO. The default is AUTO.

--export-max-concurrent-queries=NUM|AUTO

The maximum number of concurrent queries to execute. Must be a positive number or the special value AUTO. The default is AUTO.

--export-max-records=NUM

The maximum number of records to export for each table. Must be a positive number or -1. The default is -1 (export the entire table).

--export-password

The password to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all. Omit the parameter value to be prompted for the password interactively.

--export-splits=NUM|NC

The maximum number of token range queries to generate. Use the NC syntax to specify a multiple of the number of available cores. For example, 8C = 8 times the number of available cores. The default is 8C. This is an advanced setting. You should rarely need to modify the default value.

--export-username=STRING

The username to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all.

-h

--help

Displays this help text.

--import-bundle=PATH

The path to a secure connect bundle to connect to the Target cluster, if it’s a DataStax Astra DB cluster. Options --import-host and --import-bundle are mutually exclusive.

--import-consistency=CONSISTENCY

The consistency level to use when importing data. The default is LOCAL_QUORUM.

--import-default-timestamp=<defaultTimestamp>

The default timestamp to use when importing data. Must be a valid instant in ISO-8601 syntax. The default is 1970-01-01T00:00:00Z.

--import-dsbulk-option=OPT=VALUE

An extra DSBulk option to use when importing. Any valid DSBulk option can be specified here, and it will passed as is to the DSBulk process. DSBulk options, including driver options, must be passed as --long.option.name=<value>. Short options are not supported.

--import-host=HOST[:PORT]

The host name or IP and, optionally, the port of a node from the Target cluster. If the port is not specified, it will default to 9042. This option can be specified multiple times. Options --import-host and --import-bundle are mutually exclusive.

--import-max-concurrent-files=NUM|AUTO

The maximum number of concurrent files to read from. Must be a positive number or the special value AUTO. The default is AUTO.

--import-max-concurrent-queries=NUM|AUTO

The maximum number of concurrent queries to execute. Must be a positive number or the special value AUTO. The default is AUTO.

--import-max-errors=NUM

The maximum number of failed records to tolerate when importing data. The default is 1000. Failed records will appear in a load.bad file in the DSBulk operation directory.

--import-password

The password to use to authenticate against the Target cluster. Options --import-username and --import-password must be provided together, or not at all. Omit the parameter value to be prompted for the password interactively.

--import-username=STRING

The username to use to authenticate against the Target cluster. Options --import-username and --import-password must be provided together, or not at all.

-k

--keyspaces=REGEX

A regular expression to select keyspaces to migrate. The default is to migrate all keyspaces except system keyspaces, DSE-specific keyspaces, and the OpsCenter keyspace. Case-sensitive keyspace names must be entered in their exact case.

-l

--dsbulk-log-dir=PATH

The directory where DSBulk should store its logs. The default is a 'logs' subdirectory in the current working directory. This subdirectory will be created if it does not exist. Each DSBulk operation will create a subdirectory in the log directory specified here.

-t

--tables=REGEX

A regular expression to select tables to migrate. The default is to migrate all tables in the keyspaces that were selected for migration with --keyspaces. Case-sensitive table names must be entered in their exact case.

--table-types=regular|counter|all

The table types to migrate. The default is all.

DDL generation command-line options

The following options are available for the generate-ddl command. Most options have sensible default values and do not need to be specified, unless you want to override the default value.

-a

--optimize-for-astra

Produce CQL scripts optimized for DataStax Astra DB. Astra DB does not allow some options in DDL statements. Using this DSBulk Migrator command option, forbidden Astra DB options will be omitted from the generated CQL files.

-d

--data-dir=PATH

The directory where data will be exported to and imported from. The default is a 'data' subdirectory in the current working directory. The data directory will be created if it does not exist.

--export-bundle=PATH

The path to a secure connect bundle to connect to the Origin cluster, if that cluster is a DataStax Astra DB cluster. Options --export-host and --export-bundle are mutually exclusive.

--export-host=HOST[:PORT]

The host name or IP and, optionally, the port of a node from the Origin cluster. If the port is not specified, it will default to 9042. This option can be specified multiple times. Options --export-host and --export-bundle are mutually exclusive.

--export-password

The password to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all. Omit the parameter value to be prompted for the password interactively.

--export-username=STRING

The username to use to authenticate against the Origin cluster. Options --export-username and --export-password must be provided together, or not at all.

-h

--help

Displays this help text.

-k

--keyspaces=REGEX

A regular expression to select keyspaces to migrate. The default is to migrate all keyspaces except system keyspaces, DSE-specific keyspaces, and the OpsCenter keyspace. Case-sensitive keyspace names must be entered in their exact case.

-t

--tables=REGEX

A regular expression to select tables to migrate. The default is to migrate all tables in the keyspaces that were selected for migration with --keyspaces. Case-sensitive table names must be entered in their exact case.

--table-types=regular|counter|all

The table types to migrate. The default is all.

Getting help with DSBulk Migrator

Use the following command to display the available DSBulk Migrator commands:

java -jar /path/to/dsbulk-migrator-embedded-dsbulk.jar --help

For individual command help and each one’s options:

java -jar /path/to/dsbulk-migrator-embedded-dsbulk.jar COMMAND --help

DSBulk Migrator examples

These examples show sample username and password values that are for demonstration purposes only. Do not use these values in your environment.

Generate migration script

Generate a migration script to migrate from an existing Origin cluster to a Target Astra DB cluster:

    java -jar target/dsbulk-migrator-<VERSION>-embedded-driver.jar migrate-live \
        --data-dir=/path/to/data/dir \
        --dsbulk-cmd=${DSBULK_ROOT}/bin/dsbulk \
        --dsbulk-log-dir=/path/to/log/dir \
        --export-host=my-origin-cluster.com \
        --export-username=user1 \
        --export-password=s3cr3t \
        --import-bundle=/path/to/bundle \
        --import-username=user1 \
        --import-password=s3cr3t

Migrate live using external DSBulk install

Migrate live from an existing Origin cluster to a Target Astra DB cluster using an external DSBulk installation. Passwords will be prompted interactively:

    java -jar target/dsbulk-migrator-<VERSION>-embedded-driver.jar migrate-live \
        --data-dir=/path/to/data/dir \
        --dsbulk-cmd=${DSBULK_ROOT}/bin/dsbulk \
        --dsbulk-log-dir=/path/to/log/dir \
        --export-host=my-origin-cluster.com \
        --export-username=user1 \
        --export-password # password will be prompted \
        --import-bundle=/path/to/bundle \
        --import-username=user1 \
        --import-password # password will be prompted

Migrate live using embedded DSBulk install

Migrate live from an existing Origin cluster to a Target Astra DB cluster using the embedded DSBulk installation. Passwords will be prompted interactively. In this example, additional DSBulk options are passed.

    java -jar target/dsbulk-migrator-<VERSION>-embedded-dsbulk.jar migrate-live \
        --data-dir=/path/to/data/dir \
        --dsbulk-use-embedded \
        --dsbulk-log-dir=/path/to/log/dir \
        --export-host=my-origin-cluster.com \
        --export-username=user1 \
        --export-password # password will be prompted \
        --export-dsbulk-option "--connector.csv.maxCharsPerColumn=65536" \
        --export-dsbulk-option "--executor.maxPerSecond=1000" \
        --import-bundle=/path/to/bundle \
        --import-username=user1 \
        --import-password # password will be prompted \
        --import-dsbulk-option "--connector.csv.maxCharsPerColumn=65536" \
        --import-dsbulk-option "--executor.maxPerSecond=1000"

In the example above, you must use the dsbulk-migrator-<VERSION>-embedded-dsbulk.jar fat jar. Otherwise, an error will be raised because no embedded DSBulk can be found.

Generate DDL to recreate Origin schema in Target

Generate DDL files to recreate the Origin schema in a Target Astra DB cluster:

    java -jar target/dsbulk-migrator-<VERSION>-embedded-driver.jar generate-ddl \
        --data-dir=/path/to/data/dir \
        --export-host=my-origin-cluster.com \
        --export-username=user1 \
        --export-password=s3cr3t \
        --optimize-for-astra
Cassandra Data Migrator Phase 3: Enable asynchronous dual reads

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage