• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Serverless Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • Astra DB glossary
      • Get support
    • Getting Started
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
      • Use integrations
        • Connect with DataGrip
        • Connect with DBSchema
        • Connect with JanusGraph
        • Connect with Strapi
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting private endpoints
        • AWS Private Link
        • Azure Private Link
        • GCP Private Endpoints
        • Connecting custom DNS
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Drivers retry policies
      • Connecting Legacy drivers
      • Get Secure Connect Bundle
    • Migrating
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
          • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Additional resources
        • Glossary
        • Contribution guidelines
        • Release Notes
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Bring Your Own Key
          • BYOK AWS Astra DB console
          • BYOK GCP Astra DB console
          • BYOK AWS DevOps API
          • BYOK GCP DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Export metrics to third party
          • Export metrics via Astra Portal
          • Export metrics via DevOps API
        • Manage access lists
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing BYOK AWS
        • Managing BYOK GCP
        • Managing access list
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Astra CLI
    • DataStax Astra Block
      • FAQs
      • About NFTs
      • DataStax Astra Block for Ethereum quickstart
    • Developing with Stargate APIs
      • Develop with REST
      • Develop with Document
      • Develop with GraphQL
        • Develop with GraphQL (CQL-first)
        • Develop with GraphQL (Schema-first)
      • Develop with gRPC
        • gRPC Rust client
        • gRPC Go client
        • gRPC Node.js client
        • gRPC Java client
      • Develop with CQL
      • Tooling Resources
      • Node.js Document API client
      • Node.js REST API client
    • Stargate QuickStarts
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL API CQL-first QuickStart
    • API References
      • DevOps REST API v2
      • Stargate Document API v2
      • Stargate REST API v2
  • DataStax Astra DB Serverless Documentation
  • Migrating
  • Troubleshooting
  • Troubleshooting tips

Troubleshooting tips

Refer to the tips on this page for information that can help you troubleshoot issues with your migration.

How to retrieve the ZDM Proxy log files

Depending on how you deployed ZDM Proxy, there may be different ways to access the logs. If you used the ZDM Automation, see View the logs for a quick way to view the logs of a single proxy instance. Follow the instructions on Collect the logs, instead, for a playbook that systematically retrieves all logs by all instances and packages them in a zip archive for later inspection.

If you did not use the ZDM Automation, you might have to access the logs differently. If Docker is used, enter the following command to export the logs of a container to a file:

docker logs my-container > log.txt

Keep in mind that docker logs are deleted if the container is recreated.

What to look for in the logs

Make sure that the log level of the ZDM Proxy is set to the appropriate value:

  • If you deployed the ZDM Proxy through the ZDM Automation, the log level is determined by the variable log_level in vars/zdm_proxy_core_config.yml. This value can be changed in a rolling fashion by editing this variable and running the playbook rolling_update_zdm_proxy.yml. For more information, see Change a mutable configuration variable.

  • If you did not use the ZDM Automation to deploy the ZDM Proxy, change the environment variable ZDM_LOG_LEVEL on each proxy instance and restart it.

Here are the most common messages you’ll find in the proxy logs:

ZDM Proxy startup message

Assuming the Log Level is not filtering out INFO entries, you can look for the following type of log message in order to verify that the ZDM Proxy is starting up correctly. Example:

{"log":"time=\"2023-01-13T11:50:48Z\" level=info
msg=\"Proxy started. Waiting for SIGINT/SIGTERM to shutdown.
\"\n","stream":"stderr","time":"2023-01-13T11:50:48.522097083Z"}

ZDM Proxy configuration

The first few lines of the ZDM Proxy log file contains all the configuration variables and values. They are printed in a long JSON string format. You can copy/paste the string into a JSON formatter/viewer to make it easier to read. Example log message:

{"log":"time=\"2023-01-13T11:50:48Z\" level=info
msg=\"Parsed configuration: {\\\"ProxyIndex\\\":1,\\\"ProxyAddresses\\\":"...",
[remaining of json string removed for simplicity]
","stream":"stderr","time":"2023-01-13T11:50:48.339225051Z"}

Seeing the configuration settings is useful while troubleshooting issues. However, remember to check the log level variable to ensure you’re viewing the intended types of messages. Setting the log level setting to DEBUG might cause a slight performance degradation.

Be aware of current log level

When you find a log message that looks like an error, the most important thing is to check the log level of that message.

  • A log message with level=debug or level=info is very likely not an error, but something expected and normal.

  • Log messages with level=error must be examined as they usually indicate an issue with the proxy, the client application, or the clusters.

  • Log messages with level=warn are usually related to events that are not fatal to the overall running workload, but may cause issues with individual requests or connections.

  • In general, log messages with level=error or level=warn should be brought to the attention of DataStax, if the meaning is not clear. In the ZDM Proxy GitHub repo, submit a GitHub Issue to ask questions about log messages of type error or warn that are unclear.

Protocol log messages

Here’s an example of a log message that looks like an error, but it’s actually an expected and normal message:

{"log":"time=\"2023-01-13T12:02:12Z\" level=debug msg=\"[TARGET-CONNECTOR]
Protocol v5 detected while decoding a frame. Returning a protocol message
to the client to force a downgrade: PROTOCOL (code=Code Protocol [0x0000000A],
msg=Invalid or unsupported protocol version (5)).\"\n","stream":"stderr","time":"2023-01-13T12:02:12.379287735Z"}

There are cases where protocol errors are fatal so they will kill an active connection that was being used to serve requests. However, if you find a log message similar to the example above with log level debug, then it’s likely not an issue. Instead, it’s more likely an expected part of the handshake process during the connection initialization; that is, the normal protocol version negotiation.

How to identify the ZDM Proxy version

In the ZDM Proxy logs, the first message contains the version string (just before the message that shows the configuration):

time="2023-01-13T13:37:28+01:00" level=info msg="Starting ZDM proxy version 2.1.0"
time="2023-01-13T13:37:28+01:00" level=info msg="Parsed configuration: {removed for simplicity}"

You can also provide a -version command line parameter to the ZDM Proxy and it will only print the version. Example:

docker run --rm datastax/zdm-proxy:2.x -version
ZDM proxy version 2.1.0

Do not use --rm when actually launching the ZDM Proxy otherwise you will not be able to access the logs when it stops (or crashes).

How to leverage the metrics provided by ZDM Proxy

The ZDM Proxy exposes an HTTP endpoint that returns metrics in the Prometheus format. At the moment this is the only way to obtain metrics from ZDM Proxy, but we plan to introduce other ways to expose metrics in the future.

The ZDM Automation can deploy Prometheus and Grafana, configuring them automatically, as explained here. The Grafana dashboards are ready to go with metrics that are being scraped from the ZDM Proxy instances.

If you already have a Grafana deployment then you can import the dashboards from the two ZDM dashboard files on this location.

Grafana dashboard for ZDM Proxy metrics

There are three groups of metrics in this dashboard:

  • Proxy level metrics.

  • Node level metrics.

  • Asynchronous read requests metrics.

Grafana dashboard shows three categories of ZDM metrics for the proxy.

Proxy-level metrics

  • Latency:

    • Read Latency: total latency measured by the ZDM Proxy (including post-processing like response aggregation) for read requests. This metric has two labels (reads_origin and reads_target): the label that has data will depend on which cluster is receiving the reads, i.e. which cluster is currently considered the primary cluster. This is configured by the ZDM Automation through the variable primary_cluster, or directly through the environment variable ZDM_PRIMARY_CLUSTER of the ZDM Proxy.

    • Write Latency: total latency measured by the ZDM Proxy (including post-processing like response aggregation) for write requests.

  • Throughput (same structure as the previous latency metrics):

    • Read Throughput.

    • Write Throughput.

  • In-flight requests.

  • Number of client connections.

  • Prepared Statement cache:

    • Cache Misses: meaning, a prepared statement was sent to the ZDM Proxy, but it wasn’t on its cache, so the proxy returned an UNPREPARED response to make the driver send the PREPARE request again.

    • Number of cached prepared statements.

  • Request Failure Rates: number of request failures per interval. You can set the interval via the Error Rate interval dashboard variable at the top.

    • Read Failure Rate: one cluster label with two settings: origin and target. The label that contains data depends on which cluster is currently considered the primary (same as the latency and throughput metrics explained above).

    • Write Failure Rate: one failed_on label with three settings: origin, target and both.

      • failed_on=origin: the write request failed on Origin ONLY.

      • failed_on=target: the write request failed on Target ONLY.

      • failed_on=both: the write request failed on BOTH clusters.

  • Request Failure Counters: Number of total request failures (resets when the ZDM Proxy instance is restarted)

    • Read Failure Counters: same labels as read failure rate.

    • Write Failure Counters: same labels as write failure rate.

To see error metrics by error type, see the node-level error metrics on the next section.

Node-level metrics

  • Latency: metrics on this bucket are not split by request type like the proxy level latency metrics so writes and reads are mixed together:

    • Origin: latency measured by the ZDM Proxy up to the point it received a response from the Origin connection.

    • Target: latency measured by the ZDM Proxy up to the point it received a response from the Target connection.

  • Throughput: same as node level latency metrics, reads and writes are mixed together.

  • Number of connections per Origin node and per Target node.

  • Number of Used Stream Ids:

    • Tracks the total number of used stream ids ("request ids") per connection type (Origin, Target and Async).

  • Number of errors per error type per Origin node and per Target node. Possible values for the error type label:

    • error=client_timeout,

    • error=read_failure,

    • error=read_timeout,

    • error=write_failure,

    • error=write_timeout,

    • error=overloaded,

    • error=unavailable,

    • error=unprepared.

Asynchronous read requests metrics

These metrics are specific to asynchronous reads, so they are only populated if asynchronous dual reads are enabled. This is done by setting the ZDM Automation variable read_mode, or its equivalent environment variable ZDM_READ_MODE, to DUAL_ASYNC_ON_SECONDARY as explained here.

These metrics track:

  • Latency.

  • Throughput.

  • Number of dedicated connections per node for async reads: whether it’s Origin or Target connections depends on the ZDM Proxy configuration. That is, if the primary cluster is Origin, then the asynchronous reads are sent to Target.

  • Number of errors per error type per node.

Insights via the ZDM Proxy metrics

Some examples of problems manifesting on these metrics:

  • Number of client connections close to 1000 per ZDM Proxy instance: by default, ZDM Proxy starts rejecting client connections after having accepted 1000 of them.

  • Always increasing Prepared Statement cache metrics: both the entries and misses metrics.

  • Error metrics depending on the error type: these need to be evaluated on a per-case basis.

Go runtime metrics dashboard and system dashboard

This dashboard in Grafana is not as important as the ZDM Proxy dashboard. However, it may be useful to troubleshoot performance issues. Here you can see memory usage, Garbage Collection (GC) duration, open fds (file descriptors - useful to detect leaked connections), and the number of goroutines:

Golang metrics dashboard example is shown.

Some examples of problem areas on these Go runtime metrics:

  • An always increasing “open fds” metric.

  • GC latencies in (or close to) the triple digits of milliseconds frequently.

  • Always increasing memory usage.

  • Always increasing number of goroutines.

The ZDM monitoring stack also includes a system-level dashboard collected through the Prometheus Node Exporter. This dashboard contains hardware and OS-level metrics for the host on which the proxy runs. This can be useful to check the available resources and identify low-level bottlenecks or issues.

Reporting an issue

If you encounter a problem during your migration, please contact us. In the ZDM Proxy GitHub repo, submit a GitHub Issue. Only to the extent that the issue’s description does not contain your proprietary or private information, please include the following:

  • ZDM Proxy version (see here).

  • ZDM Proxy logs: ideally at debug level if you can reproduce the issue easily and can tolerate a restart of the proxy instances to apply the configuration change.

  • Version of database software on the Origin and Target clusters (relevant for DSE and Apache Cassandra deployments only).

  • If Astra DB is being used, please let us know in the issue description.

  • Screenshots of the ZDM Proxy metrics dashboards from Grafana or whatever visualization tool you use. If you can provide a way for us to access those metrics directly that would be even better.

  • Application/Driver logs.

  • Driver and version that the client application is using.

Reporting a performance issue

If the issue is related to performance, troubleshooting can be more complicated and dynamic. Because of this we request additional information to be provided which usually comes down to the answers to a few questions (in addition to the information from the prior section):

  • Which statement types are being used: simple, prepared, batch?

  • If batch statements are being used, which driver API is being used to create these batches? Are you passing a BEGIN BATCH cql query string to a simple/prepared statement? Or are you using the actual batch statement objects that drivers allow you to create?

  • How many parameters does each statement have?

  • Is CQL function replacement enabled? You can see if this feature is enabled by looking at the value of the Ansible advanced configuration variable replace_cql_functions if using the automation, or the environment variable ZDM_REPLACE_CQL_FUNCTIONS otherwise. CQL function replacement is disabled by default.

  • If permissible within your security rules, please provide us access to the ZDM Proxy metrics dashboard. Screenshots are fine but for performance issues it is more helpful to have access to the actual dashboard so the team can use all the data from these metrics in the troubleshooting process.

Troubleshooting Troubleshooting scenarios

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage