Troubleshooting Guides


Table of Contents

Troubleshooting

I’m Getting a Timeout Error

Depending on the connection and resources, the installation process may time out. If that occurs, use the --wait-duration flag to provide additional time. The time must be provided in Go duration format (for example: 60s, 1m, etc). The following example extends the wait duration to 10 minutes:

kubectl kots install wallaroo/ea -n wallaroo --license-file myfile.yaml --shared-password wallaroo --wait-duration 600s

Preflight Checks are Failing at the Command Line

If your system does not meet all of the preflight requirements, the installation process may fail when performing an automated installation. It is highly recommended to install Wallaroo on a system that meets all requirements or else performance will be degraded.

Before continuing, use the following command and note down any and all pre-flight checks that are listed as a failure. The license will be installed in later steps through the browser.

install wallaroo/ea -n wallaroo

To ignore preflight checks, use the --skip-preflights flag, as in the following example (Note: This is not recommended, only provided as an example.):

kubectl kots install wallaroo/ea -n wallaroo --license-file myfile.yaml --shared-password wallaroo --skip-preflights

When Launching JupyterHub, I get a Server 500 error

If you shut down and restart a Wallaroo instance in a new environment or change the IP address, some settings may not be updated. Run the following command to restart the deployment process and update the settings to match the current environment. Note that the namespace wallaroo is used - modify this to match the environment where Wallaroo is installed.

kubectl rollout restart deployment hub -n wallaroo

The resource wallaroo/daemonset/spire-agent shows as not loading - how do I fix it?

If the resource wallaroo/daemonset/spire-agent does not load, verify the kots version is 1.124.4. To upgrade the kots version in your Kubernetes cluster, see see Updating KOTS.

Usually this can be done by:

  1. Updating the version of kots with the command:

    curl https://kots.io/install/1.124.4 | REPL_USE_SUDO=y bash
    
  2. Upgrade kots in the Kubernetes cluster with the following, replacing the namespace wallaroo with the namespace you have installed the Wallaroo Ops instance in:

    kubectl kots admin-console upgrade --namespace wallaroo
    

How do I Send Logs and Configurations to Wallaroo?

See the Wallaroo Support Bundle Generation Guide for instructions on how to create a support bundle used to troubleshoot installation and configuration issues.

How do I Fix a Stuck Upgrade?

When doing an upgrade of Wallaroo and access the Wallaroo instance via a browser, the following message indicates a “stuck” upgrade when it remains after the upgrade process is complete:

  • Wallaroo upgrade in progress

Use the following procedures to clear the upgrade stream based on how Wallaroo was installed.

Cluster Install

For a cluster based install of Wallaroo, enter the following command, replacing $NAMESPACE with the namespace associated with the Wallaroo install:

kubectl run -i --rm --tty nats-box --restart=Never -n $NAMESPACE --image=natsio/nats-box

For example, for the namespace wallaroo this command is:

kubectl run -i --rm --tty nats-box --restart=Never -n wallaroo --image=natsio/nats-box

In a separate terminal shell, enter the following command:

nats -s nats stream purge wallaroo_upgrade_status

Air-Gapped Install

For air-gapped installations of Wallaroo, determine the private image registry with the following command. This returns the running pod that is the issue.

kubectl describe deployment dashboard|grep Image

The following is an example of the response from this command:

Image:      abcdefg.dkr.ecr.us-east-2.amazonaws.com/wallaroo/dashboard:v2026.1.0-main-6081

For this example, the registry here is abcdefg.dkr.ecr.us-east-2.amazonaws.com/wallaroo/.

Enter the following command, replacing the image value with the pod returned in the previous command:

kubectl run -i --rm --tty nats-box --restart=Never -n wallaroo --image=abcdefg.dkr.ecr.us-east-2.amazonaws.com/wallaroo/nats-box:0.17.0-nonroot

Failed Deploy Status When Attempting to Upgrade

If a “Failed Deploy" message is shown in kots or helm, retrieve the pods to check for a stalled pre-upgrade hook. Replace the $NAMESPACE variable with the namespace used to install Wallaroo.

kubectl get pods -n $NAMESPACE

For example, for the namespace wallaroo the command is:

kubectl get pod -n wallaroo

If the pre-upgrade hook still shows Running under STATUS the following command will remove the pod so the upgrade can be attempted again. Replace the $NAMESPACE variable with the namespace used to install Wallaroo.

kubectl delete job wallaroo-pre-upgrade-hook -n $NAMESPACE

For example, for the namespace wallaroo the command is:

kubectl delete job wallaroo-pre-upgrade-hook -n wallaroo

How to Resolve the Message “failed to render archive directory” for Single Node Linux Installations

If the error failed to render archive directory appears during an installation of Single Node Installation or Air-Gapped Single Node installation, this may indicate the incorrect ingress mode was selected during the installation process.

Verify that Ingress mode is Embedded:

Add nodes to cluster

Wallaroo Stops Responding

A potential issue causing Wallaroo to stop responding is when the Postgres database no longer has storage capacity for additional logs and data. The following procedure confirms this as an issue and offers a resolution. Note that these commands require administrative access to the cluster hosting the Wallaroo installation and the kubectl command line tools.

Confirm the Disk is Full

To confirm the disk is full, run the following command. Adjust the namespace for the namespace used to install Wallaroo.

NAMESPACE=wallaroo
PG_POD=$(kubectl get pod -n $NAMESPACE -l role=postgres -o name | head -1)
kubectl exec -n $NAMESPACE $PG_POD -- df -h /var/lib/postgresql/data

The following shows a typical, healthy output for postgresql/data.

Filesystem                Size      Used Available Use% Mounted on
/dev/sdc                 97.9G     81.0M     97.8G   0% /var/lib/postgresql/data

If the Use% is near or equals 100%, this indicates an issue with storage space availability. The following shows this condition.

Filesystem                Size      Used Available Use% Mounted on
/dev/sdc                 97.9G     97.8G     97.8G   99% /var/lib/postgresql/data

If this condition occurs, continue following the guide. If storage is 80% or lower, contact your Wallaroo support representative for more assistance.

Delete the Emergency Buffer

If the postgresql/data output is near or at 100%, delete the buffer to free reserved space. Typically this is around 5Gi of storage capacity recovered.

kubectl exec -n $NAMESPACE $PG_POD -- \
  rm /var/lib/postgresql/data/emergency_buffer.bin
# Confirm space is recovered
kubectl exec -n $NAMESPACE $PG_POD -- df -h /var/lib/postgresql/data

Trigger the Cleaner Immediately

The next step is to trigger the Cleaner job as a cronjob command. The cleaner command:

  • Delete rows older than dbcleaner.maxAgeDays
  • Delete oldest rows from any table exceeding its disk quota safety threshold, down to the target watermark.
  • Run VACUUM FULL on any table that had rows deleted, physically returning space to the OS.

The executes this command now rather than waiting for the regularly scheduled run.

kubectl create job -n $NAMESPACE dbcleaner-manual-$(date +%s) \
  --from=cronjob/dbcleaner-job

To view the progress, the logs are followed via the kubernetes logs command with the -f flag to “follow” the logs:

kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dbcleaner-job -f

Verify Cleanup Was Successful

Once the process is complete, verify the disk space was updated via the following command:

kubectl exec -n $NAMESPACE $PG_POD -- df -h /var/lib/postgresql/data

For example, the following output is created:

Filesystem                Size      Used Available Use% Mounted on
/dev/sdc                 97.9G     81.0M     97.8G   0% /var/lib/postgresql/data

Confirm VACUUM completed cleanly (dead rows should be zero or near zero) via the following command:

kubectl exec -n $NAMESPACE $PG_POD -- \
  psql -U postgres -c "
    SELECT relname,
           n_dead_tup,
           n_live_tup,
           last_vacuum,
           last_autovacuum
    FROM pg_stat_user_tables
    WHERE relname IN ('fluentbit', 'sidekick_logs');
  "

The following shows a sample output. Note that he setting n_dead_tup should be 0 or very low after a successful VACUUM FULL.

 relname  | n_dead_tup | n_live_tup | last_vacuum |        last_autovacuum
-----------+------------+------------+-------------+-------------------------------
 fluentbit |          0 |       1004 |             | 2026-01-17 16:00:24.672247+00

Recreate the Emergency Buffer

Once disk pressure is resolved, restart the Postgres pod so the postStart hook recreates the buffer file via the following commands.

kubectl rollout restart deployment/postgres -n $NAMESPACE
kubectl rollout status deployment/postgres -n $NAMESPACE

The following shows standard execution of these commands.

❯ kubectl rollout restart deployment/postgres -n $NAMESPACE
deployment.apps/postgres restarted
❯ kubectl rollout status deployment/postgres -n $NAMESPACE
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out

Verify the buffer was recreated via the following commands.

NEW_PG_POD=$(kubectl get pod -n $NAMESPACE -l role=postgres -o name | head -1)
kubectl exec -n $NAMESPACE $NEW_PG_POD -- \
  ls -lh /var/lib/postgresql/data/emergency_buffer.bin

Note: If the pod does not have enough free space to allocate the buffer, the postStart hook logs a warning and continues execution. Ensure at least emergencyBufferSize (default 5Gi) of free space is available before restarting, or repeat steps 3–4 until enough space is reclaimed.

If the Problem Recurs

If log tables are growing faster than the cleaner can keep up, consider adjusting these Helm values:

ValueDefaultEffect
dbcleaner.schedule*/8 * * * *Increase frequency (e.g. */2 * * * *)
dbcleaner.maxAgeDays--Reduce to retain fewer days of logs
dbcleaner.safetyPercentage0.85Lower to trigger cleanup earlier
dbcleaner.targetPercentage0.70Lower to clean more aggressively
postgres.diskSize100GiExpand the PVC if more retention is needed