Why do I receive an EOF error when attempting to backup by DTR environment?

Article ID: KB000531

Issue

When performing a backup of a DTR environment using the command docker run --rm -i docker/dtr backup, the backup fails with the following message:

FATA[0063] Error waiting for dtr-phase2 to finish:
An error occurred trying to connect: Post https://docker-ucp1:443/v1.22/containers/<id>/wait: EOF, code: -1

It can also happen when you try to open a backup file:

tar -cf /tmp/backup-images.tar dtr-registry-<replica-id>
...
tar: Unexpected EOF in archive 
tar: Error is not recoverable: exiting now 
...

Prerequisites

Before performing these steps, you must meet the following requirements:

  • Have configured a load balancer to balance traffic between your DTR replicas

Root Cause

During a DTR backup job, the bootstrap script for the backup command spins up a dtr-phase2 container, where most of the backup work is performed. The bootstrapper then monitors the progress of dtr-phase2 via an ongoing call to the ContainerWait API endpoint which blocks until an exit status is returned from the container.

The ContainerWait API is not performing a large amount of traffic on the wire, if any at all. This is problematic when an incorrectly configured load balancer is involved in the communication and is not configured to keep connections alive for a large enough amount of time. This leads to the load balancer cutting the connection and the EOF error.

Diagnostic Steps

The following steps can be used to test for a UCP loadbalancer timeout independently of the dtr backup command:

  1. As the Admin user, download and source a UCP client certificate bundle:

    https://docs.docker.com/datacenter/ucp/2.2/guides/user/access-ucp/cli-based-access/

  2. Test if docker wait for a long-running container times out after approximately the same amount of time that dtr backup is prematurely exiting.

    time docker wait $(docker ps -qaf name=ucp-controller |head -n1)
    

    If this command prematurely exits with an error in approximately the same amount of time as dtr backup, then a load balancer may be terminating the connection.

Resolution

To fix this issue, increase the tcp_keepalive setting on the load balancer balancing traffic across the DTR replicas to a value of 5 minutes.