I assume care is taken to always defer unlock after acquiring a lock.Have you (Docker Inc as a whole, not you specifically) not got any scripts to loop over the goroutines in a running daemon and dump out what they're blocked on? I've just compiled the daemon with symbols and then loaded those symbols into a gdb attached to the docker daemon.
Let me know, we can build debug binary with deadlock detection.
Is the lock held for the entire life of the container?
After that, docker ps hung with no way to recover but a reboot.
This was on centos atomic 4a524a58cb with docker 1.6.0.
--storage-opt dm.loopdatasize=150G └─1051 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8811 -container-ip 172.17.0.1 -container-port 8811 The agent container, from systemctl status: CGroup: /system.slice/agent.service └─1009 /usr/bin/docker run -a STDOUT -a STDERR --name agent -p 8811 -h agent23 --security-opt label:disable -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker.sock:/var/run/-v /home/build:/home/build -v /var/srv/data:/opt/data /root/run The build container, from the build script: docker run -t --rm --name builder -e JOB=$proj -c 256 -u build -w /home/build --security-opt label:disable -v /opt/data/scripts/docker:/home/build/scripts -v $workdir/build:/home/build/workdir -v /home/build:/home/build/homedir -e MAVEN_OPTS=$maven_opts $(cat builder-image.txt) /home/build/scripts/build-maven $proj $target No other containers were running.
After the build gave up, a subsequent build started which included a script that kills any left-over containers.
It happens sometimes but I have one case where it kills a container every time. I want to make sure it is a bug before opening a ticket. Here's what happens: qba19 ~ # pgm/perf-tools/bin/tpoint syscalls:sys_enter_kill Tracing syscalls:sys_enter_kill. 67219.620193: sys_kill(pid: 3b0, sig: 0) docker-18317  ....
67225.462948: sys_kill(pid: 92d, sig: 0) docker-18317  ....
I can now write python scripts to traverse all the goroutines looking for particular things..I know what I'm looking for. If we grep out the callers to Lock then we find the callsites goroutines are blocked at: [email protected]:~# grep -A1 Lock | grep -v '^--$' | grep -v 'Mutex' | sort | uniq -c 1 #5 0x00000000004bc5ba in github.com/docker/docker/daemon.(*Daemon).
Do you have any links to previous tickets like this? Unfortunately there are 618 goroutines :) I've scripted their dumping, result at https://gist.github.com/aidanhs/960beaf2db1de622a2dd. Container Inspect (daemon=0xc2080691e0, job=0xc208857980, ~r1=0) at /go/src/github.com/docker/docker/daemon/inspect.go:17 1 #5 0x00000000004c861f in github.com/docker/docker/daemon.(*State).
This will lead to unexpected behavior, data loss and errors.