HTCondor Docker universe throws core.STARTER HTCondor Docker universe throws core.STARTER – My IT experience

HTCondor Docker universe throws core.STARTER

This is a problem observed when using HTCondor in the Docker universe.

After re configuring HTCondor and Docker on one processing node, every time a job is sent the following errors are dumped in the corresponding slot’s StarterLog.slot1_N:

(pid:24877) Found 33 entries in docker image cache.
Stack dump for process 24877 at timestamp 1497439919 (13 frames)
/lib64/libcondor_utils_8_7_1.so(dprintf_dump_stack+0x72)[0x7fbbc3e6f0b2]
/lib64/libcondor_utils_8_7_1.so(_Z18linux_sig_coredumpi+0x24)[0x7fbbc3ffb534]
/lib64/libpthread.so.0(+0xf370)[0x7fbbc25bc370]
/lib64/libstdc++.so.6(_ZNSt8__detail15_List_node_base9_M_unhookEv+0x7)[0x7fbbc2f76077]
/lib64/libcondor_utils_8_7_1.so(_ZN9DockerAPI3runERN14compat_classad7ClassAdES2_RKSsS4_S4_RK7ArgListRK3EnvS4_St4listISsSaISsEERiPiR11CondorError+0x42e)[0x7fbbc3e31cee]
condor_starter(_ZN10DockerProc8StartJobEv+0xb66)[0x454656]
condor_starter(_ZN8CStarter8SpawnJobEv+0xc3)[0x45b753]
condor_starter(_ZN8CStarter14SpawnPreScriptEv+0x197)[0x459757]
/lib64/libcondor_utils_8_7_1.so(_ZN12TimerManager7TimeoutEPiPd+0x182)[0x7fbbc3ff9952]
/lib64/libcondor_utils_8_7_1.so(_ZN10DaemonCore6DriverEv+0x9cb)[0x7fbbc3fdb59b]
/lib64/libcondor_utils_8_7_1.so(_Z7dc_mainiPPc+0x13e8)[0x7fbbc3ffebe8]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbbc220db35]

There is also a core.STARTER generated and the output of `gdb /var/log/condor/core.STARTER <<< “where”` is:

Core was generated by `condor_starter -f -a slot1_1 fqdn.domain.com’.

(gdb) Python Exception <class ‘gdb.MemoryError’> Cannot access memory at address 0xb1340bc0:

The lead to that was a bug in the Docker thinpool storage driver, which led to the use of overlay2 driver alongside with a Docker reinstall.

Solution:
There are ‘hidden’ dot files in the condor log directory, they contain cache information that might mess up with you job submission, to fix that one needs to stop condor, remove those files and start condor again. Once done the node start accepting Docker Universe jobs again.

systemctl stop condor
cp /var/log/condor/.s* /tmp/
rm -f /var/log/condor/.s*
systemctl start condor

Categories