Installing and testing latest version of Qserv in Docker


(Teng Li) #1

Dear Qserv developers,

Recently I’ve been trying to install and do some tests with the latest version of Qserv (by the time I’m writing this post was b3a533f).

I’m doing a simple multi-node installation on my CentOS7 localhost, using the deployment tool in the Qserv repo: admin/tools/docker/deployment/localhost/run-multinode-tests.sh, and this is my env.sh.

VERSION=b3a533f
NB_WORKERS=2

Set nodes names

DNS_DOMAIN=localdomain
MASTER=master."$DNS_DOMAIN"
for i in $(seq 1 “$NB_WORKERS”);
do
WORKERS="$WORKERS worker${i}.$DNS_DOMAIN"
done

Set images names

MASTER_IMAGE=“qserv/qserv:${VERSION}_master”
WORKER_IMAGE=“qserv/qserv:${VERSION}_worker”

The worker nodes cannot start up correctly, the error message of the worker container:

Container timezone not modified
INFO: Qserv execution directory : /qserv/run
Starting MySQL. SUCCESS!
Starting xrootd./qserv/run/etc/init.d/qserv-functions: line 46: 267 Aborted (core dumped) /qserv/stack/stack/miniconda3-4.7.12-984c9f7/Linux64/xrootd/lsst-dev-g628f1b81fb/bin/xrootd -c /qserv/run/etc/lsp.cf -l @libXrdSsiLog.so -n worker -I v4 -+xrdssi /qserv/run/etc/xrdssi.cnf >> /qserv/run/var/log/xrootd.log 2>&1
ERROR! : Manager of pid-file quit without updating file.
ERROR!
See startup logfiles : /qserv/run/var/log/xrootd.log, /qserv/run/var/log/worker/xrootd.log
Starting cmsd. SUCCESS!
Starting qserv-wmgr SUCCESS!

By looking into the container, I found the xrootd daemon would crash on start, generating a core file with no error message in the log file.

I’m wondering if this was me doing tests on a broken build or something else I’ve done wrong?
If you need further information I can provide you.

Many Thanks,
Teng


(Fritz Mueller) #2

Hi Teng,

The containers at the SHA you mention were chronologically the most recent built by Travis, but happen to have been built off one of Fabrice’s as-yet un-merged branches, and so aren’t guaranteed to be in working order.

Our latest SHA on master currently is 93c3b68. Could you please give it another shot with those containers and see if you achieve a better result? If not, follow up again here and we’ll be glad to help figure it out!

Thanks,
–FritzM.


(Teng Li) #3

Hi Fritz,

Thank you for your reply, I’ll try 93c3b68.

Cheers,
Teng


(Teng Li) #4

Hi Fritz,

It seems the error is the same with the latest version:

[qserv@worker2 bin]$ ./qserv-start.sh 
INFO: Qserv execution directory : /qserv/run
Starting MySQL. SUCCESS! 
Starting xrootd./qserv/run/etc/init.d/qserv-functions: line 46:   348 Aborted                 (core dumped) /qserv/stack/stack/miniconda3-4.7.12-984c9f7/Linux64/xrootd/lsst-dev-g628f1b81fb/bin/xrootd -c /qserv/run/etc/lsp.cf -l @libXrdSsiLog.so -n worker -I v4 -+xrdssi /qserv/run/etc/xrdssi.cnf >> /qserv/run/var/log/xrootd.log 2>&1
 ERROR! : Manager of pid-file quit without updating file.
 ERROR! 
See startup logfiles : /qserv/run/var/log/xrootd.log, /qserv/run/var/log/worker/xrootd.log
Starting cmsd. SUCCESS! 
Starting qserv-wmgr SUCCESS! 
[qserv@worker2 bin]$ 
[qserv@worker2 bin]$ ls
core.348  env.sh  qserv-connect-mysql-proxy.sh  qserv-connect-mysql-sock.sh  qserv-restart.sh  qserv-start.sh  qserv-status.sh  qserv-stop.sh  worker

(K-T Lim) #5

That core.348 file will be useful, if it contains data. Please try to preserve it and put it somewhere where Fritz’s team can get at it.


(Teng Li) #6

The core.348 is 82M. How could I share it with you?

Teng


(K-T Lim) #7

If it’s (only) 82MB, you could attach it to a message here using the “Upload” button in the editor.


(Teng Li) #8

There’s a 4Mb limit on the attachment. I can email it to anybody who wants it.


(Fritz Mueller) #9

Hi Teng, thanks for checking master, that will make it easier for us to troubleshoot.

We’d be interested to see:

  • core file
  • the two called-out log files
  • /qserv/run/etc/xrdssi.cnf from inside the container

If you’d like to give email a shot for the core file, you could hit me at fritzm@slac.stanford.edu. Otherwise, if you put it up somewhere on AFS we’ll come and get it?


(Fritz Mueller) #10

Oh, also:

  • /qserv/run/etc/lsp.cf from inside the container

(Teng Li) #11

Hi Fritz,

I’ve sent you the files, the /qserv/run/var/log/worker/xrootd.log doesn’t exist so there’s only one xrootd.log.

Thank,
Teng


(Fritz Mueller) #12

Hi Teng – your email has apparently not made it through. Checked my spam filter too, but have come up empty.

Any other way than email you might get us that core file? public_html somewhere? Google drive? AFS? …?

cheers,
–FritzM.


(Teng Li) #13

Sorry for the delay. I’ve made a google drive share:
https://drive.google.com/file/d/1jZ9m_FM7afRJ2wgClviAi0acuwSLbnDk/view?usp=sharing


(Fritz Mueller) #14

Thanks, Teng – got it, taking a look now…


(Fritz Mueller) #15

Hi Teng,

I managed to get a backtrace from that core file, and what we find is a boost::uuids::entropy_error being thrown from boost::uuids::random_generator().

This is likely a result of the Qserv worker Docker container being hosted on an older kernel that is missing some recent syscalls. Could you let us know what the kernel version is that is available on your docker host?