Monday, 2017-12-11

*** jkilpatr has joined #openstack-sprint		00:09
*** baoli has quit IRC		03:34
*** skramaja has joined #openstack-sprint		05:26
*** ianychoi_ is now known as ianychoi		06:12
frickler	ianw: hi, sorry for being late, is there anything left I could help you with? or are you done for today?	08:37
* frickler is seeing messages of type "< openstackrecheck> Console logs not available after ..." for the first time after weeks again this morning, does anyone know what happened there?		09:04
ianw	frickler: hey, thanks for the reviews	10:47
ianw	you can pick out any parts, or wait for clarkb etc	10:48
ianw	if you want to jump further into ethercalc, feel free	10:48
ianw	basically, ssh to the testing host 23.253.119.134 and "cd /opt/system-config/production; puppet apply -v --modulepath=modules:/etc/puppet/modules manifests/site.pp" and keep fixing stuff till it works :) there's some notes on the etherpad, for sure our puppet needs to ship a .service file instead of an upstart, for example	10:50
ianw	i'm off but will jump back in tomorrow!	10:50
frickler	I've looked at your notes for ethercalc and was wandering whether we should do a systemd service file directly	10:51
frickler	otherwise I can go on with iterating. do I need to become root for that?	10:52
ianw	frickler: we'll want to replace https://git.openstack.org/cgit/openstack-infra/puppet-ethercalc/tree/templates/upstart.erb (and all the stuff that writes that out) with a .service file	10:54
ianw	frickler: yep; for these hosts log in as yourself and sudo -s, for ci hosts you log in as root@ (your key should be deployed, i went through all that today :)	10:55
frickler	ianw: yeah, I'm on the host already, will go ahead and try to build a service definition	10:56
ianw	that'll be step 1 ... the nodejs deployment stuff might need fiddling. i think that will work out common for etherpad too, so that's good. it's just a matter of trying & fixing failures till it works really	10:58
ianw	i thought this would be an easy one :) status.o.o is probably really an easy one ... but you never know :)	10:59
frickler	everything looks easy from the outside probably ;)	11:01
*** skramaja_ has joined #openstack-sprint		11:20
*** skramaja has quit IRC		11:21
*** skramaja has joined #openstack-sprint		11:25
*** skramaja_ has quit IRC		11:25
*** jkilpatr has quit IRC		11:37
*** ianychoi has quit IRC		11:37
*** ianychoi has joined #openstack-sprint		11:50
*** jkilpatr has joined #openstack-sprint		12:11
*** baoli has joined #openstack-sprint		13:10
*** clarkb has joined #openstack-sprint		13:15
clarkb	frickler: dmsimard the puppetmaster:/etc/puppet/hieradata/production git repo is where we keep the root non public hiera data	13:20
frickler	clarkb: so do I connect with my account and use sudo then?	13:20
clarkb	frickler: correct	13:20
clarkb	the reason our email addresses are not public is because we found people were using our puppet modules and installing our email addresses for root spam resulting in us getting their root email	13:21
frickler	clarkb: can you take a look at elasticsearch in the meantime? seems the cluster is stuck	13:21
frickler	probably since ian did some updates earlier	13:21
clarkb	looks like es02 and es04 did not hav etheir elasticserach processes running (we don't let them start on boot to give us more control over cluster management) so I started the service on those two hosts	13:22
clarkb	in general editing this hiera repo is what we'll do to update ssl certs or db credentials and so on	13:23
clarkb	so adding yourself to the sysadmins list is a good first exposure to where that lives and how to update it	13:23
frickler	o.k., so I'll start with that now	13:23
clarkb	and since its a shared repo when we edit it we'll usually drop a note in IRC to let others know not to conflict with us	13:25
clarkb	frickler: looks like you are all done?	13:27
frickler	oh, I'm seeing a note in the log about issues with google mail	13:27
frickler	my address is also gmail rebranded	13:27
frickler	maybe I should set a something different from work email anyway	13:28
frickler	but for now I'm done with editing, yes	13:28
clarkb	it may not be an issue for you, I think pabelanger's red hat email is gmail too	13:28
clarkb	I personally had problems with gmail and switched away from it though	13:29
clarkb	frickler: dmsimard the next thing I had in mind was to replace a logstash-workerNN.openstack.org node each since those are straightforward to replcae and should give us ability to focus more on process than specific service details	13:30
frickler	my private stuff is hosted at hetzner.de but I need to move things around there a bit first	13:30
clarkb	frickler: dmsimard if you haven't seen it yet you probably want to start at https://git.openstack.org/cgit/openstack-infra/system-config/tree/launch/README	13:30
clarkb	system-config/launch is our openstack cloud VM launching tool for booting new instances in clouds	13:31
clarkb	when executed from the puppetmaster it can make use of our clouds.yaml on that node making the process fairly straightforward	13:32
clarkb	I personally have a git clone of system-config in my homedir on the puppetmaster that I run that from	13:32
frickler	I noticed that I'm in admin group but not puppet. Is the idea to set this up manually when needed or should this get better automation?	13:32
clarkb	frickler: I think we've always just run the manual group addition like in that doc, but we probably could automate that instead	13:33
clarkb	if you want to work on a change to automate that I think it would be a good addition	13:33
clarkb	(but maybe for later so we can focus on launch node things now)	13:34
frickler	yeah, I'll put it on my todo list	13:34
*** skramaja has quit IRC		13:36
frickler	the pip install is failing for me https://git.openstack.org/cgit/openstack-infra/system-config/tree/launch/README#n21	13:37
clarkb	fun, is it a dependency issue?	13:37
frickler	failing to build multiple wheels	13:37
* clarkb works to reproduce		13:37
frickler	http://paste.openstack.org/show/628615	13:38
frickler	thats the full log	13:38
clarkb	looks like it failed to find Python.h which comes from python-dev	13:40
clarkb	I think this may be a new dependency or we otherwise were able to pull wheels for it in the past?	13:40
*** ianychoi has quit IRC		13:40
dmsimard	I'm here for around ~30 minutes before I have to afk briefly, going to look at step 0 and bootstrapping	13:40
clarkb	oh wait I see	13:40
clarkb	frickler: we have python 2 dev files installed but not python3 and virtualenv defaulted to python3 for some reason	13:41
clarkb	frickler: I'm testing with `virtualenv -p python2 launch-env`	13:42
Shrews	clarkb: i'm probably going to need the same bootstrapping as frickler and dmsimard	13:42
clarkb	Shrews: good morning, feel free to follow along, ask questions, etc. We hvae plenty of logstash worker nodes so should be plenty of room.	13:43
*** baoli has quit IRC		13:43
clarkb	Shrews: dmsimard has indicated he is editing the sysadmins list in hiera and since that is a shared git repo we will have to wait for him to indicate completion before you add yourself	13:43
clarkb	Shrews: the file for that is puppetmaster.openstack.org:/etc/puppet/hieradata/production/common.yaml when dmsimard is done	13:43
Shrews	k k	13:43
clarkb	bsaically you edit and commit as root and sign off on the change with your name in the commit message	13:44
dmsimard	clarkb: I see symlinks from <nickname> to production	13:44
clarkb	dmsimard: yes, that is an artifact of puppet environments	13:44
dmsimard	clarkb: is that a system used to "lock" ? i.e, we grep to see if there is a user doing it ?	13:44
dmsimard	oh, ok	13:44
clarkb	I've personally not used puppet environments any time recently becaues they are often quite clunky (and I think ansible-puppet may have mostly negated their usefulness by local applying everything)	13:45
clarkb	Instead I do my best to run puppet locally until I'm happy with it (which is probably a better way to do things anyways)	13:45
clarkb	frickler: yes virtualenv -p python2 launch-env seemed to work	13:45
clarkb	frickler: that forced virtualenv to make the env with python2 instead of python3	13:45
dmsimard	clarkb: do people typically remain as their own user or they sudo as root ? i.e, I'd want to move to /etc/puppet/hieradata	13:45
clarkb	dmsimard: I think its a mix. I know pleia2 for example was really good about always sudoing everything and never properly becoming root. I came from an env where we didn't have sudo and only had proper root so end up as proper root more often than is good probably	13:46
frickler	clarkb: confirmed, ansible installed fine now	13:47
dmsimard	oh wow, nano as default editor on git commit.. that's something I haven't seen in a long time	13:47
dmsimard	:D	13:48
frickler	dmsimard: I stumbled about that too ;)	13:48
clarkb	dmsimard: I think that is how we've avoided the vi(m) vs emacs battle :P	13:48
clarkb	frickler: great, I'll push a patch up for that now and then add you two to the infra root gerrit group so you can review it for me :)	13:48
dmsimard	Shrews: I'm done editing hieradata	13:48
Shrews	dmsimard: k	13:49
Shrews	dmsimard: i noticed you didn't sign your commit. want to amend before I change anything?	13:50
dmsimard	Shrews: let me see..	13:51
clarkb	frickler: https://review.openstack.org/527092 and I will have gerrit groups updated momentarily	13:51
dmsimard	Shrews: by sign you mean append my nickname to the commit description ?	13:51
dmsimard	Shrews: or gpg sign ?	13:52
Shrews	dmsimard: just a nick in the commit msg	13:52
dmsimard	Shrews: ok, I added it	13:52
dmsimard	Shrews: er, hang on..	13:52
clarkb	13:50:23 Shrews \| dmsimard: k	13:53
clarkb	silly weechat mouse mode	13:54
dmsimard	It's picking up the author as "Your Name <you@example.com>"	13:54
dmsimard	¯\_(ツ)_/¯	13:54
dmsimard	fixing that	13:54
dmsimard	Shrews: ok, go	13:55
clarkb	frickler: dmsimard you have been added to the infra-core group in gerrit. So you can now +/-2 +/-A changes like https://review.openstack.org/527092	13:56
frickler	clarkb: already done ;)	13:57
clarkb	dmsimard: you'll want to read https://git.openstack.org/cgit/openstack-infra/system-config/tree/launch/README next and follow the steps through to line 21 (but using my edit in change 527092)	13:57
Shrews	alright, done. i don't have any option but a gmail address, so shrug if there's a problem with that	13:57
clarkb	Shrews: ^ you'll want to follow that too	13:57
dmsimard	you'd think that "virtualenv" would be py2 and "virtualenv-3.4" would be py3 :D	13:58
clarkb	ya I'm not sure why its picking python3 yet	13:58
clarkb	I think because it got installed under python3?	13:58
dmsimard	ahhhh	13:59
dmsimard	The default is the interpreter that virtualenv was installed with (/usr/bin/python3)	13:59
clarkb	I'm going to make tea while everyone makes virtualenvs	14:00
* Shrews is now virtual		14:01
dmsimard	Those are some kind of old versions of shade and ansible by now -- Ansible 2.1 is EOL actually. Are they pinned for a good reason ?	14:01
clarkb	dmsimard: they are pinned because releases of bot htend to break things. I'm not sure that they are pinned to those specific versions for a good reason though	14:02
clarkb	I expect that ansible 2.3 would work as well	14:02
dmsimard	clarkb: yeah that's totally fair, I would up the pin. I'll guinea pig ?	14:02
clarkb	dmsimard: maybe after the first round so that we can hopefully avoid problems first time through?	14:03
dmsimard	sure	14:03
clarkb	when we upgrade nodes typically what that actually means is replacing the instance with a new instance running newer software	14:04
clarkb	I only know of one case where we upgraded in place which was the lists.openstack.org upgrade and we did that ot keep the IP and its reputation for sending email	14:05
clarkb	to upgrade logstash worker nodes we will be using the replace method	14:05
clarkb	So the next step is looking at the old instance(s) to see what flavor/size/distro we need `openstack --os-cloud openstackci-rax --os-region DFW server show logstash-worker01.openstack.org` should be runnable as a normal user on puppetmaster to give you that info	14:06
clarkb	in this case we see the flavor is 'performance1-4' and it is indeed a trusty node so we will want to replace it with a 16.04 xenial node	14:06
Shrews	aye	14:07
dmsimard	clarkb: should we grab a copy of clouds.yaml from root and put it in our home directory ?	14:07
clarkb	dmsimard: no, you should probably use the root copy it should be readable by your user	14:07
* dmsimard looks		14:08
clarkb	dmsimard: the root copy is the default for openstack client and this way we can keep it up to date more easily	14:08
clarkb	you can also do things like flavor list and image list to get a sense of what flavors and images are available	14:08
clarkb	one piece of information that the launch README doesn't really call out that is probably worth being more explicit about is that we have two tenants/users/projects/whateveritscalledtoday	14:09
clarkb	we have the openstackci account and the openstackjenkins/openstackzuul account. openstackci is where we run the control plane servers and openstackjenkins/openstackzuul is what we give nodepool access to	14:09
Shrews	yeah, that's kinda important	14:10
clarkb	in this case we are using the openstackci account because logstash workers are in the control plane but when you work with nodepool nodes you will use the openstackzuul/openstackjenkins account	14:10
dmsimard	clarkb: yeah I guess that's why I was asking for the clouds.yaml -- in order to use openstackclient "freely"	14:10
clarkb	dmsimard: you should be able to use it freely already	14:11
clarkb	does the command I pasted above work for you?	14:11
clarkb	(it should work as is)	14:11
Shrews	i don't see the other account(s) in clouds.yaml	14:11
Shrews	oh, all-clouds.yaml has them	14:12
clarkb	Shrews: oh thats brings up another important piece of info. We have two clouds.yaml the default file only has control plane stuff and then there is all-clouds.yaml which you can set with an env var for everything	14:12
Shrews	nod	14:12
dmsimard	clarkb: what is this magic, are we created as uid 0 ?	14:12
clarkb	the reason for this is the ansible-puppet things use the dfeault file and we don't want it attempting to puppet nodepool nodes	14:12
*** baoli has joined #openstack-sprint		14:12
clarkb	dmsimard: I think its just group membership	14:12
dmsimard	clarkb: huh, I totally expected osc to seek in ~/.config, not ~/root/.config	14:13
clarkb	dmsimard: ya group admin gets rw access to the file	14:13
dmsimard	well, wfm	14:13
clarkb	dmsimard: it is actually looking at /etc/openstack/clouds.yaml	14:13
Shrews	dmsimard: shade (or occ, rather) will look in /etc/openstack and ~/.config	14:13
Shrews	part of occ magic	14:13
dmsimard	clarkb: ohhhhhh, yeah /etc/openstack totally makes more sense than my confused explanation	14:14
Shrews	os-client-config for non-shorthand	14:14
clarkb	so now we should all pick a unique logstash-workerNN NN value then we can start running some boots	14:14
* frickler picks 01		14:14
* Shrews picks 02		14:15
dmsimard	I have to afk briefly but I'm all set up, I'll pick a number when I'm back	14:15
dmsimard	Let's use the pad to keep up with who's doing what	14:15
clarkb	dmsimard: ++ to using etherpad to track	14:15
dmsimard	pad is here: https://etherpad.openstack.org/p/infra-sprint-xenial-upgrades	14:15
dmsimard	ok, brb	14:16
Shrews	clarkb: and we use launch-node.py, right?	14:16
frickler	clarkb: how does our quota look like, do we need to check before launching new servers	14:17
clarkb	Shrews: correct	14:17
clarkb	frickler: I don't actually know but we can ask openstackclient for that info (or we can just execute the command and if we don't have enoug hquota it will fail fast	14:17
clarkb	before we start though a few more things	14:17
Shrews	clarkb: value for $FQDN can be the same as the thing we are replacing?	14:18
clarkb	since this is a base distro image upgrade we should be careful to explicitly set the image name we want. Also make sure they flavor matches the old server's	14:18
clarkb	Shrews: yes	14:18
* clarkb will make a quick paste for what the commands should look like in this specific case		14:18
Shrews	a3b50a75-2fe0-437a-bf7a-04c2cf0adf4c \| Ubuntu 16.04 LTS (Xenial Xerus) (PVHVM)	14:19
clarkb	ya, something like http://paste.openstack.org/show/628623/	14:20
clarkb	replacing the NN with your chosen value	14:21
clarkb	also I tend to run this in screen	14:21
clarkb	some server builds take longer than expected and being able to close the laptop is nice	14:21
Shrews	oh, yeah. that's a good tip	14:22
dmsimard	Oh yay other screen users	14:22
* dmsimard needs to learn tmux		14:22
Shrews	especially since i have a chiro appointment soon	14:22
clarkb	frickler: Shrews but ya I think you can go ahead and run that whenever you are ready	14:22
clarkb	in this specific case the server we are bringing up is largely stateless and will start its life firewalled off from the rest of the cluster so very little to worry about :)	14:23
* Shrews launching		14:24
* frickler is launching too and will be back in a couple of minutes		14:24
* fungi sprints in, very late		14:26
fungi	i'll get something good in the channel topic in just a sec	14:27
fungi	didn't we have an ml thread discussing this? was it just in meetings?	14:28
fungi	i guess i can link the 'pad	14:28
clarkb	there is a ml thread too but I think the etherpad is likely most useful	14:29
*** ChanServ changes topic to "OpenStack Infra team Xenial upgrade sprint \| Coordinating at https://etherpad.openstack.org/p/infra-sprint-xenial-upgrades"		14:31
clarkb	Shrews: frickler let me know when that completes (there should be a bunch of information about dns related items which we'll talk about next once we have that info)	14:32
Shrews	ooh exception	14:33
clarkb	woo fun	14:33
fungi	the launch script raises an exception if puppet (or anything really) fails during the process	14:34
Shrews	http://paste.openstack.org/show/628626/	14:34
clarkb	ok I think I actually know what this bug is	14:35
clarkb	I expect we'll be seeing a lot of this one because systemd	14:35
* fungi shakes fist at systemd		14:35
clarkb	well systemd and puppet. The problem (I think) is that we use sys V init scripts which systemd supports but you have to reload its config for it to find them	14:35
clarkb	puppet does not reload this config for us automatically so we'll need to add some puppetry to do that	14:35
clarkb	I've done this before for zanata let me dig up that change	14:36
dmsimard	ok, I'm back	14:36
fungi	for basically any service we're adding custom initscripts with i suppose	14:36
clarkb	fungi: ya	14:36
* Shrews has to step away a sec. brb		14:36
fungi	i guess the distro packages use maintscripts to register their initscripts with systemd	14:36
dmsimard	fungi: that was the thread for the sprint: http://lists.openstack.org/pipermail/openstack-infra/2017-November/005702.html	14:36
*** mrhillsman has joined #openstack-sprint		14:37
clarkb	fungi: ya and puppet's excuse is that this is how you are suppoesd to use puppet	14:37
fungi	dmsimard: ahh, back in november. no wonder i wasn't spotting it	14:37
clarkb	we basically need to add the code that was removed in https://review.openstack.org/#/c/423369/3/manifests/wildfly.pp to the puppet for logstash workers	14:39
clarkb	(it was removed in ^ because an external dep solved the problem for us, but we don't have external deps for logstash workers like that so we'll carry it ourselves)	14:39
clarkb	does someone else want to work on that change or should I?	14:39
clarkb	Shrews: frickler another note, by default launch-node.py will clean up after itself on failure by default so you shouldn't need to do anything special here	14:40
fungi	and if you need it _not_ to clean up after itself, add --keep	14:41
dmsimard	clarkb: I can send a patch.	14:41
clarkb	dmsimard: cool I think you want to edit worker.pp in puppet-log_processor repo	14:41
fungi	i'm starting on subunit-worker01 (to replace subunit-worker02) since i had actually started trying to boot it on xenial a month or so ago and then got sidetracked by other stuff	14:42
fungi	odds are i'll want to copy dmsimard's patch for that	14:42
clarkb	yup	14:42
fungi	should we patch tools/launch-node.py to switcn the default image to 'Ubuntu 16.04 LTS (Xenial Xerus) (PVHVM)' now?	14:45
fungi	s/switcn/switch/	14:45
clarkb	fungi: probably a good idea to prevent regressions launching new servers or mistakes if we forget to specify the image	14:45
fungi	patch on the way then	14:46
fungi	do we have a review topic we're using?	14:46
dmsimard	should we use a topic for sprint patches ?	14:46
dmsimard	wow, fungi beat me to it :)	14:47
fungi	heh	14:47
fungi	let's use topic:xenial-upgrades	14:47
dmsimard	wfm	14:47
dmsimard	https://review.openstack.org/527109 is up for logprocessor	14:47
frickler	ianw started with topic infra-xenial already	14:48
fungi	ahh, i'll adjust accordingly. now i see it in the notes section of the pad	14:48
fungi	totally missed it earlier	14:48
clarkb	dmsimard: +2	14:48
frickler	couple of patches that could be reviewed there already https://review.openstack.org/#/q/status:open+topic:infra-xenial	14:49
dmsimard	ok /me switches topic	14:49
Shrews	So once that lands to the puppet-log_processor repo, do we need to update a repo on puppetmaster, or is that done automatically by the launch script?	14:49
clarkb	Shrews: the puppet modules are updated by the ansible run puppet cron. Which runs every 15 minutes but due to how long it takes to get through effectively runs every 45 minutes	14:50
clarkb	in this case I think we can go ahead and update the git repo early to speed up the process	14:51
dmsimard	frickler: oh your comment on https://review.openstack.org/#/c/515279/ .. I remember writing a blog post exactly for stuff around those lines when 14.04 came out	14:51
fungi	heh, pabelanger already beat me to https://review.openstack.org/502856 "Bump default image to xenial to launch-node.py"	14:52
fungi	so we can already skip specifying --image	14:52
dmsimard	clarkb: it looks like https://review.openstack.org/#/c/515279/ would save us some trouble	14:54
clarkb	reading now	14:55
Shrews	clarkb: so I'm not seeing another logstash-worker02 in the server list. i guess the process didn't get far enough to create it	14:55
Shrews	or it automatically deleted it?	14:56
*** jeblair has joined #openstack-sprint		14:56
fungi	likely the latter	14:56
clarkb	Shrews: it automatically deleted it	14:56
Shrews	maybe i should just look at the code :)	14:56
clarkb	launch-node tries to be helpful that way	14:56
fungi	if the launch fails for any reason then the script will delete the instance	14:56
Shrews	yay	14:56
fungi	unless you specify --keep and then you can use the temporary root ssh key for that uuid in /tmp to log into it if you need to investigate it directly	14:57
*** baoli has quit IRC		14:57
clarkb	fungi: are you willing to be second reviewer on https://review.openstack.org/#/c/527109/1 ?	14:57
Shrews	fungi: nod thx	14:57
*** baoli has joined #openstack-sprint		15:00
*** baoli has quit IRC		15:00
*** pabelanger has joined #openstack-sprint		15:01
pabelanger	o/	15:01
pabelanger	running a little behind this morning	15:01
pabelanger	just getting coffee and will start reviewing changes that are up	15:02
fungi	subunit-worker01 puppet-user[11998]: (/Stage[main]/Subunit2sql/Package[subunit2sql]/ensure) change from absent to latest failed: Could not update: Execution of '/usr/local/bin/pip install -q --upgrade subunit2sql' returned 1: Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-9bVQMS/netifaces/setup.py';f=getattr(tokenize, 'open',	15:03
fungi	open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-wuw6O6-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-9bVQMS/netifaces/	15:03
frickler	not directly related but adding me to accessbot could also use another review https://review.openstack.org/526125	15:03
fungi	looks like building netifaces from sdist is failing when attempting to install subunit2sql	15:03
*** baoli has joined #openstack-sprint		15:04
clarkb	fungi: I wonder if that is due to the same issue we had with virtualenv on the puppetmaster (using python3 instead of 2)	15:04
fungi	ahh, right, this is the case where we want to override --upgrade-strategy for pip	15:04
dmsimard	frickler: ah I guess I should do that too.	15:04
fungi	it's calling /usr/bin/python according to that message, so should be python 2.7	15:05
clarkb	oh ya	15:05
* clarkb pops out for a few, brb		15:06
fungi	i think it's just trying to upgrade to a later netifaces than the distro package because it sees that what's on pypi is newer (even though what's already installed is sufficient)	15:06
pabelanger	cool, looks like logstash-workers have been started	15:06
pabelanger	I'm going to delete puppetdb.o.o and puppetdb01.o.o, and clean up system-config	15:07
Shrews	ALL: I have to step out for a chiropractor appointment now. I'll catch up on things when I return. Shouldn't be long.	15:07
fungi	didn't we have a change up to do --upgrade-strategy=only-if-needed for pip in some module recently? codesearch isn't turning it up for me so maybe hasn't merged yet?	15:09
clarkb	fungi: yes, ara in puppet-zuul iirc	15:10
fungi	thanks, finding	15:10
clarkb	doesn't look like it merged though	15:10
clarkb	fungi: https://review.openstack.org/#/c/516740/ yup not merged yet	15:11
pabelanger	remote: https://review.openstack.org/449167 Remove puppetdb / puppetboard server	15:12
fungi	annoying that gerrit message searches unconditionally replace hyphens with spaces so you can't search for strings containing hyphens	15:12
pabelanger	clarkb: fungi: any objections to deleteing puppetdb^/puppetboard above? it is still precise	15:12
pabelanger	err	15:12
pabelanger	yah, precise	15:13
clarkb	for some reason I thought that was already done so no objection from me	15:13
fungi	pabelanger: by all means	15:13
pabelanger	okay, done	15:17
pabelanger	updating etherpad	15:17
dmsimard	pabelanger, clarkb: that reminds me.. i'll take the opportunity of the sprint week to write the draft for continuous deployment dashboard to replace puppetboard	15:20
clarkb	waiting on gating for the log processor fix was clearly a missed opportunity to make breakfast	15:21
dmsimard	yeah it hasn't passed check yet	15:22
* dmsimard starts working on draft		15:22
pabelanger	dmsimard: great	15:22
pabelanger	okay, working on tripleo mirror now, going to ping them for a larger flavor. 100GB is the max listed right now	15:23
dmsimard	pabelanger: yeah good idea	15:23
pabelanger	I also think, we might be able to now move mirror-update.o.o into a zuulv3 job and periodic pipeline (may have to create)	15:25
clarkb	I'm worried that the log processor fix for centos 7 is out to lunch	15:28
clarkb	er the centos7 job is	15:28
clarkb	we may have to recheck it and if that happens I am making breakfast	15:29
dmsimard	clarkb: should we check out the patch locally ?	15:31
dmsimard	or wait it out ?	15:31
clarkb	I think we should wait it out, if the job doesn't go out to lunch it runs fairly quickly and this way we can't lose track of where we have or haven't fixed this particular systemd/xenial thing	15:32
dmsimard	ack.	15:32
clarkb	(also its not an emergency)	15:32
dmsimard	indeed.	15:33
jeblair	pabelanger: let's not tackle mirror-update right now. i think it will take some work, and just replacing the server will be easier.	15:33
pabelanger	sure	15:37
jeblair	what data does grafana require be migrated?	15:38
pabelanger	jeblair: for AFS services, we can join the new (xenial) servers to the existing AFS cells right? Then after some sync process retire the original trusty based servers?	15:38
jeblair	pabelanger: depends on the servers -- can you be more specific	15:39
pabelanger	jeblair: sure, afsdb01/afsdb02 right now. Could we bring online afsdb03 and join the existing?	15:40
pabelanger	jeblair: I think we'd need to update puppet-grafana in system-config to working xenial, it is also possible we might need to patch grafyaml too. I think they changed some of the APIs in newer versions.	15:41
jeblair	pabelanger: yes -- i forget off the top of my head how to tell it to join, but we should be able to tell it to sync its data from the others, then remove them.	15:41
pabelanger	okay cool	15:42
jeblair	pabelanger: okay, that's not a data migration though...	15:42
jeblair	there's a lot of servers under "these servers require data to be migrated" which i don't think require data to be migrated	15:43
pabelanger	Yah, I might have put it there by mistake. We shouldn't need any data because of grafyaml	15:43
*** baoli has quit IRC		15:45
*** baoli has joined #openstack-sprint		15:45
clarkb	dmsimard: frickler Shrews ok if someone can recheck that change when the centos7 job finally times out I am going to make breakfast (there are penty of other roots around now to answer questions, walk through process so feel free to ping them too)	15:47
dmsimard	k	15:48
pabelanger	I'm going to start on eavesdrop01.o.o replacement	15:49
*** baoli has quit IRC		15:50
pabelanger	IIRC, we'll need to migrate the volume between servers	15:50
pabelanger	clarkb: mind a +3: https://review.openstack.org/449167/	15:51
*** baoli has joined #openstack-sprint		15:57
fungi	pabelanger: yeah, /dev/mapper/main-meetbot seems to be on a cinder volume	15:58
pabelanger	yah	15:58
pabelanger	remote: https://review.openstack.org/527139 Update eavesdrop.o.o to support xenial	15:58
pabelanger	reworks eavesdrop.o.o to support numeric hosts	15:59
pabelanger	and ups our testing to start on xenial	15:59
frickler	so I have a patch to make puppet-ethercalc work on xenial. question is: do we need to keep it backwards compatible for < xenial at the same time? or can we avoid a lot of extra code and target only xenial/systemd-based hosts?	16:03
clarkb	looks like the log processor fix is queuing the centos7 job again so we may not need a recheck afterall	16:08
clarkb	frickler: I think it best to keep support for both	16:08
clarkb	frickler: makes the upgrade process (replacing servers) a little simpler	16:09
frickler	clarkb: hmm, I just submitted the xenial-only version, will update later: https://review.openstack.org/527144 Update to work on Ubuntu Xenial or newer	16:09
clarkb	pabelanger: done	16:13
pabelanger	clarkb: frickler: Yah, that one doesn't look too bad to support both. for ethercalc	16:13
pabelanger	clarkb: danke	16:13
*** baoli has quit IRC		16:15
pabelanger	I'm going to run into town for a quick errand / lunch. But have 2 servers in my name	16:16
pabelanger	I also added a 'Bug Fixes' section to https://etherpad.openstack.org/p/infra-sprint-xenial-upgrades so we can quickly identify things we need to merge	16:16
pabelanger	we should also pick a topic to make it easier, if somebody wants to do so	16:16
pabelanger	should be back in 45mins	16:16
clarkb	there is a topic for the sprint already, not sure if we need another for bugfixes?	16:17
dmsimard	let's use the same topic ?	16:22
dmsimard	we have infra-xenial right now	16:22
*** baoli has joined #openstack-sprint		16:27
*** baoli has quit IRC		16:29
*** baoli has joined #openstack-sprint		16:29
* jeblair looks up grafana stats on cacti		16:33
jeblair	grafana has like no cpu or memory usage. i think we can shrink the flavor	16:33
jeblair	the 1 year max used ram is 771M (!)	16:34
clarkb	++ to shrikning flavor	16:35
jeblair	the load average is 0.016	16:35
jeblair	1 year max	16:35
jeblair	2G then?	16:35
clarkb	what is it now?	16:36
clarkb	but ya thats double max ram usage which seems like safe overhead	16:37
jeblair	8G	16:37
clarkb	2G sounds good to me	16:37
jeblair	http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=2715&rra_id=4&view_type=&graph_start=1479957086&graph_end=1513010270	16:38
jeblair	we will need to reboot it at least once every 2 years i think.	16:38
frickler	pabelanger: clarkb: next question then, may also affect other upgrades: do we want to continue piping the output to log files like here for backwards compatibility? or can we use native systemd/journald log handling? http://git.openstack.org/cgit/openstack-infra/puppet-ethercalc/tree/templates/upstart.erb#n26	16:39
clarkb	frickler: I'm good with journald, the one thing we should check on that before we commit to it though is whether or not journald is logging persistently on our nodes	16:40
clarkb	should get infra-root's larger opinion too	16:40
* Shrews catches up		16:42
fungi	as long as stuff gets logged somewhere and i can find it, i'm fine	16:42
clarkb	it doesn't look like journald is currently logging persistently on ubuntu fwiw	16:42
clarkb	we can address that though	16:43
jeblair	me too. that hasn't been my experience with journald to date, but if someone's willing to go on a limb and guarantee that, i'm fine with it. :)	16:43
clarkb	Shrews: zuul had a burp on one job for the systemd fix so we are still waiting for that to merge, but its close now	16:44
Shrews	ah, then i haven't missed much fun	16:45
Shrews	looks like it just merged	17:00
frickler	so https://review.openstack.org/527109 merged, do we need to update puppetmaster or is there a cron?	17:00
clarkb	frickler: there is a cron but its a "slow" one. The cron that updates those puppet modules is our main run puppet with ansible cron job	17:00
clarkb	frickler: Shrews dmsimard you can check whree that cron is by looking at /var/log/puppet_run_all.log on the puppetmaster	17:01
clarkb	it looks like it just started at 1700UTC which I think means it will have just updated the module for us	17:02
clarkb	frickler: Shrews dmsimard you can confirm this by running git log at puppetmaster:/etc/puppet/modules/log_processor	17:02
clarkb	once you've poked at those items and have convinced yourselves that I actually did check before making those claims :) I think we can go ahead and try the new instance boot again	17:03
Shrews	yuppers	17:03
Shrews	i used the wait time to setup my tmux properly	17:04
* clarkb migrates into the office now that the kids are awake		17:06
Shrews	is this something to be concerned about? http://paste.openstack.org/show/628639/	17:11
Shrews	it seems we progressed past that, but happened to notice it in the output	17:11
clarkb	pabelanger: I think ^ may be related to your host removal work	17:11
clarkb	Shrews: my guess is that the host deletions pabelanger has been doing have resulted in some groups defined that don't match any instances	17:12
clarkb	pabelanger: is that something you can look into?	17:12
clarkb	if that is the cause then I don't think we need to worry about it	17:12
frickler	anyway it failed again for me, will retry with --keep for better debugging, not sure about the failure from the log	17:14
clarkb	frickler: can you share the log?	17:14
jeblair	17:12 < openstackgerrit> James E. Blair proposed openstack-infra/system-config master: Support xenial on health https://review.openstack.org/527169	17:14
jeblair	17:14 < openstackgerrit> James E. Blair proposed openstack-infra/system-config master: Support xenial on stackalytics https://review.openstack.org/527171	17:14
jeblair	since the first step is to update the node selector and the node-os comment in site.pp, and then wait for that to gate, is there any reason we shouldn't do a bunch of those ahead of time ^ ?	17:15
clarkb	jeblair: probably not, just split them up so that failures can be debugged individually	17:15
jeblair	clarkb: ya, i've pushed up 3 all based on tip so far	17:16
frickler	http://paste.openstack.org/show/628641/ is the tail of it, neglegted to tee all of it	17:16
clarkb	frickler: ya may need keep or a bigger screen bugger to see why puppet is unhappy	17:18
clarkb	some dependency for logrotate failed looks like	17:19
frickler	ya, need to amend my tmux settings to have more scrollback and searching	17:19
clarkb	maybe its a new package name or different dir path for that config?	17:19
jeblair	just so we're really clear, i'm pushing up a bunch of changes, but i don't plan on doing all these servers, i'm just trying to save time so that the initial step (with a bunch of waiting) is already done. please grab/update/abandon my changes as needed as you work on servers.	17:20
frickler	clarkb: /tmp/launch-log on puppetmaster is the complete log now, instance is kept for checking	17:25
clarkb	frickler: its lookling like the reload for systemctl isn't finding the sys v compat scripts? maybe permissions or something is wrong with them?	17:27
clarkb	frickler: running the systemctl reload in the foreground may have more details? possibly also list-units?	17:28
clarkb	I need to pop out agin to help with kids now that they are awake. Back in a bit. Look forward to seeing what you find out	17:30
frickler	clarkb: http://paste.openstack.org/show/628645/ looking deeper into the service definitions now	17:32
Shrews	i know less about puppet than anybody, but there is this in that log: \| Dec 11 17:20:36 logstash-worker01 puppet-agent[10308]: Could not run: SIGTERM	17:32
clarkb	Shrews: that is expected since we are puppet apply only I think. That happened as a result of the puppet agent stop I think	17:34
dmsimard	frickler: the daemon reload isn't working	17:35
dmsimard	frickler: (/Stage[main]/Openstack_project::Logstash_worker/Log_processor::Worker[B]/Service[jenkins-log-worker-B]/enable) change from false to true failed: Could not enable jenkins-log-worker-B:	17:35
frickler	ya, fix upcoming	17:35
jeblair	could folks +3 https://review.openstack.org/527168 please?	17:36
dmsimard	jeblair: do we actually have different grafana numbered nodes ?	17:36
jeblair	dmsimard: not yet -- we're transitioning all of the hosts to numbered hosts so it's easier to replace them	17:37
dmsimard	jeblair: makes sense	17:37
jeblair	dmsimard: so the replacement for grafana.o.o will be grafana01.o.o, with a cname in dns	17:37
dmsimard	jeblair: in any case, that pattern should match numbered or not	17:37
jeblair	yep	17:37
frickler	clarkb: dmsimard: that fixed it on my node: https://review.openstack.org/527193 Fix multiple workers for systemd	17:37
jeblair	i'm using \d* so we continue to have puppet operate on the current host	17:38
dmsimard	frickler: makes sense	17:38
jeblair	dmsimard, frickler: not sure if you're aware -- the node-os comment is read by the infra apply jobs, so adding that xenial comment causes those jobs to run, and we verify that at least puppet apply -noop works on that os.	17:39
dmsimard	I wasn't aware those comments were actually important, thanks for that	17:40
* frickler needs a break now, will take another look later		17:40
jeblair	it looks like i got 36% of the way through site.pp updating the node matchers and os comments. i'm going to stop there and leave more for others to do. :)	17:41
dmsimard	jeblair: have a comment on https://review.openstack.org/#/c/527172/	17:45
dmsimard	question came up when I was looking at https://review.openstack.org/#/c/527186/1/manifests/site.pp with the files group left intact	17:45
clarkb	frickler: back and reviewing your fix as well as jeblairs now	17:53
clarkb	oh frickler is taking a break, I have a comment on the fix I'll just update the patch	17:54
clarkb	dmsimard: no patchset on https://review.openstack.org/527193 can you rereview? jeblair care to review as well?	17:55
dmsimard	clarkb: ah I guess frickler's patch was working although it was a little bit uglier with two dashes	17:56
clarkb	dmsimard: ya and may have confused systemd slightly depending on how important that name is	17:56
clarkb	figure better to just get it matching the name used elsewhere and not worry about it	17:56
* dmsimard nods		17:57
jeblair	dmsimard: good catch thanks	18:01
*** baoli has quit IRC		18:01
*** baoli_ has joined #openstack-sprint		18:04
clarkb	I'm just going to approve all those changes without check results as long as my eyeballs don't catch anything wrong with them. Then if tests do fail we can sort them out (otherwise there is just too much state to track)	18:05
clarkb	its unfortunate that our puppet apply --noop testing won't catch the systemd reload issue though	18:06
fungi	i'm looking at the implementation of that in puppet-zuul	18:09
fungi	looks like there's a manifests/systemd_reload.pp classfile implementing it	18:09
fungi	which gets called out as a require line in services	18:09
fungi	but then there's also what looks like basically a duplicate implementation of it in manifests/executor.pp	18:10
clarkb	fungi: that would be one way to do it. The tricky thing is requiring something that won't necessarily be in place on all platforms (but hiding it in a class of its own is one way to do that	18:10
fungi	am i right in thinking that's redundant?	18:10
clarkb	oh ya if there is something else doing it then it problem is redundant /me looks	18:10
fungi	or is it serving some subtle purpose i'm not picking up?	18:11
clarkb	it looks redundant to me as well, but maybe there is an ordering issue that isn't immediately apparent that that works around	18:12
*** baoli has joined #openstack-sprint		18:46
*** baoli_ has quit IRC		18:47
clarkb	for anyone wondering why it got quiet all of a sudden we are mostly just waiting on CI to finish and changes to merge at this point (lots of demand in zuul right now)	18:54
pabelanger	and back	18:57
pabelanger	catching up on backscroll	18:57
clarkb	the log_processor fix has finally started jobs	19:05
clarkb	hopeflly will be in gate in the not too distant future then shrews and dmsimard (and frickler if still around) can give it another go.	19:05
fungi	i have a couple of puppet-subunit2sql changes proposed to help me build the replacement worker	19:12
clarkb	I'll do another round of reviews shortly	19:13
pabelanger	remote: https://review.openstack.org/526194 Remove zuulv2 long lived servers	19:18
pabelanger	could use another +3 on^ had to rebase	19:18
pabelanger	clarkb: Shrews: is the pastebin from above on expand-groups.sh still an issue?>	19:18
dmsimard	pabelanger: I believe so	19:22
pabelanger	k, lets land 526194, then delete ansible-inventory cache, since we've deleted some servers	19:23
pabelanger	okay, tripleo has bumped the flavor for mirror to 150GB	19:27
pabelanger	uploading xenial cloud image to tripleo-test-cloud-rh1 now	19:27
clarkb	fungi: did you see https://review.openstack.org/#/c/527193/ ? you amy need similar for subunit2sql	19:33
fungi	clarkb: oh, thanks! i missed that. will update my open change if it's not merged yet	19:36
fungi	added	19:39
pabelanger	okay, mirror01.regionone.tripleo-test-cloud-rh1.openstack.org launched properly	19:40
pabelanger	setting up DNS now	19:40
pabelanger	http://mirror01.regionone.tripleo-test-cloud-rh1.openstack.org/	19:43
pabelanger	everything seems okay	19:43
pabelanger	I'm going to redirect mirror.regionone to mirror01.regionone now	19:44
pabelanger	DNS updated, waiting to confirm it correct	19:49
clarkb	pabelanger: remember to use hour long ttls on those records (to avoid dns requerying	19:50
pabelanger	clarkb: Yup! confirmed at 60min	19:50
pabelanger	and cname is working	19:50
pabelanger	will accept ssh hostkey on puppetmaster	19:51
pabelanger	remote: https://review.openstack.org/507266 Comment out server in puppet.conf	19:54
pabelanger	I believe that will stop puppet from hanging for 2mins when we boot new servers	19:55
clarkb	pabelanger: will puppet apply do that?	19:55
clarkb	seems like that should be a noop	19:55
clarkb	especially now that ianw's change to stop the agent is in	19:56
ianw	oh good	19:56
clarkb	ianw: good morning	19:56
ianw	sorry, just catching up with reviews etc	19:56
ianw	morning!	19:57
pabelanger	clarkb: I think it is a race condition we install puppet with install_puppet.sh, but server boots and puppet-agent tried to connect to puppetmaster, during when puppet apply is running. So I think it might be too late	20:01
pabelanger	also, trying to see the change ian did	20:01
clarkb	pabelanger: ya but ianw's patch explicitly stop puppet agent	20:02
clarkb	and puppet apply shouldn't talk to a server aiui	20:02
pabelanger	clarkb: I don't think it worked, cause It still happened when I tried bringing tripleo mirror online	20:02
pabelanger	let me see which system-config I had	20:03
pabelanger	\| Dec 11 19:36:28 mirror01 puppet-agent[4061]: Could not request certificate: Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)	20:03
clarkb	ya I'm thinking your system-config may not have been up to date? the change just merged a few hours ago	20:03
pabelanger	also, I see HTTP requests to new tripleo mirror now	20:03
clarkb	Shrews: the fix for log_processor appears to be about to merge, will you be able to give that another shot in a few minutes?	20:04
Shrews	clarkb: yeah	20:04
Shrews	getting frustrated with sockets so could use a diversion	20:04
ianw	i dropped a comment ... so the package just assumes that there's a resolvable remote host called "puppet" ?	20:04
clarkb	ianw: ya thats puppets default behaviopr	20:05
pabelanger	yup	20:05
pabelanger	I think I had ianw commit when I ran it just now	20:05
ianw	ok, TIL :)	20:06
pabelanger	but, will know in a moment when I try to launch next server	20:06
pabelanger	server=puppet	20:06
pabelanger	that is what the default is in puppet.conf for us	20:07
pabelanger	at one point I think we managed it on server boot	20:07
clarkb	is the problem that we can't stop/disable the service until after we've already started it and sent it off trying?	20:07
pabelanger	\| + systemctl disable puppet	20:08
pabelanger	okay I see that in my console	20:08
pabelanger	\| Executing /lib/systemd/systemd-sysv-install disable puppet	20:08
pabelanger	\| Dec 11 19:32:26 mirror01 systemd[1]: Started Puppet agent.	20:09
pabelanger	so, something started it again	20:09
pabelanger	then	20:09
pabelanger	\| Dec 11 19:32:28 mirror01 puppet-agent[4061]: Could not request certificate: Failed to open TCP connection to puppet:8140 (getaddrinfo: Name or service not known)	20:09
clarkb	huh	20:11
clarkb	Shrews: arg zuul just put the trusty job back to queuing	20:12
clarkb	I'm worried that infracloud networking is falling over with nodepool running at full capacity	20:12
Shrews	clarkb: did we put your nodepool fix in?	20:12
Shrews	clarkb: if we didn't, we could be hitting that again	20:13
clarkb	Shrews: I don't thin so that reminds me I want to say tobias hda comments for me to address and I cimpletely forgot with the sprint stuff this morning	20:13
* Shrews checks nodepool		20:13
clarkb	looks like the comments are more along the lines of "this is weird and test doesn't do a good job reproducing but dunno what is going on yet"	20:14
Shrews	clarkb: hrm, only 1 ready&locked node, so unlikely we're hitting the issue you found	20:15
Shrews	just busy	20:15
clarkb	Shrews: ya I'm thinking the networking in hpcloud just can't handle the demand and is dropping connections	20:16
pabelanger	okay, moving on to eavesdrop01.o.o	20:23
pabelanger	\| Dec 11 20:30:35 eavesdrop01 puppet-user[11951]: Could not find data item openstack_meetbot_password in any Hiera data file and no default supplied at /opt/system-config/production/manifests/site.pp:347 on node eavesdrop01.openstack.org	20:32
pabelanger	how did we handle hiera data for numeric hosts again?	20:32
pabelanger	did we just move them into a group	20:32
clarkb	pabelanger: yes that is what I have been doing with eg translate	20:32
pabelanger	okay, wanted to confirm	20:33
pabelanger	I'll send a patch for eavesdrop here shortly	20:33
clarkb	Shrews: ok fix for log_processor merged	20:34
clarkb	Shrews: puppetmaster:/var/log/puppet_run_all.log says that the ansible puppet cron is currently running so we can either wait for it to finish or just manually update the puppet module on the puppet master	20:35
clarkb	Shrews: if you are able to give the node launch another go right now I can walk through updating the puppet module	20:35
clarkb	dmsimard: ^you too	20:35
dmsimard	yeah will give a try after extinguishing a fire	20:36
clarkb	(I expect at this point frickler has called it a day)	20:36
Shrews	clarkb: waiting for the puppet repo to update	20:38
Shrews	clarkb: oh, that's what you want to walk me thru	20:38
Shrews	:)	20:38
Shrews	yeah, i'm ready	20:38
clarkb	Shrews: cool, so the module is at /etc/puppet/modules/log_processor	20:40
ianw	with something like 527144 ... do we care about effectively dropping trusty support? should we put a tag in before merging maybe?	20:40
ianw	that's puppet-ethercalc btw, moving from an upstart file to a .service file	20:40
clarkb	Shrews: as root you will want to do a `git remote update` to fetch latest changes then `git checkout origin/master` it might be `git checkout origin master` I can never rememner where git wants the /	20:41
clarkb	Shrews: however the cron will update it in 3 minutes if you want to wait (and avoid conflicts though git should sort those for us in this case)	20:41
clarkb	ianw: ya frickler had asked about that and I had asked to keep trusty support for now. Simplifies the transition/upgrade too	20:42
Shrews	clarkb: those commands put me in a detached HEAD state. is that the norm?	20:42
clarkb	Shrews: yes	20:43
clarkb	Shrews: rather than try and curate a local branch we just checkout upstream states	20:43
clarkb	Shrews: its easier this way when you rely on code review to specify a state	20:43
Shrews	clarkb: that's done then	20:43
clarkb	cool I think you can give the launch node script another go then	20:43
Shrews	if i could get my copy-pasta fixed	20:48
Shrews	k. kicked off	20:49
Shrews	fwiw, launch-node.py does not play nicely with tee	20:50
clarkb	is it writing to stderr?	20:50
Shrews	i guess?	20:50
clarkb	I wonder if that isb ecause that is how ansible does it?	20:50
clarkb	mordred or dmsimard may know	20:50
dmsimard	Shrews: launch-node.py 2&>1 \|tee -a file.out ?	20:51
dmsimard	or maybe PYTHON_UNBUFFERED thing	20:51
Shrews	i'll just depend on my tmux buffer	20:52
dmsimard	if we were really motivated, we could do, like, launch-node.py \| systemd-cat -t launch-node	20:55
dmsimard	that sends the output straight to the journal and then you can do, like, journalctl -u launch-node	20:55
clarkb	dmsimard: currently no journald on that node	20:55
clarkb	fungi: https://review.openstack.org/#/c/527203/2 failed testing	20:57
fungi	grar	20:58
* dmsimard pictures fungi growling		20:58
fungi	it's not a particularly intimidating growl	20:58
*** jkilpatr has quit IRC		20:59
Shrews	clarkb: looks like we can haz new node	21:00
clarkb	Shrews: yay	21:00
clarkb	Shrews: ok now don't immediately do the dns stuff yet	21:00
clarkb	because dns is a pita we should probably talk about it a little	21:01
clarkb	dmsimard: maybe you want to get to the point where you have a launched logstash worker too and we can go through that together?	21:01
fungi	cue rant about proprietary dns hosting api	21:01
dmsimard	clarkb: fire almost extinguished	21:01
clarkb	dmsimard: cool	21:01
Shrews	aaaaaaaand go go gadget rant	21:01
clarkb	Shrews: does it work if I grab lunch and dmsimard gets a node launched before we dig into the next step?	21:02
dmsimard	for context, I don't think I've mentioned this before but I'm basically infra-root for RDO's infrastructure	21:02
dmsimard	so from time to time there's those fires :)	21:02
Shrews	clarkb: yes. i will task switch back to the finger gateway, but we do have the zuul meeting in an hour	21:02
jeblair	i'm back from lunch if needed here	21:03
clarkb	Shrews: oh right zuul meeting	21:04
clarkb	Shrews: we can also just go through the dns stuff and take the pressure off getting everything done in that time	21:04
clarkb	its not the end of the world to go through it multiple times	21:04
clarkb	Shrews: so the deal with DNS is its hosted by rackspace and they use a proprietary client and service for managing it	21:05
clarkb	Shrews: this works reasonably well for when you are just adding a new host (and not replacing an existing one) because adding records is super easy	21:05
clarkb	Shrews: the problems largely lie in removing old records safely because there is no version control like you get with gandi and other services	21:05
clarkb	Shrews: and since we share the openstack.org domain with the foundation we have had cases of stepping on each others toes in the past :/	21:06
clarkb	Shrews: in this case of replacing an instance my preferred method is to use the command line client to udpate only the reverse PTR records, then log in to the web ui and delete the old A and AAAA records and add new ones	21:07
clarkb	this means we'll only run half of the commands printed out by the launch script (2/4 that update the reverse ptr records)	21:07
clarkb	fungi: jeblair do you recall if the reverse ptr records are the first two commands or the second two? I think they are the first two	21:08
jeblair	they are the first	21:08
Shrews	so command line for one direction resolution, gui for the other	21:09
jeblair	example: http://paste.openstack.org/show/628658/	21:09
clarkb	Shrews: correct	21:09
clarkb	Shrews: so you can go ahead and run the command above line 15 in jeblairs example (but use the command that were pritned out for your launch invocation)	21:10
jeblair	we have some (a lot of) hiera data assigned by fqdn. i'm guessing that as we transition nodes to numbered, we're going to need to move those to groups, yeah?	21:11
clarkb	jeblair: yup, pabelanger ran into that with eavesdrop and I did with translate*. Making a copy of the heira data in a group is what I did for translate	21:12
clarkb	then once things are transitioned we can remove the fqdn specific data	21:12
fungi	clarkb: the entries with ip addresses are the address records, then entries with server uuids are the reverse ptrs	21:12
fungi	i don't recall what order the wind up in	21:12
jeblair	clarkb: i can never remember how our split group system works. what do i need to do to make a grafana group and add grafana01 to it?	21:12
clarkb	jeblair: in the site.pp add group = grafana line like the other examples in there	21:13
jeblair	that's the only thing?	21:13
clarkb	jeblair: then we need to update the ansible group file that I can never remember the path to /me finds it	21:13
jeblair	yeah, that's the thing i was worried about :)	21:13
clarkb	jeblair: openstack-infra/system-config/modules/openstack_project/files/puppetmaster/groups.txt	21:14
Shrews	clarkb: done	21:14
clarkb	Shrews: ok next step is the fun step	21:14
* clarkb actually goes through process with shrews to figure it out		21:15
Shrews	you mean the fun doesn't stop there?????	21:15
Shrews	:)	21:15
clarkb	Shrews: go to https://www.rackspace.com/login then click on cloud control panel login	21:15
clarkb	Shrews: username and password can be found in the file being sourced on line 16 in jeblairs example	21:16
fungi	next, attempt to extrude your brain matter through a collander	21:16
clarkb	Shrews: once there click on Networking -> Cloud DNS	21:17
clarkb	then click on openstack.org	21:17
fungi	because, you know, dns is totally a network thing	21:17
clarkb	Now my favorite part of this whole process, it doesn't load all of the records for you to serach at once, so you want to scroll taht scroll bar until its done loading all the things it can load	21:18
* fungi wonders why they don't also put database services under the "storage" menu		21:18
jeblair	remote: https://review.openstack.org/527245 Create a grafana group	21:18
jeblair	clarkb: can you ^ pls?	21:18
clarkb	jeblair: yup	21:18
clarkb	Shrews: let me know when you get tehre	21:18
Shrews	clarkb: there, and see logstash-worker02	21:19
pabelanger	remote: https://review.openstack.org/527246 Add eavesdrop into groups.txt	21:19
fungi	yeah, i basically scroll as far down as it will go, then do that again, and again, and again... until it stops letting me do it any longer or i get distracted and go do something else	21:19
pabelanger	clarkb: jeblair: also^	21:19
fungi	Shrews: there will be two, one for ipv4 and one for ipv6... and they won't be even remotely adjacent in the ui	21:19
Shrews	oh	21:20
clarkb	pabelanger: you have two differen regexes in use fwiw	21:20
jeblair	pabelanger, clarkb: i used \d* and pabelanger used \d+. which is better?	21:20
fungi	which is why once you've gotten it to load all the paginated chunks of the set, you can then use in-browser keyword searching to find them all	21:20
clarkb	jeblair: I think you got yours right because it matches the node spec in site.pp	21:20
jeblair	i mean, specifically, because i don't understand the group system, i don't know if things will break if they are different	21:20
clarkb	pabelanger should update his change to use * in groups.txt I think	21:20
pabelanger	ah, I copypasted another	21:20
pabelanger	let me fix	21:20
Shrews	ah yes. i see both A and AAAA entries	21:20
jeblair	clarkb: sounds like you are inclined to think they may break -- ie, puppet will expect a group to be present that ansible won't have placed on the filesystem, unless they match?	21:21
clarkb	jeblair: I don't think they will break but the old server will continue to fail to find the group it thinks it is in and fall back to the fqdn system instead until it is gone	21:21
clarkb	Shrews: cool now you can use brwoser search to find logstash-worker02 (I think you were 02)	21:21
jeblair	clarkb: that sounds reasonable too. hrm	21:21
pabelanger	okay, updated	21:22
pabelanger	remote: https://review.openstack.org/527246 Add eavesdrop into groups.txt	21:22
Shrews	clarkb: yup	21:22
clarkb	Shrews: then click the little gear next to the records name and click modify record	21:22
clarkb	Shrews: then replace the ipv6 address if modifying the AAAA record with the one launch printed out or the ipv4 if modifying the A record	21:22
clarkb	Shrews: and do that for both the A and AAAA records	21:22
clarkb	pabelanger: approved	21:23
*** jkilpatr has joined #openstack-sprint		21:24
Shrews	clarkb: done	21:24
clarkb	Shrews: then you can `dig +short logstash-worker02.openstack.org` and `dig +short AAAA logstash-worker02.openstack.org` to see when the records update	21:25
clarkb	once that happens there is one last step we have for the lgostash workers which is updating the firewalls to accept the new host and making sure services on new host are functioning	21:26
Shrews	groovy. i can dig it	21:26
Shrews	far out	21:26
pabelanger	okay, I'm going to delete the old mirror in triple-test-cloud-rh1, I don't see any traffic in apache logs for 45mins now	21:27
* Shrews speaks fungi language		21:27
clarkb	pabelanger: sounds good	21:28
fungi	gnarly	21:29
Shrews	pfft, that's 2 decades beyond	21:29
jeblair	i need a translator	21:29
dmsimard	ok fire extinguished	21:30
dmsimard	going through a logstashworker now.	21:30
pabelanger	and deleted	21:30
Shrews	clarkb: anyhoo, dig seems to be immediately returning the correct things (from multiple places)	21:30
clarkb	Shrews: awesome, so now some logstash-worker specifc things. We use unauthenticated connectivity to gearman (which could be changed) and to elasticsearch (whcih can't be changed without paying them money or writing our own auth plugin for es)	21:31
pabelanger	ianw: clarkb: so, what are we thinking on https://review.openstack.org/507266/ (puppet DNS error on server boot)	21:31
fungi	clarkb: there are also two other steps... updating the ssh host key cached by root on puppetmaster, and truncating the ansible inventory cache	21:32
pabelanger	clarkb: Shrews: we'll also need to restart firewalls too, to pickup new IP addresses	21:32
clarkb	Shrews: this means we have to kick the firewall on logstash.openstack.org (where gearman server runs) and elasticsearch[2-7].openstack.org where elasticsearch runs to have it pick up the new IPs based on name	21:32
clarkb	fungi: oh right	21:32
fungi	steps which i frequently forget	21:32
pabelanger	i think the last time we changes out logstash workers I wrote an ansible-playbook to restart firewalls, I think I added it to system-config	21:33
clarkb	Shrews: the way to restart the firewall on those nodes is to run `service restart iptables-persistent`	21:33
ianw	pabelanger: my only thought is that it's quite untested on everything other than xenial?	21:33
clarkb	fungi: doesn't launch node automatically truncate the cache file now?	21:33
clarkb	fungi: I think it may, but the ssh key add will need to be done	21:34
clarkb	pabelanger: oh cool	21:34
ianw	pabelanger: maybe we should just limit it to that for now?	21:34
pabelanger	ianw: sure, we can do it for xenial, then add it to others	21:34
fungi	clarkb: oh, maybe	21:34
clarkb	pabelanger: I don't see it, maybe it hasn't merged?	21:34
pabelanger	clarkb: yah, looking now	21:35
clarkb	Shrews: anyways let me know once that is run on logstash.o.o and elasticsearch[2-7].o.o (can just ssh directly or figure out ansible)	21:35
ianw	pabelanger: although actually, the apply tests do run it	21:35
ianw	http://logs.openstack.org/66/507266/2/check/legacy-infra-puppet-apply-3-centos-7/5db1915/job-output.txt.gz#_2017-12-11_20_21_21_950690	21:36
Shrews	clarkb: will do	21:36
fungi	clarkb: easiest way to be sure is to check whether the old instance continues to appear in the inventory cache file, i guess	21:36
Shrews	clarkb: should these be done in any particular order?	21:38
pabelanger	clarkb: yah, I don't see it any more but it wasn't a big playbook. I can whip up a replacement if needed	21:38
Shrews	clarkb: like logstash.o.o first, then the elasticsearch nodes?	21:38
clarkb	Shrews: probably best if elasticsearch is done first as its at the end of the data processing pipeline	21:38
clarkb	Shrews: this way we don't try processing anything until the whole pipeline can talk	21:38
clarkb	fungi: ya there is code to make sure the inventory cache file is not out of date in launch script	21:39
fungi	oh, good	21:40
pabelanger	ianw: yah, your call. If you want only xenial, I can propose that.	21:42
dmsimard	I guess I'll go learn what the DNS stuff looks like while logstash-worker03 is installing.	21:42
Shrews	clarkb: that should be 'service iptables-persistent restart', right?	21:43
clarkb	Shrews: possibly systemctl goes one way and service theo ther so I mix them up	21:43
dmsimard	Shrews: oh, that's different from trusty to xenial	21:43
clarkb	dmsimard: ya	21:43
dmsimard	Shrews: in xenial is netfilter-persistent	21:43
clarkb	Shrews: if your command works and mine doesn't then yours is correct :)	21:44
dmsimard	clarkb: where is the rackdns script ?	21:46
clarkb	you mean where does things like rdns-create live?	21:47
clarkb	dmsimard: http://paste.openstack.org/show/628658/ is jeblairs example. It lives in the virtualenv that is sourced early in that	21:47
dmsimard	oh, root/rackdns-venv/	21:47
Shrews	clarkb: those are done	21:48
clarkb	Shrews: ok now we want to hop on the node itself and check the services are working, then we will swing around and do the thing fungi mentioned and remove the old instance	21:49
clarkb	Shrews: there are 4 log worker processes that log in /var/log/logprocessor and one logstash jvm process that logs in /var/log/logstash	21:49
clarkb	Shrews: if you tail the files in /var/log/logprocessor you should see it grabbing gearman jobs and pushing log files	21:50
clarkb	logstash on the other hand seems to make on demand http connections to the elasticsearch servers so as long as the process is running it hsould be fine	21:50
dmsimard	what was that about the firewall ? I think I need that too. /me reads backlog	21:51
dmsimard	getting connection denied to gearman from the new worker	21:51
clarkb	unfortunately logstash doesn't log as well as I'd like	21:51
Shrews	clarkb: yep, seeing that	21:51
fungi	pretty ironic considering its name	21:51
clarkb	dmsimard: yup we use the dns names to set up firewall rules so you need to "restart" the iptables-persistent service once you are happy with the state of dns	21:52
clarkb	on logstash.o.o and elasticsearch[2-7].o.o	21:52
fungi	or netfilter-persistent	21:52
Shrews	those machines are still trusy	21:52
Shrews	trusty	21:52
fungi	ahh, right-o	21:52
clarkb	Shrews: so I think this ndoe is happy	21:52
dmsimard	clarkb: hmm, so we need to change the DNS before the worker can connect to gearman ?	21:52
clarkb	dmsimard: correct	21:52
dmsimard	thus we can't really validate that it works	21:52
dmsimard	should we perhaps use /etc/hosts ?	21:53
dmsimard	at least before changing the DNS to ensure it works	21:53
fungi	if we're worried about not being able to switch back and forth quickly enough, set a low ttl on the record	21:53
clarkb	fungi: ya that	21:53
clarkb	dmsimard: ^	21:53
dmsimard	TTLs is mostly a suggestion though	21:53
dmsimard	but sure	21:53
clarkb	this is also fairly specific to the logstash workers fo which we have many and can be replaced at any time	21:53
clarkb	because elasticsearch is money grabbing for features	21:54
dmsimard	lol	21:54
*** pabelanger_ has joined #openstack-sprint		21:55
*** EmilienM_ has joined #openstack-sprint		21:55
clarkb	Shrews: now before zuul meeting. As root on puppet master you need to ssh to logstash-worker02 and accept its ssh host key. This is so that ansible can ssh to it for puppetting	21:55
*** EmilienM has quit IRC		21:56
*** pabelanger has quit IRC		21:56
clarkb	Shrews: then for deleting the old instance when we are happy with how new one is functioning (seems fine to me so far)	21:56
clarkb	Shrews: I like to do something like `openstack --os-cloud openstackci-rax --os-region DFW server show cf873928-122c-447b-ad24-d1e213d277f0` to confirm the uuid I think is the old instance is actually the old instance	21:56
*** EmilienM_ is now known as EmilienM		21:56
dmsimard	TTL is already 300, short enough	21:56
clarkb	Shrews: then I can change the 'show' in that command to 'delete' to delete it	21:56
*** EmilienM has quit IRC		21:56
*** EmilienM has joined #openstack-sprint		21:56
Shrews	clarkb: known_hosts updated	21:57
*** pabelanger_ is now known as pabelanger		21:57
dmsimard	will it work if we do a rdns create/record create on a record that already exists ?	21:58
clarkb	dmsimard: sort of	21:58
Shrews	clarkb: old server deleted. many thx for the guidance	21:58
dmsimard	clarkb: heh, okay, let's see.	21:58
fungi	yeah, having the same reverse dns for multiple systems is perfectly fine	21:58
clarkb	dmsimard: I walked shrewd though it above, the reverse dns you can run the commands that launch spat out. So basically everything above line 15 in jeblairs example	21:58
clarkb	dmsimard: but when replacing a server it is easier to update the forward A and AAAA records through the gui	21:59
Shrews	etherpad updated. now meeting	21:59
clarkb	dmsimard: otherwise you get a round robin between the instances	21:59
clarkb	Shrews: thanks!	21:59
dmsimard	clarkb: yeah I've seen that, but for an existing node I'd tend to do a delete before the create -- or there is a record modify command, but not a rdns modify.	21:59
pabelanger	IIRC, rdns won't update, but create a 2nd DNS entry	21:59
Shrews	clarkb: oh, updating ansible inventory cache?	21:59
clarkb	Shrews: launch handled that for us buiilt in features	21:59
Shrews	cool cool cool	22:00
clarkb	dmsimard: ya rdns is specific to the IP address	22:00
clarkb	dmsimard: and the other rdns record gets removed when you delete the old instance	22:00
clarkb	dmsimard: wheras A and AAAA are specific to the name	22:00
pabelanger	sorry, record-create will not update	22:00
jeblair	it's zuul meeting time in #openstack-meeting-alt	22:00
clarkb	dmsimard: so its an artifact of how DNS + rax dns service operate	22:00
fungi	dmsimard: problem is you need to know thee "record id" for it which you can only get from the api, but the api refuses to return more than 100 records i think, and has no pagination, so you usually can't get the info you need to delete or modify a record via th api	22:00
dmsimard	bah	22:01
dmsimard	jeblair has not written a raxtty yet? :D	22:01
fungi	for the a/aaaa records	22:01
fungi	i doubt jeblair has any interest writing a client for a proprietary api	22:01
dmsimard	it was mostly a joke, but indeed	22:01
clarkb	dmsimard: so your general process here is run the commands for reverse dns, then ignore the forward dns commands. Switch over to rax gui using steps I described above for shrews and modify the A and AAAA records to point at the new IP addresses	22:02
clarkb	dmsimard: then once dig reports new addrs "restart" the iptables-persistent service on the nodes that firewall things (logstash.o.o and elasticsearch[2-7].o.o)	22:02
dmsimard	yup, I'll figure it out and report back if I have issues	22:02
clarkb	and be very careful when modifying openstack.org records as there is no revision control and it is a shared resource :/	22:03
clarkb	DNS is bascially the least optimal part of this whole process	22:03
clarkb	dmsimard: also totally happy to walk you through it step by step like I did with shrews after the zuul meeting if you like	22:21
*** baoli has quit IRC		22:28
*** baoli has joined #openstack-sprint		22:29
*** baoli has quit IRC		22:33
*** larainema has quit IRC		22:45
dmsimard	clarkb: DNS updated so I'll check every once in a while. Someone mentioned there was an ansible inventory somewhere ?	22:52
clarkb	dmsimard: there is, it is what the ansible that runs puppet uses to know what to puppet, but the launch node script automatically updates that for you so you should be fine	22:53
dmsimard	clarkb: oh, it was mostly to do like ansible -i inventory -m command "dig ..." :)	22:54
clarkb	oh that, I think the default inventoy will work	22:54
clarkb	but default inventory has every control plane host in it so be careful	22:54
dmsimard	yeah, but where is it ?	22:55
dmsimard	oh, /etc/ansible/hosts, got it	22:55
clarkb	dmsimard: /etc/ansible/hosts/openstack it uses the openstack dynamic inventory thing	22:55
clarkb	(with a cache file that is the thing that launch-node.py updates)	22:55
dmsimard	ansible -i /etc/ansible/hosts/openstack logstash.openstack.org,elasticsearch* --list-hosts <-- does what I wanted	22:59
dmsimard	clarkb: new logstash-worker03 is processing things \o/	23:02
clarkb	dmsimard: woot	23:02
dmsimard	so delete the old one and done ?	23:03
clarkb	dmsimard: so the lsat two steps are to make sure root accepts the host key for the new host on puppetmaster (just ssh to the host and accept it if it looks good) then delete the old one	23:03
clarkb	dmsimard: like I told shrews I like to use openstack server show $uuid and check that the uuid I have is the one then change show to delete to delete it	23:03
*** jesusaur has quit IRC		23:03
clarkb	dmsimard: and you have to use uuid in this case because there are duplicate matching names	23:03
dmsimard	yeah	23:04
dmsimard	I always use UUIDs anyway, even for flavors and images	23:04
dmsimard	name matching is nice but..	23:04
pabelanger	heads up, I've modifying /etc/puppet/hieradata for eavesdrop01	23:04
pabelanger	testing out, then will commit changes	23:04
clarkb	dmsimard: looks like 582c3ddf-a669-4c2b-bdd3-87a5ca088d0f in this case	23:05
dmsimard	yeah	23:05
dmsimard	582c3ddf-a669-4c2b-bdd3-87a5ca088d0f is deleted \o/	23:06
pabelanger	cool	23:06
dmsimard	ok, that was easy enough once we churned through some of the patches	23:06
clarkb	dmsimard: if the host key has been accepted I think thats it	23:06
dmsimard	I have to step away for dinner but I'll probably take a few out	23:06
dmsimard	clarkb: yeah, did that too.	23:07
clarkb	ya I'm about to call it a day myself. Got up very early and expect I'll try that again to walk frickler through the rest of the process	23:07
dmsimard	clarkb: I'll send you a link later tonight for continuous deployment dashboard spec	23:07
dmsimard	no rush, just sayin	23:07
*** jesusaur has joined #openstack-sprint		23:09
ianw	would someone might a quick eye on https://review.openstack.org/#/c/526975/ and i'll see about status.o.o	23:10
ianw	i'm also working through the puppet for nodejs and ethercalc	23:10
clarkb	ianw: ya I can take a look before I call it a day	23:10
ianw	yep we were chatting yesterday, all good	23:11
clarkb	ianw: re 526975 I think you also want to add a status group? see https://review.openstack.org/527245	23:13
ianw	clarkb: ok, done	23:15
clarkb	ianw: one thing inline	23:17
pabelanger	okay, hieradata for eavesdrop group works, I've commit the change	23:18
jeblair	i've added a grafana group to private hiera	23:18
clarkb	ianw: +2 thanks	23:19
pabelanger	okay, eavesdrop server failed. running with --keep to debug and propose fixes	23:20
ianw	heira will fallback to the fqdn if the group doesn't exist?	23:26
clarkb	ianw: yes	23:26
clarkb	most specifi cmatch wins	23:26
clarkb	in the case of status.o.o -> status01.o.o ther ewon't be an fqdn file for status01.o.o. What I did for translate was to copy the existing translate.o.o fqdn hiera data to a group for translate	23:27
clarkb	and I've now got kids telling me its walk time so I gotta go	23:27
clarkb	thanks everyone see you tomorrow	23:27
jeblair	http://grafana01.openstack.org/dashboard/db/zuul-status	23:29
jeblair	that looks really promising	23:29
jeblair	i'll delete dns for the old server and add a cname now	23:30
pabelanger	nice	23:38
jeblair	new dns has taken effect for me	23:41
jeblair	i'll delete the old server tomorrow unless someone screams	23:42
*** baoli has joined #openstack-sprint		23:46
*** baoli_ has joined #openstack-sprint		23:50
*** baoli has quit IRC		23:50
pabelanger	okay, I see the issue with eavesdrop01	23:52
pabelanger	Dec 11 23:26:21 eavesdrop01 puppet-user[11794]: (/Stage[main]/Ptgbot/Exec[install_ptgbot]) Failed to call refresh: Could not find command 'pip3'	23:52
pabelanger	I start working on a fix	23:52

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!