19:01:12 <clarkb> #startmeeting infra
19:01:13 <openstack> Meeting started Tue Jun 25 19:01:12 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <openstack> The meeting name has been set to 'infra'
19:01:19 <corvus> o/
19:01:25 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-June/006408.html
19:01:51 <clarkb> #topic Announcements
19:02:22 <clarkb> First up a warning that we did remove the zuul cloner shim and bindep fallback file from all jobs except for those that parent to openstack's legacy-base job
19:02:56 <clarkb> This should speed up the vast majority of our jobs, but there hsa been some fallout with jobs needing to either use bindep or drop z-c or reparent to legacy-base
19:03:14 <clarkb> So be aware that change happened if people ask why mysql isn't installed or why zuul-cloner isn't found
19:03:42 <clarkb> Also the Shanghai OpenInfra Summit CFP closes July 2
19:03:45 <clarkb> #link https://cfp.openstack.org/ Shanghai Summit CFP Deadline July 2
19:04:01 <clarkb> If you'd like to speed there in november don't forget to get your proposals submitted
19:04:19 <clarkb> s/speed/speak/
19:04:50 <clarkb> #topic Actions from last meeting
19:05:05 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-06-18-19.00.txt minutes from last meeting
19:05:33 <clarkb> corvus: had a chance to work with jroll on self managed openstack github org yet?
19:05:50 <jroll> yeah, he emailed me
19:06:07 <jroll> I've been meaning to start that conversation with the TC today
19:06:27 <clarkb> sounds like progress, thanks
19:06:53 <clarkb> I havne't seen mordred here today but he was going to create and opendevadmin account in github. Anyone know if that happened?
19:07:13 <clarkb> also was going to look into cleaning up our unused, now stale openstack-infra github org
19:07:32 <corvus> i'm unsure if either of those happened
19:07:51 <clarkb> #action mordred create opendevadmin github account
19:08:04 <fungi> clarkb: oh, also came up yesterday that there are still a ton of opendev namespace repos in openstack governance reference projects list. are we planning to remove those?
19:08:04 <clarkb> #action mordred look into cleaning up openstack-infra github org as it is no longer used and is now stale
19:08:04 <corvus> mordred is expecting next week to be a more regular work week, so hopefully something more concrete next meeting
19:08:46 <clarkb> fungi: I think that likely requires a bigger zuul like discussion?
19:08:54 <fungi> ahh, fair
19:09:01 <corvus> ++
19:09:03 <clarkb> and maybe we are far enough along to have that discussion now
19:09:05 <fungi> came up in the context of the python3 migration goal for openstack
19:09:21 <fungi> and whether the opendev namespace repos need to be included in that work
19:09:48 <fungi> (this is the stein supported runtimes addition of 3.7 and removal of 3.5)
19:10:22 <clarkb> much of our stuff still needs work to python3 properly. I think we'd gladly accept help with those ports but probably getting ahead of ourselves to think we can even skip 3.5
19:10:49 <fungi> yeah, i told coreycb to deprioritize any in the opendev namespace for now
19:11:00 <clarkb> sounds good
19:11:27 <clarkb> Probably a good transition to talking about opendev
19:11:28 <clarkb> #topic Priority Efforts
19:11:31 <clarkb> #topic OpenDev
19:12:27 <fungi> opendev: it's what's for breakfast
19:12:46 <clarkb> I've put a little bit of time this morning into thinking through a gitea06 redeployment. I think the rough plan I'm going to use is boot new gitea06 on manually uploaded image built by nodepool, add that to the inventory in system-config but exclude it from the play that creates projects in gitea within remote_puppet_git.yaml
19:13:36 <clarkb> that will have ansible install docker and deploy an empty gitea install. I can then restore the db from say gitea01 into new gitea06, update dns, trigger replication from gerrit to gitea06, then when that is all done remove the exclusion in remote_puppet_git and add it to haproxy
19:13:49 <clarkb> corvus: ^ knowing what you know of gitea and its database any concerns with that process?
19:13:54 <fungi> is there concern about image proliferation since any image uploaded has to be kept around in glance for as long as there is a server instance which used it for boot-from-volume cow?
19:14:04 <corvus> clarkb: that sounds fine
19:14:17 <clarkb> fungi: yes before I went on vacation I started the process to remove control plane image management from nodepool for that reason
19:14:41 <clarkb> fungi: that is also why I manually uploaded one of the nodepool images outside of nodepool so that we can have a working good image without nodepool trying to unsuccessfully manage it for us
19:14:43 <fungi> so we probably want to pick particular checkpoints and then reuse the same images for a while
19:15:04 <corvus> i think we just have to live with that restriction, but because this is outside of nodepool, i don't think we'll be too bothered
19:15:05 <fungi> rather than upload a new image each time we boot a new server
19:15:21 <corvus> i think we're also still expecting to have a second nodepool manage these images eventually
19:15:25 <fungi> and end up with almost as many images in glance as we have servers
19:15:36 <corvus> so i think we will have image proliferation... i'm not too bothered by it
19:15:45 <corvus> (as long as it isn't mucking up zuul's nodepool)
19:15:51 <clarkb> ya I think the bigger issue was the mixing with our test side stuff
19:15:56 <fungi> yeah, i guess if we had a separate nodepool, it wouldn't have to upload new images daily either
19:15:58 <clarkb> as it will cause problems there
19:16:21 <clarkb> also these images are about 3GB in raw format
19:16:25 <fungi> weekly/monthly images might be fine, and could cut down on the number of unique images we end up with in glance
19:17:20 <clarkb> for now we should be able to safely use the manually uploaded relatively recent image
19:17:35 <clarkb> I want ot say our launch process updates packages and reboots too
19:17:47 <ianw> (some recent discussions on kernels etc for dib has suggested we could get that 3gb a bit lower too)
19:17:49 <corvus> yeah
19:18:46 <clarkb> ianw: improving that in dib would be great for a bunch of reasons :)
19:18:57 <clarkb> Anything else OpenDev related?
19:20:27 <clarkb> #topic Update Config Management
19:20:52 <clarkb> ianw: I've not caught up on the backups with ansible status since getting back. Is that something that still needs review?
19:23:07 <ianw> reviews in, thanks ... i will work on implementing it now
19:23:41 <ianw> it needs a trivial rebase to fix up the .zuul jobs after kafs things merged around it, that's all
19:23:47 <clarkb> sounds good.
19:24:06 <clarkb> Did anyone else have puppet replacement with ansible and/or docker changes in flight?
19:26:07 <clarkb> Sounds like no. Lets move on then.
19:26:11 <clarkb> #topic Storyboard
19:26:31 <clarkb> fungi: Looks like the db lock issues that we thought had been fixed are still an issue?
19:27:30 <fungi> yes, the retries are not a solution because the first hit on that db deadlock causes the transaction to be rolled back and so the session gets set inactive and can't be reused
19:27:50 <clarkb> ah so need to retry with an entirely new session?
19:28:15 <fungi> also i have a feeling the number of retries would need to scale (linearly) with the number of initial tasks being added, so it was a bit of a dirty workaround anyway in retrospect
19:28:57 <fungi> we did also find some regressions in worklist and board creation which crept in with the team ownership feature, but those are patched now
19:29:09 <clarkb> Have we seen improvements on the slow query log side of things? Seems like there were changes to make improvements to what that had found before I took a week off
19:29:52 <fungi> and also an infrequent assertionerror which seems to occur when trying to write to rabbitmq (possibly when the event coincides with a heartbeat timeout, and those are happening every few minutes in production)
19:30:48 <fungi> i don't think there's been much new movement on the query optimizing, since the outreachy intern we got ended up accepting a "real job" and having to cancel on us at the last moment
19:31:36 <corvus> aw bummer, congrats!
19:31:41 <clarkb> unfortunate for us but good for them  Isuspect
19:31:58 <fungi> oh, also just moments ago cloudnull noticed that trying to import storyboardclient after pip installing it fails with a pbr versioning exception if your cwd is not a git repo
19:32:20 <fungi> that's a strange one
19:32:48 <fungi> like it can't find where the metadata got written
19:32:49 <clarkb> that sounds like maybe an old pbr problem
19:32:57 <clarkb> I want to say a bug around that was fixed years ago
19:33:20 <fungi> pbr==5.3.1
19:33:50 <fungi> which seems to be the latest
19:34:24 <fungi> i suppose it could be a regression in last week's release, but seems unlikely we'd be the first to spot it
19:35:00 <fungi> (last week's pbr 5.3.1 release i mean)
19:35:29 <clarkb> seems like that should be reproduceable if it is outside of storyboardclient
19:35:48 <clarkb> I'll probably give that a quick go after the meeting
19:35:58 <fungi> agreed, just starting to dig into it
19:37:01 <clarkb> ok anything else or should we move on to the next items on the agenda?
19:38:02 <clarkb> #topic General Topics
19:38:21 <clarkb> First up a quick check to make sure I'm not neglecting any needed help with the wiki upgrade or status host upgrade
19:38:28 <clarkb> fungi: mordred ^ anything we can help with there?
19:38:53 <fungi> #link https://review.opendev.org/666162 Canonicalize clone URLs
19:39:07 <fungi> seems to be the next issue i'm hitting with wiki-dev deploymenty
19:39:41 <clarkb> that is an easy one if anyone else has a quick moment
19:39:48 <fungi> for whatever reason it seems like the clone url redirects are confusing the vcsrepo module
19:40:28 <fungi> and causing it to record what is considers the relative path to the repo as the content of a .git file instead of creating a .git directory
19:41:34 <clarkb> fungi: as the output of the puppet git clone operations?
19:41:46 <fungi> yes
19:42:28 <fungi> so you end up with an otherwise empty directory that only contains a one-line file named .git whose contents are a relative file path
19:42:46 <fungi> for example...
19:42:50 <fungi> gitdir: ../../.git/modules/extensions/Renameuser
19:43:22 <clarkb> are those submodules that need initing maybe?
19:44:14 <clarkb> well we can debug that outside of the meeting
19:44:33 <clarkb> Next up is talking about The status of afs on bionic and where we are at with kafs vs openafs
19:44:39 <clarkb> ianw: ^ want to fill us in?
19:45:39 <clarkb> I'm particularly interested in which opendev mirrors are we using in prod and which of them are running kafs and which are running openafs and of those what versions of their respective afs/kernel are they on?
19:45:55 <clarkb> mostly to get a sense for how far into the future we've patched ourselves to get a working system
19:47:31 <ianw> sorry, yeah ... so it looks like kafs is working
19:47:58 <ianw> using a very latest afs-next branch from upstream developer dhowells
19:48:14 <ianw> those patches are being finalised and making their way to the main kernel tree very soon
19:48:34 <clarkb> so we are providing useful real world testing of kafs? :)
19:48:45 <ianw> i think the next step now is to bring up another kafs mirror a long way from the servers to test out higher latency and network interruptions
19:49:00 <ianw> clarkb: yes; Tested-By: us :)
19:49:24 <clarkb> ianw: is the dfw mirror running the PPA built 1.8.3 openafs?
19:49:45 <ianw> yes, that has also been going along fine afaik, though i haven't been monitoring it as actively
19:50:38 <clarkb> sounds like we've got viable paths forward with both openafs and kafs then (and ya I've not seen screaming about out of data mirrors recently)
19:51:13 <ianw> yep, once kafs changes make it to a upstream kernel, we have all the ansible roles now etc to deploy it
19:51:54 <ianw> we certainly could create roles to build custom kernels etc, it might even be useful for upstream CI, but at this stage I think we just keep it a bit manual as we try things out
19:52:09 <clarkb> I met someone on canonical's ubuntu server team over the weekend and they seemed to appreciate my bbq to maybe I can churn out more ribs to get stuff backported into bionic too >_>
19:52:49 <ianw> only thing i got a bit stuck on with CI was re-establishing the zuul console streamer inside the nested ansible runs when we reboot for a new kernel in CI ... not sure we need to discuss that but i'm open to ideas :)
19:52:59 <fungi> worth keeping in mind so far both of these are only testing anonymous/unauthenticated read-only afs client deployments
19:53:20 <corvus> ianw: yeah, i guess that would need to be a ci-contingent task in our production playbooks
19:53:24 <fungi> ianw: is adding a systemd unit file to start the console streamer at boot not a good solution?
19:53:41 <clarkb> ianw: I think zuul would appreciate tooling around making reboots better in general so probably worthwhile to see if we can figure that out
19:53:43 <fungi> (having the job-only ansible add that unit file i mean)
19:53:53 <clarkb> (Rebooting comes up occasionally as a thing people want for a variety of reasons)
19:53:53 <corvus> well, this is easy in zuul
19:53:54 <ianw> fungi: it may be ... although it's usually initiated from the "other" side
19:54:05 <fungi> what's initiated?
19:54:07 <corvus> this is zuul-running-ansible-running-ansible
19:54:24 <clarkb> corvus: oh its the second level ansible that makes it difficult because it knows not of zuul ?
19:54:29 <corvus> clarkb: yep
19:54:30 <ianw> fungi: the daemon; via ansible from the executor
19:54:32 <clarkb> got it
19:54:37 <ianw> yeah, it's the nesting that gets you :)
19:54:51 <corvus> clarkb: so far options are: put a CI task in the production playbook, or put a CI systemd unit on the production systems...
19:55:00 <fungi> ianw: but can't it be started again from the node itself?
19:55:08 <corvus> both are layer violations which we may just have to deal with
19:55:09 <fungi> or does it have to be started from the executor after reboot?
19:55:53 <fungi> not sure we have to put a ci systemd unit on the production systems if it's the job ansible which installs that unit
19:56:23 <ianw> fungi: ideally, it would be started by the executor after reboot; but as far as the executor is concerned, it's just running one big task "ansible-playbook ..." so there's no points inbetween to insert the zuul_console: call, if that makes sense
19:56:25 <clarkb> we could even have our base job add that to all jobs
19:56:26 <corvus> fungi: that might be a solution
19:56:34 <clarkb> then jobs can reboot without worrying about managing the daemon themselves?
19:56:42 <clarkb> that being the unit
19:56:55 <fungi> basically at the very start of the job zuul adds a systemd unit file to the node
19:57:01 <clarkb> fungi: ya
19:57:13 <corvus> yeah, maybe we should try that for this job (stick in in the run-base pre playbook) and see how it goes
19:57:18 <corvus> and if we like it, think about putting it in base
19:57:21 <fungi> and then the node can reboot as many times as it likes and zuul doesn't need to know
19:57:29 <corvus> wow, everything is named base
19:57:36 <fungi> all our base
19:57:41 <clarkb> belong to zuul
19:57:50 <fungi> you got it
19:57:51 <clarkb> as a time check we only have a couple minute sleft
19:57:57 <corvus> meme overload
19:57:58 <clarkb> so I'll open the floor to anything else
19:58:02 <clarkb> #topic Open Discussion
19:58:26 <clarkb> You can find us in #openstack-infra or on the infra mailing list (openstack-infra@lists.openstack.org) if ~1.5 minutes isn't enough time
19:59:17 * corvus starts looking for a tshirt with "cats: there is no base, only zuul"
19:59:52 <clarkb> sounds like that may be it. Thank you everyone for the hour of your time
20:00:04 <clarkb> #endmeeting