19:01:12 #startmeeting infra 19:01:13 Meeting started Tue Jun 25 19:01:12 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:19 o/ 19:01:25 #link http://lists.openstack.org/pipermail/openstack-infra/2019-June/006408.html 19:01:51 #topic Announcements 19:02:22 First up a warning that we did remove the zuul cloner shim and bindep fallback file from all jobs except for those that parent to openstack's legacy-base job 19:02:56 This should speed up the vast majority of our jobs, but there hsa been some fallout with jobs needing to either use bindep or drop z-c or reparent to legacy-base 19:03:14 So be aware that change happened if people ask why mysql isn't installed or why zuul-cloner isn't found 19:03:42 Also the Shanghai OpenInfra Summit CFP closes July 2 19:03:45 #link https://cfp.openstack.org/ Shanghai Summit CFP Deadline July 2 19:04:01 If you'd like to speed there in november don't forget to get your proposals submitted 19:04:19 s/speed/speak/ 19:04:50 #topic Actions from last meeting 19:05:05 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-06-18-19.00.txt minutes from last meeting 19:05:33 corvus: had a chance to work with jroll on self managed openstack github org yet? 19:05:50 yeah, he emailed me 19:06:07 I've been meaning to start that conversation with the TC today 19:06:27 sounds like progress, thanks 19:06:53 I havne't seen mordred here today but he was going to create and opendevadmin account in github. Anyone know if that happened? 19:07:13 also was going to look into cleaning up our unused, now stale openstack-infra github org 19:07:32 i'm unsure if either of those happened 19:07:51 #action mordred create opendevadmin github account 19:08:04 clarkb: oh, also came up yesterday that there are still a ton of opendev namespace repos in openstack governance reference projects list. are we planning to remove those? 19:08:04 #action mordred look into cleaning up openstack-infra github org as it is no longer used and is now stale 19:08:04 mordred is expecting next week to be a more regular work week, so hopefully something more concrete next meeting 19:08:46 fungi: I think that likely requires a bigger zuul like discussion? 19:08:54 ahh, fair 19:09:01 ++ 19:09:03 and maybe we are far enough along to have that discussion now 19:09:05 came up in the context of the python3 migration goal for openstack 19:09:21 and whether the opendev namespace repos need to be included in that work 19:09:48 (this is the stein supported runtimes addition of 3.7 and removal of 3.5) 19:10:22 much of our stuff still needs work to python3 properly. I think we'd gladly accept help with those ports but probably getting ahead of ourselves to think we can even skip 3.5 19:10:49 yeah, i told coreycb to deprioritize any in the opendev namespace for now 19:11:00 sounds good 19:11:27 Probably a good transition to talking about opendev 19:11:28 #topic Priority Efforts 19:11:31 #topic OpenDev 19:12:27 opendev: it's what's for breakfast 19:12:46 I've put a little bit of time this morning into thinking through a gitea06 redeployment. I think the rough plan I'm going to use is boot new gitea06 on manually uploaded image built by nodepool, add that to the inventory in system-config but exclude it from the play that creates projects in gitea within remote_puppet_git.yaml 19:13:36 that will have ansible install docker and deploy an empty gitea install. I can then restore the db from say gitea01 into new gitea06, update dns, trigger replication from gerrit to gitea06, then when that is all done remove the exclusion in remote_puppet_git and add it to haproxy 19:13:49 corvus: ^ knowing what you know of gitea and its database any concerns with that process? 19:13:54 is there concern about image proliferation since any image uploaded has to be kept around in glance for as long as there is a server instance which used it for boot-from-volume cow? 19:14:04 clarkb: that sounds fine 19:14:17 fungi: yes before I went on vacation I started the process to remove control plane image management from nodepool for that reason 19:14:41 fungi: that is also why I manually uploaded one of the nodepool images outside of nodepool so that we can have a working good image without nodepool trying to unsuccessfully manage it for us 19:14:43 so we probably want to pick particular checkpoints and then reuse the same images for a while 19:15:04 i think we just have to live with that restriction, but because this is outside of nodepool, i don't think we'll be too bothered 19:15:05 rather than upload a new image each time we boot a new server 19:15:21 i think we're also still expecting to have a second nodepool manage these images eventually 19:15:25 and end up with almost as many images in glance as we have servers 19:15:36 so i think we will have image proliferation... i'm not too bothered by it 19:15:45 (as long as it isn't mucking up zuul's nodepool) 19:15:51 ya I think the bigger issue was the mixing with our test side stuff 19:15:56 yeah, i guess if we had a separate nodepool, it wouldn't have to upload new images daily either 19:15:58 as it will cause problems there 19:16:21 also these images are about 3GB in raw format 19:16:25 weekly/monthly images might be fine, and could cut down on the number of unique images we end up with in glance 19:17:20 for now we should be able to safely use the manually uploaded relatively recent image 19:17:35 I want ot say our launch process updates packages and reboots too 19:17:47 (some recent discussions on kernels etc for dib has suggested we could get that 3gb a bit lower too) 19:17:49 yeah 19:18:46 ianw: improving that in dib would be great for a bunch of reasons :) 19:18:57 Anything else OpenDev related? 19:20:27 #topic Update Config Management 19:20:52 ianw: I've not caught up on the backups with ansible status since getting back. Is that something that still needs review? 19:23:07 reviews in, thanks ... i will work on implementing it now 19:23:41 it needs a trivial rebase to fix up the .zuul jobs after kafs things merged around it, that's all 19:23:47 sounds good. 19:24:06 Did anyone else have puppet replacement with ansible and/or docker changes in flight? 19:26:07 Sounds like no. Lets move on then. 19:26:11 #topic Storyboard 19:26:31 fungi: Looks like the db lock issues that we thought had been fixed are still an issue? 19:27:30 yes, the retries are not a solution because the first hit on that db deadlock causes the transaction to be rolled back and so the session gets set inactive and can't be reused 19:27:50 ah so need to retry with an entirely new session? 19:28:15 also i have a feeling the number of retries would need to scale (linearly) with the number of initial tasks being added, so it was a bit of a dirty workaround anyway in retrospect 19:28:57 we did also find some regressions in worklist and board creation which crept in with the team ownership feature, but those are patched now 19:29:09 Have we seen improvements on the slow query log side of things? Seems like there were changes to make improvements to what that had found before I took a week off 19:29:52 and also an infrequent assertionerror which seems to occur when trying to write to rabbitmq (possibly when the event coincides with a heartbeat timeout, and those are happening every few minutes in production) 19:30:48 i don't think there's been much new movement on the query optimizing, since the outreachy intern we got ended up accepting a "real job" and having to cancel on us at the last moment 19:31:36 aw bummer, congrats! 19:31:41 unfortunate for us but good for them Isuspect 19:31:58 oh, also just moments ago cloudnull noticed that trying to import storyboardclient after pip installing it fails with a pbr versioning exception if your cwd is not a git repo 19:32:20 that's a strange one 19:32:48 like it can't find where the metadata got written 19:32:49 that sounds like maybe an old pbr problem 19:32:57 I want to say a bug around that was fixed years ago 19:33:20 pbr==5.3.1 19:33:50 which seems to be the latest 19:34:24 i suppose it could be a regression in last week's release, but seems unlikely we'd be the first to spot it 19:35:00 (last week's pbr 5.3.1 release i mean) 19:35:29 seems like that should be reproduceable if it is outside of storyboardclient 19:35:48 I'll probably give that a quick go after the meeting 19:35:58 agreed, just starting to dig into it 19:37:01 ok anything else or should we move on to the next items on the agenda? 19:38:02 #topic General Topics 19:38:21 First up a quick check to make sure I'm not neglecting any needed help with the wiki upgrade or status host upgrade 19:38:28 fungi: mordred ^ anything we can help with there? 19:38:53 #link https://review.opendev.org/666162 Canonicalize clone URLs 19:39:07 seems to be the next issue i'm hitting with wiki-dev deploymenty 19:39:41 that is an easy one if anyone else has a quick moment 19:39:48 for whatever reason it seems like the clone url redirects are confusing the vcsrepo module 19:40:28 and causing it to record what is considers the relative path to the repo as the content of a .git file instead of creating a .git directory 19:41:34 fungi: as the output of the puppet git clone operations? 19:41:46 yes 19:42:28 so you end up with an otherwise empty directory that only contains a one-line file named .git whose contents are a relative file path 19:42:46 for example... 19:42:50 gitdir: ../../.git/modules/extensions/Renameuser 19:43:22 are those submodules that need initing maybe? 19:44:14 well we can debug that outside of the meeting 19:44:33 Next up is talking about The status of afs on bionic and where we are at with kafs vs openafs 19:44:39 ianw: ^ want to fill us in? 19:45:39 I'm particularly interested in which opendev mirrors are we using in prod and which of them are running kafs and which are running openafs and of those what versions of their respective afs/kernel are they on? 19:45:55 mostly to get a sense for how far into the future we've patched ourselves to get a working system 19:47:31 sorry, yeah ... so it looks like kafs is working 19:47:58 using a very latest afs-next branch from upstream developer dhowells 19:48:14 those patches are being finalised and making their way to the main kernel tree very soon 19:48:34 so we are providing useful real world testing of kafs? :) 19:48:45 i think the next step now is to bring up another kafs mirror a long way from the servers to test out higher latency and network interruptions 19:49:00 clarkb: yes; Tested-By: us :) 19:49:24 ianw: is the dfw mirror running the PPA built 1.8.3 openafs? 19:49:45 yes, that has also been going along fine afaik, though i haven't been monitoring it as actively 19:50:38 sounds like we've got viable paths forward with both openafs and kafs then (and ya I've not seen screaming about out of data mirrors recently) 19:51:13 yep, once kafs changes make it to a upstream kernel, we have all the ansible roles now etc to deploy it 19:51:54 we certainly could create roles to build custom kernels etc, it might even be useful for upstream CI, but at this stage I think we just keep it a bit manual as we try things out 19:52:09 I met someone on canonical's ubuntu server team over the weekend and they seemed to appreciate my bbq to maybe I can churn out more ribs to get stuff backported into bionic too >_> 19:52:49 only thing i got a bit stuck on with CI was re-establishing the zuul console streamer inside the nested ansible runs when we reboot for a new kernel in CI ... not sure we need to discuss that but i'm open to ideas :) 19:52:59 worth keeping in mind so far both of these are only testing anonymous/unauthenticated read-only afs client deployments 19:53:20 ianw: yeah, i guess that would need to be a ci-contingent task in our production playbooks 19:53:24 ianw: is adding a systemd unit file to start the console streamer at boot not a good solution? 19:53:41 ianw: I think zuul would appreciate tooling around making reboots better in general so probably worthwhile to see if we can figure that out 19:53:43 (having the job-only ansible add that unit file i mean) 19:53:53 (Rebooting comes up occasionally as a thing people want for a variety of reasons) 19:53:53 well, this is easy in zuul 19:53:54 fungi: it may be ... although it's usually initiated from the "other" side 19:54:05 what's initiated? 19:54:07 this is zuul-running-ansible-running-ansible 19:54:24 corvus: oh its the second level ansible that makes it difficult because it knows not of zuul ? 19:54:29 clarkb: yep 19:54:30 fungi: the daemon; via ansible from the executor 19:54:32 got it 19:54:37 yeah, it's the nesting that gets you :) 19:54:51 clarkb: so far options are: put a CI task in the production playbook, or put a CI systemd unit on the production systems... 19:55:00 ianw: but can't it be started again from the node itself? 19:55:08 both are layer violations which we may just have to deal with 19:55:09 or does it have to be started from the executor after reboot? 19:55:53 not sure we have to put a ci systemd unit on the production systems if it's the job ansible which installs that unit 19:56:23 fungi: ideally, it would be started by the executor after reboot; but as far as the executor is concerned, it's just running one big task "ansible-playbook ..." so there's no points inbetween to insert the zuul_console: call, if that makes sense 19:56:25 we could even have our base job add that to all jobs 19:56:26 fungi: that might be a solution 19:56:34 then jobs can reboot without worrying about managing the daemon themselves? 19:56:42 that being the unit 19:56:55 basically at the very start of the job zuul adds a systemd unit file to the node 19:57:01 fungi: ya 19:57:13 yeah, maybe we should try that for this job (stick in in the run-base pre playbook) and see how it goes 19:57:18 and if we like it, think about putting it in base 19:57:21 and then the node can reboot as many times as it likes and zuul doesn't need to know 19:57:29 wow, everything is named base 19:57:36 all our base 19:57:41 belong to zuul 19:57:50 you got it 19:57:51 as a time check we only have a couple minute sleft 19:57:57 meme overload 19:57:58 so I'll open the floor to anything else 19:58:02 #topic Open Discussion 19:58:26 You can find us in #openstack-infra or on the infra mailing list (openstack-infra@lists.openstack.org) if ~1.5 minutes isn't enough time 19:59:17 * corvus starts looking for a tshirt with "cats: there is no base, only zuul" 19:59:52 sounds like that may be it. Thank you everyone for the hour of your time 20:00:04 #endmeeting