#openstack-meeting log

19:01:22 <clarkb> #startmeeting infra
19:01:23 <openstack> Meeting started Tue Mar 13 19:01:22 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:26 <openstack> The meeting name has been set to 'infra'
19:01:36 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:01:44 <clarkb> #topic Announcements
19:01:56 <pabelanger> o/
19:01:57 <clarkb> fungi has created a rocky release gpg key that needs signing
19:02:06 <clarkb> infra-rooters should sign that key as they are able
19:02:23 <fungi> oh, right
19:02:25 <fungi> lemmet link
19:02:41 <diablo_rojo> o/
19:02:58 <fungi> #link https://sks-keyservers.net/pks/lookup?op=vindex&search=0xc31292066be772022438222c184fd3e1edf21a78&fingerprint=on Rocky Cycle signing key
19:03:05 <fungi> i also have a related documentation update
19:03:31 <dmsimard> I signed that but I had never uploaded my own personal key before, I still need to do that,
19:03:39 <fungi> #link https://review.openstack.org/551376 Update signing docs for Zuul v3
19:04:08 <clarkb> looks like I need to rereview it
19:04:21 <fungi> there are a couple of minor tweaks to the process which also impact attestation (specifically the location of the per-cycle revocation certificates it asks that you download a copy of)
19:04:36 <fungi> which is why i mention it
19:05:05 <fungi> in another weekish i'll un-wip the corresponding release change and move forward with the remaining steps to replace the key in zuul configuration
19:05:15 <clarkb> cool. That was the only announcement I had
19:05:39 <clarkb> #topic Actions from last meeting
19:05:47 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-03-06-19.01.txt minutes from last meeting
19:06:19 <clarkb> fungi had an action to generate the rocky key. That is done. I also got around to going over the three specs I had called out for cleanup. Two were abandoned and one was rebased with some minor edits
19:07:17 <clarkb> I still have it as a low priority todo item to continue to go through specs and clean up/rebase as necessary but not sure that it needs to be an action item until there is something actionable
19:07:33 <clarkb> #topic Specs approval
19:07:43 <clarkb> #link https://review.openstack.org/#/c/550550/ Improve IRC discoverability
19:08:08 <dmsimard> It's not so much up for approval rather than just pointing out that I wrote that thing so looking for reviews
19:08:11 <clarkb> this spec is dmsimard's. Looks like it has gotten some reviews. Do you (dmsimard) and its reviewers think it is ready to go up for a vote?
19:08:14 <clarkb> ah ok
19:08:39 <dmsimard> Sorry I didn't realize "for approval"
19:08:59 <clarkb> dmsimard its not the only thing we use this section for so it is fine.
19:09:19 <clarkb> I guess this gets it in front of people who will hopefully review it now :) and thank you to those of you who have already reviewed it
19:09:53 <clarkb> please do review that if you ahve time. But I'm going to keep moving as there is quite a bit left on the agenda and important things too
19:09:56 <clarkb> #topic Priority Efforts
19:10:01 <clarkb> #topic Zuul v3
19:10:12 <clarkb> #link https://review.openstack.org/552637 Proposal to make Zuul a top level project
19:10:37 <fungi> yay!
19:10:37 <clarkb> corvus has proposed a change to openstack governance to remove Zuul from openstack in order to make it its own top level project
19:10:58 <clarkb> If you have any questions or concerns I'm happy to field them here and I think the TC and corvus are also more than willing to discuss what this means
19:11:11 <corvus> these are true statements
19:12:08 <clarkb> The other zuul item worth mentioning was a security related bug in zuul's executors that bwrap isolated us against
19:12:15 <fungi> #link http://lists.zuul-ci.org/pipermail/zuul-announce/2018-March/000001.html Zuul untrusted information disclosure
19:12:25 <corvus> we're running with the fix for that now
19:12:50 <corvus> we're expecting a couple more like that
19:13:08 <fungi> there's also a discussion underway about rescheduling the zuul weekly meeting
19:13:14 <fungi> #link http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-March/000064.html Zuul meeting time
19:14:21 <pabelanger> ++
19:15:00 <clarkb> were there any other zuul items to bring up here? I'll also give us a few minutes here before continuing on if anyone wants to talk about the zuul governance change or security
19:16:26 <corvus> i think that's about it
19:16:58 <tobiash> Maybe to note that we're trying out a storyboard based approach for dealing with security issues
19:17:19 <corvus> oh yes!
19:17:26 <clarkb> oh ya and you'll find direction on how to submit such things in the zuul docs soon (if that isn't already merged)
19:17:29 <corvus> hopefully we'll have some documentation for that soon
19:18:14 <tobiash> So I consider me as a beta user of this new process :)
19:18:39 <fungi> right, i've got a to do this week to draft some vulnerability management process and user-facing documentation on reporting suspected vulnerabilities
19:19:24 <fungi> so far, at least, storyboard hasn't seemed to entirely fight our needs with having a private vulnerability report and discussing proposed patches
19:19:37 <fungi> we'll see how it goes
19:20:11 <clarkb> alright, if anyone has questions or concerns you can find us in #openstack-infra and in #zuul and the TC over in #openstack-tc. There is also the infra and zuul and openstack dev mailing lists. Definitely reach out if you want to.
19:20:32 <clarkb> #topic Project Renames
19:21:12 <clarkb> before the PTG we pencilled in this friday as a project rename day. Since then I think we've lost monty who was involved with one of the projects wanting a rename
19:21:35 <clarkb> I expect that the release team will be ok with renames on Friday if we want to move forward since they've loosened the requirements on trailing project releases
19:21:54 <clarkb> fungi: corvus: do we think we can realistically rename project for mordred without mordred being around?
19:22:12 <corvus> i think monty is expecting to be back by friday, but i can't guarantee that :)
19:22:43 <fungi> mmm
19:22:49 <clarkb> there was one other project as well. I do think it would be good to have at least one rename behind us with zuulv3
19:22:54 <clarkb> just so that we can sort out a working process
19:23:12 <corvus> i can help out, but i don't have time to prepare/drive it this week
19:23:28 <clarkb> the other item that made this somewhat urgent (fixing nova-specs repo replication) has been addressed already so I think we don't have to commit to it this week if impractical without monty
19:23:32 <fungi> yeah, i'm a smidge lost on what the order of operations will be with code changes to perform the rename (which in theory is the only real change to the process?)
19:23:50 <clarkb> fungi: ya I think its all on the zuul side that changes
19:23:55 <clarkb> to make sure we don't wedge zuul
19:24:07 <fungi> did we say we'll need to force in follow-up changes to repos with broken zuul configuration prior to restarting zuul?
19:24:52 <clarkb> there was talk of making zuul ignore broken configs on start up. corvus did that happen? if so I don't think we need to force things in, instead the project would just have to get updates in before they would function again
19:25:12 <tobiash> Not yet
19:25:16 <corvus> we shouldn't need changes to the repos themselves at this point, as long as they don't have project names in their config
19:25:27 <corvus> that was the biggest concern
19:26:08 <corvus> so it may just be a matter of landing a main.yaml change after gerrit is back up.
19:26:39 <corvus> oh... there's probably something in project-config too... hrm
19:27:13 <corvus> so if there's an entry in project-config for it, we'll need 3 changes.  remove from the live config, rename in the tenant config, add back to live config with new name
19:27:59 <clarkb> corvus: the last two have to be force merges currently once gerrit is back?
19:28:13 <corvus> clarkb: no, should just be regular project-config merges
19:28:33 <clarkb> so zuul will start back up again even in that state?
19:28:59 <fungi> okay, but basically we can't rename in one project-config change (even if the repo-specific config doesn't require its own editing)? we need to remove with a change and then add with another change?
19:29:31 <clarkb> oh right its the removal that allows it to start up cleanly
19:29:47 <corvus> clarkb: oh, if we're stopping zuul (i guess we are because gerrit is going down) we want 4 changes.  remove from live config, remove from tenant.  [rename].  add to tenant, add to live config.
19:29:56 <corvus> all of those can be regular changes though.
19:30:00 <corvus> no force-merges
19:30:04 <dmsimard> I had topics for the meeting but with the DST changes, the middle of the meeting conflicts with kids ending school so I'll have to drop. Just looking for reviews and a general question I'm not required to be here for. Be back in a while o/
19:30:11 <clarkb> right its the removal that allows it to be regular changes
19:30:22 <corvus> ya
19:30:28 <fungi> so changes 1 and 2 merge, then we stop zuul, do the offline manual rename bits, start zuul, merge the 3rd and 4th changes?
19:30:33 <corvus> yep
19:30:34 <clarkb> fungi: ya
19:31:05 <fungi> at what point do we merge in-repo config changes, if required?
19:31:17 <clarkb> fungi: I think that would be step 5
19:31:34 <corvus> there shouldn't be any... but if there are, we need to think about a custom plan
19:31:36 <fungi> there'll be at least one for .gitreview anyway, so presumably zuul config tweaks can go in the same change
19:31:48 <corvus> for example, if the repo has jobs used by other repos, that's a problem
19:31:57 <fungi> oh, right
19:32:02 <clarkb> hrm
19:32:19 <corvus> that's the sort of thing where we either need 'load with broken config' or we just need to force-merge all the changes with zuul off
19:32:39 <clarkb> so maybe in this case we are best off actually writing down the steps above in our normal renaming documentation with the concerns until zuul can gracefully fall back
19:32:45 <corvus> (or, if it's just one job on one other repo, maybe we temporarily remove it then add it back)
19:32:51 <clarkb> then check each of the repos to be renamed against thost concerns then do it
19:32:56 <fungi> this is where the tc resolution we've been talking about comes in, to permit us to bypass code review for changes to various projects when required for expedient modification to ci/automation configuration
19:33:39 <corvus> i have that half-written.  but i don't think we have to block in this case, i'm sure we can get permission from the involved projects.
19:33:49 <fungi> totally agree
19:33:49 <corvus> (i hit writers block on that resolution, sorry)
19:33:59 <clarkb> ya I don't think the tc resolution would block us
19:34:12 <clarkb> more I don't want us to discover on friday after noon we all of a sudden have to merge a bunch of stuff to get zuul started again
19:34:54 <clarkb> also I'm not entirely sure I know exactly what mordred wants to rename
19:35:07 <clarkb> so it might be difficult to evaluate this particular case with regards to his project(s)
19:35:41 <corvus> "Something related to sdks/shade (via mordred)"
19:35:45 <corvus> i agree that's not much to go on
19:35:57 <corvus> i mean, we could just take a guess and see if he likes what we came up with
19:35:57 <clarkb> ya I wrote that down based on a random irc message from modred at some point
19:36:00 <clarkb> corvus: ha
19:36:17 <corvus> we renamed it "shadybee".  we thought you'd like that.
19:36:45 <clarkb> I'm inclined to say lets focus on getting the docs updated with concrete steps to do the rename and gotchas to watch out for like jobs in project being used by another project
19:37:12 <clarkb> then resync with monty, get concrete todo (probably in the form of a gerrit change with the new project name and all the details), then do rename based on the docs update
19:37:19 <corvus> yah.  after this, i'm leaning toward saying maybe we should just say the plan is to force-merge the project-config changes, until we support loading broken configs.
19:37:46 <clarkb> fungi: is the process documentation still somethign you were interested in doing?
19:37:56 <corvus> that's probably the simplest, and (bizarrly) least risky.  main thing is if we typo something in project-config, we might need to force-merge a followup fix.
19:38:38 <clarkb> I'm inclined to volunteer for making zuul handle broken configs but I've read the config loader code and wouldn't want to promise anything :)
19:38:39 <fungi> clarkb: it is, but i feel like it'll be an untested shell until we actually try to do it at least once (and probably not fully fleshed-out until we've done it a few times and hit various corner cases)
19:39:08 <clarkb> fungi: ya but at least we'll have something to start with and then iterate on
19:39:14 <corvus> clarkb: i'd recommend at least waiting until i finish making it "better" :)
19:39:27 <fungi> sure, i'm up for writing that and having something in gerrit circa thursday-ish
19:39:30 <clarkb> ok lets start there then and resync on tuesday hopefully with a mordred around
19:39:38 <clarkb> I'll update the release team
19:39:46 <clarkb> (and now moving on beacuse running out of time)
19:39:51 <clarkb> #topic General Topics
19:40:04 <clarkb> ianw: would like to talk about AFS, reprepro and reliability
19:41:01 <ianw> yes
19:41:04 <clarkb> my understanding is that having reprepro write out new repo mirrors is not working so well
19:41:15 <ianw> no -- http://paste.openstack.org/show/700292/
19:41:34 <ianw> it is not very fault tolerant at all
19:42:08 <ianw> compared to, say, an rsync.  so there was some chatter with auristor in openstack-infra ... we debugged some of the client issues
19:42:29 <ianw> *some* of it might be fixed in later versions; particularly stuff and interrupted syscalls
19:42:58 <clarkb> ianw: do we think the fileservers or the clients or both need to be updated?
19:43:00 <ianw> so, my proposal is to custom build some later version client debs and manually install on mirror-update, and see how we go for a while
19:43:09 <fungi> something to the effect that the version of openafs shipped in ubuntu is incompatible with the kernel shipped in the same ubuntu version
19:43:46 <clarkb> ianw: can you reuse the debs that were built for the zuul executors? that would make this simpler probably?
19:44:01 <corvus> what do the initial errors look like?
19:44:17 <corvus> (the pasted errors seem to generally be follow-up errors about a lockfile sticking around)
19:44:18 <ianw> possibly, i was also thinking a more radical jump to the 1.8 branch ... if nothing else at least we can provide feedback
19:44:54 <ianw> corvus: it varies a lot, but the underlying issue is usually that afs issues cause broken writes
19:45:06 <pabelanger> when I've looked at reprepro / afs failures, it was because we lost access to AFS and reprepro dies, and leaves a lockfile.
19:45:23 <ianw> then once the reprepro fails, or the db gets corrupt, we have to intervene
19:45:33 <fungi> if 1.8.0 works out, we can presumably upgrade to ubuntu bionic and get the same version
19:45:34 <ianw> yep, what pabelanger said :)
19:46:00 <ianw> fungi: yes, that was my thinking, to pre-stage that essentially
19:46:16 <pabelanger> but, it does seem to happening more. I wonder if it is because our reprepro database is starting to get larger, now with 3 distros (trusty, xenial, bionic)
19:46:24 <fungi> they're on 1.8.0~pre5-1 at the moment
19:46:25 <ianw> if it explodes worse, then we can go back.
19:46:34 <corvus> at best, that would *reduce* the rate of failure.  but any network filesystem is going to hit an issue at some point.  are we sure we can't just delete the lockfile and retry?
19:47:03 <ianw> corvus: in some cases, yes.  but likely as not some of the .db files are corrupt and you know how that goes :)
19:47:07 <corvus> pabelanger: surely they each have their own databases?
19:47:36 <ianw> today i will go through that paste list and try to get them all restarted
19:48:07 <corvus> maybe we could automatically recover by rsyncing the read-only volume contents back to the read-write volume and restarting?
19:48:07 <clarkb> I think auristor also mentioned that some behavior change in the linux kernel means that clients writes will be flakier
19:48:11 <clarkb> (something to do with interrupts)
19:48:34 <fungi> yeah, this is why he was suggesting newer openafsclient
19:48:37 <pabelanger> corvus: I don't think so, we clump them into a single reprepro update run. We could split that out into per distro updates I think
19:48:39 <corvus> [to be clear, upgrading to be less flaky sounds great, i'm just wondering if we can do that and make it more robust at the same time]
19:48:47 <clarkb> corvus: ++
19:49:21 <ianw> my thought is that it's unlikely to be worse than the current status quo, and possibly likely to be better
19:49:48 <clarkb> I'm willing to try updated afs client on mirror-update and continue to try and make it more reliable from there with smarter fault recovery
19:49:56 <clarkb> resyncing from the ro volume seems like a good place to start for that
19:50:04 <pabelanger> Yah, I've tested using replacing read/write database (corrupt) with read-only database, re-run reprepro and it does recover. Just takes some time to index everything again
19:50:55 <ianw> good idea, i could definitely add a roll-back step to the script
19:51:29 <clarkb> ianw: anything else to add before we talk arm64?
19:51:38 <corvus> i wish there were an afs command to do that instantly (since i know under the hood it's cow), but i don't think there is; i think rsync is the best option.
19:51:57 <ianw> so we want to try the updated client?
19:52:04 <corvus> wfm
19:52:10 <clarkb> ianw: sounds reasonable to me especially given what auristor has said
19:52:11 <fungi> we also know we want tnewer openafs anyway for various security reasons, so i'm all for whatever gets us a sane story there (whether that's upgrading servers sooner, or coming up with a convenient means of maintaining more recent openafs on existing distro versions)
19:52:24 <ianw> ok, i'll work on that
19:53:01 <ianw> if we have issues with the server side, there might be options too
19:53:08 <ianw> but let's discuss that another time
19:53:19 <clarkb> re arm64 we ran a job successfully
19:53:28 <clarkb> does that mean the infra pieces are largely in place?
19:53:48 <ianw> yep, i think so (modulo the now broken ubuntu-ports mirroring re above :)
19:54:14 <clarkb> there was also mailing list thread about getting a kolla job running on arm64
19:54:24 <ianw> i dropped some comments in https://review.openstack.org/549319 after looking at the run; it's not super fast
19:54:27 <clarkb> which should do a decent job of exercising it a bit more than pep8
19:54:44 <ianw> so we'll have to take that into account
19:54:45 <gema> ok, I have asked xinliang in my team to add a kolla build (he is working with jeffrey4l on that)
19:55:07 <ianw> yep, i didn't respond to the email yet but of course ping me if there's issues :)
19:55:15 <gema> ianw: will do, thanks a lot
19:55:28 <ianw> i will merge the dib bits now, i think they're sufficiently tested
19:55:34 <gema> +1
19:55:50 <ianw> i think that's it for now
19:55:56 <clarkb> yay progress
19:56:07 <clarkb> thank you everyone that has worked to move that stuff along
19:56:26 <clarkb> #link https://review.openstack.org/#/q/topic:ara-sqlite-middleware Ara sqlite middleware will allow us to go back to having ara results on all jobs
19:56:55 <clarkb> dmsimard: ^ would like reviews on this topic. One of the things it will do is cut the number of files copied per job a lot which should speed up job runtimes and allow us to have ara results on all jobs again
19:57:35 <clarkb> In addition to that dmsimard was thinking it would be a good idea to have the infra meeting time and location in our irc channel topic. I think this is ar easonable thing to do though our topic is already so long
19:57:38 <dmsimard> Also, FWIW 1.0 work has resumed and it's very interesting. Can't wait to share more about it.
19:57:43 * dmsimard is back from school
19:57:56 <pabelanger> https://review.openstack.org/549644/ also gets us started on arm64 wheels, which I've +3'd too for testing
19:57:59 <clarkb> I can't see the whole infra topic in my irc client so I'm probably a bad one to ask if that would be useful
19:58:25 <clarkb> dmsimard: I think if no one opposes it we can go ahead and make that change
19:58:31 <corvus> people see it on joining
19:58:35 <corvus> well, they "see" it
19:58:37 <fungi> my client shows this much of the docs url... "http://d"
19:58:41 <clarkb> corvus: thats a good point, it does message it to you on join
19:58:47 <dmsimard> yeah what corvus said, it's the first thing that's printed when joining a channel
19:58:55 <corvus> perhaps "they are shown it" is the right phrasing :)
19:59:01 <fungi> wfm
19:59:29 <clarkb> we are basically at time now. So rather than official open discussion time feel free to join us in #openstack-infra for any remaining items/topics
19:59:43 <clarkb> thank you everyone
19:59:49 <clarkb> #endmeeting