19:01:22 #startmeeting infra 19:01:23 Meeting started Tue Mar 13 19:01:22 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:26 The meeting name has been set to 'infra' 19:01:36 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:44 #topic Announcements 19:01:56 o/ 19:01:57 fungi has created a rocky release gpg key that needs signing 19:02:06 infra-rooters should sign that key as they are able 19:02:23 oh, right 19:02:25 lemmet link 19:02:41 o/ 19:02:58 #link https://sks-keyservers.net/pks/lookup?op=vindex&search=0xc31292066be772022438222c184fd3e1edf21a78&fingerprint=on Rocky Cycle signing key 19:03:05 i also have a related documentation update 19:03:31 I signed that but I had never uploaded my own personal key before, I still need to do that, 19:03:39 #link https://review.openstack.org/551376 Update signing docs for Zuul v3 19:04:08 looks like I need to rereview it 19:04:21 there are a couple of minor tweaks to the process which also impact attestation (specifically the location of the per-cycle revocation certificates it asks that you download a copy of) 19:04:36 which is why i mention it 19:05:05 in another weekish i'll un-wip the corresponding release change and move forward with the remaining steps to replace the key in zuul configuration 19:05:15 cool. That was the only announcement I had 19:05:39 #topic Actions from last meeting 19:05:47 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-03-06-19.01.txt minutes from last meeting 19:06:19 fungi had an action to generate the rocky key. That is done. I also got around to going over the three specs I had called out for cleanup. Two were abandoned and one was rebased with some minor edits 19:07:17 I still have it as a low priority todo item to continue to go through specs and clean up/rebase as necessary but not sure that it needs to be an action item until there is something actionable 19:07:33 #topic Specs approval 19:07:43 #link https://review.openstack.org/#/c/550550/ Improve IRC discoverability 19:08:08 It's not so much up for approval rather than just pointing out that I wrote that thing so looking for reviews 19:08:11 this spec is dmsimard's. Looks like it has gotten some reviews. Do you (dmsimard) and its reviewers think it is ready to go up for a vote? 19:08:14 ah ok 19:08:39 Sorry I didn't realize "for approval" 19:08:59 dmsimard its not the only thing we use this section for so it is fine. 19:09:19 I guess this gets it in front of people who will hopefully review it now :) and thank you to those of you who have already reviewed it 19:09:53 please do review that if you ahve time. But I'm going to keep moving as there is quite a bit left on the agenda and important things too 19:09:56 #topic Priority Efforts 19:10:01 #topic Zuul v3 19:10:12 #link https://review.openstack.org/552637 Proposal to make Zuul a top level project 19:10:37 yay! 19:10:37 corvus has proposed a change to openstack governance to remove Zuul from openstack in order to make it its own top level project 19:10:58 If you have any questions or concerns I'm happy to field them here and I think the TC and corvus are also more than willing to discuss what this means 19:11:11 these are true statements 19:12:08 The other zuul item worth mentioning was a security related bug in zuul's executors that bwrap isolated us against 19:12:15 #link http://lists.zuul-ci.org/pipermail/zuul-announce/2018-March/000001.html Zuul untrusted information disclosure 19:12:25 we're running with the fix for that now 19:12:50 we're expecting a couple more like that 19:13:08 there's also a discussion underway about rescheduling the zuul weekly meeting 19:13:14 #link http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-March/000064.html Zuul meeting time 19:14:21 ++ 19:15:00 were there any other zuul items to bring up here? I'll also give us a few minutes here before continuing on if anyone wants to talk about the zuul governance change or security 19:16:26 i think that's about it 19:16:58 Maybe to note that we're trying out a storyboard based approach for dealing with security issues 19:17:19 oh yes! 19:17:26 oh ya and you'll find direction on how to submit such things in the zuul docs soon (if that isn't already merged) 19:17:29 hopefully we'll have some documentation for that soon 19:18:14 So I consider me as a beta user of this new process :) 19:18:39 right, i've got a to do this week to draft some vulnerability management process and user-facing documentation on reporting suspected vulnerabilities 19:19:24 so far, at least, storyboard hasn't seemed to entirely fight our needs with having a private vulnerability report and discussing proposed patches 19:19:37 we'll see how it goes 19:20:11 alright, if anyone has questions or concerns you can find us in #openstack-infra and in #zuul and the TC over in #openstack-tc. There is also the infra and zuul and openstack dev mailing lists. Definitely reach out if you want to. 19:20:32 #topic Project Renames 19:21:12 before the PTG we pencilled in this friday as a project rename day. Since then I think we've lost monty who was involved with one of the projects wanting a rename 19:21:35 I expect that the release team will be ok with renames on Friday if we want to move forward since they've loosened the requirements on trailing project releases 19:21:54 fungi: corvus: do we think we can realistically rename project for mordred without mordred being around? 19:22:12 i think monty is expecting to be back by friday, but i can't guarantee that :) 19:22:43 mmm 19:22:49 there was one other project as well. I do think it would be good to have at least one rename behind us with zuulv3 19:22:54 just so that we can sort out a working process 19:23:12 i can help out, but i don't have time to prepare/drive it this week 19:23:28 the other item that made this somewhat urgent (fixing nova-specs repo replication) has been addressed already so I think we don't have to commit to it this week if impractical without monty 19:23:32 yeah, i'm a smidge lost on what the order of operations will be with code changes to perform the rename (which in theory is the only real change to the process?) 19:23:50 fungi: ya I think its all on the zuul side that changes 19:23:55 to make sure we don't wedge zuul 19:24:07 did we say we'll need to force in follow-up changes to repos with broken zuul configuration prior to restarting zuul? 19:24:52 there was talk of making zuul ignore broken configs on start up. corvus did that happen? if so I don't think we need to force things in, instead the project would just have to get updates in before they would function again 19:25:12 Not yet 19:25:16 we shouldn't need changes to the repos themselves at this point, as long as they don't have project names in their config 19:25:27 that was the biggest concern 19:26:08 so it may just be a matter of landing a main.yaml change after gerrit is back up. 19:26:39 oh... there's probably something in project-config too... hrm 19:27:13 so if there's an entry in project-config for it, we'll need 3 changes. remove from the live config, rename in the tenant config, add back to live config with new name 19:27:59 corvus: the last two have to be force merges currently once gerrit is back? 19:28:13 clarkb: no, should just be regular project-config merges 19:28:33 so zuul will start back up again even in that state? 19:28:59 okay, but basically we can't rename in one project-config change (even if the repo-specific config doesn't require its own editing)? we need to remove with a change and then add with another change? 19:29:31 oh right its the removal that allows it to start up cleanly 19:29:47 clarkb: oh, if we're stopping zuul (i guess we are because gerrit is going down) we want 4 changes. remove from live config, remove from tenant. [rename]. add to tenant, add to live config. 19:29:56 all of those can be regular changes though. 19:30:00 no force-merges 19:30:04 I had topics for the meeting but with the DST changes, the middle of the meeting conflicts with kids ending school so I'll have to drop. Just looking for reviews and a general question I'm not required to be here for. Be back in a while o/ 19:30:11 right its the removal that allows it to be regular changes 19:30:22 ya 19:30:28 so changes 1 and 2 merge, then we stop zuul, do the offline manual rename bits, start zuul, merge the 3rd and 4th changes? 19:30:33 yep 19:30:34 fungi: ya 19:31:05 at what point do we merge in-repo config changes, if required? 19:31:17 fungi: I think that would be step 5 19:31:34 there shouldn't be any... but if there are, we need to think about a custom plan 19:31:36 there'll be at least one for .gitreview anyway, so presumably zuul config tweaks can go in the same change 19:31:48 for example, if the repo has jobs used by other repos, that's a problem 19:31:57 oh, right 19:32:02 hrm 19:32:19 that's the sort of thing where we either need 'load with broken config' or we just need to force-merge all the changes with zuul off 19:32:39 so maybe in this case we are best off actually writing down the steps above in our normal renaming documentation with the concerns until zuul can gracefully fall back 19:32:45 (or, if it's just one job on one other repo, maybe we temporarily remove it then add it back) 19:32:51 then check each of the repos to be renamed against thost concerns then do it 19:32:56 this is where the tc resolution we've been talking about comes in, to permit us to bypass code review for changes to various projects when required for expedient modification to ci/automation configuration 19:33:39 i have that half-written. but i don't think we have to block in this case, i'm sure we can get permission from the involved projects. 19:33:49 totally agree 19:33:49 (i hit writers block on that resolution, sorry) 19:33:59 ya I don't think the tc resolution would block us 19:34:12 more I don't want us to discover on friday after noon we all of a sudden have to merge a bunch of stuff to get zuul started again 19:34:54 also I'm not entirely sure I know exactly what mordred wants to rename 19:35:07 so it might be difficult to evaluate this particular case with regards to his project(s) 19:35:41 "Something related to sdks/shade (via mordred)" 19:35:45 i agree that's not much to go on 19:35:57 i mean, we could just take a guess and see if he likes what we came up with 19:35:57 ya I wrote that down based on a random irc message from modred at some point 19:36:00 corvus: ha 19:36:17 we renamed it "shadybee". we thought you'd like that. 19:36:45 I'm inclined to say lets focus on getting the docs updated with concrete steps to do the rename and gotchas to watch out for like jobs in project being used by another project 19:37:12 then resync with monty, get concrete todo (probably in the form of a gerrit change with the new project name and all the details), then do rename based on the docs update 19:37:19 yah. after this, i'm leaning toward saying maybe we should just say the plan is to force-merge the project-config changes, until we support loading broken configs. 19:37:46 fungi: is the process documentation still somethign you were interested in doing? 19:37:56 that's probably the simplest, and (bizarrly) least risky. main thing is if we typo something in project-config, we might need to force-merge a followup fix. 19:38:38 I'm inclined to volunteer for making zuul handle broken configs but I've read the config loader code and wouldn't want to promise anything :) 19:38:39 clarkb: it is, but i feel like it'll be an untested shell until we actually try to do it at least once (and probably not fully fleshed-out until we've done it a few times and hit various corner cases) 19:39:08 fungi: ya but at least we'll have something to start with and then iterate on 19:39:14 clarkb: i'd recommend at least waiting until i finish making it "better" :) 19:39:27 sure, i'm up for writing that and having something in gerrit circa thursday-ish 19:39:30 ok lets start there then and resync on tuesday hopefully with a mordred around 19:39:38 I'll update the release team 19:39:46 (and now moving on beacuse running out of time) 19:39:51 #topic General Topics 19:40:04 ianw: would like to talk about AFS, reprepro and reliability 19:41:01 yes 19:41:04 my understanding is that having reprepro write out new repo mirrors is not working so well 19:41:15 no -- http://paste.openstack.org/show/700292/ 19:41:34 it is not very fault tolerant at all 19:42:08 compared to, say, an rsync. so there was some chatter with auristor in openstack-infra ... we debugged some of the client issues 19:42:29 *some* of it might be fixed in later versions; particularly stuff and interrupted syscalls 19:42:58 ianw: do we think the fileservers or the clients or both need to be updated? 19:43:00 so, my proposal is to custom build some later version client debs and manually install on mirror-update, and see how we go for a while 19:43:09 something to the effect that the version of openafs shipped in ubuntu is incompatible with the kernel shipped in the same ubuntu version 19:43:46 ianw: can you reuse the debs that were built for the zuul executors? that would make this simpler probably? 19:44:01 what do the initial errors look like? 19:44:17 (the pasted errors seem to generally be follow-up errors about a lockfile sticking around) 19:44:18 possibly, i was also thinking a more radical jump to the 1.8 branch ... if nothing else at least we can provide feedback 19:44:54 corvus: it varies a lot, but the underlying issue is usually that afs issues cause broken writes 19:45:06 when I've looked at reprepro / afs failures, it was because we lost access to AFS and reprepro dies, and leaves a lockfile. 19:45:23 then once the reprepro fails, or the db gets corrupt, we have to intervene 19:45:33 if 1.8.0 works out, we can presumably upgrade to ubuntu bionic and get the same version 19:45:34 yep, what pabelanger said :) 19:46:00 fungi: yes, that was my thinking, to pre-stage that essentially 19:46:16 but, it does seem to happening more. I wonder if it is because our reprepro database is starting to get larger, now with 3 distros (trusty, xenial, bionic) 19:46:24 they're on 1.8.0~pre5-1 at the moment 19:46:25 if it explodes worse, then we can go back. 19:46:34 at best, that would *reduce* the rate of failure. but any network filesystem is going to hit an issue at some point. are we sure we can't just delete the lockfile and retry? 19:47:03 corvus: in some cases, yes. but likely as not some of the .db files are corrupt and you know how that goes :) 19:47:07 pabelanger: surely they each have their own databases? 19:47:36 today i will go through that paste list and try to get them all restarted 19:48:07 maybe we could automatically recover by rsyncing the read-only volume contents back to the read-write volume and restarting? 19:48:07 I think auristor also mentioned that some behavior change in the linux kernel means that clients writes will be flakier 19:48:11 (something to do with interrupts) 19:48:34 yeah, this is why he was suggesting newer openafsclient 19:48:37 corvus: I don't think so, we clump them into a single reprepro update run. We could split that out into per distro updates I think 19:48:39 [to be clear, upgrading to be less flaky sounds great, i'm just wondering if we can do that and make it more robust at the same time] 19:48:47 corvus: ++ 19:49:21 my thought is that it's unlikely to be worse than the current status quo, and possibly likely to be better 19:49:48 I'm willing to try updated afs client on mirror-update and continue to try and make it more reliable from there with smarter fault recovery 19:49:56 resyncing from the ro volume seems like a good place to start for that 19:50:04 Yah, I've tested using replacing read/write database (corrupt) with read-only database, re-run reprepro and it does recover. Just takes some time to index everything again 19:50:55 good idea, i could definitely add a roll-back step to the script 19:51:29 ianw: anything else to add before we talk arm64? 19:51:38 i wish there were an afs command to do that instantly (since i know under the hood it's cow), but i don't think there is; i think rsync is the best option. 19:51:57 so we want to try the updated client? 19:52:04 wfm 19:52:10 ianw: sounds reasonable to me especially given what auristor has said 19:52:11 we also know we want tnewer openafs anyway for various security reasons, so i'm all for whatever gets us a sane story there (whether that's upgrading servers sooner, or coming up with a convenient means of maintaining more recent openafs on existing distro versions) 19:52:24 ok, i'll work on that 19:53:01 if we have issues with the server side, there might be options too 19:53:08 but let's discuss that another time 19:53:19 re arm64 we ran a job successfully 19:53:28 does that mean the infra pieces are largely in place? 19:53:48 yep, i think so (modulo the now broken ubuntu-ports mirroring re above :) 19:54:14 there was also mailing list thread about getting a kolla job running on arm64 19:54:24 i dropped some comments in https://review.openstack.org/549319 after looking at the run; it's not super fast 19:54:27 which should do a decent job of exercising it a bit more than pep8 19:54:44 so we'll have to take that into account 19:54:45 ok, I have asked xinliang in my team to add a kolla build (he is working with jeffrey4l on that) 19:55:07 yep, i didn't respond to the email yet but of course ping me if there's issues :) 19:55:15 ianw: will do, thanks a lot 19:55:28 i will merge the dib bits now, i think they're sufficiently tested 19:55:34 +1 19:55:50 i think that's it for now 19:55:56 yay progress 19:56:07 thank you everyone that has worked to move that stuff along 19:56:26 #link https://review.openstack.org/#/q/topic:ara-sqlite-middleware Ara sqlite middleware will allow us to go back to having ara results on all jobs 19:56:55 dmsimard: ^ would like reviews on this topic. One of the things it will do is cut the number of files copied per job a lot which should speed up job runtimes and allow us to have ara results on all jobs again 19:57:35 In addition to that dmsimard was thinking it would be a good idea to have the infra meeting time and location in our irc channel topic. I think this is ar easonable thing to do though our topic is already so long 19:57:38 Also, FWIW 1.0 work has resumed and it's very interesting. Can't wait to share more about it. 19:57:43 * dmsimard is back from school 19:57:56 https://review.openstack.org/549644/ also gets us started on arm64 wheels, which I've +3'd too for testing 19:57:59 I can't see the whole infra topic in my irc client so I'm probably a bad one to ask if that would be useful 19:58:25 dmsimard: I think if no one opposes it we can go ahead and make that change 19:58:31 people see it on joining 19:58:35 well, they "see" it 19:58:37 my client shows this much of the docs url... "http://d" 19:58:41 corvus: thats a good point, it does message it to you on join 19:58:47 yeah what corvus said, it's the first thing that's printed when joining a channel 19:58:55 perhaps "they are shown it" is the right phrasing :) 19:59:01 wfm 19:59:29 we are basically at time now. So rather than official open discussion time feel free to join us in #openstack-infra for any remaining items/topics 19:59:43 thank you everyone 19:59:49 #endmeeting