22:03:04 #startmeeting zuul 22:03:05 Meeting started Mon Oct 9 22:03:04 2017 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:03:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 22:03:08 The meeting name has been set to 'zuul' 22:03:37 this is, i hope, going to be the last openstack-infra heavy meeting for a while 22:04:04 i'd like to use most of the time to assess where we are on blockers for the infra rollback, and whether we can move forward 22:04:25 then, hopefully next week we can shift this meeting back to being about zuul in the abstract 22:04:41 #link zuulv3 infra rollout issues: https://etherpad.openstack.org/p/zuulv3-issues 22:05:19 modulo that deadlock you found today, seems to be performing waaaaay better 22:05:25 ya 22:05:27 i *had* hoped to have this all sorted out by the meeting, but other things came up 22:06:09 but skimming the outstanding debug list -- i think the only serious issues still outstanding are the nodepool issue Shrews has a fix for, and the git deadlock issue i have a fix for 22:06:39 do we (openstack infra team) feel okay switching back to production running on a gitpython fork? 22:06:51 jeblair: does the git deadlock issue explain the backup today? 22:06:53 yeah, that's something that's worth discussing 22:07:19 Shrews: which backup? :) 22:07:25 i'm still not sure why we end up with thousands of requests at times 22:07:51 just a capacity thing? 22:07:53 Shrews: oh, i expect we have thousands of requests because we have thousands of jobs waiting on nodes because we only have a portion of our capacity supplying v3 22:08:04 Shrews: i think fungi said there were 1100 *changes* total in queues 22:08:24 that was many hours ago too 22:08:26 Shrews: which could easily mean 10,000 requests 22:08:33 jeblair: ok, that sort of lines up with 6000+ requests i saw this morning 22:09:11 Shrews: as long as there was movement at all, the git stuff probably wasn't related. the thing i saw today only blocked one build 22:09:20 however, the git bug can, fairly easily, stop the entire system 22:10:04 a single instance of it can stop the gate pipeline altogether, and a few more instances can leave us with no operating mergers 22:11:04 as for job-specific problems, it's seemed to me like we're at the point where we're addressing them almost as soon as they're identified (for the ones i've added to the pad, i've usually already had a fix in mind if not in fact pushed into gerrit yet) 22:11:29 so if we don't run with a fix for the git bug in place (in whatever form that fix takes), or we switch back to v3 before the fix is in place, then we need to consider ourselves on-call to find and kill any stuck git processes 22:11:30 granted, when we roll forward again, the reporting rate for broken jobs will pick up substantially 22:12:36 jeblair: for the record, i'm cool with running on a gitpython fork (apparently master plus your pr at this point) until they get a fix in master for that, and then switch to the latest release as soon as they tag one 22:13:02 i just figured we should acknowledge that's what's going on if we decide to 22:13:15 i think i'm inclined to suggest that i polish https://review.openstack.org/509517 so it passes tests and then we run with my gitpython fork and https://review.openstack.org/509517 locally applied 22:13:36 fungi: ya. i don't like it, but i think it's the least worst option, and hopefully very temporary. 22:13:43 i, too, am fine with that. and since this is starting to seem like a meeting of 3, i think all votes are in 22:13:44 and probably goes without saying we shouldn't officially release zuul 3.0.0 with a dep on forked gitpython 22:13:53 and believe me -- i considered monkeypatching. :) 22:14:10 so best to hold the release until there's a fixed gitpython we can list as our minimum 22:14:15 fungi: ya. in a similar vein, we need a github3.py release too. 22:14:27 i'm happy to hold the line on both of those. 22:14:28 good point. i had forgotten about that one 22:15:11 Shrews, fungi: cool. it's unanimous then. :) i'll have that running before i go to bed tonight. :) 22:15:12 Shrews: mrhillsman is here too! 22:15:31 so meeting of four ;) 22:15:34 oh i really hope mrhillsman votes for our crazy plan :) 22:15:49 hehe 22:15:59 not sure my vote has any merit 22:16:23 o/ (but a bit behind, so just lurking :) 22:16:31 we _consider_ all opinions in here ;) 22:16:53 i'm really looking forward to v3 since it is the underpinning of our openlab efforts 22:16:56 as for the nodepool issue -- as long as that's landed before we restart launchers, we should be fine (it's a bug in launcher restart), so i don't think we need to fret too much over that; i imagine we can get it landed before it creeps up again. 22:17:34 mrhillsman: ++ (and sorry i haven't been able to jump in as much; hopefully we'll be done fighting fires soon!) 22:17:48 so far my small little setup is "working" just have not got a job running yet so working on that 22:17:51 jeblair: i don't think it's limited to restart 22:18:07 Shrews: true; i guess that's when it's most likely to appear though 22:18:17 yeah 22:18:51 fungi: so yeah, it looks like there are a few thing still in the jobs section 22:19:01 are any of those blockers? 22:19:57 maybe the unbound setup issue 22:20:18 i'm guessing dmsimard is on holiday today too 22:20:45 i guess without that, our rax random error rate will go up? 22:20:51 yeah 22:21:03 yeah, i mean my guess is that many (most) of the lingering items in the jobs list are probably already fixed and we just need to circle back around to confirm 22:21:16 what's propose-updates? 22:21:27 i can looking into the unbound thing, since i did a bit of ansible around the mirror setup 22:22:36 ianw: cool, thanks -- dmsimard has his name next to that on the etherpad, so be sure to let him know what you find/do with that for when he gets back 22:22:37 how about "i will"; i'll put any updates into the etherpad 22:22:46 ++ :) 22:23:17 jeblair: good question about the "propose-updates" job. that's so surprisingly vague i can't even figure out from the log what it's supposed to be doing yet 22:23:57 unbound is the only thing there that doesn't have a fix next to it that i would consider a potentially big enough problem to be a blocker. infra-index updating is something we can fix at our leisure, for instance. 22:24:23 yeah, i mentioned the unbound configuration because it potentially impacts all jobs in the system 22:24:47 whereas the rest of these look like they could be dealt with in isolation 22:25:56 also i think the playbook that job mentions no longer exists? 22:26:12 that would be a problem :) 22:26:16 overall, i'd say we're *almost* ready to switch back, and have a pretty legit chance of being *actually* ready by tomorrow morning. i think we're close enough we can consider flipping the switch as early as tomorrow morning. 22:26:29 git.openstack.org/openstack-infra/project-config/playbooks/proposal/propose-updates 22:26:36 i don't see it at all 22:26:52 should we do that? or should we give more lead time for an announcement, etc? 22:26:52 no, i'm just blind. it's there 22:27:15 jeblair: fungi: should we consider adding a subset of projects first, rather than a total switch? or is it easier to just do them all at once? 22:27:35 Shrews: it's really hard to do anything other than all at once 22:27:36 Shrews: i think it's been tough enough to run with the minimal split we've got 22:27:58 *nod* 22:29:03 i'm on board with announcing a date/time, but consider <24h to mean a lot of people could be surprised (many will be surprised anyway, but at least 24 hours gives people around the globe a chance to read it) 22:29:44 also it's meeting day for the infra team 22:30:09 we could shoot for something like 16:00z wednesday? (is that too early, or not early enough for you?) 22:30:12 yeah, i normally like longer lead-times; but considering the sort of extended-maintenance + partial-rollback state we're in, i figure anything's on the table. :) 22:30:30 o/ 22:30:34 sorry I am late 22:30:42 too far into the week, and i agree we risk not having enough opportunity to spot issues before the weekend 22:31:14 fungi: i think we should do the switch as soon as you or pabelanger or mordred are online 22:31:36 the earlier the better. cool 22:31:37 i'm online at 14:00 22:32:14 I'll be online first thing too 22:32:16 i think the ideal is for us to make the switch before (or as close to 'before' as possible) the US-time surge 22:32:23 pabelanger: mordred: are you around wednesday morning? 22:32:33 fungi: yes 22:32:56 i figure actually executing the switch may take 30 to 60 or.. idunno, maybe even more minutes. 22:33:06 i can be up pretty early, though we're talking 11:00z if we want to catch the start of the morning ramp-up 22:33:18 so if you get started first thing, i probably won't actually miss much. 22:33:40 Sure 22:34:16 #link https://etherpad.openstack.org/p/zuulv3-cutover 22:34:25 that still the steps we want? 22:34:56 looking 22:35:00 fungi: we probably need to refresh some changes there 22:35:14 sure 22:35:33 fungi: let's ask mordred to do that tomorrow, and make sure we have changes staged and steps written down for wed morning 22:35:38 prepping those should probably be a top priority for at least a couple of us tomorrow 22:35:51 do we want to keep infra-check / infra-gate for a few more days or revert that change too? 22:36:00 pabelanger: i say drop 'em 22:36:06 ack 22:36:15 agreed 22:36:33 nodepool back to nodepool-launchers should be straightforward too 22:36:44 exciting 22:37:29 #agreed zuulv3 cutover wed 11 oct at 11:00 utc 22:37:33 ^ that look right? 22:37:40 wfm 22:37:48 can i suggest keeping infra-check and infra-gate 22:37:53 at least for the first week after the cutover 22:38:03 it's really useful having a high priority queue to get project-config in quickly 22:38:21 #action mordred stage and document rollback steps 22:38:35 project-config changes* 22:38:50 mnaser: that's true, but with the stabilization we've done over the past week, will it be as necessary? 22:38:59 mnaser makes a good point... the number of fixes to legacy jobs are likely to shoot back up 22:39:11 jeblair in the high volume we deal with of jobs, sometimes i see xenial jobs taking almost an hour to get a node 22:39:14 * jeblair asks mnaser to predict future :) 22:39:33 and "stabilization" (under v2 at least) is often an hour or two to get check results 22:39:45 when we're pressed for capacity 22:40:10 infra-gate is less necessary, but infra-check may be useful 22:40:29 okay, i could be convinced to keep it for just project-config 22:41:10 yeah, infra-gate probably wouldn't actually get us anything. but check would make a difference. 22:41:42 okay, so maybe keep infra-check for a few more days, remove infra-gate 22:41:48 (this is also actually a good use for multi-tenancy, but i don't want to muddy things too much right now) 22:42:04 :> 22:42:08 looking forward to test that too :) 22:42:28 yep, if we keep infra-check around temporarily, i think we make a judgment call at some point where the volume of project-config changes has dropped off significantly and fold it back into normal check then 22:43:04 okay, we likely can merge that right away too 22:43:37 but i guess we move zuul, nodepool, openstack-zuul-jobs and zuul-jobs repos back to normal check 22:43:53 so that infra-check is just for project-config changes? 22:44:07 regarding zuul-jobs and openstack-zuul-jobs -- we *could* also put them in infra-check, however, depends-on works for those, so they don't *need* to block changes. though it may be convenient so we don't have to use depends-on as much. 22:44:16 my inclination would be only to use infra-check for project-config 22:44:20 and rely on depends-on for the others 22:44:33 i figure with the others, we still have the option to punt a change straight to the gate pipeline when urgent 22:44:41 (i don't like being a special case) 22:44:45 fungi: that's true too 22:44:50 as long as there's an infra-root around to make that exception 22:45:31 #agreed keep infra-check for project-config only (other repos move back to regular check/gate). remove infra-gate completely. 22:45:36 look good ^? 22:45:44 we're already a special case insofar as project-config is the point of coordination for all this, so makes sense to just acknowledge that for now 22:45:57 yeah, i'm good with that 22:46:01 fungi: yeah, it's a config-project, so it really is special :) 22:46:29 project-config wfm 22:46:44 (sorry internets also slow and spotty right now) 22:46:44 project-config knows it's special 'cause its momma told it so 22:47:20 since we're expecting reports of broken jobs to increase again, i suggest we copy all the cruft from the zuulv3-issues etherpad to a backup, and continue to use zuulv3-issues to track new job and/or zuul bugs 22:47:51 that also makes sense. i can do that now... want me to do a proper copy with the etherpad api, or just copy/paste content? 22:48:04 or wait until we're ready to do the cutover? 22:48:25 fungi: i think copy/paste content -- we still have the history on the main page if we have any questions. 22:48:33 sounds good 22:48:35 wfm 22:48:49 just a small suggestion: it'd be useful to split zuul issues such as functionality and job issues .. i would really want to help out fix jobs but i have no idea how to fix zuul things 22:48:59 #agreed clear cruft off of zuulv3-issues etherpad and continue using it to track new issues after roll-forward 22:49:13 i guess we should give ourselves some time tomorrow to quiesce the etherpad 22:49:13 mnaser: we sort of have that, but we can make it clearer 22:49:15 yeah, cleanup 22:49:35 the "debugging" section ended up being the "zuul issues" for the most part 22:50:04 but yeah, let's make sure we have "fixed" "job issues" "zuul issues" "un-triaged" sections 22:50:20 yeah, in many (particularly early) cases, it was hard to tell whether a job was broken due to misconfiguration or due to a zuul bug 22:50:36 but i expect it will be a lot more straightforward to figure out most of them now 22:51:07 and if not, use the triage section for that. we should just move things out of there quickly after triage. 22:52:25 to make the most of this >24h lead time, we should send an announcement now, yeah? shall i send that out? 22:52:47 +1 22:53:36 I didn't see in backscroll, and may have missed it. But have we restarted zuul-executors recently? Want to make sure we have ABORTED patch in place before we go live again 22:53:59 pabelanger: i think that happened over the wknd 22:54:10 okay great 22:54:23 jeblair: i'm in favor of an announcement as early as we can provide one. happy to send it if you have other things you'd prefer to work on 22:55:05 kast restarted october 7 for ze01 22:55:09 s/kast/last/ 22:55:29 fungi: i'm going to take you up on that offer and let you send it out then. :) 22:55:41 i'll get to work on that now 22:55:48 i'll do the git thing and review the nodepool change 22:55:54 thanks! 22:55:58 thank you! 22:56:00 fungi: great, oct 6 was commit 22:56:06 anything else, or should we wrap this up? 22:56:15 pabelanger: note i didn't check any other executors, just ze01 22:56:20 i have nothing else 22:56:25 nothing here 22:58:18 thanks! 22:58:20 #endmeeting