#openstack-meeting-alt log

22:03:04 <jeblair> #startmeeting zuul
22:03:05 <openstack> Meeting started Mon Oct  9 22:03:04 2017 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:03:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
22:03:08 <openstack> The meeting name has been set to 'zuul'
22:03:37 <jeblair> this is, i hope, going to be the last openstack-infra heavy meeting for a while
22:04:04 <jeblair> i'd like to use most of the time to assess where we are on blockers for the infra rollback, and whether we can move forward
22:04:25 <jeblair> then, hopefully next week we can shift this meeting back to being about zuul in the abstract
22:04:41 <jeblair> #link zuulv3 infra rollout issues: https://etherpad.openstack.org/p/zuulv3-issues
22:05:19 <fungi> modulo that deadlock you found today, seems to be performing waaaaay better
22:05:25 <jeblair> ya
22:05:27 <jeblair> i *had* hoped to have this all sorted out by the meeting, but other things came up
22:06:09 <jeblair> but skimming the outstanding debug list -- i think the only serious issues still outstanding are the nodepool issue Shrews has a fix for, and the git deadlock issue i have a fix for
22:06:39 <fungi> do we (openstack infra team) feel okay switching back to production running on a gitpython fork?
22:06:51 <Shrews> jeblair: does the git deadlock issue explain the backup today?
22:06:53 <jeblair> yeah, that's something that's worth discussing
22:07:19 <jeblair> Shrews: which backup? :)
22:07:25 <Shrews> i'm still not sure why we end up with thousands of requests at times
22:07:51 <Shrews> just a capacity thing?
22:07:53 <jeblair> Shrews: oh, i expect we have thousands of requests because we have thousands of jobs waiting on nodes because we only have a portion of our capacity supplying v3
22:08:04 <jeblair> Shrews: i think fungi said there were 1100 *changes* total in queues
22:08:24 <fungi> that was many hours ago too
22:08:26 <jeblair> Shrews: which could easily mean 10,000 requests
22:08:33 <Shrews> jeblair: ok, that sort of lines up with 6000+ requests i saw this morning
22:09:11 <jeblair> Shrews: as long as there was movement at all, the git stuff probably wasn't related.  the thing i saw today only blocked one build
22:09:20 <jeblair> however, the git bug can, fairly easily, stop the entire system
22:10:04 <jeblair> a single instance of it can stop the gate pipeline altogether, and a few more instances can leave us with no operating mergers
22:11:04 <fungi> as for job-specific problems, it's seemed to me like we're at the point where we're addressing them almost as soon as they're identified (for the ones i've added to the pad, i've usually already had a fix in mind if not in fact pushed into gerrit yet)
22:11:29 <jeblair> so if we don't run with a fix for the git bug in place (in whatever form that fix takes), or we switch back to v3 before the fix is in place, then we need to consider ourselves on-call to find and kill any stuck git processes
22:11:30 <fungi> granted, when we roll forward again, the reporting rate for broken jobs will pick up substantially
22:12:36 <fungi> jeblair: for the record, i'm cool with running on a gitpython fork (apparently master plus your pr at this point) until they get a fix in master for that, and then switch to the latest release as soon as they tag one
22:13:02 <fungi> i just figured we should acknowledge that's what's going on if we decide to
22:13:15 <jeblair> i think i'm inclined to suggest that i polish https://review.openstack.org/509517 so it passes tests and then we run with my gitpython fork and https://review.openstack.org/509517 locally applied
22:13:36 <jeblair> fungi: ya.  i don't like it, but i think it's the least worst option, and hopefully very temporary.
22:13:43 <Shrews> i, too, am fine with that. and since this is starting to seem like a meeting of 3, i think all votes are in
22:13:44 <fungi> and probably goes without saying we shouldn't officially release zuul 3.0.0 with a dep on forked gitpython
22:13:53 <jeblair> and believe me -- i considered monkeypatching.  :)
22:14:10 <fungi> so best to hold the release until there's a fixed gitpython we can list as our minimum
22:14:15 <jeblair> fungi: ya.  in a similar vein, we need a github3.py release too.
22:14:27 <jeblair> i'm happy to hold the line on both of those.
22:14:28 <fungi> good point. i had forgotten about that one
22:15:11 <jeblair> Shrews, fungi: cool.  it's unanimous then.  :)  i'll have that running before i go to bed tonight.  :)
22:15:12 <fungi> Shrews: mrhillsman is here too!
22:15:31 <fungi> so meeting of four ;)
22:15:34 <jeblair> oh i really hope mrhillsman votes for our crazy plan :)
22:15:49 <mrhillsman> hehe
22:15:59 <mrhillsman> not sure my vote has any merit
22:16:23 <ianw> o/ (but a bit behind, so just lurking :)
22:16:31 <fungi> we _consider_ all opinions in here ;)
22:16:53 <mrhillsman> i'm really looking forward to v3 since it is the underpinning of our openlab efforts
22:16:56 <jeblair> as for the nodepool issue -- as long as that's landed before we restart launchers, we should be fine (it's a bug in launcher restart), so i don't think we need to fret too much over that; i imagine we can get it landed before it creeps up again.
22:17:34 <jeblair> mrhillsman: ++  (and sorry i haven't been able to jump in as much; hopefully we'll be done fighting fires soon!)
22:17:48 <mrhillsman> so far my small little setup is "working" just have not got a job running yet so working on that
22:17:51 <Shrews> jeblair: i don't think it's limited to restart
22:18:07 <jeblair> Shrews: true; i guess that's when it's most likely to appear though
22:18:17 <Shrews> yeah
22:18:51 <jeblair> fungi: so yeah, it looks like there are a few thing still in the jobs section
22:19:01 <jeblair> are any of those blockers?
22:19:57 <fungi> maybe the unbound setup issue
22:20:18 <jeblair> i'm guessing dmsimard is on holiday today too
22:20:45 <jeblair> i guess without that, our rax random error rate will go up?
22:20:51 <fungi> yeah
22:21:03 <fungi> yeah, i mean my guess is that many (most) of the lingering items in the jobs list are probably already fixed and we just need to circle back around to confirm
22:21:16 <jeblair> what's propose-updates?
22:21:27 <ianw> i can looking into the unbound thing, since i did a bit of ansible around the mirror setup
22:22:36 <jeblair> ianw: cool, thanks -- dmsimard has his name next to that on the etherpad, so be sure to let him know what you find/do with that for when he gets back
22:22:37 <ianw> how about "i will"; i'll put any updates into the etherpad
22:22:46 <ianw> ++ :)
22:23:17 <fungi> jeblair: good question about the "propose-updates" job. that's so surprisingly vague i can't even figure out from the log what it's supposed to be doing yet
22:23:57 <jeblair> unbound is the only thing there that doesn't have a fix next to it that i would consider a potentially big enough problem to be a blocker.  infra-index updating is something we can fix at our leisure, for instance.
22:24:23 <fungi> yeah, i mentioned the unbound configuration because it potentially impacts all jobs in the system
22:24:47 <fungi> whereas the rest of these look like they could be dealt with in isolation
22:25:56 <fungi> also i think the playbook that job mentions no longer exists?
22:26:12 <jeblair> that would be a problem :)
22:26:16 <jeblair> overall, i'd say we're *almost* ready to switch back, and have a pretty legit chance of being *actually* ready by tomorrow morning.  i think we're close enough we can consider flipping the switch as early as tomorrow morning.
22:26:29 <fungi> git.openstack.org/openstack-infra/project-config/playbooks/proposal/propose-updates
22:26:36 <fungi> i don't see it at all
22:26:52 <jeblair> should we do that?  or should we give more lead time for an announcement, etc?
22:26:52 <fungi> no, i'm just blind. it's there
22:27:15 <Shrews> jeblair: fungi: should we consider adding a subset of projects first, rather than a total switch? or is it easier to just do them all at once?
22:27:35 <jeblair> Shrews: it's really hard to do anything other than all at once
22:27:36 <fungi> Shrews: i think it's been tough enough to run with the minimal split we've got
22:27:58 <Shrews> *nod*
22:29:03 <fungi> i'm on board with announcing a date/time, but consider <24h to mean a lot of people could be surprised (many will be surprised anyway, but at least 24 hours gives people around the globe a chance to read it)
22:29:44 <fungi> also it's meeting day for the infra team
22:30:09 <fungi> we could shoot for something like 16:00z wednesday? (is that too early, or not early enough for you?)
22:30:12 <jeblair> yeah, i normally like longer lead-times; but considering the sort of extended-maintenance + partial-rollback state we're in, i figure anything's on the table.  :)
22:30:30 <pabelanger> o/
22:30:34 <pabelanger> sorry I am late
22:30:42 <fungi> too far into the week, and i agree we risk not having enough opportunity to spot issues before the weekend
22:31:14 <jeblair> fungi: i think we should do the switch as soon as you or pabelanger or mordred are online
22:31:36 <fungi> the earlier the better. cool
22:31:37 <jeblair> i'm online at 14:00
22:32:14 <pabelanger> I'll be online first thing too
22:32:16 <jeblair> i think the ideal is for us to make the switch before (or as close to 'before' as possible) the US-time surge
22:32:23 <fungi> pabelanger: mordred: are you around wednesday morning?
22:32:33 <pabelanger> fungi: yes
22:32:56 <jeblair> i figure actually executing the switch may take 30 to 60 or.. idunno, maybe even more minutes.
22:33:06 <fungi> i can be up pretty early, though we're talking 11:00z if we want to catch the start of the morning ramp-up
22:33:18 <jeblair> so if you get started first thing, i probably won't actually miss much.
22:33:40 <pabelanger> Sure
22:34:16 <fungi> #link https://etherpad.openstack.org/p/zuulv3-cutover
22:34:25 <fungi> that still the steps we want?
22:34:56 <pabelanger> looking
22:35:00 <jeblair> fungi: we probably need to refresh some changes there
22:35:14 <fungi> sure
22:35:33 <jeblair> fungi: let's ask mordred to do that tomorrow, and make sure we have changes staged and steps written down for wed morning
22:35:38 <fungi> prepping those should probably be a top priority for at least a couple of us tomorrow
22:35:51 <pabelanger> do we want to keep infra-check / infra-gate for a few more days or revert that change too?
22:36:00 <jeblair> pabelanger: i say drop 'em
22:36:06 <pabelanger> ack
22:36:15 <fungi> agreed
22:36:33 <pabelanger> nodepool back to nodepool-launchers should be straightforward too
22:36:44 <pabelanger> exciting
22:37:29 <jeblair> #agreed zuulv3 cutover wed 11 oct at 11:00 utc
22:37:33 <jeblair> ^ that look right?
22:37:40 <fungi> wfm
22:37:48 <mnaser> can i suggest keeping infra-check and infra-gate
22:37:53 <mnaser> at least for the first week after the cutover
22:38:03 <mnaser> it's really useful having a high priority queue to get project-config in quickly
22:38:21 <jeblair> #action mordred stage and document rollback steps
22:38:35 <mnaser> project-config changes*
22:38:50 <jeblair> mnaser: that's true, but with the stabilization we've done over the past week, will it be as necessary?
22:38:59 <fungi> mnaser makes a good point... the number of fixes to legacy jobs are likely to shoot back up
22:39:11 <mnaser> jeblair in the high volume we deal with of jobs, sometimes i see xenial jobs taking almost an hour to get a node
22:39:14 * jeblair asks mnaser to predict future :)
22:39:33 <fungi> and "stabilization" (under v2 at least) is often an hour or two to get check results
22:39:45 <fungi> when we're pressed for capacity
22:40:10 <fungi> infra-gate is less necessary, but infra-check may be useful
22:40:29 <jeblair> okay, i could be convinced to keep it for just project-config
22:41:10 <jeblair> yeah, infra-gate probably wouldn't actually get us anything.  but check would make a difference.
22:41:42 <pabelanger> okay, so maybe keep infra-check for a few more days, remove infra-gate
22:41:48 <jeblair> (this is also actually a good use for multi-tenancy, but i don't want to muddy things too much right now)
22:42:04 <mnaser> :>
22:42:08 <pabelanger> looking forward to test that too :)
22:42:28 <fungi> yep, if we keep infra-check around temporarily, i think we make a judgment call at some point where the volume of project-config changes has dropped off significantly and fold it back into normal check then
22:43:04 <pabelanger> okay, we likely can merge that right away too
22:43:37 <fungi> but i guess we move zuul, nodepool, openstack-zuul-jobs and zuul-jobs repos back to normal check
22:43:53 <fungi> so that infra-check is just for project-config changes?
22:44:07 <jeblair> regarding zuul-jobs and openstack-zuul-jobs -- we *could* also put them in infra-check, however, depends-on works for those, so they don't *need* to block changes.  though it may be convenient so we don't have to use depends-on as much.
22:44:16 <jeblair> my inclination would be only to use infra-check for project-config
22:44:20 <jeblair> and rely on depends-on for the others
22:44:33 <fungi> i figure with the others, we still have the option to punt a change straight to the gate pipeline when urgent
22:44:41 <jeblair> (i don't like being a special case)
22:44:45 <jeblair> fungi: that's true too
22:44:50 <fungi> as long as there's an infra-root around to make that exception
22:45:31 <jeblair> #agreed keep infra-check for project-config only (other repos move back to regular check/gate).  remove infra-gate completely.
22:45:36 <jeblair> look good ^?
22:45:44 <fungi> we're already a special case insofar as project-config is the point of coordination for all this, so makes sense to just acknowledge that for now
22:45:57 <fungi> yeah, i'm good with that
22:46:01 <jeblair> fungi: yeah, it's a config-project, so it really is special :)
22:46:29 <pabelanger> project-config wfm
22:46:44 <pabelanger> (sorry internets also slow and spotty right now)
22:46:44 <fungi> project-config knows it's special 'cause its momma told it so
22:47:20 <jeblair> since we're expecting reports of broken jobs to increase again, i suggest we copy all the cruft from the zuulv3-issues etherpad to a backup, and continue to use zuulv3-issues to track new job and/or zuul bugs
22:47:51 <fungi> that also makes sense. i can do that now... want me to do a proper copy with the etherpad api, or just copy/paste content?
22:48:04 <fungi> or wait until we're ready to do the cutover?
22:48:25 <jeblair> fungi: i think copy/paste content -- we still have the history on the main page if we have any questions.
22:48:33 <fungi> sounds good
22:48:35 <pabelanger> wfm
22:48:49 <mnaser> just a small suggestion: it'd be useful to split zuul issues such as functionality and job issues .. i would really want to help out fix jobs but i have no idea how to fix zuul things
22:48:59 <jeblair> #agreed clear cruft off of zuulv3-issues etherpad and continue using it to track new issues after roll-forward
22:49:13 <fungi> i guess we should give ourselves some time tomorrow to quiesce the etherpad
22:49:13 <jeblair> mnaser: we sort of have that, but we can make it clearer
22:49:15 <fungi> yeah, cleanup
22:49:35 <jeblair> the "debugging" section ended up being the "zuul issues" for the most part
22:50:04 <jeblair> but yeah, let's make sure we have "fixed" "job issues" "zuul issues" "un-triaged" sections
22:50:20 <fungi> yeah, in many (particularly early) cases, it was hard to tell whether a job was broken due to misconfiguration or due to a zuul bug
22:50:36 <fungi> but i expect it will be a lot more straightforward to figure out most of them now
22:51:07 <jeblair> and if not, use the triage section for that.  we should just move things out of there quickly after triage.
22:52:25 <jeblair> to make the most of this >24h lead time, we should send an announcement now, yeah?  shall i send that out?
22:52:47 <pabelanger> +1
22:53:36 <pabelanger> I didn't see in backscroll, and may have missed it. But have we restarted zuul-executors recently? Want to make sure we have ABORTED patch in place before we go live again
22:53:59 <jeblair> pabelanger: i think that happened over the wknd
22:54:10 <pabelanger> okay great
22:54:23 <fungi> jeblair: i'm in favor of an announcement as early as we can provide one. happy to send it if you have other things you'd prefer to work on
22:55:05 <fungi> kast restarted october 7 for ze01
22:55:09 <fungi> s/kast/last/
22:55:29 <jeblair> fungi: i'm going to take you up on that offer and let you send it out then.  :)
22:55:41 <fungi> i'll get to work on that now
22:55:48 <jeblair> i'll do the git thing and review the nodepool change
22:55:54 <fungi> thanks!
22:55:58 <jeblair> thank you!
22:56:00 <pabelanger> fungi: great, oct 6 was commit
22:56:06 <jeblair> anything else, or should we wrap this up?
22:56:15 <fungi> pabelanger: note i didn't check any other executors, just ze01
22:56:20 <fungi> i have nothing else
22:56:25 <pabelanger> nothing here
22:58:18 <jeblair> thanks!
22:58:20 <jeblair> #endmeeting