#openstack-meeting log

19:01:08 <clarkb> #startmeeting infra
19:01:09 <openstack> Meeting started Tue Jan 30 19:01:08 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:12 <openstack> The meeting name has been set to 'infra'
19:01:25 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:01:29 <mordred> o/
19:01:33 <clarkb> #topic Announcements
19:01:47 <AJaeger> o/
19:01:57 <Shrews> o/
19:02:00 <clarkb> #link http://lists.openstack.org/pipermail/openstack-dev/2018-January/126192.html Vancouver Summit CFP open. Submit your papers.
19:02:16 <clarkb> You have until february 8 which is like ~10 days from now to submit to the summit cfp
19:02:19 <pabelanger> o/
19:02:34 <clarkb> OpenDev is being run alongside the summit in vancouver and will have two days dedicated to it
19:02:57 <clarkb> #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming
19:03:22 <clarkb> We also have a PTG coming up in just under a month. Looks like we've started brainstorming topics there \o/
19:03:54 <clarkb> PTL election season just started
19:04:28 <clarkb> (I just put my token back into the nomination hat for Infra)
19:05:01 <clarkb> Also my wife is traveling this week to visit family (all happened last minute) so I'm going to have weird hours and be more afk than usual while watching the kids
19:05:06 <clarkb> Anything else?
19:05:42 <clarkb> #topic Actions from last meeting
19:05:51 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-01-23-19.01.txt minutes from last meeting
19:06:03 <clarkb> corvus patch spammed projects with name removals
19:06:11 <mordred> yay spam patches!
19:06:13 <clarkb> I still haven't gotten around to classifying old nodepool and zuul changes
19:06:32 <clarkb> but it is still on my list /me rerecords it
19:06:54 <clarkb> #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes
19:07:01 <corvus> i have started spamming
19:07:10 <corvus> but i stopped; i don't think uploading was a problem
19:07:17 <clarkb> #action clarkb abandon specs per: http://lists.openstack.org/pipermail/openstack-infra/2018-January/005779.html
19:07:22 <corvus> but apparently mass approvals on monday morning was
19:07:36 <corvus> i think perhaps i should restart them now?
19:07:55 <clarkb> maybe check with the release team to amke sure they don't have any urgent chagnes they want in first?
19:08:11 <clarkb> but ya I expect that normal weekly interrupts will spread those changes out appropriately
19:08:28 <fungi> i think the release team is still waiting for go-ahead from us that things are stable enough to go back to approving releases
19:08:42 <pabelanger> they did do a few releases last night
19:08:50 <AJaeger> corvus: yeah, restart them - and please send an email to openstack-dev explaining what you do and that you don't need help ;)
19:08:56 <clarkb> the big outstanding item for stability is the dfw mirror right?
19:08:57 <pabelanger> fungi: and don't think there were any issue
19:08:59 <fungi> i was hesitant to say we're back to normal stability until we finish working out what's going on with the rax-dfw mirror
19:09:05 <pabelanger> yah
19:09:07 <clarkb> any jobs running there run the risk of failing, though we did reduce max servers there
19:09:38 <fungi> which so far seems to have had minimal impact on bandwidth utilization (theorized to be due to the rdo ci system also using that mirror)
19:10:59 * clarkb looks at agenda. Lets move on and get back to this a little later
19:11:11 <clarkb> #topic Specs approval
19:11:37 <clarkb> Between travel and dadops I'm a bit behind on specs. I don't think there are any that need review this week. Did I miss some?
19:13:09 <clarkb> #topic Priority Efforts
19:13:17 <clarkb> #topic Storyboard
19:13:38 <clarkb> fungi: Zara search engine indexing is still on the list. Anything new we need to go over on that topic?
19:14:33 <fungi> oh, right, i was going to propose dropping the robots.txt restriction and see how that goes
19:14:38 <fungi> i haven't done that yet
19:14:52 <clarkb> sounds like a plan at least :)
19:15:25 <clarkb> #topic Zuul v3
19:16:00 <corvus> are folks aware of the changes related to reporting on ansible?
19:16:09 <clarkb> I am not
19:16:27 <fungi> i am not as familiar as i'd like to be
19:16:29 <corvus> we have found (again) that github can be occasionally unreliable
19:16:42 * fungi puts on his "shocked" face
19:16:43 <Shrews> wha?????
19:16:45 <corvus> that caused us to occasionally report spurious merge failures
19:17:14 <corvus> to keep the ansible folks from getting mad at us, we pulled ansible out of the config temporarily
19:17:28 <corvus> we need to make fetching in the merger a bit more robust
19:17:51 <pabelanger> +1
19:17:52 <corvus> but also, we probably never really want to report merge-failure to ansible
19:18:03 <corvus> but reporting merge-failure is a pipeline config option
19:18:22 <corvus> so mordred has proposed https://review.openstack.org/539286 to create a new pipeline which does not report merge failures
19:18:42 <clarkb> corvus: in the case where pull request can't merge into master we would not report anything and rely on github's ability to detect that case instead?
19:19:16 <corvus> clarkb: yep
19:19:33 <corvus> so i think the only potential issue is if a spurious merge failure causes nothing to report on a change where it should
19:20:01 <fungi> and "recheck" is the answer to that for now?
19:20:22 <corvus> with no merge-failure reports, we would only rely on the folks managing the openstack modules in ansible to notice "hey, there should be a zuul report for this"
19:20:29 <corvus> fungi: yeah, if they notice
19:21:07 <mordred> yup. since it's advisory third-party style testing anyway, it's probably better to err on the side of quietness than spamness
19:21:28 <AJaeger> should we give the third-party pipeline a low precedence?
19:23:08 <corvus> oh, one other thing -- currently config error reports go through the merge-failure reporters
19:23:13 <corvus> # TODOv3(jeblair): consider a new reporter action for this
19:23:21 <corvus> perhaps this is motivation to implement that :)
19:23:25 <fungi> AJaeger: might be better to keep it timely, at least in the beginning, unless we discover the volume is problematic?
19:23:25 <mordred> corvus: :)
19:23:37 <corvus> because i bet we would want those reported even if we don't want *merge* failures reported
19:23:45 <clarkb> corvus: ++
19:23:48 <AJaeger> fungi, I just see this week and last one a very high load...
19:23:59 <clarkb> though that could still be spammy if they ignore the message on a broken config and merge it
19:24:09 <mordred> AJaeger: this first batch should be very low volume
19:24:09 <clarkb> then zuul will spam all ansible PRs with that message?
19:24:15 <fungi> AJaeger: because of ansible/ansible pull request volume?
19:24:24 <mordred> well - ansible doesn't currently have any .zuul.yaml in which to put config ...
19:24:34 <mordred> so we shouldn't see any persistent config errors
19:25:20 <mordred> but yes - I think if they did merge a .zuul.yaml and we had the new reporter category, then they'd start getting messages on all their chanes- which might be motivation to fix the merged file
19:25:21 <corvus> i guess that's not urgent then
19:25:45 <corvus> well, if they did merge a .zuul.yaml with a broken config
19:25:52 <corvus> at that point we'll have a broken config in our system, and zuul will run with the most recent working config until it restarts (at which point it will fail to start).
19:26:01 <corvus> so it won't spam all changes
19:26:11 <corvus> it'll just do something worse :)
19:26:27 <clarkb> we had talked about ignoring projects that don't have working configs on start too right?
19:26:31 <corvus> both of those things need fixing before we roll this out too much.
19:26:33 <corvus> clarkb: ya
19:26:34 <AJaeger> fungi: I mean: We currently are according to grafana waiting for 163 nodes - which is unusual low. We will not be able to launch new nodes for testing immediately. Should we give it the same prio as check pipeline?
19:26:48 <corvus> but i think we can continue with ansible testing for now since we have no immediate plans for .zuul.yaml there
19:26:59 <fungi> AJaeger: i assumed similar priority to check, yes
19:27:29 <corvus> i lean toward trying same prio as check for now, and decreasing later if necessary
19:27:34 <clarkb> wfm
19:27:48 <pabelanger> wfm
19:27:58 <AJaeger> let's try - and monitor our workloads...
19:28:42 <clarkb> sounds like we know what the issues are and have a reasonable path forward.
19:28:48 <clarkb> anything else zuul related we want to go over?
19:29:00 <clarkb> I want to say pabelanger had something yesterday?
19:29:05 <mordred> clarkb: and as a worst-case scenario, Shrews and I have commit rights in ansible/ansible, so we could just back out a broken .zuul.yaml addition
19:29:14 <pabelanger> clarkb: yah, I can do it now or open topic
19:29:30 <clarkb> ok lets try to get to open topic
19:29:56 <clarkb> because I do think we should talk about job stability and make sure we aren't missing something important
19:30:12 <clarkb> #topic General Topics
19:31:13 <clarkb> The last week or so has been rough for jobs in openstack ci. We lost the log server filesystem which eventually led to replacing it. We have current mirror issues in rax dfw. There are also changes to job configs to address test timeouts (I guess being blamed on meltdown)
19:31:33 <clarkb> For the logs server is there anything outstanding that we need to do?
19:31:45 <clarkb> I guess decide if we want to keep the old logs around and if not delete those volumes?
19:31:48 <AJaeger> I think all test timeout increases have merged, I'm not aware of open ones
19:32:20 <clarkb> also if you are familiar with fallout 3's soundtrack I currently have that playing for this topic >_>
19:32:24 <pabelanger> clarkb: we need to decided if we want to rsync old logs back to logs.o.o, with right permissions
19:32:32 <fungi> clarkb: nice choice!
19:32:38 <pabelanger> I'm not sure anybody as even asked for old logs yet
19:33:01 <pabelanger> AJaeger: I think we need to get stable branches for nova still
19:33:09 <cmurphy> i didn't know we could ask for old logs, i've certainly noticed them missing
19:33:11 <clarkb> considering the amount of time required to sync the logs I'm leaning towards just rolling forward with new logs
19:33:17 <AJaeger> pabelanger: ah, might be...
19:33:18 <clarkb> cmurphy: thats good info though
19:33:34 <cmurphy> not that big a deal to me if they don't come back
19:33:38 <clarkb> pabelanger: maybe we should start with a sync for just 00 and make sure we got it all right then do the rest
19:33:47 <fungi> longer-term, i suppose we should revisit redesigning the logs service
19:33:52 <pabelanger> clarkb: sure
19:34:11 <pabelanger> fungi: yes, I added a topic for PTG. Ive see a few plans floating around recently
19:34:19 <clarkb> that would be a great ptg topic
19:34:40 <fungi> pabelanger: yeah, we had a list of some half-dozen options last time the topic came up
19:34:56 <fungi> i think there's an etherpad somewhere
19:35:02 <clarkb> pabelanger: I'm not going to volunteer for the sync task simply because I'm being pulled a million directions this week but am happy to help with that whwnever I am at a keyboard
19:35:02 <pabelanger> ++
19:35:16 * mordred has been poking at what our options are re: swift - should have things to report back to folks soonish, but definitely by the PTG
19:35:59 <clarkb> the other outstanding issue then is the dfw mirror
19:36:08 <clarkb> fungi: ^ you said it hasn't reduced its bw usage
19:36:16 <clarkb> maybe we should build a new server and see if it works better
19:37:00 <fungi> actually, bandwidth utilization has started subsiding in recent minutes
19:37:07 <fungi> http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64259&rra_id=all
19:37:09 <clarkb> fungi: but we still have a cap of 100mbps ya?
19:37:16 <pabelanger> I'm trying to help rdoproject get onto their own AFS mirror, it is online but will take some time to rebuild some images.
19:37:22 <fungi> right, and no word back from rax on the ticket for that yet
19:37:25 <pabelanger> also happy to try a new mirror in dfw too
19:37:39 <clarkb> I'm just thinking new mirror in dfw may be way to address the 100mbps problem
19:37:45 <clarkb> especially if it is a hypervisor issue
19:38:01 <pabelanger> yah, rebuilding mirrors have been generally good for us
19:38:25 <mordred> there's also nothing stopping us from doing more than one with a load-balancer - given how they work
19:38:37 <mordred> but doing one new one as a first step seems like a great next thing to do
19:38:50 <clarkb> ok anyone want to volunteer for that?
19:38:50 <ianw> can we dns bounce between mirror01 / 02?
19:38:53 <fungi> though also, i wouldn't be surprised to discover a bulk of bandwidth utilization there is from rdo and not our jobs, so may drop well below current levels once they stop using this mirror for their ci system
19:38:53 <pabelanger> didn't we discuss that at last PTG too?
19:39:00 <mordred> fungi: agree
19:39:03 <pabelanger> clarkb: I can, if nobody else wants to
19:39:09 <clarkb> pabelanger: thanks
19:40:09 <fungi> would be unfortunate to go to the effort of building out a load-balancing solution only to end up not needing nearly that much bandwidth in a few hours/days
19:40:13 <clarkb> in this case dns round robin may not be so great if the available resources are not symmetric
19:41:01 <ianw> it looks like with rack cloud dns you just put in two of  the same a records for the same host
19:41:14 <fungi> though i suppose dns round-robin doesn't really take any extra effort beyond the second server build
19:41:17 <clarkb> ianw: ya we've done it for git.o.o before. It will round robin
19:41:25 <ianw> i mean, i cloud try that if we like over the next few hours
19:41:27 <clarkb> my only concern with it is you can't weigh it
19:41:44 <clarkb> so 100mbps and 800 mbps node may still be an issue
19:41:47 <corvus> the mirror hostname is a cname
19:42:00 <corvus> so we would need to change that to A records
19:42:22 <ianw> right
19:42:24 <fungi> also, odds of the rdo cloud sharing a recursive dns resolver and all going to the same server...
19:42:30 <clarkb> in any case building a new server sounds like step 0 then we can decide if we want to keep old server
19:42:38 <corvus> however, the host isn't supposed to be limited to 100mbps, right?
19:42:49 <clarkb> corvus: yes exactly. I think it shouldbe 800mbps
19:42:51 <corvus> so regardless of who's using the bandwidth, there's still a problem, yeah?
19:42:57 <clarkb> iirc we get half the rxtx factor for public interface
19:43:00 <fungi> yep
19:43:06 <clarkb> whcih is why I think building a new server is what we want
19:43:16 <fungi> sounds fine to me
19:43:24 <corvus> yeah, maybe do that after a reasonable period waiting on info via the ticket?
19:43:36 <corvus> just in case it's "oops, we fixed it".  may same some work
19:43:37 <clarkb> corvus: wfm
19:43:52 <corvus> like maybe 2 more hours or something :)
19:43:56 <pabelanger> sure
19:44:20 <fungi> bw is down around 60Mbps now anyway, so i expect we're in the clear for a little while
19:44:26 <clarkb> alright, are there any other known issues with CI?
19:45:05 <clarkb> (we also had ubuntu mirror issues which ahve since been corrected)
19:45:21 <pabelanger> yes
19:45:32 <mordred> devstack jobs are still using rax dfw for UCA mirror
19:45:38 <pabelanger> reprepro working and nodepool-builder back to using AFS mirrors for ubuntu
19:45:41 <mordred> for reasons that surpass my understanding
19:45:53 <clarkb> mordred: pabelanger indicated we reverted teh change to fix that?
19:46:03 <pabelanger> no, I was confused
19:46:21 <clarkb> ah ok
19:46:26 <clarkb> so thats something else to look into
19:46:35 <mordred> OH
19:46:41 <mordred> I think I just grokked why
19:46:46 <mordred> the change in question was on stable/pike
19:46:46 <clarkb> yay
19:46:51 <clarkb> that will do it
19:46:53 <mordred> yup
19:47:22 <pabelanger> ah
19:47:32 <clarkb> ok before we run out of time I was also super curious to catch up on arm64 and gpt and uefi and all that
19:47:49 <clarkb> ianw: ^ my skimming on chat logs and email seem to tell me that things are going well?
19:47:59 <ianw> yep
19:48:24 <ianw> gpt i plan on merging
19:48:28 <ianw> #link https://review.openstack.org/533490
19:48:33 <clarkb> You had the question of whether or not we should default to gpt on our images. Is that still an outstanding question?
19:48:36 <ianw> and will spend a little more time with the uefi bits
19:49:14 <ianw> let's come back to that ... we may default to a 3 partition layout that is suitable for everything
19:49:27 <ianw> mbr boot and uefi boot
19:49:31 <clarkb> hax
19:49:42 <ianw> i have brought up mirror.cn1.linaro.openstack.org
19:49:52 <ianw> it's currently a little bit of a bespoke server
19:50:10 <clarkb> did afs build alright on the arm kernel?
19:50:10 <ianw> #link https://review.openstack.org/#/c/539083/
19:50:19 <ianw> no, that's my next thing
19:50:36 <ianw> all the other puppet bits worked (modulo that change for the arch-specific sources)
19:51:10 <clarkb> not too bad then I guess
19:51:12 <ianw> i found some worryingly old patches but will have to dig into it.  it might end up similar to centos where we keep our own little mirror
19:51:21 <pabelanger> yah, that's good news
19:51:41 <ianw> we will definitely want the mirror up however, network is a little flaky in the cn1 i've noticed
19:52:43 <ianw> so that's about it ... just keep picking off things one by one till we can run a node there :)
19:52:53 <clarkb> exciting
19:53:04 <clarkb> frickler: had an agenda entry for review spam
19:53:21 <clarkb> I think this was part of last week's meeting. Any thing else we need to cover on that?
19:53:21 <AJaeger> clarkb: we discussed last week, can be removed
19:53:24 <clarkb> AJaeger: thanks
19:53:33 <clarkb> #topic Open Discussion
19:53:50 * AJaeger asks for some more config-core review
19:53:59 <clarkb> #link https://review.openstack.org/539248 change to work around nodepool pool deadlocking
19:54:05 <pabelanger> tripleo-test-cloud-rh1 is in the process of being removed from nodepool, only a few more patches to remove jobs from zuul
19:54:15 <pabelanger> those jobs now run in rdocloud (OVB)
19:54:18 <AJaeger> I would appreciate if others could review regularly open changes
19:55:03 <Shrews> pabelanger and clarkb found a neat nodepool issue yesterday while I was having dinner. I have yet to come up with a programmatic way to deal with it, but a change to our configs to use max-ready-age helps curb the issue until then.
19:55:07 <pabelanger> we also lost console streaming on ze01 (maybe more) due to OOM killer. Wanted to see if we'd like to zuul-executor to have lower score for OOM or increase ansible-playbooks? Only way to restore streaming is to restart executor now
19:55:17 <clarkb> AJaeger: this is specfically project-config?
19:55:21 <AJaeger> since frickler is now infra-root and yolanda not reviewing since a few weeks, we miss some regular config-core reviewers. So, how to get more?
19:55:30 <pabelanger> anything to help not kill zuul-executor helps
19:55:31 <AJaeger> clarkb: project-config and oepnstack-zuul-jobs especially
19:55:48 <AJaeger> clarkb: I'm sure there are more that are neglected but those two I care about ;)
19:56:04 <pabelanger> AJaeger: yah, I'll increase my reviews, I've been lacking in recent months
19:56:11 <mordred> me too - sorry aboutthat
19:56:18 <clarkb> #info project-config and openstack-zuul-jobs need review help. Please give them attention as you are able. Also, if anyone is interested in becoming config-core reviews on those is a great place to start
19:56:51 <ianw> ++ good reminder
19:57:03 <AJaeger> thanks!
19:57:16 <clarkb> I know I've not been able to give gerrit the attention I'd like for the last week and a half :(
19:57:53 <AJaeger> last days was far too much firefighting anyhow...
19:58:15 <clarkb> pabelanger: I don't have any immediate good ideas for that other than the governor idea
19:58:55 <Shrews> I was hoping we could discuss whether or not the change in https://review.openstack.org/539248 is acceptable, or if we want a different timeout, or just want to wait on a real fix (no idea on that one yet)
19:59:01 <pabelanger> yah, I have to loop back to adding tests for that, but haven't had time to do so
19:59:18 <Shrews> but, alas, we are out of time  :(
19:59:37 <AJaeger> Shrews: I'm sceptical - as commented - on some corner cases. But happy to try...
19:59:43 <dmsimard|afk> I'm mostly afk all week so it doesn't help -- sorry about that
19:59:45 <clarkb> Shrews: fwiw I think its an acceptable work around for the deadlock issue
19:59:47 <pabelanger> Shrews: I'm okay, if others are for work around
20:00:14 <clarkb> and yup we are out of time
20:00:26 <clarkb> Thank you everyone and sorry I've not been around as much as I'd like recently
20:00:28 <clarkb> #endmeeting