19:01:08 #startmeeting infra 19:01:09 Meeting started Tue Jan 30 19:01:08 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:12 The meeting name has been set to 'infra' 19:01:25 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:29 o/ 19:01:33 #topic Announcements 19:01:47 o/ 19:01:57 o/ 19:02:00 #link http://lists.openstack.org/pipermail/openstack-dev/2018-January/126192.html Vancouver Summit CFP open. Submit your papers. 19:02:16 You have until february 8 which is like ~10 days from now to submit to the summit cfp 19:02:19 o/ 19:02:34 OpenDev is being run alongside the summit in vancouver and will have two days dedicated to it 19:02:57 #link https://etherpad.openstack.org/p/infra-rocky-ptg PTG Brainstorming 19:03:22 We also have a PTG coming up in just under a month. Looks like we've started brainstorming topics there \o/ 19:03:54 PTL election season just started 19:04:28 (I just put my token back into the nomination hat for Infra) 19:05:01 Also my wife is traveling this week to visit family (all happened last minute) so I'm going to have weird hours and be more afk than usual while watching the kids 19:05:06 Anything else? 19:05:42 #topic Actions from last meeting 19:05:51 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-01-23-19.01.txt minutes from last meeting 19:06:03 corvus patch spammed projects with name removals 19:06:11 yay spam patches! 19:06:13 I still haven't gotten around to classifying old nodepool and zuul changes 19:06:32 but it is still on my list /me rerecords it 19:06:54 #action clarkb / corvus / everyone / to take pass through old zuul and nodepool master branch changes to at least categorize changes 19:07:01 i have started spamming 19:07:10 but i stopped; i don't think uploading was a problem 19:07:17 #action clarkb abandon specs per: http://lists.openstack.org/pipermail/openstack-infra/2018-January/005779.html 19:07:22 but apparently mass approvals on monday morning was 19:07:36 i think perhaps i should restart them now? 19:07:55 maybe check with the release team to amke sure they don't have any urgent chagnes they want in first? 19:08:11 but ya I expect that normal weekly interrupts will spread those changes out appropriately 19:08:28 i think the release team is still waiting for go-ahead from us that things are stable enough to go back to approving releases 19:08:42 they did do a few releases last night 19:08:50 corvus: yeah, restart them - and please send an email to openstack-dev explaining what you do and that you don't need help ;) 19:08:56 the big outstanding item for stability is the dfw mirror right? 19:08:57 fungi: and don't think there were any issue 19:08:59 i was hesitant to say we're back to normal stability until we finish working out what's going on with the rax-dfw mirror 19:09:05 yah 19:09:07 any jobs running there run the risk of failing, though we did reduce max servers there 19:09:38 which so far seems to have had minimal impact on bandwidth utilization (theorized to be due to the rdo ci system also using that mirror) 19:10:59 * clarkb looks at agenda. Lets move on and get back to this a little later 19:11:11 #topic Specs approval 19:11:37 Between travel and dadops I'm a bit behind on specs. I don't think there are any that need review this week. Did I miss some? 19:13:09 #topic Priority Efforts 19:13:17 #topic Storyboard 19:13:38 fungi: Zara search engine indexing is still on the list. Anything new we need to go over on that topic? 19:14:33 oh, right, i was going to propose dropping the robots.txt restriction and see how that goes 19:14:38 i haven't done that yet 19:14:52 sounds like a plan at least :) 19:15:25 #topic Zuul v3 19:16:00 are folks aware of the changes related to reporting on ansible? 19:16:09 I am not 19:16:27 i am not as familiar as i'd like to be 19:16:29 we have found (again) that github can be occasionally unreliable 19:16:42 * fungi puts on his "shocked" face 19:16:43 wha????? 19:16:45 that caused us to occasionally report spurious merge failures 19:17:14 to keep the ansible folks from getting mad at us, we pulled ansible out of the config temporarily 19:17:28 we need to make fetching in the merger a bit more robust 19:17:51 +1 19:17:52 but also, we probably never really want to report merge-failure to ansible 19:18:03 but reporting merge-failure is a pipeline config option 19:18:22 so mordred has proposed https://review.openstack.org/539286 to create a new pipeline which does not report merge failures 19:18:42 corvus: in the case where pull request can't merge into master we would not report anything and rely on github's ability to detect that case instead? 19:19:16 clarkb: yep 19:19:33 so i think the only potential issue is if a spurious merge failure causes nothing to report on a change where it should 19:20:01 and "recheck" is the answer to that for now? 19:20:22 with no merge-failure reports, we would only rely on the folks managing the openstack modules in ansible to notice "hey, there should be a zuul report for this" 19:20:29 fungi: yeah, if they notice 19:21:07 yup. since it's advisory third-party style testing anyway, it's probably better to err on the side of quietness than spamness 19:21:28 should we give the third-party pipeline a low precedence? 19:23:08 oh, one other thing -- currently config error reports go through the merge-failure reporters 19:23:13 # TODOv3(jeblair): consider a new reporter action for this 19:23:21 perhaps this is motivation to implement that :) 19:23:25 AJaeger: might be better to keep it timely, at least in the beginning, unless we discover the volume is problematic? 19:23:25 corvus: :) 19:23:37 because i bet we would want those reported even if we don't want *merge* failures reported 19:23:45 corvus: ++ 19:23:48 fungi, I just see this week and last one a very high load... 19:23:59 though that could still be spammy if they ignore the message on a broken config and merge it 19:24:09 AJaeger: this first batch should be very low volume 19:24:09 then zuul will spam all ansible PRs with that message? 19:24:15 AJaeger: because of ansible/ansible pull request volume? 19:24:24 well - ansible doesn't currently have any .zuul.yaml in which to put config ... 19:24:34 so we shouldn't see any persistent config errors 19:25:20 but yes - I think if they did merge a .zuul.yaml and we had the new reporter category, then they'd start getting messages on all their chanes- which might be motivation to fix the merged file 19:25:21 i guess that's not urgent then 19:25:45 well, if they did merge a .zuul.yaml with a broken config 19:25:52 at that point we'll have a broken config in our system, and zuul will run with the most recent working config until it restarts (at which point it will fail to start). 19:26:01 so it won't spam all changes 19:26:11 it'll just do something worse :) 19:26:27 we had talked about ignoring projects that don't have working configs on start too right? 19:26:31 both of those things need fixing before we roll this out too much. 19:26:33 clarkb: ya 19:26:34 fungi: I mean: We currently are according to grafana waiting for 163 nodes - which is unusual low. We will not be able to launch new nodes for testing immediately. Should we give it the same prio as check pipeline? 19:26:48 but i think we can continue with ansible testing for now since we have no immediate plans for .zuul.yaml there 19:26:59 AJaeger: i assumed similar priority to check, yes 19:27:29 i lean toward trying same prio as check for now, and decreasing later if necessary 19:27:34 wfm 19:27:48 wfm 19:27:58 let's try - and monitor our workloads... 19:28:42 sounds like we know what the issues are and have a reasonable path forward. 19:28:48 anything else zuul related we want to go over? 19:29:00 I want to say pabelanger had something yesterday? 19:29:05 clarkb: and as a worst-case scenario, Shrews and I have commit rights in ansible/ansible, so we could just back out a broken .zuul.yaml addition 19:29:14 clarkb: yah, I can do it now or open topic 19:29:30 ok lets try to get to open topic 19:29:56 because I do think we should talk about job stability and make sure we aren't missing something important 19:30:12 #topic General Topics 19:31:13 The last week or so has been rough for jobs in openstack ci. We lost the log server filesystem which eventually led to replacing it. We have current mirror issues in rax dfw. There are also changes to job configs to address test timeouts (I guess being blamed on meltdown) 19:31:33 For the logs server is there anything outstanding that we need to do? 19:31:45 I guess decide if we want to keep the old logs around and if not delete those volumes? 19:31:48 I think all test timeout increases have merged, I'm not aware of open ones 19:32:20 also if you are familiar with fallout 3's soundtrack I currently have that playing for this topic >_> 19:32:24 clarkb: we need to decided if we want to rsync old logs back to logs.o.o, with right permissions 19:32:32 clarkb: nice choice! 19:32:38 I'm not sure anybody as even asked for old logs yet 19:33:01 AJaeger: I think we need to get stable branches for nova still 19:33:09 i didn't know we could ask for old logs, i've certainly noticed them missing 19:33:11 considering the amount of time required to sync the logs I'm leaning towards just rolling forward with new logs 19:33:17 pabelanger: ah, might be... 19:33:18 cmurphy: thats good info though 19:33:34 not that big a deal to me if they don't come back 19:33:38 pabelanger: maybe we should start with a sync for just 00 and make sure we got it all right then do the rest 19:33:47 longer-term, i suppose we should revisit redesigning the logs service 19:33:52 clarkb: sure 19:34:11 fungi: yes, I added a topic for PTG. Ive see a few plans floating around recently 19:34:19 that would be a great ptg topic 19:34:40 pabelanger: yeah, we had a list of some half-dozen options last time the topic came up 19:34:56 i think there's an etherpad somewhere 19:35:02 pabelanger: I'm not going to volunteer for the sync task simply because I'm being pulled a million directions this week but am happy to help with that whwnever I am at a keyboard 19:35:02 ++ 19:35:16 * mordred has been poking at what our options are re: swift - should have things to report back to folks soonish, but definitely by the PTG 19:35:59 the other outstanding issue then is the dfw mirror 19:36:08 fungi: ^ you said it hasn't reduced its bw usage 19:36:16 maybe we should build a new server and see if it works better 19:37:00 actually, bandwidth utilization has started subsiding in recent minutes 19:37:07 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64259&rra_id=all 19:37:09 fungi: but we still have a cap of 100mbps ya? 19:37:16 I'm trying to help rdoproject get onto their own AFS mirror, it is online but will take some time to rebuild some images. 19:37:22 right, and no word back from rax on the ticket for that yet 19:37:25 also happy to try a new mirror in dfw too 19:37:39 I'm just thinking new mirror in dfw may be way to address the 100mbps problem 19:37:45 especially if it is a hypervisor issue 19:38:01 yah, rebuilding mirrors have been generally good for us 19:38:25 there's also nothing stopping us from doing more than one with a load-balancer - given how they work 19:38:37 but doing one new one as a first step seems like a great next thing to do 19:38:50 ok anyone want to volunteer for that? 19:38:50 can we dns bounce between mirror01 / 02? 19:38:53 though also, i wouldn't be surprised to discover a bulk of bandwidth utilization there is from rdo and not our jobs, so may drop well below current levels once they stop using this mirror for their ci system 19:38:53 didn't we discuss that at last PTG too? 19:39:00 fungi: agree 19:39:03 clarkb: I can, if nobody else wants to 19:39:09 pabelanger: thanks 19:40:09 would be unfortunate to go to the effort of building out a load-balancing solution only to end up not needing nearly that much bandwidth in a few hours/days 19:40:13 in this case dns round robin may not be so great if the available resources are not symmetric 19:41:01 it looks like with rack cloud dns you just put in two of the same a records for the same host 19:41:14 though i suppose dns round-robin doesn't really take any extra effort beyond the second server build 19:41:17 ianw: ya we've done it for git.o.o before. It will round robin 19:41:25 i mean, i cloud try that if we like over the next few hours 19:41:27 my only concern with it is you can't weigh it 19:41:44 so 100mbps and 800 mbps node may still be an issue 19:41:47 the mirror hostname is a cname 19:42:00 so we would need to change that to A records 19:42:22 right 19:42:24 also, odds of the rdo cloud sharing a recursive dns resolver and all going to the same server... 19:42:30 in any case building a new server sounds like step 0 then we can decide if we want to keep old server 19:42:38 however, the host isn't supposed to be limited to 100mbps, right? 19:42:49 corvus: yes exactly. I think it shouldbe 800mbps 19:42:51 so regardless of who's using the bandwidth, there's still a problem, yeah? 19:42:57 iirc we get half the rxtx factor for public interface 19:43:00 yep 19:43:06 whcih is why I think building a new server is what we want 19:43:16 sounds fine to me 19:43:24 yeah, maybe do that after a reasonable period waiting on info via the ticket? 19:43:36 just in case it's "oops, we fixed it". may same some work 19:43:37 corvus: wfm 19:43:52 like maybe 2 more hours or something :) 19:43:56 sure 19:44:20 bw is down around 60Mbps now anyway, so i expect we're in the clear for a little while 19:44:26 alright, are there any other known issues with CI? 19:45:05 (we also had ubuntu mirror issues which ahve since been corrected) 19:45:21 yes 19:45:32 devstack jobs are still using rax dfw for UCA mirror 19:45:38 reprepro working and nodepool-builder back to using AFS mirrors for ubuntu 19:45:41 for reasons that surpass my understanding 19:45:53 mordred: pabelanger indicated we reverted teh change to fix that? 19:46:03 no, I was confused 19:46:21 ah ok 19:46:26 so thats something else to look into 19:46:35 OH 19:46:41 I think I just grokked why 19:46:46 the change in question was on stable/pike 19:46:46 yay 19:46:51 that will do it 19:46:53 yup 19:47:22 ah 19:47:32 ok before we run out of time I was also super curious to catch up on arm64 and gpt and uefi and all that 19:47:49 ianw: ^ my skimming on chat logs and email seem to tell me that things are going well? 19:47:59 yep 19:48:24 gpt i plan on merging 19:48:28 #link https://review.openstack.org/533490 19:48:33 You had the question of whether or not we should default to gpt on our images. Is that still an outstanding question? 19:48:36 and will spend a little more time with the uefi bits 19:49:14 let's come back to that ... we may default to a 3 partition layout that is suitable for everything 19:49:27 mbr boot and uefi boot 19:49:31 hax 19:49:42 i have brought up mirror.cn1.linaro.openstack.org 19:49:52 it's currently a little bit of a bespoke server 19:50:10 did afs build alright on the arm kernel? 19:50:10 #link https://review.openstack.org/#/c/539083/ 19:50:19 no, that's my next thing 19:50:36 all the other puppet bits worked (modulo that change for the arch-specific sources) 19:51:10 not too bad then I guess 19:51:12 i found some worryingly old patches but will have to dig into it. it might end up similar to centos where we keep our own little mirror 19:51:21 yah, that's good news 19:51:41 we will definitely want the mirror up however, network is a little flaky in the cn1 i've noticed 19:52:43 so that's about it ... just keep picking off things one by one till we can run a node there :) 19:52:53 exciting 19:53:04 frickler: had an agenda entry for review spam 19:53:21 I think this was part of last week's meeting. Any thing else we need to cover on that? 19:53:21 clarkb: we discussed last week, can be removed 19:53:24 AJaeger: thanks 19:53:33 #topic Open Discussion 19:53:50 * AJaeger asks for some more config-core review 19:53:59 #link https://review.openstack.org/539248 change to work around nodepool pool deadlocking 19:54:05 tripleo-test-cloud-rh1 is in the process of being removed from nodepool, only a few more patches to remove jobs from zuul 19:54:15 those jobs now run in rdocloud (OVB) 19:54:18 I would appreciate if others could review regularly open changes 19:55:03 pabelanger and clarkb found a neat nodepool issue yesterday while I was having dinner. I have yet to come up with a programmatic way to deal with it, but a change to our configs to use max-ready-age helps curb the issue until then. 19:55:07 we also lost console streaming on ze01 (maybe more) due to OOM killer. Wanted to see if we'd like to zuul-executor to have lower score for OOM or increase ansible-playbooks? Only way to restore streaming is to restart executor now 19:55:17 AJaeger: this is specfically project-config? 19:55:21 since frickler is now infra-root and yolanda not reviewing since a few weeks, we miss some regular config-core reviewers. So, how to get more? 19:55:30 anything to help not kill zuul-executor helps 19:55:31 clarkb: project-config and oepnstack-zuul-jobs especially 19:55:48 clarkb: I'm sure there are more that are neglected but those two I care about ;) 19:56:04 AJaeger: yah, I'll increase my reviews, I've been lacking in recent months 19:56:11 me too - sorry aboutthat 19:56:18 #info project-config and openstack-zuul-jobs need review help. Please give them attention as you are able. Also, if anyone is interested in becoming config-core reviews on those is a great place to start 19:56:51 ++ good reminder 19:57:03 thanks! 19:57:16 I know I've not been able to give gerrit the attention I'd like for the last week and a half :( 19:57:53 last days was far too much firefighting anyhow... 19:58:15 pabelanger: I don't have any immediate good ideas for that other than the governor idea 19:58:55 I was hoping we could discuss whether or not the change in https://review.openstack.org/539248 is acceptable, or if we want a different timeout, or just want to wait on a real fix (no idea on that one yet) 19:59:01 yah, I have to loop back to adding tests for that, but haven't had time to do so 19:59:18 but, alas, we are out of time :( 19:59:37 Shrews: I'm sceptical - as commented - on some corner cases. But happy to try... 19:59:43 I'm mostly afk all week so it doesn't help -- sorry about that 19:59:45 Shrews: fwiw I think its an acceptable work around for the deadlock issue 19:59:47 Shrews: I'm okay, if others are for work around 20:00:14 and yup we are out of time 20:00:26 Thank you everyone and sorry I've not been around as much as I'd like recently 20:00:28 #endmeeting