19:01:14 #startmeeting infra 19:01:15 Meeting started Tue Jul 2 19:01:14 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:19 The meeting name has been set to 'infra' 19:01:29 #link http://lists.openstack.org/pipermail/openstack-infra/2019-July/006409.html Meeting Agenda 19:01:37 #topic Announcements 19:01:43 * fungi is sorta here 19:02:07 This week coincides with a major US holiday (I think with a Canadian holiday too) so we can probably expect it to be a slow weird one 19:02:24 Mostly just a heads up that those of us in the USA are likely to disappear thursday and maybe friday 19:02:47 i'm also basically gone all week 19:03:00 (just happen to be taking a break near a computer at the moment) 19:03:16 fungi: was that break timed strategically? 19:03:21 #topic Actions from last meeting 19:03:28 entirely accidental 19:03:28 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-06-25-19.01.txt minutes from last meeting 19:03:54 mordred: you took actions a little while back to setup an opendevadmin account on github and to clean up the github openstack-infra org 19:04:24 mordred: have you managed to make progress on that yet? if not I'll tack it back onto the list and I'll bet you are able to get more work done now that you have settled into the nomad life 19:04:48 clarkb: I have not 19:04:59 clarkb: but yes - I should be much more able to actually make progress on it 19:05:01 #action mordred Set up opendevadmin github account 19:05:12 #action morded clean up openstack-infra github org 19:05:40 mordred: if only wifi worked underwater right? 19:05:48 clarkb: one day! 19:06:04 #topic Priority Efforts 19:06:11 #topic Update Config Management 19:06:39 ianw: I think you are largely driving the efforts here at this point. Want to update us on the mirror-update ansiblification? 19:08:12 i've noticed the kafs-based mirror in iad hasn't crashed in over 5 days now 19:08:21 yeah, so that really came from wanting to export the logs 19:09:11 (mirror-update) ... rather than hacking puppet to do it, seemed better to start an opendev.org mirror-update; initially with the rsync updates because it's easy 19:09:32 that's in progress, i have the host up and reviews done, just need to babysit it 19:10:15 as mentioned, we have two kafs based mirrors running now, rax.iad & gra1.ovh. details in https://etherpad.openstack.org/p/opendev-mirror-afs 19:10:15 ya there is some coordinating between the two hosts that needs doing right? 19:10:23 basically disable all the crons on the old host, then enable on the new host? 19:10:50 clarkb: yep, that's it, and make sure it's actually working :) 19:12:09 kafs doesn't have any blocking issues afaik. however we seem to have narrowed down some of the problems with retry logic when we do "vos release" 19:12:26 ianw: the cachefs crashes are still expected to happen though? 19:12:58 surprising we can go nearly a week without one but then sometimes see it thrice in a day 19:13:35 that's ... undetermined. dhowells is aware of it (links to mail list posts in that etherpad, although there hasn't been on list discussion) 19:14:14 gotcha so unsure of root cause or when/why it may happen then 19:14:44 ianw: I was enjoying the scrollback between you and auristor from my last evening 19:15:02 seems like we can skip the agenda entry to talk about the mirrors if we cover at least one more item. That is new mirror should be built with bionic + openafs 1.8.3 (via our ppa) and the ansible is all in place to do that by default? 19:15:23 i think dhowells has a pretty good idea of what's going on, it is in bits of code he wants to rewrite. i need to sync on what the medium term plans are, because i know significant work won't happen for a few weeks at least 19:16:41 clarkb: i think so; the dfw.rax mirror has been going OK AFAIK? we got a report last night about the ord.rax mirror and "hash sum mismatch", which we just turned on the other day ... i need to investigate that 19:17:11 ianw: that reminds me of the problems we had on default bionic openafs 19:17:19 (perhaps we didn't get the ppa version there?) 19:17:36 mordred: yes, auristor has been (and continues to be) super helpful with our AFS infrastructure :) it's like a direct line to the afs gods :) 19:17:43 ianw: +100 19:18:53 Alright before we move on anyone else have puppet conversion irons in the fire? 19:19:02 (I'm not aware of any so speak up if I've missed them) 19:19:10 there is.... 19:19:47 we should remember that further work on the gerrit-gitea-zuul complex is pending a move of the project creation ansible into python (to speed that up) 19:20:41 i think that was on mordred's plate... mordred is that still something you want to do, or would you like me to see if i can scrounge up some time for it? 19:20:50 (i think this week is spoken for for me, but maybe next) 19:21:17 https://review.opendev.org/#/c/651390/ is related to that if we want to poke at a slightly less complex version of that CD thing first 19:21:43 corvus: I ... yeah - I think this next week is spoken for for me too 19:21:56 I'm still game to do it - but if someone else wants to hit it, I won't be offended 19:21:57 Maybe thats a resync next week after the holiday thing then? 19:22:10 sounds good, we'll see where we are then 19:22:15 I'm happy to help as well but ya holiday makes this week a hard one 19:22:33 there's a holiday? I haven't heard mention of that where I am ... 19:22:35 and yeah, that NS change can go in parallel 19:22:53 mordred: there is probably an expat community that will have a BBQ type thing 19:22:58 mordred: we had that growing up 19:22:58 oh mordred should be all caught up by next week then, since he moved *out* of a holiday zone :) 19:23:21 mordred: (you're supposed to nomad into places right before major holidays ;) 19:23:23 maybe I can celebrate king's day late or something instead 19:23:30 corvus: point taken 19:24:07 #topic OpenDev 19:24:16 Last week I replaced gitea06 finally 19:24:24 clarkb: huzzah, thanks! 19:24:25 The new server has a larger uncorrupted rott disk 19:24:30 *root 19:24:36 heh, it had the disk rot 19:24:40 indeed 19:24:52 I also wrote down the process for replacing the gitea nodes in our docs 19:25:00 so if we have to do that again hopefully it is nice and easy 19:25:15 (we should consider replacing all of them actually to get bigger ext4 journals and larger disk over all) 19:25:28 I don't think ^ is urgent but something to keep in mind as we have time 19:25:42 and inoculate the rest of them against diskrot 19:26:15 Semi related to that its been pointed out that our larger repos like nova perform poorly in gitea 19:26:38 I responded to the openstack-discuss thread with pointers on how people might help us debug that, but haven't seen a response 19:26:59 yeah, that's a little disappointing 19:27:53 Any other opendev related business? 19:27:53 I imagine our gitea friends would be interested in making improvements if we can collect the debug info from people 19:28:06 mordred: ya I mentioned that upstream is really helpful and would likely want to fix it too 19:28:12 we heard a lot of feedback that the complex puppet system we had made it hard for people to contribute casually; it's interesting that a shiny new simple dockerfile is not sufficient enticement. 19:29:36 corvus: dockerfiles are too complicated - if we could just hand out root accounts to developers, then I'm sure people would help 19:29:58 mordred: i guess our standards are too high :) 19:30:40 if anyone is listening in I'm happy to assist someone that would like to debug that in the nearer future, I just have no bw to do it myself currently 19:32:05 Ok sounds like that may be it for opendev. Moving on 19:32:08 #topic Storyboard 19:32:18 fungi: diablo_rojo_phon ^ any updates? 19:32:43 My IRC client decided that to reload plugins that meant reloading all channel connections and I've apparently not set up #storyboard again 19:32:54 there was another bug triage on friday 19:33:00 nothing too exciting 19:33:17 most of the active stories in the storyboard project group are now tagged appropriately 19:33:35 that's the only update i'm aware of 19:33:49 #topic General Topics 19:33:57 #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup 19:34:07 fungi: ^ any new progress on the wiki before stepping out for vacation? 19:34:16 and are there changes we should be looking at? 19:36:38 i'll know more after https://review.opendev.org/666162 merges 19:36:46 i had to recheck it for a beaker problem, apparently 19:37:05 ya that looks consistent 19:37:15 may need debugging 19:37:37 The last item on the Agenda (we covered mirrors previously) is a cloud resource status update 19:37:47 right, not sure why the same patchset was passing a week ago 19:38:06 donnyd has set up users for us in the fortnebula cloud whcih I'm working through sanity checking of right now 19:38:11 (tempest is running as we speak) 19:38:27 and apparently tempest needs a boatload of ram nowadays? 19:38:43 yup tempest/devstack is digging into swap about 389MB deep on our 8GB ram nodes 19:39:20 should we try a larger flavor... maybe like 10 or 12 G of memory 19:39:27 is that due to increased pressure from devstack? 19:39:39 It might just eat all the rams and swap anyways 19:39:42 donnyd: we've long pushed for the 8GB number because it is a reasonable number for people to have on their own machines 19:39:54 hrm, we should have the peak memory dumps going on? does it give a clue what's growing? 19:39:58 donnyd: imo the issue here is openstack is using way too much memory 19:40:20 and that should be fixed rather than increasing memory per node (also increasing memory per node would decrease our total number of test instances) 19:40:46 true, but if it makes it through the tests faster its a wash 19:40:52 ianw: I haven't dug into it much beyond discovering I had to enable swap to get tempest to pass on these nodes 19:41:08 not disagreeing, just willing to test it out 19:42:22 donnyd: it seems to have run within similar time bounds of other clouds as is so I think we may be fine then push back on openstack to try and improve memory usage 19:42:43 donnyd: it runs devstack faster than other clouds and tempest slightly slower so resulting job runtime should be about the same 19:42:50 maybe even a little quicker 19:42:58 right; the one to look for is like -> http://logs.openstack.org/68/668668/2/check/tempest-full-py3/65eee4c/controller/logs/screen-peakmem_tracker.txt.gz 19:43:31 the idea is that whenver free ram is seen to drop, it logs a new dump of what's taking it up 19:43:38 in any case Once tempest is done I'll post up numbers and if others agree we are in the same region of time I'll build the mirror when donnyd is done twaeking things then we can have nodepool start using the cloud 19:44:25 I also wanted to do a quick check on the other two clouds where things have been moving. Inap was turned off because they were upgrading the cloud right? Any word on whether or not we'll be able to turn that back on at some point? 19:45:23 i haven't heard anything new from mgagne on that 19:45:26 Looks like pabelanger and ajaeger may have worked with mgagne to disable that 19:45:31 https://review.opendev.org/#/c/655924/ 19:45:35 I am pretty much done 19:45:49 ok I can reach out to mgagne and ask if there is any news 19:45:57 the last cloud was limestone which has network trouble currently 19:45:58 just minor sysctl's at this point to get the last few cycles out 19:46:23 clarkb, fungi: Sorry, no news for now. A coworker is working on that one. I've been busy working on something else. I can ask him asap. 19:46:37 mgagne: ok not a rush, but didn't want ti falling through the cracks 19:46:39 mgagne: thank you 19:46:53 ++ 19:47:02 fungi: you were working with logan on limestone right? is the way to monitor that on our end the cacti gaps with our mirror there? 19:47:34 I guess we can continue to monitor that and if it stops having gaps put the cloud back into service 19:48:20 Alright that was all I had 19:48:24 #topic Open Discussion 19:48:40 Anything else? 19:48:48 I'm happy to end the meeting early if not 19:49:27 yeah that's what i did previously 19:50:03 i think logan- hasn't had a chance to look at it again after it went back to having connectivity issues 19:51:39 sounds liek that may be it. Thank you everyone 19:51:50 For those of you celebrating holidays have fun! 19:51:52 #endmeeting