19:01:14 <clarkb> #startmeeting infra
19:01:15 <openstack> Meeting started Tue Jul  2 19:01:14 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:16 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:19 <openstack> The meeting name has been set to 'infra'
19:01:29 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-July/006409.html Meeting Agenda
19:01:37 <clarkb> #topic Announcements
19:01:43 * fungi is sorta here
19:02:07 <clarkb> This week coincides with a major US holiday (I think with a Canadian holiday too) so we can probably expect it to be a slow weird one
19:02:24 <clarkb> Mostly just a heads up that those of us in the USA are likely to disappear thursday and maybe friday
19:02:47 <fungi> i'm also basically gone all week
19:03:00 <fungi> (just happen to be taking a break near a computer at the moment)
19:03:16 <clarkb> fungi: was that break timed strategically?
19:03:21 <clarkb> #topic Actions from last meeting
19:03:28 <fungi> entirely accidental
19:03:28 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-06-25-19.01.txt minutes from last meeting
19:03:54 <clarkb> mordred: you took actions a little while back to setup an opendevadmin account on github and to clean up the github openstack-infra org
19:04:24 <clarkb> mordred: have you managed to make progress on that yet? if not I'll tack it back onto the list and I'll bet you are able to get more work done now that you have settled into the nomad life
19:04:48 <mordred> clarkb: I have not
19:04:59 <mordred> clarkb: but yes - I should be much more able to actually make progress on it
19:05:01 <clarkb> #action mordred Set up opendevadmin github account
19:05:12 <clarkb> #action morded clean up openstack-infra github org
19:05:40 <clarkb> mordred: if only wifi worked underwater right?
19:05:48 <mordred> clarkb: one day!
19:06:04 <clarkb> #topic Priority Efforts
19:06:11 <clarkb> #topic Update Config Management
19:06:39 <clarkb> ianw: I think you are largely driving the efforts here at this point. Want to update us on the mirror-update ansiblification?
19:08:12 <fungi> i've noticed the kafs-based mirror in iad hasn't crashed in over 5 days now
19:08:21 <ianw> yeah, so that really came from wanting to export the logs
19:09:11 <ianw> (mirror-update) ... rather than hacking puppet to do it, seemed better to start an opendev.org mirror-update; initially with the rsync updates because it's easy
19:09:32 <ianw> that's in progress, i have the host up and reviews done, just need to babysit it
19:10:15 <ianw> as mentioned, we have two kafs based mirrors running now, rax.iad & gra1.ovh.  details in https://etherpad.openstack.org/p/opendev-mirror-afs
19:10:15 <clarkb> ya there is some coordinating between the two hosts that needs doing right?
19:10:23 <clarkb> basically disable all the crons on the old host, then enable on the new host?
19:10:50 <ianw> clarkb: yep, that's it, and make sure it's actually working :)
19:12:09 <ianw> kafs doesn't have any blocking issues afaik.  however we seem to have narrowed down some of the problems with retry logic when we do "vos release"
19:12:26 <clarkb> ianw: the cachefs crashes are still expected to happen though?
19:12:58 <fungi> surprising we can go nearly a week without one but then sometimes see it thrice in a day
19:13:35 <ianw> that's ... undetermined.  dhowells is aware of it (links to mail list posts in that etherpad, although there hasn't been on list discussion)
19:14:14 <clarkb> gotcha so unsure of root cause or when/why it may happen then
19:14:44 <mordred> ianw: I was enjoying the scrollback between you and auristor from my last evening
19:15:02 <clarkb> seems like we can skip the agenda entry to talk about the mirrors if we cover at least one more item. That is new mirror should be built with bionic + openafs 1.8.3 (via our ppa) and the ansible is all in place to do that by default?
19:15:23 <ianw> i think dhowells has a pretty good idea of what's going on, it is in bits of code he wants to rewrite.  i need to sync on what the medium term plans are, because i know significant work won't happen for a few weeks at least
19:16:41 <ianw> clarkb: i think so; the dfw.rax mirror has been going OK AFAIK?  we got a report last night about the ord.rax mirror and "hash sum mismatch", which we just turned on the other day ... i need to investigate that
19:17:11 <clarkb> ianw: that reminds me of the problems we had on default bionic openafs
19:17:19 <clarkb> (perhaps we didn't get the ppa version there?)
19:17:36 <ianw> mordred: yes, auristor has been (and continues to be) super helpful with our AFS infrastructure :)  it's like a direct line to the afs gods :)
19:17:43 <mordred> ianw: +100
19:18:53 <clarkb> Alright before we move on anyone else have puppet conversion irons in the fire?
19:19:02 <clarkb> (I'm not aware of any so speak up if I've missed them)
19:19:10 <corvus> there is....
19:19:47 <corvus> we should remember that further work on the gerrit-gitea-zuul complex is pending a move of the project creation ansible into python (to speed that up)
19:20:41 <corvus> i think that was on mordred's plate... mordred is that still something you want to do, or would you like me to see if i can scrounge up some time for it?
19:20:50 <corvus> (i think this week is spoken for for me, but maybe next)
19:21:17 <clarkb> https://review.opendev.org/#/c/651390/ is related to that if we want to poke at a slightly less complex version of that CD thing first
19:21:43 <mordred> corvus: I ... yeah - I think this next week is spoken for for me too
19:21:56 <mordred> I'm still game to do it - but if someone else wants to hit it, I won't be offended
19:21:57 <clarkb> Maybe thats a resync next week after the holiday thing then?
19:22:10 <corvus> sounds good, we'll see where we are then
19:22:15 <clarkb> I'm happy to help as well but ya holiday makes this week a hard one
19:22:33 <mordred> there's a holiday? I haven't heard mention of that where I am ...
19:22:35 <corvus> and yeah, that NS change can go in parallel
19:22:53 <clarkb> mordred: there is probably an expat community that will have a BBQ type thing
19:22:58 <clarkb> mordred: we had that growing up
19:22:58 <corvus> oh mordred should be all caught up by next week then, since he moved *out* of a holiday zone :)
19:23:21 <corvus> mordred: (you're supposed to nomad into places right before major holidays ;)
19:23:23 <mordred> maybe I can celebrate king's day late or something instead
19:23:30 <mordred> corvus: point taken
19:24:07 <clarkb> #topic OpenDev
19:24:16 <clarkb> Last week I replaced gitea06 finally
19:24:24 <corvus> clarkb: huzzah, thanks!
19:24:25 <clarkb> The new server has a larger uncorrupted rott disk
19:24:30 <clarkb> *root
19:24:36 <corvus> heh, it had the disk rot
19:24:40 <clarkb> indeed
19:24:52 <clarkb> I also wrote down the process for replacing the gitea nodes in our docs
19:25:00 <clarkb> so if we have to do that again hopefully it is nice and easy
19:25:15 <clarkb> (we should consider replacing all of them actually to get bigger ext4 journals and larger disk over all)
19:25:28 <clarkb> I don't think ^ is urgent but something to keep in mind as we have time
19:25:42 <fungi> and inoculate the rest of them against diskrot
19:26:15 <clarkb> Semi related to that its been pointed out that our larger repos like nova perform poorly in gitea
19:26:38 <clarkb> I responded to the openstack-discuss thread with pointers on how people might help us debug that, but haven't seen a response
19:26:59 <corvus> yeah, that's a little disappointing
19:27:53 <clarkb> Any other opendev related business?
19:27:53 <mordred> I imagine our gitea friends would be interested in making improvements if we can collect the debug info from people
19:28:06 <clarkb> mordred: ya I mentioned that upstream is really helpful and would likely want to fix it too
19:28:12 <corvus> we heard a lot of feedback that the complex puppet system we had made it hard for people to contribute casually; it's interesting that a shiny new simple dockerfile is not sufficient enticement.
19:29:36 <mordred> corvus: dockerfiles are too complicated - if we could just hand out root accounts to developers, then I'm sure people would help
19:29:58 <corvus> mordred: i guess our standards are too high :)
19:30:40 <clarkb> if anyone is listening in I'm happy to assist someone that would like to debug that in the nearer future, I just have no bw to do it myself currently
19:32:05 <clarkb> Ok sounds like that may be it for opendev. Moving on
19:32:08 <clarkb> #topic Storyboard
19:32:18 <clarkb> fungi: diablo_rojo_phon ^ any updates?
19:32:43 <clarkb> My IRC client decided that to reload plugins that meant reloading all channel connections and I've apparently not set up #storyboard again
19:32:54 <fungi> there was another bug triage on friday
19:33:00 <fungi> nothing too exciting
19:33:17 <fungi> most of the active stories in the storyboard project group are now tagged appropriately
19:33:35 <fungi> that's the only update i'm aware of
19:33:49 <clarkb> #topic General Topics
19:33:57 <clarkb> #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup
19:34:07 <clarkb> fungi: ^ any new progress on the wiki before stepping out for vacation?
19:34:16 <clarkb> and are there changes we should be looking at?
19:36:38 <fungi> i'll know more after https://review.opendev.org/666162 merges
19:36:46 <fungi> i had to recheck it for a beaker problem, apparently
19:37:05 <clarkb> ya that looks consistent
19:37:15 <clarkb> may need debugging
19:37:37 <clarkb> The last item on the Agenda (we covered mirrors previously) is a cloud resource status update
19:37:47 <fungi> right, not sure why the same patchset was passing a week ago
19:38:06 <clarkb> donnyd has set up users for us in the fortnebula cloud whcih I'm working through sanity checking of right now
19:38:11 <clarkb> (tempest is running as we speak)
19:38:27 <fungi> and apparently tempest needs a boatload of ram nowadays?
19:38:43 <clarkb> yup tempest/devstack is digging into swap about 389MB deep on our 8GB ram nodes
19:39:20 <donnyd> should we try a larger flavor... maybe like 10 or 12 G of memory
19:39:27 <fungi> is that due to increased pressure from devstack?
19:39:39 <donnyd> It might just eat all the rams and swap anyways
19:39:42 <clarkb> donnyd: we've long pushed for the 8GB number because it is a reasonable number for people to have on their own machines
19:39:54 <ianw> hrm, we should have the peak memory dumps going on?  does it give a clue what's growing?
19:39:58 <clarkb> donnyd: imo the issue here is openstack is using way too much memory
19:40:20 <clarkb> and that should be fixed rather than increasing memory per node (also increasing memory per node would decrease our total number of test instances)
19:40:46 <donnyd> true, but if it makes it through the tests faster its a wash
19:40:52 <clarkb> ianw: I haven't dug into it much beyond discovering I had to enable swap to get tempest to pass on these nodes
19:41:08 <donnyd> not disagreeing, just willing to test it out
19:42:22 <clarkb> donnyd: it seems to have run within similar time bounds of other clouds as is so I think we may be fine then push back on openstack to try and improve memory usage
19:42:43 <clarkb> donnyd: it runs devstack faster than other clouds and tempest slightly slower so resulting job runtime should be about the same
19:42:50 <clarkb> maybe even a little quicker
19:42:58 <ianw> right; the one to look for is like -> http://logs.openstack.org/68/668668/2/check/tempest-full-py3/65eee4c/controller/logs/screen-peakmem_tracker.txt.gz
19:43:31 <ianw> the idea is that whenver free ram is seen to drop, it logs a new dump of what's taking it up
19:43:38 <clarkb> in any case Once tempest is done I'll post up numbers and if others agree we are in the same region of time I'll build the mirror when donnyd is done twaeking things then we can have nodepool start using the cloud
19:44:25 <clarkb> I also wanted to do a quick check on the other two clouds where things have been moving. Inap was turned off because they were upgrading the cloud right? Any word on whether or not we'll be able to turn that back on at some point?
19:45:23 <fungi> i haven't heard anything new from mgagne on that
19:45:26 <clarkb> Looks like pabelanger and ajaeger may have worked with mgagne to disable that
19:45:31 <clarkb> https://review.opendev.org/#/c/655924/
19:45:35 <donnyd> I am pretty much done
19:45:49 <clarkb> ok I can reach out to mgagne and ask if there is any news
19:45:57 <clarkb> the last cloud was limestone which has network trouble currently
19:45:58 <donnyd> just minor sysctl's at this point to get the last few cycles out
19:46:23 <mgagne> clarkb, fungi: Sorry, no news for now. A coworker is working on that one. I've been busy working on something else. I can ask him asap.
19:46:37 <clarkb> mgagne: ok not a rush, but didn't want ti falling through the cracks
19:46:39 <clarkb> mgagne: thank you
19:46:53 <corvus> ++
19:47:02 <clarkb> fungi: you were working with logan on limestone right? is the way to monitor that on our end the cacti gaps with our mirror there?
19:47:34 <clarkb> I guess we can continue to monitor that and if it stops having gaps put the cloud back into service
19:48:20 <clarkb> Alright that was all I had
19:48:24 <clarkb> #topic Open Discussion
19:48:40 <clarkb> Anything else?
19:48:48 <clarkb> I'm happy to end the meeting early if not
19:49:27 <fungi> yeah that's what i did previously
19:50:03 <fungi> i think logan- hasn't had a chance to look at it again after it went back to having connectivity issues
19:51:39 <clarkb> sounds liek that may be it. Thank you everyone
19:51:50 <clarkb> For those of you celebrating holidays have fun!
19:51:52 <clarkb> #endmeeting