19:01:37 <clarkb> #startmeeting infra
19:01:38 <openstack> Meeting started Tue Jul 23 19:01:37 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:39 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:41 <openstack> The meeting name has been set to 'infra'
19:01:44 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-July/006425.html
19:01:48 <fungi> aloha
19:02:56 <clarkb> #topic Announcements
19:03:13 <clarkb> There weren't any recorded on the agenda
19:03:20 <clarkb> #topic Actions from last meeting
19:03:26 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-07-16-19.01.txt minutes from last meeting
19:03:34 <clarkb> mordred: any updates on the github cleanup work?
19:05:54 <mordred> clarkb: nope. totally been doing gerrit instead. also, I keep hitting github rate limits and then doing something else
19:06:09 <clarkb> mordred: we need to convert you into a github app then
19:06:16 <clarkb> #action mordred clean up openstack-infra github org
19:06:25 <clarkb> #action mordred create opendevadmin account on github
19:07:43 <clarkb> #topic Priority Efforts
19:07:52 <clarkb> #topic OpenDev
19:08:04 <clarkb> Lets start here since a fair bit has gone on with gitea in the last week or so
19:08:13 <fungi> indeed it has
19:08:38 <clarkb> We've discovered that gitea services can cause OOMKiller to be invoked. This often targets git processes for killing. If this happens when gerrit is replicating to gitea we can lose that replication event
19:09:08 <fungi> synopsis: corrupt git repo or missing objects
19:09:19 <clarkb> I've put 1GB of swap (via swapfile) on gitea01-05,07-08 and 8GB on gitea06. The reason for the difference in size is 06 was rebuilt with a much larger root disk and has space for the swapfile the others are smaller and don't have much room
19:09:28 <clarkb> 1GB of swap is not sufficient to avoid these errors
19:09:49 <clarkb> But 8GB appears to have been sufficient so we are rebuilding gitea01 to be like gitea06 and will likely roll through all the other backends to do the same thing
19:10:24 <clarkb> In this process we've discovered a few deficiencies with how we deploy gitea and manage haproxy fixes for which are all merged now with exception of how to gracefully restart haproxy when haproxy's image updates
19:10:31 <fungi> old instance/volume deleted, new bfv instance exists now and needs data copied over
19:10:42 <clarkb> I don't know how to handle that last case and will need to think about it more (and read up on docker-compose)
19:12:05 <donnyd> clarkb: Is there an agenda?
19:12:12 <fungi> yeah, manuals say sigusr1 is for that
19:12:19 <clarkb> donnyd: ya http://lists.openstack.org/pipermail/openstack-infra/2019-July/006425.html
19:12:24 <fungi> if i'm reading correctly
19:12:36 <donnyd> Sorry. I didn't scroll all the way down.
19:12:37 <clarkb> fungi: ya so will depend on whether or not we can have docker-compose somehow send signals when it restarts stuff
19:12:53 <clarkb> fungi: we may have to break this out of docker-compose? or accept that iamges don't update often
19:12:57 <clarkb> (and haproxy restarts are quick)
19:13:06 <clarkb> need to do more research
19:13:42 <mordred> I mean - I think we always have ansible running docker-compose things
19:13:56 <mordred> so it's reasonable for ansible to  tell docker-compose to send a signal or whatever
19:14:06 <mordred> maybe?
19:14:07 <Shrews> http://cavaliercoder.com/blog/restarting-services-in-docker-compose.html  ??
19:14:13 <clarkb> mordred: right but the image replacement is all a bit automagic in docker compose
19:14:14 * fungi just wrote a handler which does that
19:14:36 <mordred> clarkb: ah yes - that's an excellent point
19:14:54 <clarkb> mordred: so either we can hook into that or we stop relying on it and manually break out the steps
19:14:57 <mordred> we don't actually want to do this every ansible time- only if the result of pull would cause a restart
19:15:04 <clarkb> correct
19:15:06 <fungi> yeah, image updates will require more than just sending a signal to the running daemon
19:15:39 <corvus> happily, haproxy is generally pretty boring and stable
19:15:55 <clarkb> corvus: yup (today is an exception and there should be a new image soon but in general that is true)
19:16:22 <clarkb> I think that is the least urgent item to sort out for impactless updates
19:16:29 <clarkb> the urgent ones have all been addressed, thank you
19:16:47 <clarkb> We should keep an eye out for unexpected behavior going forward as digging into the problems today was really helpful
19:17:00 <clarkb> Are there any other opendev related changes we should be aware of?
19:18:18 <clarkb> Sounds like no. Onward
19:18:23 <clarkb> #topic Update Config Management
19:18:35 <clarkb> mordred has been working on getting gerrit into docker
19:18:53 <corvus> \o/
19:18:54 <clarkb> mordred: anything you want to call out for that activity?
19:18:58 <mordred> yes. I just pushed up a new rev
19:19:07 <mordred> uhm - mainly it's definitely ready for review now
19:19:18 <mordred> and clarkb just found a good pile of gotcha for 2.13
19:19:26 <mordred> so review is much appreciated
19:19:32 <clarkb> #link https://review.opendev.org/671457 Gerrit docker image builds ready for review
19:19:37 <corvus> i think when that's in place https://review.opendev.org/630406 is going to show us any problems with it
19:19:46 <mordred> the general idea is making a 2.13 image that works pretty much just like our current 2.13 install
19:19:56 <corvus> there's currently a problem with the last image we built (6mo ago); i doubt it's fixed, but we'll be able to iterate then
19:20:02 <mordred> corvus: agree - although there are a few things, like heapSize, that we'll want to thikn about?
19:20:06 <mordred> corvus: ++
19:20:13 <corvus> and by "problem" i mean things like file ownership, paths, etc
19:20:16 <mordred> yeah
19:20:30 <mordred> most of the things should pretty immediately break
19:20:44 <corvus> mordred: heapSize is something we can pass in in an env var, right?
19:20:49 <mordred> it is now :)
19:20:54 <corvus> (so we can have a test and prod value)
19:20:55 <corvus> cool
19:21:27 <mordred> we WILL be losing the ability to set those things in gerrit.config unless we make things more complex
19:21:40 <clarkb> mordred: could we run an image that ran the gerrit init script?
19:21:55 <mordred> not really - it forks gerrit into the background
19:22:08 <clarkb> right but aren't there docker images that know how to manage that?
19:22:23 <mordred> I mean - we COULD - but I'd rather fix this to not be wonky like that
19:22:27 <mordred> and it's not far off
19:22:35 <clarkb> ok
19:23:08 <clarkb> we should be able to deploy it on review-dev and be pretty happy with the results before we commit to production too
19:23:13 <mordred> ++
19:23:28 <fungi> as long as gerrit upstream also doesn't intend it to be wonky like that
19:23:38 <fungi> otherwise seems a bit sisyphean
19:23:56 <clarkb> I get the sense that gerrit upstream has sort of ignored these problems with their images?
19:24:06 <clarkb> (I seem to recall it relying on h2 among other things)
19:24:23 <fungi> which i expect is fine as long as they don't intentionally make it worse
19:24:25 <corvus> hrm, i think the upstream images are decently constructed
19:24:34 <corvus> they have volumes in the right places, so you don't have to use h2
19:24:37 <clarkb> ah
19:24:38 <mordred> yah
19:24:50 <corvus> they don't work for us because we want to be able to patch :)
19:25:00 <mordred> most of the stuff going on in the init script is not actually necessary
19:25:23 <mordred> it elides out pretty quickly once you're making images, because you know where all the things are
19:25:35 <mordred> (there's a TON of logic for finding where your files might be, for instance)
19:25:50 <clarkb> it seems like they completely ignore things like java heapsizes though?
19:26:33 <clarkb> oh maybe not https://github.com/GerritCodeReview/docker-gerrit/blob/master/ubuntu/18/entrypoint.sh#L16
19:27:07 <clarkb> they are running the init script
19:27:22 <mordred> yeah. I mean - we could do that if people want - I just don't think we need to
19:28:02 <clarkb> My biggest concern with not doing that is that we might miss important changes to java configs or other settings that happen in the runtime (not gerrit itself) as new versions of gerrit or java come out
19:28:10 <clarkb> but you are right that that is less ideal for docker
19:28:38 <fungi> and maybe an opportunity to collaborate
19:28:49 <mordred> there is currently exactly one setting we are setting that results in a java cli option being pulled out by the init script :)
19:29:18 <clarkb> mordred: we also set the timeout option but that one doesn't make sense with docker
19:29:32 <mordred> yah
19:30:02 <clarkb> not much preexisting cause for worry then
19:30:08 <clarkb> current approach should be fine
19:33:04 <clarkb> Sounds like that may be it on this topic then?
19:33:18 <clarkb> #topic Storyboard
19:33:32 <clarkb> fungi: diablo_rojo_phon how are things?
19:33:56 <fungi> we had a good session on friday grooming feature requests into our main priorities board
19:34:48 <diablo_rojo_phon> Gonna meet this week
19:34:48 <fungi> #link https://storyboard.openstack.org/#!/board/115 StoryBoard Team Dashboard
19:34:55 <diablo_rojo_phon> Talk about ptg forum things
19:35:19 <diablo_rojo_phon> Onboarding maybe.
19:35:35 <fungi> curious to see what interest we can drum up in shanghai
19:35:56 <diablo_rojo_phon> Same
19:36:23 <fungi> that's all i can recall
19:36:35 <diablo_rojo_phon> Yeah that's it for now
19:36:40 <clarkb> Thanks!
19:36:44 <clarkb> #topic General Topics
19:36:51 <fungi> oh, i've pushed some changes to get python-storyboardclient testable again
19:37:06 <diablo_rojo_phon> Aside from begging mordred to do some SQL things in hopes of improving search
19:37:20 <diablo_rojo_phon> Thanks fungi!
19:37:24 <clarkb> fungi: for the wiki upgrade I think last week the suspicion was git submodules
19:37:45 <clarkb> have we been able to take that suspicion and make progress with it yet ? (also I know I said I woudl try to help then got sucked into making gitea betterer)
19:38:29 <fungi> nope, i had weekend guests and so less time on hand than hoped
19:38:42 <clarkb> ok I'll actually try to take a look this week
19:38:43 <corvus> i've been working on making progress on the zuul log handling; while we can technically switch to swift log storage at any time, the experience will degrade if we do it now, but could be better if we do it after we add some things to zuul
19:38:52 <corvus> #link https://zuul-ci.org/docs/zuul/developer/specs/logs.html zuul log handling spec
19:39:01 <fungi> submodules or similar (git subtree?) theories are possible avenues of investigation for the mediawiki situation
19:39:05 <clarkb> corvus: is the log manifest bit the piece to reduce degradation?
19:39:37 <corvus> i've pestered mordred and clarkb for reviews of blokers so far.  i think i'm about at the end of that, and with the (yes) zuul-manifest stuff in place, we should be able to see the javascript work in action with the preview builds
19:40:12 <corvus> yeah, the short version is the manifest lets the web app handle indexes, so we don't have to pre-generate them, and we can also display logs in the web app itself
19:40:31 <corvus> which means we can have javascript do the OSLA bits (which otherwise we would also have to pre-generate)
19:40:43 <fungi> that sounds like a great approach
19:40:53 <clarkb> then we can delete the log server
19:40:58 <clarkb> and there will be much rejoicing
19:41:01 <mordred> \o/
19:41:03 <fungi> saw some of those changes float by and will try to take a closer look
19:41:32 <corvus> it'll get interesting once the manifest stuff is in place and i can rebase for it.  hopefully today
19:41:48 <clarkb> exciting progress
19:41:58 <ianw> the kafs mirror servers are out of rotation for now; there are some changes queued in afs-next branch which it would be good for us to test before they are sent to linus
19:42:33 <ianw> however the tree currently doesn't build, but when it does it would be good to put into rotation for a while to confirm.  the fscache issues are unresolved, however
19:42:35 <fungi> i find it awesome that we're testing things before they're sent to linus
19:42:39 <fungi> (just an aside)
19:42:48 <corvus> fungi: ++
19:43:14 <clarkb> ianw: good to know, let us know how we can heklp I suppose (reviewing changes to flip around the mirror that is used?)
19:43:28 <corvus> ianw: does afs-next get patches via mailing list?
19:44:14 <fungi> seems so, yes
19:44:17 <corvus> (i've been itching to write an imap driver for zuul; wonder if this would be a practical use)
19:44:18 <ianw> corvus: there is a mailing list, but things also pop in and out as dhowells works on things
19:44:35 <fungi> i will *so* review an imap/smtp zuul driver
19:45:05 <fungi> i mean, i guess the smtp reporter is already there? ;)
19:45:22 * corvus read "tree currently doesn't build" and got really confused and sad
19:46:08 <clarkb> Intel has been doing a bunch of CI for the kernel recently I think
19:46:16 <clarkb> that might be another avenue for collaboration potentially
19:46:30 <clarkb> As a timecheck we have ~14 minutes left and a couple more items to get through so lets continue on though
19:46:40 <clarkb> I did want to do a cloud status check in
19:46:42 <fungi> an nntp driver would be cool, but convincing lkml to return their focus on usenet might be an uphill battle
19:47:01 <clarkb> FortNebula cloud is pretty well stabilized at this point. Thank you donnyd
19:47:11 <fungi> thanks donnyd!!!
19:47:16 <clarkb> there may be one or two corner cases that need further invenstigating (ipv6 related maybe?)
19:47:18 <donnyd> yea it seems to be working well atm
19:47:29 <clarkb> but overall it is doing great
19:47:36 <donnyd> Yea I am not sure why there seem to be just 3 jobs that timeout
19:47:57 <fungi> which three jobs?
19:48:03 <donnyd> Hopeful the right storage gear will *actually* show up tomorrow
19:48:21 <fungi> i hear there's a generator backup in the works too
19:48:24 <donnyd> fungi: I will get you the list, but it seems to be fairly consistent
19:48:38 <donnyd> yea, its been sitting outside for 6 weeks
19:49:05 <donnyd> just got a pad poured and now I am waiting for the gas to be hooked up to it
19:49:05 <clarkb> mordred: for MOC is there anything we can be doing or is it still in that weird make accounts and get people to trust us situation :)
19:49:26 <donnyd> and my UPS have been refreshed, so they should handle the load till the genny takes over
19:49:27 <clarkb> I have not heard anything new about the new linaro cloud since we last had this meeting
19:49:46 <fungi> hah. i need to get a generator for here. most of my neighbors have them so i'm feeling rather naked (though i need a >=6' platform to support it)
19:50:01 <clarkb> fungi: is that something you can put on the roof? that should be high enoguh
19:50:13 <fungi> i'd need to reinforce the roof for that
19:50:15 <mordred> clarkb: lemme check to see if it's been fixed
19:50:50 <donnyd> Is there any other fun experiments we can do with FN?
19:50:55 <donnyd> are
19:51:18 <fungi> this experiment isn't enough? (just kidding!)
19:51:25 <clarkb> donnyd: I think if we get a serious group together to start working out nested virt issues (another potential avenue for feedback to the kernel) that would be super helpful
19:51:26 <donnyd> LOL
19:51:41 <fungi> yeah, that's been a frequent request
19:51:47 <clarkb> donnyd: what we've found in the past is taht debugging those issues requires involvement from all layers of the stack and what we've traditionally lacked is insight into the cloud
19:51:53 <donnyd> yea, I have seen more than one request for it
19:52:17 <donnyd> clarkb: would ssh access to the hypervisors help?
19:52:30 <clarkb> johnsom and kashyap are your workload and kernel people and then if we can get them insight into the hypervisors we may start with balls rolling
19:52:33 <fungi> if it came with people to ssh in and troubleshoot them ;)
19:52:38 <mordred> clarkb: MOC still waiting on app credentials to be enabled in their keystone
19:52:40 <clarkb> donnyd: I'm not sure they need to ssh in as much as just candid data capture
19:52:42 * mordred pings knikolla ^^
19:52:53 <clarkb> donnyd: cpu and microcode version and kernel versions and modules loaded and so on
19:52:59 <clarkb> mordred: k
19:53:14 <donnyd> sure... this is the only workload this thing is doing... I can make logs public facing without issue
19:53:20 <clarkb> donnyd: kashyap would be the best person to engage for what needs are
19:53:25 <fungi> on the cloud providers topic, there's a change posted to switch the main flavor in the lon1 aarch64 cloud to 16 cpus and 16gb ram because of resource demands from kolla arm64 jobs
19:53:59 <clarkb> fungi: I'm a bit wary of increasing the memory footprint just because but considering it is a different cpu arch incraesing teh cpu count seems reasonable
19:54:00 <corvus> the arm jobs require more ram?
19:54:05 <fungi> #link https://review.opendev.org/671445 Linaro London: use new bigger flavour
19:54:20 <fungi> clarkb: corvus: these are questions i too have asked
19:54:32 <clarkb> in defense of the request devstack + tempest does swap now
19:54:46 <fungi> please follow up on that change so i'm not the only one ;)
19:54:47 <clarkb> and I'm sure that is part of the slowness, but the fixing should involve figuring out why we've used so much more memory than in the past
19:54:50 <clarkb> fungi: will do
19:55:00 <clarkb> really quickly before we run out of time
19:55:10 <clarkb> I submitted PTG survery for the opendev infra team and gitea as separate requests
19:55:40 <clarkb> That means if you are going to be in shanghai some group of us likely will as well
19:55:54 <clarkb> #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019 Start planning the next PTG
19:56:11 <clarkb> That has no content yet but it is there for people to start putting ideas up (I know it is early so no pressure)
19:56:17 * mordred will be in shanghai
19:56:23 <fungi> "TODO"
19:57:17 <fungi> i'm so scattered i honestly can't recall whether i've filled out the opendev infra survey
19:57:36 <clarkb> fungi: its for me to fill out and you not to worry about
19:57:41 <fungi> ahh, got i
19:57:43 <fungi> y
19:57:45 <fungi> t
19:57:46 <clarkb> fungi: basically our request to the foundation that we want space at the PTG
19:57:52 * fungi gives up on typing again
19:57:53 <clarkb> #topic Open Discussion
19:58:03 <clarkb> We have a couple minutes for anything else that may have been missed
19:59:16 <clarkb> Thank you for yout time and we'll see you next week
19:59:23 <fungi> thanks clarkb!
19:59:25 <clarkb> my typing is suffering now too :)
20:00:05 <clarkb> #endmeeting