#openstack-meeting log

19:01:07 <clarkb> #startmeeting infra
19:01:08 <openstack> Meeting started Tue Aug  6 19:01:07 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <openstack> The meeting name has been set to 'infra'
19:01:22 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-August/006437.html Today'd Agenda
19:02:06 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting Edit the agenda at least 24 hours before our scheduled meeting to get items on the agenda
19:02:34 <clarkb> #topic Announcements
19:03:13 <clarkb> Next week I will be attending foundation staff meetings and will not be able to run our weekly meeting. I expect fungi is in the same boat. We will need a non clarkb or fungi volunteer to chair the meeting
19:03:20 <clarkb> or we can decide to skip it if people prefer that
19:03:27 <fungi> yup
19:03:41 <clarkb> Also expect that I won't be much help next week in general
19:03:58 <fungi> the boat is probably a metaphor
19:04:19 <fungi> i won't be bringing any fishing gear
19:04:45 <clarkb> #topic Actions from last meeting
19:04:54 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-07-30-19.01.txt minutes from last meeting
19:05:03 <clarkb> I think mordred did github things last week
19:05:16 <fungi> he did indeed
19:05:19 <clarkb> github.com/openstack-infra repos should all be updated with a note on where they can now be found as well as archived
19:05:29 <diablo_rojo> o/
19:05:30 <clarkb> mordred: was the opendev admin account created too?
19:05:31 <fungi> s/updated with/replaced by/
19:06:19 <corvus> i think i saw mordred say he did that
19:06:29 <clarkb> awesome
19:06:47 <clarkb> The other action listed was updating the gitea sshd container to log its sshd logs
19:07:03 <clarkb> I do not think this happened; however, I can take a look at that today so I'll assign the action to myself
19:07:18 <clarkb> #action clarkb Have gitea sshd logs recorded somewhere
19:08:45 <clarkb> #topic Priority Efforts
19:08:52 <mordred> o/
19:08:56 <clarkb> #topic OpenDev
19:09:05 <clarkb> That is a good jump into recent opendev things
19:09:07 <mordred> yes - I did github things
19:09:17 <clarkb> we do still have the OOM problem however it seems less prevalent
19:09:18 <mordred> and the opendevadmin account
19:09:21 <clarkb> mordred: tyty
19:09:35 <clarkb> #link https://etherpad.openstack.org/p/debugging-gitea08-OOM
19:09:50 <clarkb> Last week I dug into that a bit and tried to collect my thoughts there
19:10:09 <clarkb> it would probably be good if someone else could review that and see if I missed anything obvious and consider my ideas there
19:10:22 <corvus> we have no reason to think that gitea 1.9.0 will improve anything, but mordred has a patch to upgrade; assuming we move forward that will be a variable change
19:10:26 <mordred> clarkb: I pushed up a patch just a little bit ago to upgrade us to gitea 1.9 - it's also possible that 1.9 magically fixes the oom problems
19:10:40 <mordred> corvus: yes - I agree - I have no reason to believe it fixes anything
19:11:04 <fungi> right, it's also possible magical elves will fix the oom ;)
19:11:05 <mordred> so we could also hold off so as not to move variables
19:11:15 <clarkb> ya I think the memory issues largely come down to big git repos being a problem and gitea holding open requests against big git repos for significant periods of time so they pile up
19:11:15 <corvus> yeah, but unfounded optimism is great, i endorse it :)
19:11:41 <mordred> \o/
19:11:44 <corvus> i don't think we need to hold back for further debugging; i think we should do the 190 upgrade and just be aware of the change
19:11:56 <clarkb> we have a 2 minute haproxy timeout (and haproxy seemed to time out these requests because i could not map them to gitea logs based on timestamps) but gitea logs show requests going on for hours
19:11:58 <clarkb> corvus: ++
19:12:16 <clarkb> one idea I had was maybe digging into having gitea timeout requiests because a single 500 error is better than OOMing and crashing gitea
19:12:20 <fungi> an option there, just spitballing, might be to use an haproxy health check which includes some resource metrics like memory
19:12:22 <corvus> clarkb: oh that's an interesting data point i missed
19:12:30 <clarkb> (I have not done that yet, but possibly can later this week)
19:12:31 <fungi> probably involves running a check agent on the gitea servers though
19:12:43 <corvus> clarkb: if the remote side has hung up, i don't see why gitea should continue whatever it was doing
19:12:46 <clarkb> corvus: ya that is at the bottom of the etherpad
19:12:48 <clarkb> corvus: exactly
19:13:03 <fungi> but that would force additional requests to get redistributed if there's a pileup on one of the backends
19:13:14 <fungi> on the other hand it could just end up taking the entire pool offline
19:13:38 <clarkb> fungi: ya I think if this continues to be a problem (we've failed at making gitea/git more efficient first) then improving haproxy load balancing methods is our next step
19:14:39 <clarkb> corvus: you have some grasp of the gitea code base maybe I can take a look at it later this week and if I run into problems ask for help?
19:14:49 <corvus> clarkb: absolutely
19:15:30 <clarkb> great. Any other opendev related business before we move on?
19:15:52 <corvus> oh one thing
19:16:24 <corvus> i think tobiash identified the underlying cause of the zuul executors ooming; the fix has merged and if we restart them, things should be better
19:16:31 <corvus> this is the thing ianw discovered
19:16:42 <clarkb> corvus: is that related to the executor gearman worker class fix?
19:16:44 <corvus> https://review.opendev.org/674762 is the fix
19:16:45 <corvus> yes
19:17:08 <corvus> uneven distribution of jobs makes executors use too much memory and either the log streaming process gets killed, or the executor itself (making the problem worse)
19:17:36 <corvus> what's really cool is --
19:17:53 <corvus> if you take a look at the graphs right now, you can actually see that some of them are graphs of noisy neighbors
19:18:02 <corvus> (they have oscillations which have no relationship to zuul itself)
19:18:29 <corvus> because absent our leveling algorithm, external inputs like hypervisor load and network topology have an outsize effect
19:18:58 <clarkb> because gearman is a "you get jobs as quickly as you can service them" system
19:19:07 <corvus> yep
19:19:37 <corvus> nanoseconds count
19:20:04 <clarkb> #topic Update Config Management
19:20:16 <clarkb> ianw has changes up to deploy an ansible based/managed backup server
19:20:20 * clarkb finds links
19:20:33 <ianw> #link https://review.opendev.org/674549
19:20:39 <clarkb> ianw wins
19:20:42 <ianw> #link https://review.opendev.org/674550
19:20:59 <ianw> i can fiddle that today if i can get some eyes and see if we can get something backing up to it
19:21:26 <ianw> should run in parallel with existing backups, so no flag day etc
19:22:04 <clarkb> ianw: ya the original backup design had us backing up to two location anyway
19:22:09 <ianw> (well, no client is opted into it yet either, the first one i'll babysit closely)
19:22:12 <clarkb> I expect that the puppetry will handle that fine
19:23:01 <clarkb> I think we are in a good spot to start pushing on the CD stuff too?
19:23:18 <clarkb> corvus: ^ I unfortunately tend to page that out more than I should. You probably know what the next step is there
19:23:38 <corvus> for jobs triggered by changes to system-config, yes
19:23:45 <corvus> for the dns stuff, no
19:24:02 <corvus> https://review.opendev.org/671637 is the current hangup for that
19:24:19 <clarkb> #link https://review.opendev.org/671637 Next step for CD'ing changes to system-config
19:24:32 <corvus> that's how i wanted to solve the problem, but logan pointed out a potentially serious problem
19:25:14 <corvus> so we either need to put more brainpower into that, or adopt one of our secondary plans (such as, opendev takes over the zuul-ci.org zone from the zuul project).  that could be a temporary thing until we have the brainpower to solve it better.
19:26:01 <clarkb> er that is for the dns stuff not system-config right?
19:26:03 <clarkb> #undo
19:26:04 <openstack> Removing item from minutes: #link https://review.opendev.org/671637
19:26:10 <corvus> clarkb: correct
19:26:11 <clarkb> #link https://review.opendev.org/671637 Next step for CD'ing changes to DNS zones
19:26:27 <corvus> no known obstacles to triggering cd jobs from changes to system-config
19:26:31 <clarkb> got it
19:26:56 <clarkb> Anything else on this subject?
19:27:10 <corvus> (project-config is probably ok too)
19:28:20 <clarkb> #topic Storyboard
19:28:36 <fungi> no updates for sb this week that i'm aware of
19:28:42 <clarkb> fungi: I know mnaser reported some slowness with the dev server, but I believe that was tracked back to sql queries actually being slow?
19:28:48 * mordred did not help on the sql this past week
19:28:51 <clarkb> (so there isn't an operational change we need to be aware of?)
19:29:00 * mordred will endeavor to do so again this week
19:29:17 <fungi> it seemed to be the same behavior, yes. if i tested the same query i saw mysql occupy 100% a vcpu until the query returned
19:29:27 <fungi> (api query, i mean)
19:29:45 <clarkb> diablo_rojo: Do you have anything to add?
19:29:47 <diablo_rojo> mordred, should I just actively bother you in like...two days or something? Would that be helpful?
19:30:10 <fungi> so the bulk of the wait was in one or more database queries presumably
19:30:25 <diablo_rojo> clarkb, the only other thing we did during the meeting last week was start to talk about how we want to try to do onboarding in SHanghai of users and another for contributors
19:30:29 <diablo_rojo> That's all.
19:30:58 <clarkb> diablo_rojo: related to that can I assume that you have or will handle space allocation for storyboard? or should I make a formal request similar to what I did for infra?
19:31:01 <fungi> ahh, yep, and covered that trying to facilitate remote participation in shanghai might be harder than normal
19:31:07 <mordred> diablo_rojo: yeah - actually - if you don't mind
19:31:34 <mordred> diablo_rojo: I keep remembering on tuesday morning - when I look at the schedule and think "oh, infra meeting"
19:31:40 <diablo_rojo> mordred, happy to be annoying ;) I'll try to find/create a quality gif or meme for your reminder
19:31:45 <mordred> hah
19:31:50 <diablo_rojo> mordred, lol
19:32:14 <clarkb> diablo_rojo: just let me know if I need to do anything official like for storyboard presence in shanghai. Happy to do so
19:32:15 <diablo_rojo> clarkb, I will handle space for StoryBoard :)
19:32:19 <clarkb> awesome
19:32:24 <diablo_rojo> clarkb, I know the person with the form ;)
19:32:32 <clarkb> indeed
19:32:37 <diablo_rojo> ^^ bad joke I will continue to make
19:33:00 <clarkb> #topic General Topics
19:33:07 <clarkb> First up is trusty server replacements.
19:33:15 <clarkb> fungi: are you planning to do the testing of wiki-dev02?
19:33:29 <clarkb> iirc planned next steps was to redeploy it to make sure puppet works from scratch?
19:33:48 <fungi> yep, have been sidetracked by other responsibilities unfortunately, but that's still high on my to do list
19:34:16 <fungi> wiki-dev02 can simply be deleted and re-launched at any time
19:34:26 <fungi> nothing is using it
19:34:29 <fungi> i'll try to get to that this week
19:34:32 <clarkb> thank you
19:34:53 <clarkb> corvus has also made great progress with the swift log storage (whcih means we can possibly get rid of logs.openstack.org)
19:35:05 <clarkb> corvus: at this point you are workign through testing of individual cloud behaviors?
19:35:57 <corvus> clarkb: yes, i believe rax, and vexxhost are ready, confirming ovh now (i expect it's good)
19:36:09 <corvus> so we'll be able to randomly store logs in one of six regions
19:36:32 <clarkb> andI know you intended to switch over to the zuul logs tab with logs.o.o backing it first. Are we ready to start planning that move or do we want to have the swift stuff ready to happen shortly after?
19:36:48 <corvus> (job/log region proximity would be nice, but not relevant at the moment since our logs still go through the executor)
19:37:14 <corvus> yeah, we're currently waiting out a deprecation period for one of the roles which ends monday
19:37:52 <clarkb> exciting we might be switched over next week then?
19:37:59 <corvus> after that, i think we can switch to zuul build page as the reporting target (but we need a change to zuul to enable that behavior)
19:38:20 <corvus> and then i think we'll have the swift stuff ready almost immediately after that
19:38:45 <clarkb> that is great news
19:38:49 <corvus> maybe we plan for a week between the two changes, just to give time for issues to shake out
19:38:55 <clarkb> wfm
19:39:02 <mordred> ++
19:39:11 <fungi> yes, it's timely, given we've had something like 3 disruptions to the current log storage in a couple weeks time
19:39:25 <corvus> though... hrm, timing might be tight on that cause i leave for gerrit user summit soon
19:40:14 <corvus> i leave on aug 22
19:40:21 <clarkb> we can probably take our time then and do the switches we are comfortable with bit by bit as people are around to monitor
19:40:28 <corvus> assuming we don't want to merge it the day before i leave, we really only have next week to work with
19:40:43 <corvus> i return sept 3
19:40:47 <clarkb> k
19:41:06 <corvus> so we either do both things next week, or build page next week and swift in september
19:41:36 * mnaser curious which swifts are being used
19:41:39 <clarkb> and we can probably decide on when to do swift based on how smoothly the build logs tag change goes?
19:41:55 <clarkb> s/tag/tab/
19:42:04 <corvus> mnaser: vexxhost, rax, ovh
19:42:20 <mnaser> just wondering how much data is expected to be likely hosted?
19:42:28 <clarkb> and fortnebula has hinted that a swift install there might happen too
19:42:57 <corvus> mnaser: i think we're currently estimating about 2GB for starters (much less than when we initially discussed it with you)
19:43:05 <corvus> er 2TB
19:43:31 <mnaser> cool, thank you for that info
19:43:33 * mnaser hides again
19:43:37 <clarkb> Which is a good lead into the next agenda item. State of the clouds
19:43:55 <clarkb> I wanted to quickly give a status update on fn and was hoping mordred could fill us in on any changes with MOC
19:44:07 <clarkb> fn is now providing 100 test instances and we seem to be quite stable there now
19:44:21 <clarkb> We have noticed odd mirror throughput when pulling things from afs
19:44:46 <mordred> app-creds are working in moc now - so next steps are getting teh second account created and creating the mirror node
19:44:49 <clarkb> if we manually pull cold files we get about 1MBps and if we pull a warm file we get about 270MBps. But yum installing packages reports 12MBps
19:45:08 <clarkb> I am not sure that the afs mirror performance behavior is a major issue as the impact on job runtimes is low
19:45:12 <clarkb> but something I wanted to make note of
19:45:17 <clarkb> mordred: exciting
19:45:37 <corvus> clarkb: only yum?
19:45:49 <mnaser> yum being slow is nothing new :(
19:45:50 <clarkb> corvus: I haven't looked at the other package managers yet, but the examples donnyd dug up were yum
19:45:57 <clarkb> corvus: but that is a good point we should check apt-get too
19:46:07 <mnaser> OSA's centos jobs take almost twice as long and there isn't a lot of different things happening
19:46:11 <mnaser> for context
19:46:21 <clarkb> mnaser: good to know
19:46:45 <clarkb> mordred: do you need anything to push MOC along or is that mostly you filing a ticket/request for the second account?
19:47:13 <mordred> nope- just filing a ticket
19:47:25 <donnyd> Im a little late to the party, but swift will surely be happening. Just a matter of when
19:48:09 <clarkb> great
19:48:28 <clarkb> next up is making note of a couple of our distro mirrors' recent struggles
19:48:33 <ianw> re mirrors i think overall some macro apache throughput stats would be useful, for this and also for kafs comparisons.  working on some ideas
19:48:42 <clarkb> ianw: thanks!
19:49:02 <clarkb> fungi has found reprepro won't create a repo until it has packages (even if an empty repo exists upstream)
19:49:15 <clarkb> this is causing problems for debian buster jobs as buster updates does not exist
19:49:27 <clarkb> fungi: ^ have we managed to workaroudn that yet?
19:49:47 <ianw> oh ... maybe a skip like we added for the security when it wasn't there?
19:49:51 <fungi> yeah, the first buster stable point release is scheduled to happen a month from tomorrow, so the buster-updates suite won't exist until then
19:50:00 <fungi> or rather it will have no packages in it until then
19:50:16 <fungi> ianw: well, except we also add it to sources.list on test nodes
19:50:22 <fungi> so would need to actually omit that
19:50:58 <fungi> or find a way to convince reprepro to generate an empty suite, but i haven't been able to identify a solution in that direction
19:51:05 <ianw> ahh .. can we just fake an empty something?
19:51:18 <clarkb> fungi: if we touch a couple files does that result in a valid empty repo or is it more involved than that?
19:51:39 <clarkb> or maybe even mirror empty repos exactly as upstream rather than building from scratch
19:51:47 <fungi> it's mostly that...
19:52:37 <fungi> i mean, sure we could fake one but need to then find a way to prevent reprepro from removing it
19:52:48 <clarkb> ah
19:52:50 <fungi> since it's a suite in an existing mirror
19:53:11 <fungi> existing package repository i mean
19:53:34 <fungi> so not as simple as something like debian-security which is a different repository we're mirroring separately
19:53:58 <clarkb> ok something to dig into more outside of the meeting I guess
19:54:06 <clarkb> we are almost at time and have a few more thing to bring up really quickly
19:54:30 <fungi> yeah, we can move on
19:54:33 <clarkb> the fedora mirror has also been struggling. It did not update for about a month because a vos release timed out (presumably that is why the lock on the volume was held)
19:54:49 <clarkb> I have since manaully updated it and returned the responsibility for updates to the mirror update server.
19:55:11 <clarkb> One thing I did though was to reduce the size of that mirror by removing virtualbox and vagrant image files, old atomic release files, and power pc files
19:55:25 <clarkb> That dropped repo size by about 200GB whihc should make vos releases quicker
19:55:28 <fungi> yeah, the debian mirror was similarly a month stale until we worked out which keys we should be verifying buster-backports with
19:55:43 <clarkb> that said it is still a large repo and we may be want to further exclude thigns we don't need
19:55:55 <ianw> thanks; i haven't quite got f30 working which is why i guess nobody noticed ... we should be able to drop f28 then
19:56:00 <clarkb> I'm watching it now to make sure automatic updates work
19:56:15 <clarkb> ianw: I think tripleo depends on 28 to stand in for rhel8/centos8
19:56:28 <clarkb> ianw: so we might not be able to drop 28 until they also drop it, but ya that will also reduce the size
19:56:38 <fungi> also... magnum? uses f27 still right?
19:56:44 <clarkb> fungi: the atomic image only
19:56:46 <fungi> aha
19:56:50 <clarkb> fungi: which I don't think uses our mirrors
19:56:54 <fungi> got it
19:57:07 <clarkb> And finally we have PTG prep as a topic
19:57:16 <clarkb> friendly reminder we can start brainstorming topics if we have them
19:57:19 <clarkb> #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019
19:58:00 <clarkb> #topic Open Discussion
19:58:07 <clarkb> we have a couple minutes for any remaining items
19:58:17 <clarkb> I will be doing family things tomorrow so won't be around
19:58:19 <Shrews> fwiw, i think i've identified the sdk bug that is causing us to leak swift objects from uploading images to rax. if we fail to upload the final manifest after the segments, we don't retry and don't cleanup after ourselves. seems to happen at least once every few days or so according to what logs we have.
19:58:53 <clarkb> Shrews: the manifest is the special object that tells swift about the multiobject file?
19:59:12 <mordred> yah
19:59:14 <Shrews> clarkb: correct. uploaded last
19:59:51 <corvus> \o/
20:00:20 <clarkb> and we are at time. That is an excellent find re image uploads. Thank you everyone!
20:00:23 <clarkb> #endmeeting