19:01:07 #startmeeting infra 19:01:08 Meeting started Tue Aug 6 19:01:07 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:09 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 The meeting name has been set to 'infra' 19:01:22 #link http://lists.openstack.org/pipermail/openstack-infra/2019-August/006437.html Today'd Agenda 19:02:06 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting Edit the agenda at least 24 hours before our scheduled meeting to get items on the agenda 19:02:34 #topic Announcements 19:03:13 Next week I will be attending foundation staff meetings and will not be able to run our weekly meeting. I expect fungi is in the same boat. We will need a non clarkb or fungi volunteer to chair the meeting 19:03:20 or we can decide to skip it if people prefer that 19:03:27 yup 19:03:41 Also expect that I won't be much help next week in general 19:03:58 the boat is probably a metaphor 19:04:19 i won't be bringing any fishing gear 19:04:45 #topic Actions from last meeting 19:04:54 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-07-30-19.01.txt minutes from last meeting 19:05:03 I think mordred did github things last week 19:05:16 he did indeed 19:05:19 github.com/openstack-infra repos should all be updated with a note on where they can now be found as well as archived 19:05:29 o/ 19:05:30 mordred: was the opendev admin account created too? 19:05:31 s/updated with/replaced by/ 19:06:19 i think i saw mordred say he did that 19:06:29 awesome 19:06:47 The other action listed was updating the gitea sshd container to log its sshd logs 19:07:03 I do not think this happened; however, I can take a look at that today so I'll assign the action to myself 19:07:18 #action clarkb Have gitea sshd logs recorded somewhere 19:08:45 #topic Priority Efforts 19:08:52 o/ 19:08:56 #topic OpenDev 19:09:05 That is a good jump into recent opendev things 19:09:07 yes - I did github things 19:09:17 we do still have the OOM problem however it seems less prevalent 19:09:18 and the opendevadmin account 19:09:21 mordred: tyty 19:09:35 #link https://etherpad.openstack.org/p/debugging-gitea08-OOM 19:09:50 Last week I dug into that a bit and tried to collect my thoughts there 19:10:09 it would probably be good if someone else could review that and see if I missed anything obvious and consider my ideas there 19:10:22 we have no reason to think that gitea 1.9.0 will improve anything, but mordred has a patch to upgrade; assuming we move forward that will be a variable change 19:10:26 clarkb: I pushed up a patch just a little bit ago to upgrade us to gitea 1.9 - it's also possible that 1.9 magically fixes the oom problems 19:10:40 corvus: yes - I agree - I have no reason to believe it fixes anything 19:11:04 right, it's also possible magical elves will fix the oom ;) 19:11:05 so we could also hold off so as not to move variables 19:11:15 ya I think the memory issues largely come down to big git repos being a problem and gitea holding open requests against big git repos for significant periods of time so they pile up 19:11:15 yeah, but unfounded optimism is great, i endorse it :) 19:11:41 \o/ 19:11:44 i don't think we need to hold back for further debugging; i think we should do the 190 upgrade and just be aware of the change 19:11:56 we have a 2 minute haproxy timeout (and haproxy seemed to time out these requests because i could not map them to gitea logs based on timestamps) but gitea logs show requests going on for hours 19:11:58 corvus: ++ 19:12:16 one idea I had was maybe digging into having gitea timeout requiests because a single 500 error is better than OOMing and crashing gitea 19:12:20 an option there, just spitballing, might be to use an haproxy health check which includes some resource metrics like memory 19:12:22 clarkb: oh that's an interesting data point i missed 19:12:30 (I have not done that yet, but possibly can later this week) 19:12:31 probably involves running a check agent on the gitea servers though 19:12:43 clarkb: if the remote side has hung up, i don't see why gitea should continue whatever it was doing 19:12:46 corvus: ya that is at the bottom of the etherpad 19:12:48 corvus: exactly 19:13:03 but that would force additional requests to get redistributed if there's a pileup on one of the backends 19:13:14 on the other hand it could just end up taking the entire pool offline 19:13:38 fungi: ya I think if this continues to be a problem (we've failed at making gitea/git more efficient first) then improving haproxy load balancing methods is our next step 19:14:39 corvus: you have some grasp of the gitea code base maybe I can take a look at it later this week and if I run into problems ask for help? 19:14:49 clarkb: absolutely 19:15:30 great. Any other opendev related business before we move on? 19:15:52 oh one thing 19:16:24 i think tobiash identified the underlying cause of the zuul executors ooming; the fix has merged and if we restart them, things should be better 19:16:31 this is the thing ianw discovered 19:16:42 corvus: is that related to the executor gearman worker class fix? 19:16:44 https://review.opendev.org/674762 is the fix 19:16:45 yes 19:17:08 uneven distribution of jobs makes executors use too much memory and either the log streaming process gets killed, or the executor itself (making the problem worse) 19:17:36 what's really cool is -- 19:17:53 if you take a look at the graphs right now, you can actually see that some of them are graphs of noisy neighbors 19:18:02 (they have oscillations which have no relationship to zuul itself) 19:18:29 because absent our leveling algorithm, external inputs like hypervisor load and network topology have an outsize effect 19:18:58 because gearman is a "you get jobs as quickly as you can service them" system 19:19:07 yep 19:19:37 nanoseconds count 19:20:04 #topic Update Config Management 19:20:16 ianw has changes up to deploy an ansible based/managed backup server 19:20:20 * clarkb finds links 19:20:33 #link https://review.opendev.org/674549 19:20:39 ianw wins 19:20:42 #link https://review.opendev.org/674550 19:20:59 i can fiddle that today if i can get some eyes and see if we can get something backing up to it 19:21:26 should run in parallel with existing backups, so no flag day etc 19:22:04 ianw: ya the original backup design had us backing up to two location anyway 19:22:09 (well, no client is opted into it yet either, the first one i'll babysit closely) 19:22:12 I expect that the puppetry will handle that fine 19:23:01 I think we are in a good spot to start pushing on the CD stuff too? 19:23:18 corvus: ^ I unfortunately tend to page that out more than I should. You probably know what the next step is there 19:23:38 for jobs triggered by changes to system-config, yes 19:23:45 for the dns stuff, no 19:24:02 https://review.opendev.org/671637 is the current hangup for that 19:24:19 #link https://review.opendev.org/671637 Next step for CD'ing changes to system-config 19:24:32 that's how i wanted to solve the problem, but logan pointed out a potentially serious problem 19:25:14 so we either need to put more brainpower into that, or adopt one of our secondary plans (such as, opendev takes over the zuul-ci.org zone from the zuul project). that could be a temporary thing until we have the brainpower to solve it better. 19:26:01 er that is for the dns stuff not system-config right? 19:26:03 #undo 19:26:04 Removing item from minutes: #link https://review.opendev.org/671637 19:26:10 clarkb: correct 19:26:11 #link https://review.opendev.org/671637 Next step for CD'ing changes to DNS zones 19:26:27 no known obstacles to triggering cd jobs from changes to system-config 19:26:31 got it 19:26:56 Anything else on this subject? 19:27:10 (project-config is probably ok too) 19:28:20 #topic Storyboard 19:28:36 no updates for sb this week that i'm aware of 19:28:42 fungi: I know mnaser reported some slowness with the dev server, but I believe that was tracked back to sql queries actually being slow? 19:28:48 * mordred did not help on the sql this past week 19:28:51 (so there isn't an operational change we need to be aware of?) 19:29:00 * mordred will endeavor to do so again this week 19:29:17 it seemed to be the same behavior, yes. if i tested the same query i saw mysql occupy 100% a vcpu until the query returned 19:29:27 (api query, i mean) 19:29:45 diablo_rojo: Do you have anything to add? 19:29:47 mordred, should I just actively bother you in like...two days or something? Would that be helpful? 19:30:10 so the bulk of the wait was in one or more database queries presumably 19:30:25 clarkb, the only other thing we did during the meeting last week was start to talk about how we want to try to do onboarding in SHanghai of users and another for contributors 19:30:29 That's all. 19:30:58 diablo_rojo: related to that can I assume that you have or will handle space allocation for storyboard? or should I make a formal request similar to what I did for infra? 19:31:01 ahh, yep, and covered that trying to facilitate remote participation in shanghai might be harder than normal 19:31:07 diablo_rojo: yeah - actually - if you don't mind 19:31:34 diablo_rojo: I keep remembering on tuesday morning - when I look at the schedule and think "oh, infra meeting" 19:31:40 mordred, happy to be annoying ;) I'll try to find/create a quality gif or meme for your reminder 19:31:45 hah 19:31:50 mordred, lol 19:32:14 diablo_rojo: just let me know if I need to do anything official like for storyboard presence in shanghai. Happy to do so 19:32:15 clarkb, I will handle space for StoryBoard :) 19:32:19 awesome 19:32:24 clarkb, I know the person with the form ;) 19:32:32 indeed 19:32:37 ^^ bad joke I will continue to make 19:33:00 #topic General Topics 19:33:07 First up is trusty server replacements. 19:33:15 fungi: are you planning to do the testing of wiki-dev02? 19:33:29 iirc planned next steps was to redeploy it to make sure puppet works from scratch? 19:33:48 yep, have been sidetracked by other responsibilities unfortunately, but that's still high on my to do list 19:34:16 wiki-dev02 can simply be deleted and re-launched at any time 19:34:26 nothing is using it 19:34:29 i'll try to get to that this week 19:34:32 thank you 19:34:53 corvus has also made great progress with the swift log storage (whcih means we can possibly get rid of logs.openstack.org) 19:35:05 corvus: at this point you are workign through testing of individual cloud behaviors? 19:35:57 clarkb: yes, i believe rax, and vexxhost are ready, confirming ovh now (i expect it's good) 19:36:09 so we'll be able to randomly store logs in one of six regions 19:36:32 andI know you intended to switch over to the zuul logs tab with logs.o.o backing it first. Are we ready to start planning that move or do we want to have the swift stuff ready to happen shortly after? 19:36:48 (job/log region proximity would be nice, but not relevant at the moment since our logs still go through the executor) 19:37:14 yeah, we're currently waiting out a deprecation period for one of the roles which ends monday 19:37:52 exciting we might be switched over next week then? 19:37:59 after that, i think we can switch to zuul build page as the reporting target (but we need a change to zuul to enable that behavior) 19:38:20 and then i think we'll have the swift stuff ready almost immediately after that 19:38:45 that is great news 19:38:49 maybe we plan for a week between the two changes, just to give time for issues to shake out 19:38:55 wfm 19:39:02 ++ 19:39:11 yes, it's timely, given we've had something like 3 disruptions to the current log storage in a couple weeks time 19:39:25 though... hrm, timing might be tight on that cause i leave for gerrit user summit soon 19:40:14 i leave on aug 22 19:40:21 we can probably take our time then and do the switches we are comfortable with bit by bit as people are around to monitor 19:40:28 assuming we don't want to merge it the day before i leave, we really only have next week to work with 19:40:43 i return sept 3 19:40:47 k 19:41:06 so we either do both things next week, or build page next week and swift in september 19:41:36 * mnaser curious which swifts are being used 19:41:39 and we can probably decide on when to do swift based on how smoothly the build logs tag change goes? 19:41:55 s/tag/tab/ 19:42:04 mnaser: vexxhost, rax, ovh 19:42:20 just wondering how much data is expected to be likely hosted? 19:42:28 and fortnebula has hinted that a swift install there might happen too 19:42:57 mnaser: i think we're currently estimating about 2GB for starters (much less than when we initially discussed it with you) 19:43:05 er 2TB 19:43:31 cool, thank you for that info 19:43:33 * mnaser hides again 19:43:37 Which is a good lead into the next agenda item. State of the clouds 19:43:55 I wanted to quickly give a status update on fn and was hoping mordred could fill us in on any changes with MOC 19:44:07 fn is now providing 100 test instances and we seem to be quite stable there now 19:44:21 We have noticed odd mirror throughput when pulling things from afs 19:44:46 app-creds are working in moc now - so next steps are getting teh second account created and creating the mirror node 19:44:49 if we manually pull cold files we get about 1MBps and if we pull a warm file we get about 270MBps. But yum installing packages reports 12MBps 19:45:08 I am not sure that the afs mirror performance behavior is a major issue as the impact on job runtimes is low 19:45:12 but something I wanted to make note of 19:45:17 mordred: exciting 19:45:37 clarkb: only yum? 19:45:49 yum being slow is nothing new :( 19:45:50 corvus: I haven't looked at the other package managers yet, but the examples donnyd dug up were yum 19:45:57 corvus: but that is a good point we should check apt-get too 19:46:07 OSA's centos jobs take almost twice as long and there isn't a lot of different things happening 19:46:11 for context 19:46:21 mnaser: good to know 19:46:45 mordred: do you need anything to push MOC along or is that mostly you filing a ticket/request for the second account? 19:47:13 nope- just filing a ticket 19:47:25 Im a little late to the party, but swift will surely be happening. Just a matter of when 19:48:09 great 19:48:28 next up is making note of a couple of our distro mirrors' recent struggles 19:48:33 re mirrors i think overall some macro apache throughput stats would be useful, for this and also for kafs comparisons. working on some ideas 19:48:42 ianw: thanks! 19:49:02 fungi has found reprepro won't create a repo until it has packages (even if an empty repo exists upstream) 19:49:15 this is causing problems for debian buster jobs as buster updates does not exist 19:49:27 fungi: ^ have we managed to workaroudn that yet? 19:49:47 oh ... maybe a skip like we added for the security when it wasn't there? 19:49:51 yeah, the first buster stable point release is scheduled to happen a month from tomorrow, so the buster-updates suite won't exist until then 19:50:00 or rather it will have no packages in it until then 19:50:16 ianw: well, except we also add it to sources.list on test nodes 19:50:22 so would need to actually omit that 19:50:58 or find a way to convince reprepro to generate an empty suite, but i haven't been able to identify a solution in that direction 19:51:05 ahh .. can we just fake an empty something? 19:51:18 fungi: if we touch a couple files does that result in a valid empty repo or is it more involved than that? 19:51:39 or maybe even mirror empty repos exactly as upstream rather than building from scratch 19:51:47 it's mostly that... 19:52:37 i mean, sure we could fake one but need to then find a way to prevent reprepro from removing it 19:52:48 ah 19:52:50 since it's a suite in an existing mirror 19:53:11 existing package repository i mean 19:53:34 so not as simple as something like debian-security which is a different repository we're mirroring separately 19:53:58 ok something to dig into more outside of the meeting I guess 19:54:06 we are almost at time and have a few more thing to bring up really quickly 19:54:30 yeah, we can move on 19:54:33 the fedora mirror has also been struggling. It did not update for about a month because a vos release timed out (presumably that is why the lock on the volume was held) 19:54:49 I have since manaully updated it and returned the responsibility for updates to the mirror update server. 19:55:11 One thing I did though was to reduce the size of that mirror by removing virtualbox and vagrant image files, old atomic release files, and power pc files 19:55:25 That dropped repo size by about 200GB whihc should make vos releases quicker 19:55:28 yeah, the debian mirror was similarly a month stale until we worked out which keys we should be verifying buster-backports with 19:55:43 that said it is still a large repo and we may be want to further exclude thigns we don't need 19:55:55 thanks; i haven't quite got f30 working which is why i guess nobody noticed ... we should be able to drop f28 then 19:56:00 I'm watching it now to make sure automatic updates work 19:56:15 ianw: I think tripleo depends on 28 to stand in for rhel8/centos8 19:56:28 ianw: so we might not be able to drop 28 until they also drop it, but ya that will also reduce the size 19:56:38 also... magnum? uses f27 still right? 19:56:44 fungi: the atomic image only 19:56:46 aha 19:56:50 fungi: which I don't think uses our mirrors 19:56:54 got it 19:57:07 And finally we have PTG prep as a topic 19:57:16 friendly reminder we can start brainstorming topics if we have them 19:57:19 #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019 19:58:00 #topic Open Discussion 19:58:07 we have a couple minutes for any remaining items 19:58:17 I will be doing family things tomorrow so won't be around 19:58:19 fwiw, i think i've identified the sdk bug that is causing us to leak swift objects from uploading images to rax. if we fail to upload the final manifest after the segments, we don't retry and don't cleanup after ourselves. seems to happen at least once every few days or so according to what logs we have. 19:58:53 Shrews: the manifest is the special object that tells swift about the multiobject file? 19:59:12 yah 19:59:14 clarkb: correct. uploaded last 19:59:51 \o/ 20:00:20 and we are at time. That is an excellent find re image uploads. Thank you everyone! 20:00:23 #endmeeting