19:01:37 #startmeeting infra 19:01:38 Meeting started Tue Jul 23 19:01:37 2019 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:39 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:41 The meeting name has been set to 'infra' 19:01:44 #link http://lists.openstack.org/pipermail/openstack-infra/2019-July/006425.html 19:01:48 aloha 19:02:56 #topic Announcements 19:03:13 There weren't any recorded on the agenda 19:03:20 #topic Actions from last meeting 19:03:26 #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-07-16-19.01.txt minutes from last meeting 19:03:34 mordred: any updates on the github cleanup work? 19:05:54 clarkb: nope. totally been doing gerrit instead. also, I keep hitting github rate limits and then doing something else 19:06:09 mordred: we need to convert you into a github app then 19:06:16 #action mordred clean up openstack-infra github org 19:06:25 #action mordred create opendevadmin account on github 19:07:43 #topic Priority Efforts 19:07:52 #topic OpenDev 19:08:04 Lets start here since a fair bit has gone on with gitea in the last week or so 19:08:13 indeed it has 19:08:38 We've discovered that gitea services can cause OOMKiller to be invoked. This often targets git processes for killing. If this happens when gerrit is replicating to gitea we can lose that replication event 19:09:08 synopsis: corrupt git repo or missing objects 19:09:19 I've put 1GB of swap (via swapfile) on gitea01-05,07-08 and 8GB on gitea06. The reason for the difference in size is 06 was rebuilt with a much larger root disk and has space for the swapfile the others are smaller and don't have much room 19:09:28 1GB of swap is not sufficient to avoid these errors 19:09:49 But 8GB appears to have been sufficient so we are rebuilding gitea01 to be like gitea06 and will likely roll through all the other backends to do the same thing 19:10:24 In this process we've discovered a few deficiencies with how we deploy gitea and manage haproxy fixes for which are all merged now with exception of how to gracefully restart haproxy when haproxy's image updates 19:10:31 old instance/volume deleted, new bfv instance exists now and needs data copied over 19:10:42 I don't know how to handle that last case and will need to think about it more (and read up on docker-compose) 19:12:05 clarkb: Is there an agenda? 19:12:12 yeah, manuals say sigusr1 is for that 19:12:19 donnyd: ya http://lists.openstack.org/pipermail/openstack-infra/2019-July/006425.html 19:12:24 if i'm reading correctly 19:12:36 Sorry. I didn't scroll all the way down. 19:12:37 fungi: ya so will depend on whether or not we can have docker-compose somehow send signals when it restarts stuff 19:12:53 fungi: we may have to break this out of docker-compose? or accept that iamges don't update often 19:12:57 (and haproxy restarts are quick) 19:13:06 need to do more research 19:13:42 I mean - I think we always have ansible running docker-compose things 19:13:56 so it's reasonable for ansible to tell docker-compose to send a signal or whatever 19:14:06 maybe? 19:14:07 http://cavaliercoder.com/blog/restarting-services-in-docker-compose.html ?? 19:14:13 mordred: right but the image replacement is all a bit automagic in docker compose 19:14:14 * fungi just wrote a handler which does that 19:14:36 clarkb: ah yes - that's an excellent point 19:14:54 mordred: so either we can hook into that or we stop relying on it and manually break out the steps 19:14:57 we don't actually want to do this every ansible time- only if the result of pull would cause a restart 19:15:04 correct 19:15:06 yeah, image updates will require more than just sending a signal to the running daemon 19:15:39 happily, haproxy is generally pretty boring and stable 19:15:55 corvus: yup (today is an exception and there should be a new image soon but in general that is true) 19:16:22 I think that is the least urgent item to sort out for impactless updates 19:16:29 the urgent ones have all been addressed, thank you 19:16:47 We should keep an eye out for unexpected behavior going forward as digging into the problems today was really helpful 19:17:00 Are there any other opendev related changes we should be aware of? 19:18:18 Sounds like no. Onward 19:18:23 #topic Update Config Management 19:18:35 mordred has been working on getting gerrit into docker 19:18:53 \o/ 19:18:54 mordred: anything you want to call out for that activity? 19:18:58 yes. I just pushed up a new rev 19:19:07 uhm - mainly it's definitely ready for review now 19:19:18 and clarkb just found a good pile of gotcha for 2.13 19:19:26 so review is much appreciated 19:19:32 #link https://review.opendev.org/671457 Gerrit docker image builds ready for review 19:19:37 i think when that's in place https://review.opendev.org/630406 is going to show us any problems with it 19:19:46 the general idea is making a 2.13 image that works pretty much just like our current 2.13 install 19:19:56 there's currently a problem with the last image we built (6mo ago); i doubt it's fixed, but we'll be able to iterate then 19:20:02 corvus: agree - although there are a few things, like heapSize, that we'll want to thikn about? 19:20:06 corvus: ++ 19:20:13 and by "problem" i mean things like file ownership, paths, etc 19:20:16 yeah 19:20:30 most of the things should pretty immediately break 19:20:44 mordred: heapSize is something we can pass in in an env var, right? 19:20:49 it is now :) 19:20:54 (so we can have a test and prod value) 19:20:55 cool 19:21:27 we WILL be losing the ability to set those things in gerrit.config unless we make things more complex 19:21:40 mordred: could we run an image that ran the gerrit init script? 19:21:55 not really - it forks gerrit into the background 19:22:08 right but aren't there docker images that know how to manage that? 19:22:23 I mean - we COULD - but I'd rather fix this to not be wonky like that 19:22:27 and it's not far off 19:22:35 ok 19:23:08 we should be able to deploy it on review-dev and be pretty happy with the results before we commit to production too 19:23:13 ++ 19:23:28 as long as gerrit upstream also doesn't intend it to be wonky like that 19:23:38 otherwise seems a bit sisyphean 19:23:56 I get the sense that gerrit upstream has sort of ignored these problems with their images? 19:24:06 (I seem to recall it relying on h2 among other things) 19:24:23 which i expect is fine as long as they don't intentionally make it worse 19:24:25 hrm, i think the upstream images are decently constructed 19:24:34 they have volumes in the right places, so you don't have to use h2 19:24:37 ah 19:24:38 yah 19:24:50 they don't work for us because we want to be able to patch :) 19:25:00 most of the stuff going on in the init script is not actually necessary 19:25:23 it elides out pretty quickly once you're making images, because you know where all the things are 19:25:35 (there's a TON of logic for finding where your files might be, for instance) 19:25:50 it seems like they completely ignore things like java heapsizes though? 19:26:33 oh maybe not https://github.com/GerritCodeReview/docker-gerrit/blob/master/ubuntu/18/entrypoint.sh#L16 19:27:07 they are running the init script 19:27:22 yeah. I mean - we could do that if people want - I just don't think we need to 19:28:02 My biggest concern with not doing that is that we might miss important changes to java configs or other settings that happen in the runtime (not gerrit itself) as new versions of gerrit or java come out 19:28:10 but you are right that that is less ideal for docker 19:28:38 and maybe an opportunity to collaborate 19:28:49 there is currently exactly one setting we are setting that results in a java cli option being pulled out by the init script :) 19:29:18 mordred: we also set the timeout option but that one doesn't make sense with docker 19:29:32 yah 19:30:02 not much preexisting cause for worry then 19:30:08 current approach should be fine 19:33:04 Sounds like that may be it on this topic then? 19:33:18 #topic Storyboard 19:33:32 fungi: diablo_rojo_phon how are things? 19:33:56 we had a good session on friday grooming feature requests into our main priorities board 19:34:48 Gonna meet this week 19:34:48 #link https://storyboard.openstack.org/#!/board/115 StoryBoard Team Dashboard 19:34:55 Talk about ptg forum things 19:35:19 Onboarding maybe. 19:35:35 curious to see what interest we can drum up in shanghai 19:35:56 Same 19:36:23 that's all i can recall 19:36:35 Yeah that's it for now 19:36:40 Thanks! 19:36:44 #topic General Topics 19:36:51 oh, i've pushed some changes to get python-storyboardclient testable again 19:37:06 Aside from begging mordred to do some SQL things in hopes of improving search 19:37:20 Thanks fungi! 19:37:24 fungi: for the wiki upgrade I think last week the suspicion was git submodules 19:37:45 have we been able to take that suspicion and make progress with it yet ? (also I know I said I woudl try to help then got sucked into making gitea betterer) 19:38:29 nope, i had weekend guests and so less time on hand than hoped 19:38:42 ok I'll actually try to take a look this week 19:38:43 i've been working on making progress on the zuul log handling; while we can technically switch to swift log storage at any time, the experience will degrade if we do it now, but could be better if we do it after we add some things to zuul 19:38:52 #link https://zuul-ci.org/docs/zuul/developer/specs/logs.html zuul log handling spec 19:39:01 submodules or similar (git subtree?) theories are possible avenues of investigation for the mediawiki situation 19:39:05 corvus: is the log manifest bit the piece to reduce degradation? 19:39:37 i've pestered mordred and clarkb for reviews of blokers so far. i think i'm about at the end of that, and with the (yes) zuul-manifest stuff in place, we should be able to see the javascript work in action with the preview builds 19:40:12 yeah, the short version is the manifest lets the web app handle indexes, so we don't have to pre-generate them, and we can also display logs in the web app itself 19:40:31 which means we can have javascript do the OSLA bits (which otherwise we would also have to pre-generate) 19:40:43 that sounds like a great approach 19:40:53 then we can delete the log server 19:40:58 and there will be much rejoicing 19:41:01 \o/ 19:41:03 saw some of those changes float by and will try to take a closer look 19:41:32 it'll get interesting once the manifest stuff is in place and i can rebase for it. hopefully today 19:41:48 exciting progress 19:41:58 the kafs mirror servers are out of rotation for now; there are some changes queued in afs-next branch which it would be good for us to test before they are sent to linus 19:42:33 however the tree currently doesn't build, but when it does it would be good to put into rotation for a while to confirm. the fscache issues are unresolved, however 19:42:35 i find it awesome that we're testing things before they're sent to linus 19:42:39 (just an aside) 19:42:48 fungi: ++ 19:43:14 ianw: good to know, let us know how we can heklp I suppose (reviewing changes to flip around the mirror that is used?) 19:43:28 ianw: does afs-next get patches via mailing list? 19:44:14 seems so, yes 19:44:17 (i've been itching to write an imap driver for zuul; wonder if this would be a practical use) 19:44:18 corvus: there is a mailing list, but things also pop in and out as dhowells works on things 19:44:35 i will *so* review an imap/smtp zuul driver 19:45:05 i mean, i guess the smtp reporter is already there? ;) 19:45:22 * corvus read "tree currently doesn't build" and got really confused and sad 19:46:08 Intel has been doing a bunch of CI for the kernel recently I think 19:46:16 that might be another avenue for collaboration potentially 19:46:30 As a timecheck we have ~14 minutes left and a couple more items to get through so lets continue on though 19:46:40 I did want to do a cloud status check in 19:46:42 an nntp driver would be cool, but convincing lkml to return their focus on usenet might be an uphill battle 19:47:01 FortNebula cloud is pretty well stabilized at this point. Thank you donnyd 19:47:11 thanks donnyd!!! 19:47:16 there may be one or two corner cases that need further invenstigating (ipv6 related maybe?) 19:47:18 yea it seems to be working well atm 19:47:29 but overall it is doing great 19:47:36 Yea I am not sure why there seem to be just 3 jobs that timeout 19:47:57 which three jobs? 19:48:03 Hopeful the right storage gear will *actually* show up tomorrow 19:48:21 i hear there's a generator backup in the works too 19:48:24 fungi: I will get you the list, but it seems to be fairly consistent 19:48:38 yea, its been sitting outside for 6 weeks 19:49:05 just got a pad poured and now I am waiting for the gas to be hooked up to it 19:49:05 mordred: for MOC is there anything we can be doing or is it still in that weird make accounts and get people to trust us situation :) 19:49:26 and my UPS have been refreshed, so they should handle the load till the genny takes over 19:49:27 I have not heard anything new about the new linaro cloud since we last had this meeting 19:49:46 hah. i need to get a generator for here. most of my neighbors have them so i'm feeling rather naked (though i need a >=6' platform to support it) 19:50:01 fungi: is that something you can put on the roof? that should be high enoguh 19:50:13 i'd need to reinforce the roof for that 19:50:15 clarkb: lemme check to see if it's been fixed 19:50:50 Is there any other fun experiments we can do with FN? 19:50:55 are 19:51:18 this experiment isn't enough? (just kidding!) 19:51:25 donnyd: I think if we get a serious group together to start working out nested virt issues (another potential avenue for feedback to the kernel) that would be super helpful 19:51:26 LOL 19:51:41 yeah, that's been a frequent request 19:51:47 donnyd: what we've found in the past is taht debugging those issues requires involvement from all layers of the stack and what we've traditionally lacked is insight into the cloud 19:51:53 yea, I have seen more than one request for it 19:52:17 clarkb: would ssh access to the hypervisors help? 19:52:30 johnsom and kashyap are your workload and kernel people and then if we can get them insight into the hypervisors we may start with balls rolling 19:52:33 if it came with people to ssh in and troubleshoot them ;) 19:52:38 clarkb: MOC still waiting on app credentials to be enabled in their keystone 19:52:40 donnyd: I'm not sure they need to ssh in as much as just candid data capture 19:52:42 * mordred pings knikolla ^^ 19:52:53 donnyd: cpu and microcode version and kernel versions and modules loaded and so on 19:52:59 mordred: k 19:53:14 sure... this is the only workload this thing is doing... I can make logs public facing without issue 19:53:20 donnyd: kashyap would be the best person to engage for what needs are 19:53:25 on the cloud providers topic, there's a change posted to switch the main flavor in the lon1 aarch64 cloud to 16 cpus and 16gb ram because of resource demands from kolla arm64 jobs 19:53:59 fungi: I'm a bit wary of increasing the memory footprint just because but considering it is a different cpu arch incraesing teh cpu count seems reasonable 19:54:00 the arm jobs require more ram? 19:54:05 #link https://review.opendev.org/671445 Linaro London: use new bigger flavour 19:54:20 clarkb: corvus: these are questions i too have asked 19:54:32 in defense of the request devstack + tempest does swap now 19:54:46 please follow up on that change so i'm not the only one ;) 19:54:47 and I'm sure that is part of the slowness, but the fixing should involve figuring out why we've used so much more memory than in the past 19:54:50 fungi: will do 19:55:00 really quickly before we run out of time 19:55:10 I submitted PTG survery for the opendev infra team and gitea as separate requests 19:55:40 That means if you are going to be in shanghai some group of us likely will as well 19:55:54 #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019 Start planning the next PTG 19:56:11 That has no content yet but it is there for people to start putting ideas up (I know it is early so no pressure) 19:56:17 * mordred will be in shanghai 19:56:23 "TODO" 19:57:17 i'm so scattered i honestly can't recall whether i've filled out the opendev infra survey 19:57:36 fungi: its for me to fill out and you not to worry about 19:57:41 ahh, got i 19:57:43 y 19:57:45 t 19:57:46 fungi: basically our request to the foundation that we want space at the PTG 19:57:52 * fungi gives up on typing again 19:57:53 #topic Open Discussion 19:58:03 We have a couple minutes for anything else that may have been missed 19:59:16 Thank you for yout time and we'll see you next week 19:59:23 thanks clarkb! 19:59:25 my typing is suffering now too :) 20:00:05 #endmeeting