19:01:14 #startmeeting infra 19:01:15 Meeting started Tue Dec 8 19:01:14 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:18 The meeting name has been set to 'infra' 19:01:22 #link http://lists.opendev.org/pipermail/service-discuss/2020-December/000151.html Our Agenda 19:01:32 #topic Announcements 19:02:12 o/ 19:02:24 I intend to be far away from keyboards next week. I'll also be doing school duties to give my wife a break from that so will be distracted either way. This means that we will need a meeting chair volunteer or we can cancel the next meeting 19:02:46 thenfor the 22nd and 29th I figured we'd play it more by ear as others are also likely taking time? 19:03:00 clarkb: are you away all next week? 19:03:12 with things getting quiet, having fewer meetings might also just be nice 19:03:28 corvus: ya sorry, trying to take the week off and get some rest/reset 19:03:35 don't be sorry :) 19:03:44 i heard he's taking a week-long trip to oregon 19:03:52 (just wanted to be clear if it was a day or a week) 19:03:56 ++ sounds good :) 19:04:13 i'll be around through the 23rd, then not around 19:04:27 if you will be around and want to chair either let me know or maybe just send out a meeting agenda email on monday 19:05:08 #topic Actions from last meeting 19:05:12 i'm in favor of just having fewer meetings and handling things as the come up 19:05:18 fungi: that wfm 19:05:27 #undo 19:05:28 Removing item from minutes: #topic Actions from last meeting 19:05:37 yep 19:05:45 in that case why don't we consider the meeting cancelled and we can schedule meetings as necessary instead with those who happen to be around 19:05:55 seconded 19:05:59 and apply similar logic to the 22nd and 29th 19:06:40 #topic Actions from last meeting 19:06:43 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-12-01-19.01.txt minutes from last meeting 19:06:53 there were no actiosn recorded so lets just dive in 19:06:58 #topic Priority Efforts 19:07:03 #topic OpenDev 19:07:24 On the gerrit side of things I listed out a few items for further tuning consideration 19:07:56 last night (relative to me) ianw ended up resetarting gerrit as it became non responsive. I think that possibly the lack of memory headroom with java 11 may be related to that? I've pushed up a chnge to reduce allowed heap size to 44g from 48g 19:08:06 I think java 11's non heap space is larger than java 8s 19:08:24 #link https://review.opendev.org/c/opendev/system-config/+/766020 reduce java heap size on review.o.o 19:08:35 that should give us more room for things like apache, git gc, backups, and so on 19:08:37 this all seems reasonable. i've +2'd but not approved the changes you recommended for the next restart 19:08:55 even if the memory wasn't at fault for the issue last night I think we're seeing sawpping and should avoid it if necessary 19:09:10 #link https://review.opendev.org/c/opendev/system-config/+/765867 Put more jgit configs in jgit.config 19:09:37 This change is the other one that I think we should consider for the next restart. Reading the gerrit documentation it is completely confusing as to whether our preexisting jgit tunables in gerrit.config apply anymore or if they need to be in jgit.config 19:10:00 I tried poking around the source to figure it out but failed doing that as well. Instead I'm thinking lets put the config options in both files and see if we get a difference in behavior 19:10:25 yeah, the host was responsive, but gerrit was not. my debugging was not extensive unfortunately 19:10:29 There are two children changes of ^ this change which are worth consideration too, but likely need more eyeballs and should go in on later restarts so we can tell what helps and what doesn't 19:10:49 specifically using packedGitUseStrongRefs and enabling git protocol 2 19:11:19 for the strong refs the idea there reading stuff from matthias upstraem is that when garbage collection happens it has a tendency to flush out jgit caches which jgit then immediately refills and this thrashing can lead to a sad gerrit 19:11:42 the strong refs makes the garbage collector stop doing that. My concern with this change is that I'm not sure if the garbage collector can ever clean strong refs if it needed to? 19:12:02 (strong refs are not eligible for garbage collections) 19:12:24 I think if we land this particular change we should do so when we can monitor it over a long period of time just to keep an eye on memory use 19:12:55 for git protocol v2, the idea there is it is much more efficient for git client operations when dealing with repos that have a lot of refs (like our gerrit repos) 19:13:20 the client must also support it but current git clients default to v2 aiui so as systems update we woudl get more and more use out of that? 19:13:58 anyway those first two chagnes should be much safer than the latter two. ANd if we can get the first two in and restart with them that would probably be good 19:14:49 sounds great 19:14:54 The other tunable that I discovered is that gerrit allows you to split its thread resources into batch and interactive user sets. The idea here is that things like CI systems could have dedicated thread resources. I'm not sure if this would help us or not but I noticed it was somethign called out in tuning discussions 19:15:14 i can do another gerrit restart later in my evening for the initial changes you mentioned 19:15:21 if others have time to look into ^ that would probably be good (even if it is to say "no we don't want this as it will start regular users") 19:15:27 fungi: thanks! 19:15:57 that was all I had for tunables. ianw want to update us on the ci results table progress? 19:16:35 for a quick look at what i've got see https://104.130.172.52/c/openstack/diskimage-builder/+/554002 19:16:41 there's a tab 19:17:07 ooh I like that 19:17:24 ++ that looks great 19:17:29 this is just all very very simple plain javascript @ https://github.com/ianw/gerrit-zuul-summary-status/blob/main/gr-zuul-summary-status/gr-zuul-summary-status-view.js#L87 19:18:09 are comment tags available to the js? 19:18:16 (so we could act on that instead of author name?) 19:18:29 i guess we call it a "zuul summary" because it's based on parsing zuul's standard comment format, even though it may include results from other non-zuul ci systems reporting in a similar format? 19:19:09 corvus: hrm, whatever is in https://gerrit-review.googlesource.com/Documentation/rest-api-changes.html#change-info i guess 19:19:28 fungi: yeah, i mean that's up for debate i guess 19:19:59 i think i can probably get it down to be simple enough to be a single file 19:21:05 i am starting to wonder about pushing it upstream, it might feel more at home even as a contrib/ in zuul 19:21:17 * diablo_rojo sneaks in late 19:21:43 but it installs as a pg plugin? 19:21:46 the big upside to pushing it upstream is that we know there are other zuul users out there with gerrit and they may be more likely to find these things on the gerrit side (as it is a gerrit modification)? 19:21:46 ianw: https://gerrit-review.googlesource.com/Documentation/rest-api-changes.html#change-message-info 'tag' field 19:22:18 yeah, and we might be able to take advantage of the gerrit plugin ecosystem 19:22:20 in fact a zuul user I didn't recognize (sorry if I shoudl've) caught the stream events thing on gerrit 3.3.0 (which we will talk about in a bit) and sent amil about it to the repo discuss list 19:22:31 (like i think there's a pluginmanager plugin or something where you can click-to-install gerrit plugins) 19:22:59 ya there is 19:23:05 (we don't have it enabled on our setup)_ 19:23:21 so i'd be in favor of putting that in the upstream gerrit for that and community relations purposes :) 19:23:53 corvus: i'll look into it. basically the plugin gets called with a changeinfo object for the current change. my debug method is "console.log" so that's how i inspect what's going on :) 19:24:27 yeah, i do have it building via bazel ATM 19:24:39 and there's testing frameworks for polymer 19:24:49 and a zuul to run those tests :) 19:25:30 anything else to add on this? 19:25:46 nope, i'll just keep plugging away on it 19:25:47 ianw: if you're okay with pushing that to gerrit's gerrit, i think the next step is to send an email to repo-discuss requesting the repo creation; i can help with that if you want 19:26:19 corvus: thanks. i will clean it up a bit more and get back to you 19:27:02 fwiw, i think it looks good enough to start iterating on things in parallel :) 19:27:39 Next up is the built in WIP status for changes on newer gerrit. We had hacked in WIP support by adding a -1 approval category that change owners and cores could toggle, but now gerrit supports it directly (for change orwners at least). People have started asking about using the actual WIP status instead of the approval category 19:28:05 but I think just now we have discovered that zuul doesn't yet know about the built in wip status and should be updated before we recommend our users use the built in wip status 19:28:28 corvus: fungi zbr any other specifics to call out on that? sounds like work will start soon on addressing that in zuul 19:29:21 zbr volunteered to work on a change tomorrow 19:29:22 i had nothing to add 19:29:57 ok, I figure once zuul is updated we'll do more testing then we can decide if we want to clean up the old approval hack or not (or at least offer that as an option to users) 19:30:15 just be aware it will cause top-of-queue gate resets for now if people accidentally approve wip state changes 19:30:32 oh ya beacuse submit will fail which zuul will think is a merge failure 19:30:38 zuul will get as far as trying to submit, right 19:31:03 that is a good point particularly since we haev seen deep gate queues in some projects recently (there has been a lot of python trouble with pip lately) 19:31:36 python comics, issue #473: the trouble with pip 19:31:51 Last up on the Gerrit OpenDev topic was calling out that Gerrit 3.3.0's event stream implementation breaks zuul's ability to take action on comment contents (think recheck comments) 19:32:09 corvus: ^ are any other zuul interactions with gerrit known to be affected ? 19:33:23 calling this out beacuse upstream is aware of the issue and is working on addressing it, but we should avoid upgrading to 3.3 until it is fixed 19:33:39 can we insert after this subtopic the jeepyb lp bug/bp hook scripts? i wanted to know if anyone has made progress on those or if i should try to pick them up next myself tomorrow-ish 19:34:02 sure I think that was all I had on it (basically upgrade to 3.3.0 has found a blocker) 19:34:04 clarkb: that's all i'm aware of 19:34:19 fungi: I am not aware of anyone working on them yet 19:34:25 latest on the stream-events thing is luca is going to rage code a bunch of tests :) 19:34:39 cool, mostly just trying to prioritize the stuff we've been accumulating on the post-upgrade etherpad 19:34:45 fungi: ianw had looked at them briefly pre upgrade iirc, but that was the last I heard 19:34:49 fungi: ++ and thank you 19:34:57 he said something about "if it's not tested it's broken" 19:35:09 and bug/bp integration seems to be next on the painpoints after/alongside ci results table 19:35:30 corvus: i feel like i've heard that somewhere before 19:35:47 yeah, i hadn't really got that far with them, but now we have the actual REST API to play against i think we can iterate on it faster 19:36:21 Anything else on the subject of gerrit and or opendev? 19:36:44 one quick thing on the system-config gate test for gerrit/review ... what does the review-dev node test over just the review node? 19:36:55 i'm wondering if we can prune that to just the one node? 19:37:22 it's been pointed out that we may be invalidating gerrit logins more quickly than (we think) we've configured, so i'll also test whether my restart later today invalidates my webui session 19:37:22 ianw: ya I think the idea before we realized that we really need something like a prod alike is that we might have -dev and prod in different stages of upgrades 19:37:46 ianw: since we're doing pre merge testing anyway I think we can probably have a single node that just does the thing we want prod to look like and use it that way 19:37:55 ianw: mordred may remember if there was any better reason than that though 19:38:02 fungi: oh good idea 19:38:14 what did I do? 19:38:32 mordred: basically in the system-config job for gerrit we have a review.o.o and review-dev.o.o fake tests nodes separated I think 19:38:41 i feel like we should rip out review-dev at this point (and keep in mind when we're ready to also tear down review-test in favor of held job nodes) 19:38:48 fungi: ++ 19:39:00 fungi: re the logout, i was not logged out when i restarted it last night my time 19:39:02 yeah- I think it was just because they were a bit different 19:39:15 at this point its an artifact of how we didn't have great testing for gerrit and now we can make that better with testing that looks like prod 19:39:26 so I think re-collapsing those at this point is ... yup 19:39:27 ianw: thanks, that's also a useful data point 19:40:06 ok, i will propose that. i makes it a bit simpler doing a full gerrit initalisation and pushing changes in the job 19:40:43 #topic Update Configuration Management 19:41:11 Has there been any movement on this topic in the last week (sorry gerrit has been overly consuming) 19:42:07 this might be the place to remind folks we're running into dockerhub rate limits on our containerized service test jobs 19:42:35 no easy answers at this point though 19:42:37 is it enough we're ready to decide we want to do something about it? 19:42:59 corvus: probably not? It is just infrequent enough that I haven't rage fixed it :) 19:43:01 it hasn't been particularly crippling yet 19:43:23 but worth keeping an eye on in case it escalates quickly 19:43:54 it may be worth setting up a job to publish to quay just to see if that works? 19:43:57 iirc, we're thinking if it is annoying enough, we should start by looking into squid, and if that fails, we could look at a smart proxy based on zuul-registry but that's high-effort. that still a decent summary? 19:43:58 since that may be an easy out 19:44:13 corvus: yup I think as far as proper fixing goes that is a good summary 19:44:34 (my quay comment is more that "maybe this is an easy half measure to consider alongside ^ and we'd still want to cache for quay anyway) 19:44:42 yeah, that's still the latest thinking as far as i'm aware 19:46:10 #topic General Topics 19:46:20 #topic Bup and Borg Backups 19:46:40 ianw: this is still on the agenda mostly as a remidner that we should look out for your bup removal change and +2 that after we've verified borg backups? 19:46:46 ianw: is that change up yet? 19:48:04 thanks, that reminds me to actually add restoration docs exercising to my to do list 19:48:37 no, it is not. i'll get to it so we can hopefully sort it by year end 19:48:42 thanks 19:48:49 there's no rush 19:48:50 #topic OpenStackID hosting 19:49:42 This I failed to add to the agenda but the foundation sysadmins have started to think about what a more ideal hosting situation looks like (ignoring who is hosting it) which I think is a good first step in figuring out how we collaborate (if at all) in hosting it 19:50:13 basically taking another look at service needs and requirements and work out how to deploy it well 19:50:18 (this hasn't been forgotten) 19:51:31 Then for remaining topics I may have to declare bankruptcy on ptg followups or at least in the way I've done them before. Meetpad testing with users in china has not happened yet (I'd still like to coordinate that though), and puppet job splitting hasn't happened as far as I can tell 19:51:38 #topic Open Discussion 19:51:40 thinking back, the reason we insisted on hosting it in opendev previously was that we were tying the gerrit contact store api to it, so new contributors at the time couldn't agree to the (then mandatory for basically all projects) osf icla if openstackid was down, but also we were looking at depending on it for authenticating users to various services 19:51:46 #undo 19:51:46 Removing item from minutes: #topic Open Discussion 19:52:06 gerrit since remove the contact store feature entirely so that's no longer a concern 19:53:10 that is a good point, the requierments/needs on our end have shifted too 19:53:21 and the only services we set up authenticating against it were translate (openstack-only abandonware which needs to be replaced soonish), refstack (also openstack-only, tied to foundation trademark programs), and survey (beta which never really gained traction) 19:54:35 alright I'll open it up now as we only have a few minutes left 19:54:38 #topic Open Discussion 19:54:43 Anything else to call out really quickly? 19:55:54 Nothing from me. 19:56:51 i've promised to spend less time on the computer this month. i expect to still be around some of the time but will also be taking more time away as i can to work on some projects around the house. also probably for the last week-ish of the month i may not be around much at all 19:57:15 ya I'll be trying to take it easy around the holidays though in and out 19:57:37 for me that probably translates to fixing emergency fires but maybe not much progress on longer term efforts 19:59:06 s/fixing/fueling/ ? ;) 19:59:29 heh 19:59:36 anyway we are about at time now. Thanks everyone! 19:59:38 #endmeeting