19:01:13 #startmeeting infra 19:01:13 Meeting started Tue Dec 1 19:01:13 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 The meeting name has been set to 'infra' 19:01:23 #link http://lists.opendev.org/pipermail/service-discuss/2020-November/000137.html Our Agenda 19:01:33 We have an agenda, trying to get things back to normal after an eventful few weeks 19:01:42 #topic Announcements 19:01:47 Wallaby cycle signing key has been activated https://review.opendev.org/760364 19:01:52 Please sign if you haven't yet https://docs.opendev.org/opendev/system-config/latest/signing.html 19:02:03 at this point this is there mostly as a reminder for myself as I have failed to sign it sofar :( 19:02:32 #topic Actions from last meeting 19:02:37 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.txt minutes from last meeting 19:02:49 Last meeting we didn't haev a formal agenda and instead went through gerrit upgrade related items. 19:03:00 There are still a few of those to talk through which we will get to shortly 19:03:05 #topic Priority Efforts 19:03:10 #topic OpenDev 19:03:52 We've been working through the debugging of system load on Gerrit. We've had a few good leads so far but nothing that has made it go completely away 19:03:59 ohai 19:04:10 In particular someone else on the Gerrit mailing list was struggling with similar on Gerrit 2.16 and the discussion there pointed to caches 19:04:14 o/ 19:04:26 fungi and I have since been trying to tune our cache sized based on the info that ssh review gerrit show-caches gives us 19:04:38 I think this has helped but it hasn't completely made things happy yet 19:05:10 worth noting, we think there is correlation between the "missing tree" errors people get on push and the elevated system load 19:05:19 We also noticed that there is a jgit recieve.autogc setting that runs git gc when code is pushed. We set that but literally just 5 minutes ago I realized we set it in the wrong config file 19:05:31 there is not a jgit config file so I imagine next ups is getting that moved into the correct config file 19:05:34 though it's so far only been observed on large repos, generally while or immediately after pushing change series 19:05:59 whcih could be related to the autogc thing maybe? the gerrit docs note that disabling it is recommended (despite being enabled by default) due to the load impact it has 19:06:20 yeah, i'll work on adding the jgit.config after this meeting 19:06:23 3.3.0 release notes imply that it will not be disabled by default though so maybe they decided making things bad by default was not recommended 19:06:53 the release note on that is a little vague/confusing, to be perfectly honest 19:06:56 In conversation with Luca on the Gerrit slack he says that Java 11 is likely also to have some performance benefits. Gerrithub has been running 3.2 on java 11 since the beginning of this release 19:06:58 fungi: yup 19:07:19 I think this means we should also look at landing java 11 support in our images then switch over prod to java 11. Fungi switched review-test over to java 11 this morning 19:07:38 yeah, if folks want to beat on review-test at all, that's helpful 19:07:52 And I'm hoping that this afternoon I'll have time to update the image jobs to build a 3.3 as well 19:08:01 i feel like we should land the openjdk 11 patch before trying the upgrade to gerrit 3.3, fwiw 19:08:07 I agree 19:08:22 that way if we see new issues we have a better idea of what brought them in 19:08:39 ++ 19:08:53 not to derail, but how important do we think upgrading the hsot from xenial is too? 19:09:01 also we need to do openjdk 11 before we upgrade to (not yet existent) gerrit 3.4 19:09:17 since they're planning to drop support for <11 at that release 19:09:22 ianw: I think that is reasonably important, but not urgent. eg we should be able to schedule that and warn people of the upcoming new IP address 19:09:47 if someone wants to start looking at what that would require I would be grateful :) 19:10:02 my understanding is it should only be important for OS support reasons 19:10:14 not for java version/performance reasons 19:10:21 is that correct or do we think there's a perf benefit? 19:10:37 corvus: generally linux benchmarking gets worse as you get newer kernels 19:10:54 I would actually expect a performance impact (if I had to guess without testing) 19:11:04 yeah, i think the os upgrade would just be mre because xenial reaches eol in a few months 19:11:41 phoronix does generic benchmarking of linux over time if people want to see what I would assume that 19:11:52 yep, and also if you're spending time debugging things and it does get down to the kernel/container-ish layer better to be debugging something current 19:12:00 ianw: ya thats true 19:12:06 and on that note, sometime soon we should also talk out a plan for how we would actually do the upgrading to focal... options are to build a new vm and then we have new ip addresses to warn folks about (given how many we know are stuck behind corporate firewalls with special rules allowing 29418/tcp to our server's current address) or do in-place upgrades 19:12:30 I think I still strongly prefer the new host method 19:12:36 i feel like last time we went with in-place 19:12:40 it sounds like it's a wildcard and could go either way, so i'd lean towards deferring os upgrade until we've stabilized or run out of other things 19:12:44 i do too, but in that case we need to decide on a communication schedule 19:12:46 corvus: ++ 19:13:28 corvus: yes, i agree we should hold off the os upgrade until we have known performance for the container on the current os version 19:13:42 do we want to see about putting together an http-only recommendation for third-party ci before host replacement? 19:13:55 The other thing I wanted to bring up is tristanC has done some plugin work to do zuul results table rendering. I've been too distracted by other things, but do others think that is in a place that we should consume it? I think if I had any concerns its that it is written in another esoteric alnguage that compiles to js/java aiui 19:14:17 corvus: based on some of the responses I've gotten so far I think a lot of third party CIs would struggle with that 19:14:24 a non zero number are still stuck on zuul v2 19:14:42 they would have a choice about what kind of struggle 19:15:03 also how would http-only work? are we planning to add the checks plugin? 19:15:04 fight internal network rules or upgrade software to supported versions 19:15:23 fungi: that is a good question 19:15:49 fungi: that's the question; i'm not sure checks has a long-term future, but it does exist and has no limitations for the third-party ci use-case (it does for a full gating system); an alternative may be webhooks. 19:15:56 right now we're not offering them an alternative for the stream-events cli 19:16:20 corvus: is webhooks another plugin option? 19:16:23 yep 19:16:32 so while i think http-only sounds great, we'd probably need to decide what that looks like and get it available first 19:16:36 afaik, its supporters do have a long-term interest 19:16:56 fungi: ya sounds like something to do more investigating for 19:17:05 there's also now the "findings" tab? if i've understood, you're supposed to put "autogenerated" on your review comment to be in there? 19:17:07 fungi: agreed (is why i raised it -- do we want to look into setting that as a goal?) 19:17:25 ianw: I think zuul is doing that? 19:17:31 yes has been for some time 19:17:53 i believe findings are different (at least, last time i was exposed to the design doc) 19:18:01 ianw: robot comments are toggleable, zuul has done that by default for ~ a year 19:18:17 and yes, robot comments and findings are separate things 19:18:41 i haven't yet managed to find the documentation on how to get anything into "findings" 19:18:51 ianw: are you suggesting findings tab as alternative to results table rendering? 19:18:52 the checks plugin puts thnigs in findings 19:20:02 corvus: not really as i don't understand it, but i mean it does seem like a summary of the latest zuul results is a "finding" 19:20:28 clarkb: i haven't seen tristanC's table; is there a ml message or other link or something? 19:20:41 ianw: have a link to an example? 19:20:56 ianw: what "robot comments" (autogenerated) do is hide things when you switch the "only comments" slider in the "change log" section of the change view 19:21:11 corvus: yes, let me did, it was rolled out on a test instance 19:21:15 dig 19:21:16 corvus: https://review.opendev.org/c/opendev/system-config/+/763891 is the change 19:21:34 and ya the job that test gerrit installation on ^ was held aiui for people to test it 19:22:21 fungi: (at some point i understood "robot comments" to be a new type of comment associated with checks plugin vs the "regular old comments" which may or may not have the 'autogenerated' tag 19:22:24 https://104.130.172.52/c/openstack/diskimage-builder/+/554002 19:23:19 are we onto talking about the table? because i'd like to run some things about gerrit gate testing by the peanut gallery 19:23:33 corvus: oh, interesting, it's possible i've confused them but i kept seeing them mentioned as the same thing 19:23:35 clarkb, tristanC: there are 2 zuul plugins for gerrit 19:23:51 clarkb, tristanC: is there any way maybe we could contribute to one or more of those? 19:24:19 corvus: yes I strongly encouraged tristanC to do so, but was told there is no interest in learning java or js 19:24:34 i believe tristanC knows js 19:24:40 unless tristanC forgot js? 19:24:42 which is one of my concerns with using the sf thing, its in a random language that tristanC finds acceptable rather than the upstream tooling 19:24:49 corvus: I dunno that is just what I was told last week when it came up 19:25:06 I believe this particular plugin is written in some language that compiles to js 19:25:22 yeah, but "javascript" these days is similar to assembly language really 19:25:27 the main thing i've wondered about scope-wise is whether a pg plugin for displaying a summary table of arbitrary third-party ci comments/votes is relevant to the zuul plug-in, but maybe if zuul is the reference for the comment format then it could be 19:25:33 ianw: i think that's a bit of a stretch 19:25:53 fungi: displaying zuul results is absolutely relevant 19:26:23 yep, and so if other ci systems leave comments which look like zuul results, then supporting that as part of the zuul plugin seems sane enough 19:26:33 https://gerrit.googlesource.com/plugins/zuul-status/ 19:26:34 corvus: maybe, but i mean https://104.130.172.52/plugins/zuul-results/static/zuul-results.js 19:26:36 Displays zuul status on PolyGerrit change 19:26:45 corvus: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-11-23.log.html#t2020-11-23T15:10:55 19:27:09 ianw: i'm not sure what the point you're making is 19:27:26 ianw: that is clearly a minimized and obfuscated file; i don't deny the existence of such things 19:27:44 i only say that plenty of people write javascript as the input to creating such files 19:28:43 the fact that there are minimized js files doesn't mean we need to learn new languages; the upstream polygerrit plugins are written in something resembling js, right? so collaboration with others could be done that way, and since we've managed to teach some zuul devs how to do some basic js, they may be able to contribute too 19:29:28 corvus: yup agreed. Maybe the best thing here is to hold out and see if we can upstream support for this into an existing plugin first 19:29:37 right, anyway i guess the exact point at hand is this is this concrete proposal for adding this table is written in https://reasonml.github.io/ and we probably have to decide if we want to incorporate that 19:29:48 clarkb: based on that convo, it seems like we're saying "someone needs to learn polygerrit" vs "someone needs to learn reasonml" 19:30:26 in terms of the bigger picture, of testing plugins, i think we should do some work there too. fungi suggested on-list that we should hold a node to test the plugins, which sort of works 19:30:29 right which I still think would be better if that is the toolchain gerrit has attached to 19:30:33 ianw: i've already -2d one change to add reasonml to zuul based on the lack of support for our last experiment with an esoteric language 19:30:53 because then we're collaborating in that ecosystem rather tah nsetting off on our own and being different 19:30:53 however, getting reviews into that held gerrit that look useful enough to test the plugin is a bit of a pain 19:31:03 ianw: yeah, what i didn't consider at the time was that we also need to get some representative content into the held gerrit somehow 19:31:25 ianw: fungi could we autogenerate some content? 19:31:30 we could instead demo things on review-test for now, i suppose, and hold off deleting it 19:31:30 i mean, i like playing with esoteric functional languages, don't get me wrong, but as a group we don't have the best track record there, whereas i think there's a bigger chance we can get more long-term collaboration/support by sticking with how upstream does plugins 19:31:34 make a project, push some changes, merge a change or two, etc 19:31:46 clarkb: ++ 'collaborating in that ecosystem' 19:31:53 clarkb: yes, i think so ... but we need to figure out adding the first admin user automatically 19:32:08 ianw: the zuul all in one stuff does that, I bet we can reuse it 19:32:09 ianw: i have that figured out 19:32:10 you just need to leave a comment to test this, right? 19:32:33 corvus: ya a zuul formatted comment I think (maybe the username matters too? I'm not sure) 19:32:51 there are probably multiple ways to create an initial admin account, but one is to use the gerrit cli with the built-in "gerrit code review" user 19:32:54 fungi: ok, i think we should go through together out of meeting maybe, and see if we can get the test job doing it 19:32:54 for hideci, yes; but hopefully we can omit that in the future -- comment tags are a thing :) 19:33:08 i think the zuul quickstart just uses become auth right? 19:33:17 been a while since i looked at that bit 19:33:30 at that point, it seems like it would also be easy to use a headless browser to take a screenshot of a review, which would make it easy to have an artifact confirming plugins working 19:33:47 and we can also hold the node for manual fiddling 19:33:50 that also sounds really awesome 19:34:31 there's some flag, DEVELOPMENT_BECOME_ANY_ACCOUNT which i didn't fully get to understanding last week 19:35:11 ianw: the alternative is the mechanism i describe in the gerrit admins section of our system-config docs. that works even on a gerrit with no existing accounts 19:35:21 also, ftr, i suspect it's perfectly fine to make a new plugin if this doesn't fit with zuul-status; i don't get the impression that lots of small plugins are necessarily bad. 19:35:45 corvus: that is a good point. It seems the more important bit is using the toolchains upstream is using then they may get involved and help us 19:35:52 fungi: ok, that was what i was trying but wasn't getting an admin account. i think we should try again 19:36:00 I think the gerrit maintainers do actually do a reasonable amount of plugin work to keep them working as things change ing errit 19:36:07 supporting that work would be a good idea imo 19:36:33 i've already engaged on the thread; i can write a summary to respond if we like 19:36:55 that sounds like a good way to recap this discussion for those who may not be hear 19:36:57 that reminds me, paladox contributed an opendev theme override (with light and dark mode support) as what i think is a pg plugin, but it's just an sgml/html blob in a paste. i was going to try to learn how to integrate that 19:36:58 s/hear/here/ 19:37:19 it sounds like basically a) we're not currently convinced on the separate project, especially in a language that doesn't have a lot of exposure, and would like to investigate integrating with upstream more 19:37:34 and b) we'd like to expand the overall plugin testing environment to make it easier 19:37:39 ianw: ++ 19:37:52 i'll draft something and loop people back 19:37:55 thank you 19:38:01 also if that discussion thread wasn't on service-discuss, could it be redirected there? 19:38:15 i have a feeling it might have ended up on openstack-discuss 19:38:18 i can cc, i think it was openstack discuss only 19:38:31 yeah, fwiw i have no idea what thread is being discussed :( 19:39:11 also, friendly reminder that there is a zuul running for the purposes of testing plugins in the upstream gerrit; i have no idea what testing means for polygerrit plugins; that may be interesting to learn 19:39:17 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-November/019051.html 19:39:18 i do recall replying on it, but in retrospect i should have asked people to follow up to service-discuss 19:39:20 for reference 19:39:27 (it's mostly testing java plugins) 19:39:42 thanks ianw 19:40:23 alright anything else on Gerrit before we move on? 19:40:29 part of the problem is i subscribe to lots of mailing lists and dump them into the same folder, so sometimes it's not immediately apparent to me if people have started discussions in the wrong ml 19:41:05 maybe we should agree to move forward with the jdk update asap? 19:41:27 other than that, no i think we've got things pretty well covered 19:41:30 I'm on board, its being tested on review-test. If others can give that a quick check then we're probably good to proceed on that 19:41:50 thinking out loud here: do the jgit autogc config first maybe? then do java 11 next? 19:41:58 just to do one thing at a time and autogc fix seems simpler 19:42:07 yeah, i'll push that change up after the meeting 19:42:11 thanks 19:42:22 #topic Update Config Management 19:43:30 Is there anything new on this effort to call out? I don't think so but I'm double checking 19:44:12 the codesearch rebuild maybe? 19:44:26 oh ya ianw ^ is that complete at this point? 19:44:26 we have two servers at the moment still, right? 19:44:43 oh, actually it's a cname now 19:44:45 no i cleaned the old one up, that should be all finished now 19:44:52 awesome, thanks! 19:44:58 nobody has complained so i assume it's working perfectly :) 19:45:05 excellent 19:45:18 yes, i was making a point to use the opendev one so i would test it 19:45:24 and have had no problems 19:45:49 #topic General topics 19:45:56 #topic Bup and Borg Backups 19:45:57 ianw: ++ thanks! 19:46:17 I think we're getting more and more comfortable with borg? I've unfortunately had little time to interact with it mroe recently 19:46:34 ianw: I know at some point you wanted to do verification then drop bup? 19:46:37 i should practice with restoring something i guess 19:46:51 maybe a good thing to try and do before dropping bup is having other admins do things like ^ 19:46:55 yeah, i was thinking what i'll do is a config change to remove the bup cron jobs; people can audit the borg changes and approve that when happy 19:47:07 ianw: that sounds like a reasonable plan 19:47:13 i'm down 19:47:33 and that is important for the focal upgrades we were talking about earlier too 19:47:35 then we can kill all the puppet bits and maybe just attach the old backup volumes to the new server for a bit 19:47:37 since bup and pytho3n don't mix 19:48:22 thank you for getting this moving and doing all that work, really appreciated 19:48:37 #topic Docker Rate Limits are Being Seen in CI 19:49:01 This is mostly a heads up/fyi 19:49:15 mostly in NAT environments? 19:49:17 jobs particularly those running on limestone seem to hit this 19:49:36 ianw: ya, though I would've expected it to hit all environments fairly equally due to our use of mirrors? But maybe we aren't using the mirrors the way I thought we were 19:49:44 yeah, we're not seeing it so much on our proxies as on limestone nat for jobs not using the proxy 19:50:16 a few weeks back I pushed up changes to switch our zuul mirror config for docker over to just using the host addrs rather than the mirror. I don't think we need to land those yet since its NAT getting us 19:50:39 but something to be aware of and maybe we need to bring that conversation for getting our images open source specialled again 19:50:53 jbryce was going to look at the agreement in more detail and get back to us but I think like us has been busy 19:51:03 another option is to use quay which does not rate limit 19:51:09 but does have outages when aws east goes down 19:51:23 I don't have answers, just info for people to digest :) 19:51:40 or make a new kind of pass-through proxy/mirror 19:52:00 ya one that understands it needs to be a sort of lru cache 19:52:21 yup; i believe that's doable and much of the code in zuul-registry can be repurposed for that 19:52:37 (but still, it's not a trivial project, so one that we should deliberately choose) 19:52:52 also possibly a more useful effort than trying to bend something like squid to cache "authenticated" requests 19:53:09 so that would authenticate with a higher-limited key, and transparently pass through all our requests? 19:53:10 though possibly squid would be better for our http caching we do on those hosts in general 19:53:21 since in theory it can be more flexible than what apache is currently doing 19:53:24 ianw: or even anonymously but just stay under the limit? 19:53:42 ya docker hub sends the required cache control headers to cache publicly those manifests 19:53:54 the issue is that apache will not cache any authenticated request even with those headers 19:54:00 there's no "anonymously" really through right? 19:54:01 we believe squid can be convinced to do so though 19:54:02 clarkb: do you think the squid approach will work with all the weird auth stuff? 19:54:28 corvus: I think so if we can make it respect the cache-control: public or whatever header it is that is sent back by docker hub 19:54:28 fungi: in the way i intended to use it, yes (an auth credential obtained with no identifying information) 19:54:35 what we have not traditionally done is limit our mirrors to only be connectable from their respective clouds; we might want to think about that if we're using a opendev specific key 19:54:44 fungi: (authz without authn i guess?) 19:55:16 it's been years since i've done esoteric things with squid (including trivially patching it to ignore some things which would cause it not to cache but that it lacked configuration for), so it would need a poc regardless 19:55:17 I think the major issue with apache as we use it for this problem space is that it will never cache a request that had an authorization header even if cache control says it is ok to do so 19:55:48 if apache could be convinced to do ^ it would probably be fine too. Since it is now the manifest data that we need to cache 19:56:22 based on my estimation of effort, it sounds like spending a couple of days attempting to get squid to work should take precedence over a couple of weeks to implement a smart registry proxy 19:56:48 (or, you know, convince everyone to use quay.io :) 19:57:06 yeah, like i said, there have been times when i had to patch and recompile squid to get it to cache some stuff too, so i don't want to say it's necessarily better than apache mod_proxy, and i don't personally think being stuck maintaining our own patched build of either of those is particularly wise 19:58:15 the first thing it really needs is exploration 19:58:29 ++ 19:58:41 part of the issue in the past is the info from docker was a bit vague 19:58:55 but now we've got a bit more real world data and we should be able to work with that to find a reasonable solution 19:59:18 alright we are just about at time so I'll call it here 19:59:20 thanks everyone 19:59:21 the other part of the issue was that it was an advance warning about stuff they weren't actually doing yet, yeah 19:59:30 now it's observable and testable at least 19:59:33 feel free to continue any/all of these conversations in #opendev 19:59:39 #endmeeting