#opendev-meeting log

19:01:13 <clarkb> #startmeeting infra
19:01:13 <openstack> Meeting started Tue Dec  1 19:01:13 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <openstack> The meeting name has been set to 'infra'
19:01:23 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-November/000137.html Our Agenda
19:01:33 <clarkb> We have an agenda, trying to get things back to normal after an eventful few weeks
19:01:42 <clarkb> #topic Announcements
19:01:47 <clarkb> Wallaby cycle signing key has been activated https://review.opendev.org/760364
19:01:52 <clarkb> Please sign if you haven't yet https://docs.opendev.org/opendev/system-config/latest/signing.html
19:02:03 <clarkb> at this point this is there mostly as a reminder for myself as I have failed to sign it sofar :(
19:02:32 <clarkb> #topic Actions from last meeting
19:02:37 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.txt minutes from last meeting
19:02:49 <clarkb> Last meeting we didn't haev a formal agenda and instead went through gerrit upgrade related items.
19:03:00 <clarkb> There are still a few of those to talk through which we will get to shortly
19:03:05 <clarkb> #topic Priority Efforts
19:03:10 <clarkb> #topic OpenDev
19:03:52 <clarkb> We've been working through the debugging of system load on Gerrit. We've had a few good leads so far but nothing that has made it go completely away
19:03:59 <fungi> ohai
19:04:10 <clarkb> In particular someone else on the Gerrit mailing list was struggling with similar on Gerrit 2.16 and the discussion there pointed to caches
19:04:14 <corvus> o/
19:04:26 <clarkb> fungi and I have since been trying to tune our cache sized based on the info that ssh review gerrit show-caches gives us
19:04:38 <clarkb> I think this has helped but it hasn't completely made things happy yet
19:05:10 <fungi> worth noting, we think there is correlation between the "missing tree" errors people get on push and the elevated system load
19:05:19 <clarkb> We also noticed that there is a jgit recieve.autogc setting that runs git gc when code is pushed. We set that but literally just 5 minutes ago I realized we set it in the wrong config file
19:05:31 <clarkb> there is not a jgit config file so I imagine next ups is getting that moved into the correct config file
19:05:34 <fungi> though it's so far only been observed on large repos, generally while or immediately after pushing change series
19:05:59 <clarkb> whcih could be related to the autogc thing maybe? the gerrit docs note that disabling it is recommended (despite being enabled by default) due to the load impact it has
19:06:20 <fungi> yeah, i'll work on adding the jgit.config after this meeting
19:06:23 <clarkb> 3.3.0 release notes imply that it will not be disabled by default though so maybe they decided making things bad by default was not recommended
19:06:53 <fungi> the release note on that is a little vague/confusing, to be perfectly honest
19:06:56 <clarkb> In conversation with Luca on the Gerrit slack he says that Java 11 is likely also to have some performance benefits. Gerrithub has been running 3.2 on java 11 since the beginning of this release
19:06:58 <clarkb> fungi: yup
19:07:19 <clarkb> I think this means we should also look at landing java 11 support in our images then switch over prod to java 11. Fungi switched review-test over to java 11 this morning
19:07:38 <fungi> yeah, if folks want to beat on review-test at all, that's helpful
19:07:52 <clarkb> And I'm hoping that this afternoon I'll have time to update the image jobs to build a 3.3 as well
19:08:01 <fungi> i feel like we should land the openjdk 11 patch before trying the upgrade to gerrit 3.3, fwiw
19:08:07 <clarkb> I agree
19:08:22 <fungi> that way if we see new issues we have a better idea of what brought them in
19:08:39 <ianw> ++
19:08:53 <ianw> not to derail, but how important do we think upgrading the hsot from xenial is too?
19:09:01 <fungi> also we need to do openjdk 11 before we upgrade to (not yet existent) gerrit 3.4
19:09:17 <fungi> since they're planning to drop support for <11 at that release
19:09:22 <clarkb> ianw: I think that is reasonably important, but not urgent. eg we should be able to schedule that and warn people of the upcoming new IP address
19:09:47 <clarkb> if someone wants to start looking at what that would require I would be grateful :)
19:10:02 <corvus> my understanding is it should only be important for OS support reasons
19:10:14 <corvus> not for java version/performance reasons
19:10:21 <corvus> is that correct or do we think there's a perf benefit?
19:10:37 <clarkb> corvus: generally linux benchmarking gets worse as you get newer kernels
19:10:54 <clarkb> I would actually expect a performance impact (if I had to guess without testing)
19:11:04 <fungi> yeah, i think the os upgrade would just be mre because xenial reaches eol in a few months
19:11:41 <clarkb> phoronix does generic benchmarking of linux over time if people want to see what I would assume that
19:11:52 <ianw> yep, and also if you're spending time debugging things and it does get down to the kernel/container-ish layer better to be debugging something current
19:12:00 <clarkb> ianw: ya thats true
19:12:06 <fungi> and on that note, sometime soon we should also talk out a plan for how we would actually do the upgrading to focal... options are to build a new vm and then we have new ip addresses to warn folks about (given how many we know are stuck behind corporate firewalls with special rules allowing 29418/tcp to our server's current address) or do in-place upgrades
19:12:30 <clarkb> I think I still strongly prefer the new host method
19:12:36 <ianw> i feel like last time we went with in-place
19:12:40 <corvus> it sounds like it's a wildcard and could go either way, so i'd lean towards deferring os upgrade until we've stabilized or run out of other things
19:12:44 <fungi> i do too, but in that case we need to decide on a communication schedule
19:12:46 <clarkb> corvus: ++
19:13:28 <fungi> corvus: yes, i agree we should hold off the os upgrade until we have known performance for the container on the current os version
19:13:42 <corvus> do we want to see about putting together an http-only recommendation for third-party ci before host replacement?
19:13:55 <clarkb> The other thing I wanted to bring up is tristanC has done some plugin work to do zuul results table rendering. I've been too distracted by other things, but do others think that is in a place that we should consume it? I think if I had any concerns its that it is written in another esoteric alnguage that compiles to js/java aiui
19:14:17 <clarkb> corvus: based on some of the responses I've gotten so far I think a lot of third party CIs would struggle with that
19:14:24 <clarkb> a non zero number are still stuck on zuul v2
19:14:42 <corvus> they would have a choice about what kind of struggle
19:15:03 <fungi> also how would http-only work? are we planning to add the checks plugin?
19:15:04 <corvus> fight internal network rules or upgrade software to supported versions
19:15:23 <clarkb> fungi: that is a good question
19:15:49 <corvus> fungi: that's the question; i'm not sure checks has a long-term future, but it does exist and has no limitations for the third-party ci use-case (it does for a full gating system); an alternative may be webhooks.
19:15:56 <fungi> right now we're not offering them an alternative for the stream-events cli
19:16:20 <clarkb> corvus: is webhooks another plugin option?
19:16:23 <corvus> yep
19:16:32 <fungi> so while i think http-only sounds great, we'd probably need to decide what that looks like and get it available first
19:16:36 <corvus> afaik, its supporters do have a long-term interest
19:16:56 <clarkb> fungi: ya sounds like something to do more investigating for
19:17:05 <ianw> there's also now the "findings" tab?  if i've understood, you're supposed to put "autogenerated" on your review comment to be in there?
19:17:07 <corvus> fungi: agreed (is why i raised it -- do we want to look into setting that as a goal?)
19:17:25 <clarkb> ianw: I think zuul is doing that?
19:17:31 <corvus> yes has been for some time
19:17:53 <corvus> i believe findings are different (at least, last time i was exposed to the design doc)
19:18:01 <fungi> ianw: robot comments are toggleable, zuul has done that by default for ~ a year
19:18:17 <fungi> and yes, robot comments and findings are separate things
19:18:41 <ianw> i haven't yet managed to find the documentation on how to get anything into "findings"
19:18:51 <corvus> ianw: are you suggesting findings tab as alternative to results table rendering?
19:18:52 <fungi> the checks plugin puts thnigs in findings
19:20:02 <ianw> corvus: not really as i don't understand it, but i mean it does seem like a summary of the latest zuul results is a "finding"
19:20:28 <corvus> clarkb: i haven't seen tristanC's table; is there a ml message or other link or something?
19:20:41 <corvus> ianw: have a link to an example?
19:20:56 <fungi> ianw: what "robot comments" (autogenerated) do is hide things when you switch the "only comments" slider in the "change log" section of the change view
19:21:11 <ianw> corvus: yes, let me did, it was rolled out on a test instance
19:21:15 <ianw> dig
19:21:16 <clarkb> corvus: https://review.opendev.org/c/opendev/system-config/+/763891 is the change
19:21:34 <clarkb> and ya the job that test gerrit installation on ^ was held aiui for people to test it
19:22:21 <corvus> fungi: (at some point i understood "robot comments" to be a new type of comment associated with checks plugin vs the "regular old comments" which may or may not have the 'autogenerated' tag
19:22:24 <ianw> https://104.130.172.52/c/openstack/diskimage-builder/+/554002
19:23:19 <ianw> are we onto talking about the table?  because i'd like to run some things about gerrit gate testing by the peanut gallery
19:23:33 <fungi> corvus: oh, interesting, it's possible i've confused them but i kept seeing them mentioned as the same thing
19:23:35 <corvus> clarkb, tristanC: there are 2 zuul plugins for gerrit
19:23:51 <corvus> clarkb, tristanC: is there any way maybe we could contribute to one or more of those?
19:24:19 <clarkb> corvus: yes I strongly encouraged tristanC to do so, but was told there is no interest in learning java or js
19:24:34 <corvus> i believe tristanC knows js
19:24:40 <corvus> unless tristanC forgot js?
19:24:42 <clarkb> which is one of my concerns with using the sf thing, its in a random language that tristanC finds acceptable rather than the upstream tooling
19:24:49 <clarkb> corvus: I dunno that is just what I was told last week when it came up
19:25:06 <clarkb> I believe this particular plugin is written in some language that compiles to js
19:25:22 <ianw> yeah, but "javascript" these days is similar to assembly language really
19:25:27 <fungi> the main thing i've wondered about scope-wise is whether a pg plugin for displaying a summary table of arbitrary third-party ci comments/votes is relevant to the zuul plug-in, but maybe if zuul is the reference for the comment format then it could be
19:25:33 <corvus> ianw: i think that's a bit of a stretch
19:25:53 <corvus> fungi: displaying zuul results is absolutely relevant
19:26:23 <fungi> yep, and so if other ci systems leave comments which look like zuul results, then supporting that as part of the zuul plugin seems sane enough
19:26:33 <corvus> https://gerrit.googlesource.com/plugins/zuul-status/
19:26:34 <ianw> corvus: maybe, but i mean https://104.130.172.52/plugins/zuul-results/static/zuul-results.js
19:26:36 <corvus> Displays zuul status on PolyGerrit change
19:26:45 <clarkb> corvus: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-11-23.log.html#t2020-11-23T15:10:55
19:27:09 <corvus> ianw: i'm not sure what the point you're making is
19:27:26 <corvus> ianw: that is clearly a minimized and obfuscated file; i don't deny the existence of such things
19:27:44 <corvus> i only say that plenty of people write javascript as the input to creating such files
19:28:43 <corvus> the fact that there are minimized js files doesn't mean we need to learn new languages; the upstream polygerrit plugins are written in something resembling js, right?  so collaboration with others could be done that way, and since we've managed to teach some zuul devs how to do some basic js, they may be able to contribute too
19:29:28 <clarkb> corvus: yup agreed. Maybe the best thing here is to hold out and see if we can upstream support for this into an existing plugin first
19:29:37 <ianw> right, anyway i guess the exact point at hand is this is this concrete proposal for adding this table is written in https://reasonml.github.io/ and we probably have to decide if we want to incorporate that
19:29:48 <corvus> clarkb: based on that convo, it seems like we're saying "someone needs to learn polygerrit" vs "someone needs to learn reasonml"
19:30:26 <ianw> in terms of the bigger picture, of testing plugins, i think we should do some work there too.  fungi suggested on-list that we should hold a node to test the plugins, which sort of works
19:30:29 <clarkb> right which I still think would be better if that is the toolchain gerrit has attached to
19:30:33 <corvus> ianw: i've already -2d one change to add reasonml to zuul based on the lack of support for our last experiment with an esoteric language
19:30:53 <clarkb> because then we're collaborating in that ecosystem rather tah nsetting off on our own and being different
19:30:53 <ianw> however, getting reviews into that held gerrit that look useful enough to test the plugin is a bit of a pain
19:31:03 <fungi> ianw: yeah, what i didn't consider at the time was that we also need to get some representative content into the held gerrit somehow
19:31:25 <clarkb> ianw: fungi could we autogenerate some content?
19:31:30 <fungi> we could instead demo things on review-test for now, i suppose, and hold off deleting it
19:31:30 <corvus> i mean, i like playing with esoteric functional languages, don't get me wrong, but as a group we don't have the best track record there, whereas i think there's a bigger chance we can get more long-term collaboration/support by sticking with how upstream does plugins
19:31:34 <clarkb> make a project, push some changes, merge a change or two, etc
19:31:46 <corvus> clarkb: ++ 'collaborating in that ecosystem'
19:31:53 <ianw> clarkb: yes, i think so ... but we need to figure out adding the first admin user automatically
19:32:08 <clarkb> ianw: the zuul all in one stuff does that, I bet we can reuse it
19:32:09 <fungi> ianw: i have that figured out
19:32:10 <corvus> you just need to leave a comment to test this, right?
19:32:33 <clarkb> corvus: ya a zuul formatted comment I think (maybe the username matters too? I'm not sure)
19:32:51 <fungi> there are probably multiple ways to create an initial admin account, but one is to use the gerrit cli with the built-in "gerrit code review" user
19:32:54 <ianw> fungi: ok, i think we should go through together out of meeting maybe, and see if we can get the test job doing it
19:32:54 <corvus> for hideci, yes; but hopefully we can omit that in the future -- comment tags are a thing :)
19:33:08 <fungi> i think the zuul quickstart just uses become auth right?
19:33:17 <fungi> been a while since i looked at that bit
19:33:30 <ianw> at that point, it seems like it would also be easy to use a headless browser to take a screenshot of a review, which would make it easy to have an artifact confirming plugins working
19:33:47 <ianw> and we can also hold the node for manual fiddling
19:33:50 <fungi> that also sounds really awesome
19:34:31 <ianw> there's some flag, DEVELOPMENT_BECOME_ANY_ACCOUNT which i didn't fully get to understanding last week
19:35:11 <fungi> ianw: the alternative is the mechanism i describe in the gerrit admins section of our system-config docs. that works even on a gerrit with no existing accounts
19:35:21 <corvus> also, ftr, i suspect it's perfectly fine to make a new plugin if this doesn't fit with zuul-status; i don't get the impression that lots of small plugins are necessarily bad.
19:35:45 <clarkb> corvus: that is a good point. It seems the more important bit is using the toolchains upstream is using then they may get involved and help us
19:35:52 <ianw> fungi: ok, that was what i was trying but wasn't getting an admin account.  i think we should try again
19:36:00 <clarkb> I think the gerrit maintainers do actually do a reasonable amount of plugin work to keep them working as things change ing errit
19:36:07 <clarkb> supporting that work would be a good idea imo
19:36:33 <ianw> i've already engaged on the thread; i can write a summary to respond if we like
19:36:55 <clarkb> that sounds like a good way to recap this discussion for those who may not be hear
19:36:57 <fungi> that reminds me, paladox contributed an opendev theme override (with light and dark mode support) as what i think is a pg plugin, but it's just an sgml/html blob in a paste. i was going to try to learn how to integrate that
19:36:58 <clarkb> s/hear/here/
19:37:19 <ianw> it sounds like basically a) we're not currently convinced on the separate project, especially in a language that doesn't have a lot of exposure, and would like to investigate integrating with upstream more
19:37:34 <ianw> and b) we'd like to expand the overall plugin testing environment to make it easier
19:37:39 <clarkb> ianw: ++
19:37:52 <ianw> i'll draft something and loop people back
19:37:55 <clarkb> thank you
19:38:01 <fungi> also if that discussion thread wasn't on service-discuss, could it be redirected there?
19:38:15 <fungi> i have a feeling it might have ended up on openstack-discuss
19:38:18 <ianw> i can cc, i think it was openstack discuss only
19:38:31 <corvus> yeah, fwiw i have no idea what thread is being discussed :(
19:39:11 <corvus> also, friendly reminder that there is a zuul running for the purposes of testing plugins in the upstream gerrit; i have no idea what testing means for polygerrit plugins; that may be interesting to learn
19:39:17 <ianw> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-November/019051.html
19:39:18 <fungi> i do recall replying on it, but in retrospect i should have asked people to follow up to service-discuss
19:39:20 <ianw> for reference
19:39:27 <corvus> (it's mostly testing java plugins)
19:39:42 <fungi> thanks ianw
19:40:23 <clarkb> alright anything else on Gerrit before we move on?
19:40:29 <fungi> part of the problem is i subscribe to lots of mailing lists and dump them into the same folder, so sometimes it's not immediately apparent to me if people have started discussions in the wrong ml
19:41:05 <fungi> maybe we should agree to move forward with the jdk update asap?
19:41:27 <fungi> other than that, no i think we've got things pretty well covered
19:41:30 <clarkb> I'm on board, its being tested on review-test. If others can give that a quick check then we're probably good to proceed on that
19:41:50 <clarkb> thinking out loud here: do the jgit autogc config first maybe? then do java 11 next?
19:41:58 <clarkb> just to do one thing at a time and autogc fix seems simpler
19:42:07 <fungi> yeah, i'll push that change up after the meeting
19:42:11 <clarkb> thanks
19:42:22 <clarkb> #topic Update Config Management
19:43:30 <clarkb> Is there anything new on this effort to call out? I don't think so but I'm double checking
19:44:12 <fungi> the codesearch rebuild maybe?
19:44:26 <clarkb> oh ya ianw ^ is that complete at this point?
19:44:26 <fungi> we have two servers at the moment still, right?
19:44:43 <fungi> oh, actually it's a cname now
19:44:45 <ianw> no i cleaned the old one up, that should be all finished now
19:44:52 <fungi> awesome, thanks!
19:44:58 <ianw> nobody has complained so i assume it's working perfectly :)
19:45:05 <clarkb> excellent
19:45:18 <fungi> yes, i was making a point to use the opendev one so i would test it
19:45:24 <fungi> and have had no problems
19:45:49 <clarkb> #topic General topics
19:45:56 <clarkb> #topic Bup and Borg Backups
19:45:57 <corvus> ianw: ++ thanks!
19:46:17 <clarkb> I think we're getting more and more comfortable with borg? I've unfortunately had little time to interact with it mroe recently
19:46:34 <clarkb> ianw: I know at some point you wanted to do verification then drop bup?
19:46:37 <fungi> i should practice with restoring something i guess
19:46:51 <clarkb> maybe a good thing to try and do before dropping bup is having other admins do things like ^
19:46:55 <ianw> yeah, i was thinking what i'll do is a config change to remove the bup cron jobs; people can audit the borg changes and approve that when happy
19:47:07 <clarkb> ianw: that sounds like a reasonable plan
19:47:13 <fungi> i'm down
19:47:33 <clarkb> and that is important for the focal upgrades we were talking about earlier too
19:47:35 <ianw> then we can kill all the puppet bits and maybe just attach the old backup volumes to the new server for a bit
19:47:37 <clarkb> since bup and pytho3n don't mix
19:48:22 <clarkb> thank you for getting this moving and doing all that work, really appreciated
19:48:37 <clarkb> #topic Docker Rate Limits are Being Seen in CI
19:49:01 <clarkb> This is mostly a heads up/fyi
19:49:15 <ianw> mostly in NAT environments?
19:49:17 <clarkb> jobs particularly those running on limestone seem to hit this
19:49:36 <clarkb> ianw: ya, though I would've expected it to hit all environments fairly equally due to our use of mirrors? But maybe we aren't using the mirrors the way I thought we were
19:49:44 <fungi> yeah, we're not seeing it so much on our proxies as on limestone nat for jobs not using the proxy
19:50:16 <clarkb> a few weeks back I pushed up changes to switch our zuul mirror config for docker over to just using the host addrs rather than the mirror. I don't think we need to land those yet since its NAT getting us
19:50:39 <clarkb> but something to be aware of and maybe we need to bring that conversation for getting our images open source specialled again
19:50:53 <clarkb> jbryce was going to look at the agreement in more detail and get back to us but I think like us has been busy
19:51:03 <clarkb> another option is to use quay which does not rate limit
19:51:09 <clarkb> but does have outages when aws east goes down
19:51:23 <clarkb> I don't have answers, just info for people to digest :)
19:51:40 <corvus> or make a new kind of pass-through proxy/mirror
19:52:00 <clarkb> ya one that understands it needs to be a sort of lru cache
19:52:21 <corvus> yup; i believe that's doable and much of the code in zuul-registry can be repurposed for that
19:52:37 <corvus> (but still, it's not a trivial project, so one that we should deliberately choose)
19:52:52 <fungi> also possibly a more useful effort than trying to bend something like squid to cache "authenticated" requests
19:53:09 <ianw> so that would authenticate with a higher-limited key, and transparently pass through all our requests?
19:53:10 <clarkb> though possibly squid would be better for our http caching we do on those hosts in general
19:53:21 <clarkb> since in theory it can be more flexible than what apache is currently doing
19:53:24 <corvus> ianw: or even anonymously but just stay under the limit?
19:53:42 <clarkb> ya docker hub sends the required cache control headers to cache publicly those manifests
19:53:54 <clarkb> the issue is that apache will not cache any authenticated request even with those headers
19:54:00 <fungi> there's no "anonymously" really through right?
19:54:01 <clarkb> we believe squid can be convinced to do so though
19:54:02 <corvus> clarkb: do you think the squid approach will work with all the weird auth stuff?
19:54:28 <clarkb> corvus: I think so if we can make it respect the cache-control: public or whatever header it is that is sent back by docker hub
19:54:28 <corvus> fungi: in the way i intended to use it, yes (an auth credential obtained with no identifying information)
19:54:35 <ianw> what we have not traditionally done is limit our mirrors to only be connectable from their respective clouds; we might want to think about that if we're using a opendev specific key
19:54:44 <corvus> fungi: (authz without authn i guess?)
19:55:16 <fungi> it's been years since i've done esoteric things with squid (including trivially patching it to ignore some things which would cause it not to cache but that it lacked configuration for), so it would need a poc regardless
19:55:17 <clarkb> I think the major issue with apache as we use it for this problem space is that it will never cache a request that had an authorization header even if cache control says it is ok to do so
19:55:48 <clarkb> if apache could be convinced to do ^ it would probably be fine too. Since it is now the manifest data that we need to cache
19:56:22 <corvus> based on my estimation of effort, it sounds like spending a couple of days attempting to get squid to work should take precedence over a couple of weeks to implement a smart registry proxy
19:56:48 <corvus> (or, you know, convince everyone to use quay.io :)
19:57:06 <fungi> yeah, like i said, there have been times when i had to patch and recompile squid to get it to cache some stuff too, so i don't want to say it's necessarily better than apache mod_proxy, and i don't personally think being stuck maintaining our own patched build of either of those is particularly wise
19:58:15 <fungi> the first thing it really needs is exploration
19:58:29 <clarkb> ++
19:58:41 <clarkb> part of the issue in the past is the info from docker was a bit vague
19:58:55 <clarkb> but now we've got a bit more real world data and we should be able to work with that to find a reasonable solution
19:59:18 <clarkb> alright we are just about at time so I'll call it here
19:59:20 <clarkb> thanks everyone
19:59:21 <fungi> the other part of the issue was that it was an advance warning about stuff they weren't actually doing yet, yeah
19:59:30 <fungi> now it's observable and testable at least
19:59:33 <clarkb> feel free to continue any/all of these conversations in #opendev
19:59:39 <clarkb> #endmeeting