#opendev-meeting log

19:01:19 <clarkb> #startmeeting infra
19:01:19 <opendevmeet> Meeting started Tue May 23 19:01:19 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:19 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:19 <opendevmeet> The meeting name has been set to 'infra'
19:01:30 <clarkb> Hello everyone (I expect a small group today thats fine)
19:01:44 <clarkb> link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda
19:01:47 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda
19:01:52 <clarkb> #topic Announcements
19:02:08 <clarkb> I guess a friendly reminder that the Open Infra summit is coming up in a few weeks
19:02:13 <clarkb> less than a month away now (like 3 weeks?)
19:02:32 * fungi sticks his head in the sand
19:02:46 <corvus> 2 more weeks till time to panic
19:03:22 <clarkb> #topic Migrating container images to quay.io
19:03:55 <clarkb> After our discussion last week I dove in and started poking at running services with podman in system-config and leaned on Zuul jobs to give us info back
19:04:13 <clarkb> The good news is that podman and buildkit seem to solve the problem with speculative testing of container images
19:04:19 <clarkb> (this was expected but good to confirm(
19:04:43 <clarkb> The bad news is that switching to podman isn't super straightforward for a number of smaller issues that add up in my opinion
19:06:05 <clarkb> Hurdles include packaging for podman and friends isn't great until you get on Ubuntu Jammy or newer. Podman doesn't support syslog logging so we need to switch all our logging over to journald. In addition to simply checking if services can run under podman we need to transition to podman which means stopping docker services and starting podman services in some sort of
19:06:07 <clarkb> coordinated fashion. There are also questions about whether or not we should change which user runs services when moving to podman
19:06:26 <clarkb> I have yet to find a single blocker that would prevent us from doing the move, but I'm not confident we can do it in a short period of time
19:07:10 <clarkb> For this reason I brought this up yesterday in #opendev and basically said I think we should go with the skopeo workaround or revert to docker hub. In both cases we could then work forward to move services onto podman and either remove the skopeo workaround or migrate to quay.io at a later date
19:07:40 <clarkb> During that discussion the consensus seemed to be that we preferred reverting back to docker hub. Then we can move to podman then we can move to quay.io and everything should be happy with us unlike the current state
19:08:06 <fungi> that matches my takeaway
19:08:08 <clarkb> I wanted to bring this up more formally in the meeting before proceeding with that plan. Has anyone changed their mind or have new input etc?
19:09:30 <clarkb> if we proceed with that plan I think it will look very similar to the quay.io move. We want to move the base images first so that when we move the other images back they rebuild and see the current base images
19:09:53 <clarkb> One thing that makes this tricky is the sync of container images back to docker hub since I don't think I can use a personal account for that like I did with quay.io. But that is solvable
19:10:24 <clarkb> I'll also draft up a document like I did for the move to quay.io so that we can have a fairly complete list of tasks and keep track of it all.
19:11:28 <clarkb> I'm not hearing any objections or new input. In that case I plan to start on this tomorrow (I'd like to be able to be pretty head down on it at least to start to make sure I don't miss anyting important)
19:11:38 <corvus> clarkb: sounds good to me
19:11:46 <fungi> thanks!
19:11:51 <tonyb> sounds good.  Thank you clarkb for pulling all that together
19:11:58 <corvus> clarkb: i think you can just use manually the opendev zuul creds on docker?
19:12:03 <clarkb> corvus: yup
19:12:24 <clarkb> should be able to docker login with them during the period of time I need them
19:12:30 <corvus> also, it seems like zuul can continue moving forward with quay.io
19:12:45 <corvus> since it's isolated from the opendev/openstack tenant
19:12:53 <corvus> so it can be the advance party
19:13:08 <clarkb> agreed and the podman moves there are isolated to CI jobs which are a lot easier to transition. The either work or they don't (and as of a few minutes ago I think they are working)
19:13:18 <corvus> i'm in favor of zuul continuing with that, iff the plan is for opendev to eventually move.
19:13:27 <corvus> i'd like us to be aligned long-term, one way or the other
19:13:56 <clarkb> I'd personally like to see opendev move to podman and quay.io. I think for long term viability the extra functioanlity is going to be useful
19:14:12 <corvus> ++
19:14:15 <fungi> same
19:14:16 <clarkb> both tools have extra features that enable things like per image access controls in quay.io, speculative gating etc
19:14:25 <clarkb> I'm just acknowledging it will take time
19:14:56 <corvus> also, we should be able to move services to podman one at a time, then once everything is on podman, make the quay switch
19:15:48 <clarkb> ++
19:16:07 <clarkb> servers currently on Jammy would be a good place to start since jammy has podman packages built in
19:16:15 <clarkb> currently I Think that is gitea, gitea-lb and etherpad?
19:16:51 <clarkb> alright I probably won't get to that today since I've got other stuff in flight already, but hope to focus on this tomorrow. I'll ping for reviews :)
19:17:25 <clarkb> #topic Bastion Host Changes
19:17:37 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:17:43 <clarkb> this topic still needs reviews if anyone has time
19:17:48 <clarkb> otherwise I am not aware of anything new
19:18:03 <fungi> lists01 is also jammy
19:18:27 <clarkb> ah up
19:18:35 <clarkb> #link Mailman 3
19:18:44 <clarkb> fungi: any new movement with mailman3 things speaking of list01
19:18:56 <fungi> i'm fiddling with it now, current held node is 173.231.255.71
19:19:24 <clarkb> anything we can do to help?
19:19:31 <fungi> with the current stack the default mailman hostname is no longer one of the individual list hostnames, which gives us some additional flexibility
19:20:08 <clarkb> fungi: we probably want to test the transition from our current state to that state as well?
19:20:15 <clarkb> (I'm not sure if that has been looked at yet)
19:20:27 <fungi> i'm still messing with the django commands to see how we might automate the additional django site to postorius/hyperkitty hostname mappings
19:20:44 <fungi> but yes, that should be fairly testable manually
19:21:09 <fungi> for a one-time transition it probably isn't all that useful to build automated testing
19:21:21 <fungi> but ultimately it's just a few db entries changing
19:21:58 <clarkb> ya not sure it needs to be autoamted. More just tested
19:22:02 <fungi> agreed
19:22:06 <clarkb> similar to how we've checked the gerrit upgrade rollback procedures
19:22:21 <clarkb> Sounds good. Let us know if we can help
19:22:48 <fungi> folks are welcome to override their /etc/hosts to point to the held node and poke around the webui, though this one needs data imported still
19:23:07 <fungi> and there are still things i haven't set, so that's probably premature anyway
19:23:26 <clarkb> ok I've scribbled a note to do that if I have time. Can't hurt anyway
19:23:39 <fungi> exactly
19:23:42 <fungi> thanks!
19:24:19 <clarkb> #topic Gerrit leaked replication task files
19:24:39 <clarkb> This is ongoing. No movement upstream in the issues I've filed. The number of files is growing at a steady ut management rate
19:24:57 <clarkb> I'm honestly tempted to undo the bind mound for this directory and go back to potentially lossy replications though
19:25:21 <clarkb> I've been swamped with other stuff though and since the rate hasn't grown in a scary way I've been happy to deprioritize looking into this further
19:25:55 <clarkb> Maybe late this week / early next week I can put a java developer hat on and see about fixing it though
19:26:19 <clarkb> #topic Upgrading Old Servers
19:26:40 <clarkb> similar to the last item I hvaen't had time to look at this recently. I don't think anyone else has either based on what I've seen in gerrit/irc/email
19:26:56 <fungi> i have not, sorry
19:27:00 <clarkb> This should probably be a higher priority than playing java dev though so maybe I start here when I dig out of my hole
19:27:37 <clarkb> if anyone else ends up with time please jump on one of these. It is a big help
19:27:42 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes
19:27:49 <clarkb> #topic Fedora Cleanup
19:28:14 <clarkb> This topic was the old openafs utilization topic. But I pushed a change to remove unused fedora mirror content and that freed up about 200GB and now we are below 90% utilization
19:28:25 <tonyb> yay
19:28:26 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IOYIYWGTZW3TM4TR2N47XY6X7EB2W2A6/ Proposed removal of fedora mirrors
19:28:39 <clarkb> No one has said please keep fedora we need it for X, Y or Z
19:28:57 <clarkb> corvus: ^ I actually recently remembered that the nodepool openshift testing uses fedora but I don't think it needs to
19:29:22 <clarkb> openshift itself is installed on centos and then nodepool is run on fedora. It could probably be a single node job even or the fedora node should be able to be replaced with anything else
19:29:42 <clarkb> either way I think the next step here is removing fedora mirror config from our jobs so that jobs on fedora talk upstream
19:29:53 <tonyb> I started looking at removing pulling the setup of f36 from zuul but that's a little bigger than expected so we so we don't break existing users outside of here
19:30:10 <clarkb> then we can remove all fedora mirroring and then we can work on clearing out fedora jobs and the nodes themselves
19:30:11 <fungi> ianw has a (failing/wip) change to remove our ci mirrors from fedora builds in dib
19:30:13 <clarkb> tonyb: oh cool
19:30:21 <fungi> #link https://review.opendev.org/883798 "fedora: don't use CI mirrors"
19:30:26 <clarkb> fungi: that change should pass as soon as the nodepool podman change merges
19:30:27 <corvus> clarkb: i agree, it should be able to be replaced
19:30:54 <clarkb> tonyb: is the issue that we assume all nodes should have a local mirror configured? I wonder how we are handling that with rocky. Maybe we just ignored rocky in that role?
19:31:09 <fungi> clarkb: there are fedora-specific tests in dib which are currently in need of removal for that change to pass, it looked like to me
19:31:18 <tonyb> yeah rocky is ignored
19:32:20 <tonyb> my initial plan was to just pull fedora but then I realised if there were non OpenDev users of zuul that did care about fedora mirroring they get broken
19:32:38 <clarkb> fungi: ah
19:32:53 <tonyb> so now I'm working on adding a flag to the base zuul jobs to say ... skip fedora
19:33:00 <fungi> oh, never mind. now the failures i'm looking at are about finding files on disk, so maybe. (or maybe i'm looking at different errors now)
19:33:10 <clarkb> tonyb: got it. That does pose a problem. If you add flags it might be good to add flags for everthing too for consistency
19:33:21 <tonyb> but trying to do that well and struggling with perfect being the enemy of good
19:33:32 <clarkb> tonyb: but getting something up so that the zuul community can weigh in is probably best before over engineering
19:33:36 <clarkb> ++
19:33:51 <tonyb> clarkb: yeah doing it for everything was the plan
19:34:05 <fungi> but yeah, the dib-functtests failure doesn't appear on the surface to be container-toolchain-related
19:34:23 <tonyb> kinda adding a distro-enable-mirror style flag
19:34:46 <tonyb> I'll post a Wip for a more concrete discussion
19:35:07 <clarkb> sounds good
19:35:16 <clarkb> #topic Quo Vadis Storyboard
19:35:22 <clarkb> Anything new here?
19:35:31 <clarkb> I don't have anything new. ETOOMANYTHINGS
19:36:04 <fungi> nor i
19:36:15 <fungi> well, the spamhaus blocklist thing
19:36:47 <fungi> apparently spamhaus css has listed the /64 containing storyboard01's ipv6 address
19:36:48 <clarkb> oh ya we should probably talk to oepnstack about that. tl;dr is cloud providers (particularly public clodu providers) should assign no less than a /64 to each instance
19:37:03 <fungi> due to some other tenant in rackspace sending spam from a different address in that same network
19:37:24 <fungi> they processed my removal request, but them immediately re-added it because of the problem machine that isn't ours
19:37:43 <fungi> #link https://storyboard.openstack.org/#!/story/2010689 "Improve email deliverabilty for storyboard@storyboard.openstack.org, some emails are considered SPAM"
19:38:21 <fungi> there was also a related ml thread on openstack-discuss, which one of my story comments links to a message in
19:38:24 <corvus> i know you can tell exim not to use v6 for outgoing; it may be possible to tell it to prefer v4...
19:38:57 <fungi> right, that's a variation of what i was thinking in my last comment on the story
19:39:22 <fungi> and less hacky
19:39:29 <fungi> though still definitely hacky
19:39:42 <fungi> anyway, that's all i really had storyboard-related this week
19:39:54 <clarkb> thanks!
19:39:54 <clarkb> #topic Open Discussion
19:39:54 <clarkb> Anything else?
19:40:22 <fungi> there were a slew of rackspace tickets early today about server outages... pulling up the details
19:41:19 <fungi> zm06, nl04, paste01,
19:41:50 <fungi> looks like they were from right around midnight utc
19:41:57 <fungi> er, no, 04:00 utc
19:42:36 <fungi> anyway, if anyone spots problems with those three servers, that's likely related
19:43:12 <clarkb> only paste01 is a singleton that would really show up as a problem. Might be worth double checkign the other two
19:43:45 <fungi> all three servers have a ~16-hour uptime right now
19:43:59 <fungi> so they at least came back up and are responding to ssh
19:44:40 <fungi> host became unresponsive and was rebooted in all three cases (likely all on the same host)
19:45:00 <clarkb> zm06 did not restart its merger
19:45:05 <clarkb> nl04 did restart its launcher
19:45:18 <clarkb> I can restart zm06 after lunch
19:45:22 <clarkb> thank you for calling that out
19:45:54 <corvus> we have "restart: on-failure"...
19:46:02 <fungi> i'll go ahead and close out those tickets on the rackspace end
19:46:19 <corvus> and services normally restart after boots
19:46:22 <fungi> also they worked two tickets i opened yesterday about undeletable servers, most were in the iad region
19:46:25 <corvus> so i wonder why zm06 didn't?
19:46:37 <fungi> in total, freed up 118 nodes worth of capacity
19:46:48 <clarkb> corvus: maybe hard reboot under the host doesn't count as failure?
19:47:19 <fungi> in the twisted land of systemd
19:47:35 <corvus> May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.636033310Z" level=info msg="Removing stale sandbox 88813ad85aa8c751aa92b684e64e1ea7f2e9f2e9c8209ce79bfcf9fe18ee77e7 (05ce3156e52feef4e8e78ad8015aabba477f7d926635e8bf59534a8294d44559)"
19:47:35 <corvus> May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.638772820Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint f128a3c81341d323dfbcbb367224049ec2aac64f10af50e702b3f546f8f09a6c 6e0ac20c216668af07c98111a59723329046fab81fde48b45369a1f7d088ffeb], retrying...."
19:47:40 <corvus> i have no idea what either of those mean
19:48:07 <corvus> other than that, i don't see any logs about docker's decision making there
19:49:12 <clarkb> probably fine to start things back up again then and see if it happens again in the future? I guess if we really want we can do a graceful reboot and see if it comes back up
19:49:50 <clarkb> Anything else? Last call I can give you all 10 minutes back for breakfast/lunch/dinner :)
19:50:15 <fungi> i'm set
19:50:58 <clarkb> Thank you everyone. I expect we'll be back here next week and the week after. But then we'll skip on june 13th due to the summit
19:51:03 <clarkb> #endmeeting