Tuesday, 2023-05-23

clarkbJust about meeting time18:59
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue May 23 19:01:19 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkbHello everyone (I expect a small group today thats fine)19:01
clarkblink https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI guess a friendly reminder that the Open Infra summit is coming up in a few weeks19:02
clarkbless than a month away now (like 3 weeks?)19:02
* fungi sticks his head in the sand19:02
corvus2 more weeks till time to panic19:02
clarkb#topic Migrating container images to quay.io19:03
clarkbAfter our discussion last week I dove in and started poking at running services with podman in system-config and leaned on Zuul jobs to give us info back19:03
clarkbThe good news is that podman and buildkit seem to solve the problem with speculative testing of container images19:04
clarkb(this was expected but good to confirm(19:04
clarkbThe bad news is that switching to podman isn't super straightforward for a number of smaller issues that add up in my opinion19:04
clarkbHurdles include packaging for podman and friends isn't great until you get on Ubuntu Jammy or newer. Podman doesn't support syslog logging so we need to switch all our logging over to journald. In addition to simply checking if services can run under podman we need to transition to podman which means stopping docker services and starting podman services in some sort of19:06
clarkbcoordinated fashion. There are also questions about whether or not we should change which user runs services when moving to podman19:06
clarkbI have yet to find a single blocker that would prevent us from doing the move, but I'm not confident we can do it in a short period of time19:06
clarkbFor this reason I brought this up yesterday in #opendev and basically said I think we should go with the skopeo workaround or revert to docker hub. In both cases we could then work forward to move services onto podman and either remove the skopeo workaround or migrate to quay.io at a later date19:07
clarkbDuring that discussion the consensus seemed to be that we preferred reverting back to docker hub. Then we can move to podman then we can move to quay.io and everything should be happy with us unlike the current state19:07
fungithat matches my takeaway19:08
clarkbI wanted to bring this up more formally in the meeting before proceeding with that plan. Has anyone changed their mind or have new input etc?19:08
clarkbif we proceed with that plan I think it will look very similar to the quay.io move. We want to move the base images first so that when we move the other images back they rebuild and see the current base images19:09
clarkbOne thing that makes this tricky is the sync of container images back to docker hub since I don't think I can use a personal account for that like I did with quay.io. But that is solvable19:09
clarkbI'll also draft up a document like I did for the move to quay.io so that we can have a fairly complete list of tasks and keep track of it all.19:10
clarkbI'm not hearing any objections or new input. In that case I plan to start on this tomorrow (I'd like to be able to be pretty head down on it at least to start to make sure I don't miss anyting important)19:11
corvusclarkb: sounds good to me19:11
fungithanks!19:11
tonybsounds good.  Thank you clarkb for pulling all that together 19:11
corvusclarkb: i think you can just use manually the opendev zuul creds on docker?19:11
clarkbcorvus: yup19:12
clarkbshould be able to docker login with them during the period of time I need them19:12
corvusalso, it seems like zuul can continue moving forward with quay.io19:12
corvussince it's isolated from the opendev/openstack tenant19:12
corvusso it can be the advance party19:12
clarkbagreed and the podman moves there are isolated to CI jobs which are a lot easier to transition. The either work or they don't (and as of a few minutes ago I think they are working)19:13
corvusi'm in favor of zuul continuing with that, iff the plan is for opendev to eventually move.19:13
corvusi'd like us to be aligned long-term, one way or the other19:13
clarkbI'd personally like to see opendev move to podman and quay.io. I think for long term viability the extra functioanlity is going to be useful19:13
corvus++19:14
fungisame19:14
clarkbboth tools have extra features that enable things like per image access controls in quay.io, speculative gating etc19:14
clarkbI'm just acknowledging it will take time19:14
corvusalso, we should be able to move services to podman one at a time, then once everything is on podman, make the quay switch19:14
clarkb++19:15
clarkbservers currently on Jammy would be a good place to start since jammy has podman packages built in19:16
clarkbcurrently I Think that is gitea, gitea-lb and etherpad?19:16
clarkbalright I probably won't get to that today since I've got other stuff in flight already, but hope to focus on this tomorrow. I'll ping for reviews :)19:16
clarkb#topic Bastion Host Changes19:17
clarkb#link https://review.opendev.org/q/topic:bridge-backups19:17
clarkbthis topic still needs reviews if anyone has time19:17
clarkbotherwise I am not aware of anything new19:17
fungilists01 is also jammy19:18
clarkbah up19:18
clarkb#link Mailman 319:18
clarkbfungi: any new movement with mailman3 things speaking of list0119:18
fungii'm fiddling with it now, current held node is 173.231.255.7119:18
clarkbanything we can do to help?19:19
fungiwith the current stack the default mailman hostname is no longer one of the individual list hostnames, which gives us some additional flexibility19:19
clarkbfungi: we probably want to test the transition from our current state to that state as well?19:20
clarkb(I'm not sure if that has been looked at yet)19:20
fungii'm still messing with the django commands to see how we might automate the additional django site to postorius/hyperkitty hostname mappings19:20
fungibut yes, that should be fairly testable manually19:20
fungifor a one-time transition it probably isn't all that useful to build automated testing19:21
fungibut ultimately it's just a few db entries changing19:21
clarkbya not sure it needs to be autoamted. More just tested19:21
fungiagreed19:22
clarkbsimilar to how we've checked the gerrit upgrade rollback procedures19:22
clarkbSounds good. Let us know if we can help19:22
fungifolks are welcome to override their /etc/hosts to point to the held node and poke around the webui, though this one needs data imported still19:22
fungiand there are still things i haven't set, so that's probably premature anyway19:23
clarkbok I've scribbled a note to do that if I have time. Can't hurt anyway19:23
fungiexactly19:23
fungithanks!19:23
clarkb#topic Gerrit leaked replication task files19:24
clarkbThis is ongoing. No movement upstream in the issues I've filed. The number of files is growing at a steady ut management rate19:24
clarkbI'm honestly tempted to undo the bind mound for this directory and go back to potentially lossy replications though19:24
clarkbI've been swamped with other stuff though and since the rate hasn't grown in a scary way I've been happy to deprioritize looking into this further19:25
clarkbMaybe late this week / early next week I can put a java developer hat on and see about fixing it though19:25
clarkb#topic Upgrading Old Servers19:26
clarkbsimilar to the last item I hvaen't had time to look at this recently. I don't think anyone else has either based on what I've seen in gerrit/irc/email19:26
fungii have not, sorry19:26
clarkbThis should probably be a higher priority than playing java dev though so maybe I start here when I dig out of my hole19:27
clarkbif anyone else ends up with time please jump on one of these. It is a big help19:27
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes19:27
clarkb#topic Fedora Cleanup19:27
clarkbThis topic was the old openafs utilization topic. But I pushed a change to remove unused fedora mirror content and that freed up about 200GB and now we are below 90% utilization19:28
tonybyay19:28
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IOYIYWGTZW3TM4TR2N47XY6X7EB2W2A6/ Proposed removal of fedora mirrors19:28
clarkbNo one has said please keep fedora we need it for X, Y or Z19:28
clarkbcorvus: ^ I actually recently remembered that the nodepool openshift testing uses fedora but I don't think it needs to19:28
clarkbopenshift itself is installed on centos and then nodepool is run on fedora. It could probably be a single node job even or the fedora node should be able to be replaced with anything else19:29
clarkbeither way I think the next step here is removing fedora mirror config from our jobs so that jobs on fedora talk upstream19:29
tonybI started looking at removing pulling the setup of f36 from zuul but that's a little bigger than expected so we so we don't break existing users outside of here19:29
clarkbthen we can remove all fedora mirroring and then we can work on clearing out fedora jobs and the nodes themselves19:30
fungiianw has a (failing/wip) change to remove our ci mirrors from fedora builds in dib19:30
clarkbtonyb: oh cool19:30
fungi#link https://review.opendev.org/883798 "fedora: don't use CI mirrors"19:30
clarkbfungi: that change should pass as soon as the nodepool podman change merges19:30
corvusclarkb: i agree, it should be able to be replaced19:30
clarkbtonyb: is the issue that we assume all nodes should have a local mirror configured? I wonder how we are handling that with rocky. Maybe we just ignored rocky in that role?19:30
fungiclarkb: there are fedora-specific tests in dib which are currently in need of removal for that change to pass, it looked like to me19:31
tonybyeah rocky is ignored 19:31
tonybmy initial plan was to just pull fedora but then I realised if there were non OpenDev users of zuul that did care about fedora mirroring they get broken19:32
clarkbfungi: ah19:32
tonybso now I'm working on adding a flag to the base zuul jobs to say ... skip fedora 19:32
fungioh, never mind. now the failures i'm looking at are about finding files on disk, so maybe. (or maybe i'm looking at different errors now)19:33
clarkbtonyb: got it. That does pose a problem. If you add flags it might be good to add flags for everthing too for consistency19:33
tonybbut trying to do that well and struggling with perfect being the enemy of good19:33
clarkbtonyb: but getting something up so that the zuul community can weigh in is probably best before over engineering19:33
clarkb++19:33
tonybclarkb: yeah doing it for everything was the plan19:33
fungibut yeah, the dib-functtests failure doesn't appear on the surface to be container-toolchain-related19:34
tonybkinda adding a distro-enable-mirror style flag19:34
tonybI'll post a Wip for a more concrete discussion 19:34
clarkbsounds good19:35
clarkb#topic Quo Vadis Storyboard19:35
clarkbAnything new here?19:35
clarkbI don't have anything new. ETOOMANYTHINGS19:35
funginor i19:36
fungiwell, the spamhaus blocklist thing19:36
fungiapparently spamhaus css has listed the /64 containing storyboard01's ipv6 address19:36
clarkboh ya we should probably talk to oepnstack about that. tl;dr is cloud providers (particularly public clodu providers) should assign no less than a /64 to each instance19:36
fungidue to some other tenant in rackspace sending spam from a different address in that same network19:37
fungithey processed my removal request, but them immediately re-added it because of the problem machine that isn't ours19:37
fungi#link https://storyboard.openstack.org/#!/story/2010689 "Improve email deliverabilty for storyboard@storyboard.openstack.org, some emails are considered SPAM"19:37
fungithere was also a related ml thread on openstack-discuss, which one of my story comments links to a message in19:38
corvusi know you can tell exim not to use v6 for outgoing; it may be possible to tell it to prefer v4...19:38
fungiright, that's a variation of what i was thinking in my last comment on the story19:38
fungiand less hacky19:39
fungithough still definitely hacky19:39
fungianyway, that's all i really had storyboard-related this week19:39
clarkbthanks!19:39
clarkb#topic Open Discussion19:39
clarkbAnything else?19:39
fungithere were a slew of rackspace tickets early today about server outages... pulling up the details19:40
fungizm06, nl04, paste01, 19:41
fungilooks like they were from right around midnight utc19:41
fungier, no, 04:00 utc19:41
fungianyway, if anyone spots problems with those three servers, that's likely related19:42
clarkbonly paste01 is a singleton that would really show up as a problem. Might be worth double checkign the other two19:43
fungiall three servers have a ~16-hour uptime right now19:43
fungiso they at least came back up and are responding to ssh19:43
fungihost became unresponsive and was rebooted in all three cases (likely all on the same host)19:44
clarkbzm06 did not restart its merger19:45
clarkbnl04 did restart its launcher19:45
clarkbI can restart zm06 after lunch19:45
clarkbthank you for calling that out19:45
corvuswe have "restart: on-failure"...19:45
fungii'll go ahead and close out those tickets on the rackspace end19:46
corvusand services normally restart after boots19:46
fungialso they worked two tickets i opened yesterday about undeletable servers, most were in the iad region19:46
corvusso i wonder why zm06 didn't?19:46
fungiin total, freed up 118 nodes worth of capacity19:46
clarkbcorvus: maybe hard reboot under the host doesn't count as failure?19:46
fungiin the twisted land of systemd19:47
corvusMay 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.636033310Z" level=info msg="Removing stale sandbox 88813ad85aa8c751aa92b684e64e1ea7f2e9f2e9c8209ce79bfcf9fe18ee77e7 (05ce3156e52feef4e8e78ad8015aabba477f7d926635e8bf59534a8294d44559)"19:47
corvusMay 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.638772820Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint f128a3c81341d323dfbcbb367224049ec2aac64f10af50e702b3f546f8f09a6c 6e0ac20c216668af07c98111a59723329046fab81fde48b45369a1f7d088ffeb], retrying...."19:47
corvusi have no idea what either of those mean19:47
corvusother than that, i don't see any logs about docker's decision making there19:48
clarkbprobably fine to start things back up again then and see if it happens again in the future? I guess if we really want we can do a graceful reboot and see if it comes back up19:49
clarkbAnything else? Last call I can give you all 10 minutes back for breakfast/lunch/dinner :)19:49
fungii'm set19:50
clarkbThank you everyone. I expect we'll be back here next week and the week after. But then we'll skip on june 13th due to the summit19:50
clarkb#endmeeting19:51
opendevmeetMeeting ended Tue May 23 19:51:03 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:51
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.html19:51
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.txt19:51
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.log.html19:51
fungithanks clarkb!19:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!