19:01:19 #startmeeting infra 19:01:19 Meeting started Tue May 23 19:01:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:19 The meeting name has been set to 'infra' 19:01:30 Hello everyone (I expect a small group today thats fine) 19:01:44 link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda 19:01:47 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda 19:01:52 #topic Announcements 19:02:08 I guess a friendly reminder that the Open Infra summit is coming up in a few weeks 19:02:13 less than a month away now (like 3 weeks?) 19:02:32 * fungi sticks his head in the sand 19:02:46 2 more weeks till time to panic 19:03:22 #topic Migrating container images to quay.io 19:03:55 After our discussion last week I dove in and started poking at running services with podman in system-config and leaned on Zuul jobs to give us info back 19:04:13 The good news is that podman and buildkit seem to solve the problem with speculative testing of container images 19:04:19 (this was expected but good to confirm( 19:04:43 The bad news is that switching to podman isn't super straightforward for a number of smaller issues that add up in my opinion 19:06:05 Hurdles include packaging for podman and friends isn't great until you get on Ubuntu Jammy or newer. Podman doesn't support syslog logging so we need to switch all our logging over to journald. In addition to simply checking if services can run under podman we need to transition to podman which means stopping docker services and starting podman services in some sort of 19:06:07 coordinated fashion. There are also questions about whether or not we should change which user runs services when moving to podman 19:06:26 I have yet to find a single blocker that would prevent us from doing the move, but I'm not confident we can do it in a short period of time 19:07:10 For this reason I brought this up yesterday in #opendev and basically said I think we should go with the skopeo workaround or revert to docker hub. In both cases we could then work forward to move services onto podman and either remove the skopeo workaround or migrate to quay.io at a later date 19:07:40 During that discussion the consensus seemed to be that we preferred reverting back to docker hub. Then we can move to podman then we can move to quay.io and everything should be happy with us unlike the current state 19:08:06 that matches my takeaway 19:08:08 I wanted to bring this up more formally in the meeting before proceeding with that plan. Has anyone changed their mind or have new input etc? 19:09:30 if we proceed with that plan I think it will look very similar to the quay.io move. We want to move the base images first so that when we move the other images back they rebuild and see the current base images 19:09:53 One thing that makes this tricky is the sync of container images back to docker hub since I don't think I can use a personal account for that like I did with quay.io. But that is solvable 19:10:24 I'll also draft up a document like I did for the move to quay.io so that we can have a fairly complete list of tasks and keep track of it all. 19:11:28 I'm not hearing any objections or new input. In that case I plan to start on this tomorrow (I'd like to be able to be pretty head down on it at least to start to make sure I don't miss anyting important) 19:11:38 clarkb: sounds good to me 19:11:46 thanks! 19:11:51 sounds good. Thank you clarkb for pulling all that together 19:11:58 clarkb: i think you can just use manually the opendev zuul creds on docker? 19:12:03 corvus: yup 19:12:24 should be able to docker login with them during the period of time I need them 19:12:30 also, it seems like zuul can continue moving forward with quay.io 19:12:45 since it's isolated from the opendev/openstack tenant 19:12:53 so it can be the advance party 19:13:08 agreed and the podman moves there are isolated to CI jobs which are a lot easier to transition. The either work or they don't (and as of a few minutes ago I think they are working) 19:13:18 i'm in favor of zuul continuing with that, iff the plan is for opendev to eventually move. 19:13:27 i'd like us to be aligned long-term, one way or the other 19:13:56 I'd personally like to see opendev move to podman and quay.io. I think for long term viability the extra functioanlity is going to be useful 19:14:12 ++ 19:14:15 same 19:14:16 both tools have extra features that enable things like per image access controls in quay.io, speculative gating etc 19:14:25 I'm just acknowledging it will take time 19:14:56 also, we should be able to move services to podman one at a time, then once everything is on podman, make the quay switch 19:15:48 ++ 19:16:07 servers currently on Jammy would be a good place to start since jammy has podman packages built in 19:16:15 currently I Think that is gitea, gitea-lb and etherpad? 19:16:51 alright I probably won't get to that today since I've got other stuff in flight already, but hope to focus on this tomorrow. I'll ping for reviews :) 19:17:25 #topic Bastion Host Changes 19:17:37 #link https://review.opendev.org/q/topic:bridge-backups 19:17:43 this topic still needs reviews if anyone has time 19:17:48 otherwise I am not aware of anything new 19:18:03 lists01 is also jammy 19:18:27 ah up 19:18:35 #link Mailman 3 19:18:44 fungi: any new movement with mailman3 things speaking of list01 19:18:56 i'm fiddling with it now, current held node is 173.231.255.71 19:19:24 anything we can do to help? 19:19:31 with the current stack the default mailman hostname is no longer one of the individual list hostnames, which gives us some additional flexibility 19:20:08 fungi: we probably want to test the transition from our current state to that state as well? 19:20:15 (I'm not sure if that has been looked at yet) 19:20:27 i'm still messing with the django commands to see how we might automate the additional django site to postorius/hyperkitty hostname mappings 19:20:44 but yes, that should be fairly testable manually 19:21:09 for a one-time transition it probably isn't all that useful to build automated testing 19:21:21 but ultimately it's just a few db entries changing 19:21:58 ya not sure it needs to be autoamted. More just tested 19:22:02 agreed 19:22:06 similar to how we've checked the gerrit upgrade rollback procedures 19:22:21 Sounds good. Let us know if we can help 19:22:48 folks are welcome to override their /etc/hosts to point to the held node and poke around the webui, though this one needs data imported still 19:23:07 and there are still things i haven't set, so that's probably premature anyway 19:23:26 ok I've scribbled a note to do that if I have time. Can't hurt anyway 19:23:39 exactly 19:23:42 thanks! 19:24:19 #topic Gerrit leaked replication task files 19:24:39 This is ongoing. No movement upstream in the issues I've filed. The number of files is growing at a steady ut management rate 19:24:57 I'm honestly tempted to undo the bind mound for this directory and go back to potentially lossy replications though 19:25:21 I've been swamped with other stuff though and since the rate hasn't grown in a scary way I've been happy to deprioritize looking into this further 19:25:55 Maybe late this week / early next week I can put a java developer hat on and see about fixing it though 19:26:19 #topic Upgrading Old Servers 19:26:40 similar to the last item I hvaen't had time to look at this recently. I don't think anyone else has either based on what I've seen in gerrit/irc/email 19:26:56 i have not, sorry 19:27:00 This should probably be a higher priority than playing java dev though so maybe I start here when I dig out of my hole 19:27:37 if anyone else ends up with time please jump on one of these. It is a big help 19:27:42 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes 19:27:49 #topic Fedora Cleanup 19:28:14 This topic was the old openafs utilization topic. But I pushed a change to remove unused fedora mirror content and that freed up about 200GB and now we are below 90% utilization 19:28:25 yay 19:28:26 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IOYIYWGTZW3TM4TR2N47XY6X7EB2W2A6/ Proposed removal of fedora mirrors 19:28:39 No one has said please keep fedora we need it for X, Y or Z 19:28:57 corvus: ^ I actually recently remembered that the nodepool openshift testing uses fedora but I don't think it needs to 19:29:22 openshift itself is installed on centos and then nodepool is run on fedora. It could probably be a single node job even or the fedora node should be able to be replaced with anything else 19:29:42 either way I think the next step here is removing fedora mirror config from our jobs so that jobs on fedora talk upstream 19:29:53 I started looking at removing pulling the setup of f36 from zuul but that's a little bigger than expected so we so we don't break existing users outside of here 19:30:10 then we can remove all fedora mirroring and then we can work on clearing out fedora jobs and the nodes themselves 19:30:11 ianw has a (failing/wip) change to remove our ci mirrors from fedora builds in dib 19:30:13 tonyb: oh cool 19:30:21 #link https://review.opendev.org/883798 "fedora: don't use CI mirrors" 19:30:26 fungi: that change should pass as soon as the nodepool podman change merges 19:30:27 clarkb: i agree, it should be able to be replaced 19:30:54 tonyb: is the issue that we assume all nodes should have a local mirror configured? I wonder how we are handling that with rocky. Maybe we just ignored rocky in that role? 19:31:09 clarkb: there are fedora-specific tests in dib which are currently in need of removal for that change to pass, it looked like to me 19:31:18 yeah rocky is ignored 19:32:20 my initial plan was to just pull fedora but then I realised if there were non OpenDev users of zuul that did care about fedora mirroring they get broken 19:32:38 fungi: ah 19:32:53 so now I'm working on adding a flag to the base zuul jobs to say ... skip fedora 19:33:00 oh, never mind. now the failures i'm looking at are about finding files on disk, so maybe. (or maybe i'm looking at different errors now) 19:33:10 tonyb: got it. That does pose a problem. If you add flags it might be good to add flags for everthing too for consistency 19:33:21 but trying to do that well and struggling with perfect being the enemy of good 19:33:32 tonyb: but getting something up so that the zuul community can weigh in is probably best before over engineering 19:33:36 ++ 19:33:51 clarkb: yeah doing it for everything was the plan 19:34:05 but yeah, the dib-functtests failure doesn't appear on the surface to be container-toolchain-related 19:34:23 kinda adding a distro-enable-mirror style flag 19:34:46 I'll post a Wip for a more concrete discussion 19:35:07 sounds good 19:35:16 #topic Quo Vadis Storyboard 19:35:22 Anything new here? 19:35:31 I don't have anything new. ETOOMANYTHINGS 19:36:04 nor i 19:36:15 well, the spamhaus blocklist thing 19:36:47 apparently spamhaus css has listed the /64 containing storyboard01's ipv6 address 19:36:48 oh ya we should probably talk to oepnstack about that. tl;dr is cloud providers (particularly public clodu providers) should assign no less than a /64 to each instance 19:37:03 due to some other tenant in rackspace sending spam from a different address in that same network 19:37:24 they processed my removal request, but them immediately re-added it because of the problem machine that isn't ours 19:37:43 #link https://storyboard.openstack.org/#!/story/2010689 "Improve email deliverabilty for storyboard@storyboard.openstack.org, some emails are considered SPAM" 19:38:21 there was also a related ml thread on openstack-discuss, which one of my story comments links to a message in 19:38:24 i know you can tell exim not to use v6 for outgoing; it may be possible to tell it to prefer v4... 19:38:57 right, that's a variation of what i was thinking in my last comment on the story 19:39:22 and less hacky 19:39:29 though still definitely hacky 19:39:42 anyway, that's all i really had storyboard-related this week 19:39:54 thanks! 19:39:54 #topic Open Discussion 19:39:54 Anything else? 19:40:22 there were a slew of rackspace tickets early today about server outages... pulling up the details 19:41:19 zm06, nl04, paste01, 19:41:50 looks like they were from right around midnight utc 19:41:57 er, no, 04:00 utc 19:42:36 anyway, if anyone spots problems with those three servers, that's likely related 19:43:12 only paste01 is a singleton that would really show up as a problem. Might be worth double checkign the other two 19:43:45 all three servers have a ~16-hour uptime right now 19:43:59 so they at least came back up and are responding to ssh 19:44:40 host became unresponsive and was rebooted in all three cases (likely all on the same host) 19:45:00 zm06 did not restart its merger 19:45:05 nl04 did restart its launcher 19:45:18 I can restart zm06 after lunch 19:45:22 thank you for calling that out 19:45:54 we have "restart: on-failure"... 19:46:02 i'll go ahead and close out those tickets on the rackspace end 19:46:19 and services normally restart after boots 19:46:22 also they worked two tickets i opened yesterday about undeletable servers, most were in the iad region 19:46:25 so i wonder why zm06 didn't? 19:46:37 in total, freed up 118 nodes worth of capacity 19:46:48 corvus: maybe hard reboot under the host doesn't count as failure? 19:47:19 in the twisted land of systemd 19:47:35 May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.636033310Z" level=info msg="Removing stale sandbox 88813ad85aa8c751aa92b684e64e1ea7f2e9f2e9c8209ce79bfcf9fe18ee77e7 (05ce3156e52feef4e8e78ad8015aabba477f7d926635e8bf59534a8294d44559)" 19:47:35 May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.638772820Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint f128a3c81341d323dfbcbb367224049ec2aac64f10af50e702b3f546f8f09a6c 6e0ac20c216668af07c98111a59723329046fab81fde48b45369a1f7d088ffeb], retrying...." 19:47:40 i have no idea what either of those mean 19:48:07 other than that, i don't see any logs about docker's decision making there 19:49:12 probably fine to start things back up again then and see if it happens again in the future? I guess if we really want we can do a graceful reboot and see if it comes back up 19:49:50 Anything else? Last call I can give you all 10 minutes back for breakfast/lunch/dinner :) 19:50:15 i'm set 19:50:58 Thank you everyone. I expect we'll be back here next week and the week after. But then we'll skip on june 13th due to the summit 19:51:03 #endmeeting