19:01:19 <clarkb> #startmeeting infra
19:01:19 <opendevmeet> Meeting started Tue Aug  3 19:01:19 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:19 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:19 <opendevmeet> The meeting name has been set to 'infra'
19:01:25 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000272.html Our Agenda
19:01:49 <ianw> o/
19:01:58 <fungi> ohai
19:02:36 <clarkb> Hello
19:02:41 <clarkb> #topic Announcements
19:02:45 <clarkb> I had no announcements
19:02:50 <clarkb> #topic Actions from last meeting
19:02:55 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-07-27-19.01.txt minutes from last meeting
19:02:59 <clarkb> #action someone write spec to replace Cacti with Prometheus
19:03:20 <clarkb> tristanC has some prometheus checks in the matrix gerritbot. I'm still hopeful I may have time this week to start a draft of this spec
19:03:47 <clarkb> considering that people want to add this to new software and some of our existing software has prometheus integration (gitea) getting this moving forwad seems like a good idea
19:04:04 <corvus> Zuul has it too
19:04:38 <clarkb> ah neat
19:05:05 <corvus> (For health, not job data)
19:06:05 <clarkb> #topic Topics
19:06:14 <fungi> the topics topic!
19:06:16 <clarkb> #topic Service Coordinator Election
19:06:25 <clarkb> fungi: I kept it because you enjoy it so much :)
19:06:33 * fungi blushes
19:06:59 <clarkb> As mentioend in the email sent out last week Nominations will run for another week. I'm really encouraging someone else to take this on now so I can not do it :)
19:07:23 <clarkb> frickler ianw fungi corvus ^ I know we're all busy but if you're interested please feel free to volunteer :)
19:08:13 <clarkb> #topic Review Upgrade
19:08:30 <clarkb> I expect that we are close to being able to drop this topic from the meeting agenda, but wanted to keep it for at least this week
19:08:48 <clarkb> We just merged a change to bump the review.opendev.org dns record ttl up to one hour from 5 minutes
19:09:19 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803372 Stack of gerrit fixes
19:09:45 <clarkb> I've also got this stack of changes which should pull in the mariadb and openid fixes that we landed upstream as well as switch us over to using the mariadb connector and then removing the mysql connector from our images
19:10:12 <clarkb> They are split out this way because I suspect that we'll want to land the first two. Do a gerrit restart, then after we have happy gerrit for a bit we can land the third
19:10:41 <clarkb> ianw: The other major followup is the old server cleanups?
19:10:59 <clarkb> I think frickler ack'd the file preservation question. Do we want to set a date for cleaning those up?
19:11:15 <ianw> yep, i think i can probably just go ahead with that at this point
19:11:27 <fungi> agreed
19:11:46 <clarkb> cool and thank you for doing a ton of work to make this happen
19:13:34 <clarkb> #topic Gerrit User Cleanups
19:13:56 <clarkb> I put together another set of proposed user cleanups. There are about 103 remaining conflicts. I think I've got 73 proposed for cleanup. That will take us to 30 remaining
19:14:25 <clarkb> My goal here is that we'll do the 73 then the 30 will be manageable as a single comit pushed back to All-Users and we won't have to do the more hacky retire and delete external ids thing we have been doing
19:14:36 <clarkb> I plan to reach out to those ~30 when we get there as well
19:15:47 <clarkb> fungi has reviewed this list. If anyone has tiem to take a look that is appreciated. I basically put together an audit file with a bunch of user info and teh separately indicate which of those we'll be retiring and cleaning up
19:15:58 <clarkb> But I plan to try and run the cleanup scripts against that list tomorrow
19:16:24 <clarkb> #topic Project Renames
19:16:39 <clarkb> I'll pull this topic off the agenda for next week but wanted to follow up on how the renames went last week
19:16:47 <clarkb> They went really well. Having the testing is great
19:16:49 <fungi> smoothly, i thought
19:16:58 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/802922 Need to manage ssh known_hosts for review02 users
19:17:00 <fungi> yeah, the testing was a huge help
19:17:06 <clarkb> I think ^ was the major thing that we worked around by hand
19:17:27 <fungi> and we predicted that ahead of time so it didn't disrupt the maintenance
19:17:37 <clarkb> if others can review that and indicate the chagne looks good I'll add the host var data to prod host vars and then I can approve that
19:17:43 <fungi> the only surprise was that it choked on storyboard-dev being unreachable
19:17:48 <clarkb> yup
19:17:50 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803266 Further testing improvements
19:17:59 <clarkb> is another related chagne which aims to make the testing we've added even more useful
19:18:04 <fungi> that will hopefully merge in the next hour or so
19:18:15 <ianw> ahh sorry yep i looked at that yesterday
19:18:28 <ianw> that doesn't add the host key for giteas?
19:18:39 <ianw> was that intentional?
19:18:50 <clarkb> ianw: yes. The idea is we'll do it via private vars.
19:19:11 <clarkb> ianw: the reason for this is the previous patchset tried to add the giteas and the localhost known_hosts but failed becusae that overrode the testing group vars
19:19:27 <clarkb> ianw: this meant the known hosts entry we had for the testing review was actually the prod review and host key verification failed
19:19:48 <clarkb> ianw: to work around this the plan is to just set those values in the private host vars which I'll do as soon as we think that change is mergeable.
19:20:05 <clarkb> The content I'll add is at https://review.opendev.org/c/opendev/system-config/+/802922/6/inventory/service/host_vars/review02.opendev.org.yaml
19:21:15 <clarkb> ianw: if you think a different approach makes sense feel free to leave that in review. There are a few options there but fungi and corvus  felt that keeping the pubkey and private key stuff close together in private vars made sense
19:21:41 <ianw> ok, i feel like we could grab the keys from gitea servers directly but i'll rethink on it
19:22:33 <clarkb> #topic Matrix eavesdrop and gerritbot bots
19:23:03 <clarkb> corvus and tristanC have been working to get these bots deployed. Testing has shown a couple of issues. Specifically creating log dirs properly and not using fedora as the base image so that ssh works for gerritbot
19:23:23 <clarkb> I think fixes for both of those items are on their way in whcih means I expect we can interact with those bots in the opendev test room
19:23:39 <clarkb> corvus: ^ is there anything else to be aware of as the matrix bots get deployed?
19:24:11 <corvus> Nope just waiting to check back
19:24:32 <clarkb> #topic gitea01 backups
19:25:03 <clarkb> gitea01 backups are still sad. This is more important now that we have done a project rename. We do have backups to the other host working. If you have to restore gitea database from backup we need to be careful to use the up to date backup
19:25:17 <clarkb> backups are listed by date so this should be pretty apparent but calling it out here so that others see it
19:25:53 <clarkb> One option available to us is to drop the AAAA record for the backup server in vexxhost. Now that we actually want backups to update I think we should consider this but I know others feel more strongly about ipv6 than I do (I don't have local ipv6 yet :( )
19:26:19 <ianw> note we do have daily copies going to rax, so we're not too bad
19:26:26 <clarkb> yup
19:27:13 <ianw> i dunno, perhaps email is the next step.  i haven't heard any more on the underlying issue
19:27:25 <clarkb> that seems reasonable to me
19:27:42 <ianw> i'm open to ideas but i feel like i've more or less exhausted my options for fiddling things client side
19:28:56 <clarkb> yup I think we need to rely on the cloud to correct it. The only thing we can do now is workaround it by dropping the aaaa record
19:29:00 <clarkb> I'm happy if we start with email
19:29:08 <ianw> i can do that
19:29:11 <clarkb> thanks
19:29:21 <clarkb> #topic Gitea 1.15.0 upgrade prep
19:29:37 <clarkb> When I did the gitea 1.14 upgrade recently I noticed there was a release candidate for gitea 1.15.
19:30:48 <clarkb> This turned into a thing yesterday where 1.15.0 didn't work for a number of reasons
19:31:20 <clarkb> The first is nodejs 16 and gitea are no longer compatibile despite their release notes saying this is the default nodejs to build with. We have since (just a few minutes ago) landed a change to use nodejs 14 to work around that as 14 is happy with old and new gitea
19:31:48 <clarkb> Next we've discovered that some of our old web ui based project management for things like project settings updatse and renames don't necessarily work with new gitea (specifically the renames)
19:32:07 <clarkb> I took this as a good indication we should rewrite to use the proper rest api as the api does support these actions now
19:32:15 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803366/ Update project renaming and its child
19:32:24 <clarkb> Again that should be safe with older and newer gitea
19:32:31 <fungi> i expect we didn't use the api for those in the past because there weren't methods/objects to accomplish that?
19:32:38 <ianw> oh nice, and gate tested with the new playbooks?
19:32:41 <clarkb> fungi: yup that is what corvus thought
19:32:43 <clarkb> ianw: yup
19:33:29 <clarkb> then at the end of the stack I've placed a WIP change for gitea v1.15.0-rc2 that does a bunch of v1.15.0 specific things around templates and file paths and some config
19:33:49 <clarkb> I don't want to land the last change until we can do so with a real release. But we should be in good shape to deploy that release when it is ready
19:34:36 <clarkb> I'm happy I caught this early because there are a number of things that all needed updating to make this work right. Also when we get a real release we can hold nodes and double check it all looks good (I did that with the rc recently and it was mostly ok other than the rename not working)
19:35:48 <clarkb> #topic Mailman Ansible and Server Upgrades
19:36:00 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/803263 fix newlist command
19:36:23 <clarkb> corvus: ianw ^ if one of you can review that it would be great. It seems small on the surface but I think it gets deep into how ansible handles commands and how newlist reads prompts
19:36:25 <fungi> that was a fun subtle "we're not testing *exactly* like production" bug
19:36:32 <clarkb> otherwise I'd go ahead and approve it
19:36:51 <clarkb> fungi did test redirecting a file with only a newline in it to newlist on a held test node and that worked
19:37:03 <clarkb> so we expect it will work but we got this wrong once already so eyeballs are a good thing :)
19:37:23 <fungi> i guess the way we could have caught that was if we'd not used --quiet in the newlist test and instead blackholed delivery in exim
19:37:58 <clarkb> ya I suppose that is still an option but someone that groks exim better than me may want to take a shot at it
19:38:20 <clarkb> I've also got a todo I need to get back to which is "upgrade the list serve servers"
19:39:07 <clarkb> It occured to me that the lists.katacontainers.io server is a good guinea pig. In part because I expect it to be easier to snapshot and therefore easier to iterate on for testing. But also because if it goes really sideways we can emergency migrate those lists onto lists.openstack.org (long planned anyway) and then figure out why it was so broken
19:39:29 <clarkb> fungi: ^ I'll probably bug you for a second set of eyeballs as I try to bootstrap that testing
19:39:46 <ianw> hrm, i wonder if running it under nohup might work too?
19:39:48 <fungi> oh, on a related note, a user approached me privately today to find out why they'd stopped receiving posts from one of the lists, and in researching i discovered it was because the ipv6 address for lists.o.o ended up on the spamhaus xbl. as of ~16:30 deliveries were working to the problem destinations again
19:40:51 <clarkb> ianw: ya that might work too. We went with the ansible version to avoid needing to use the shell module, but nohup should do that as well
19:41:10 <clarkb> if we feel strognly about it I expect fungi can test newlist on the test server again using nohup
19:41:40 <fungi> rough timeframe on the xbl listing is it likely got added around july 21
19:42:01 <clarkb> fungi: fun. I guess I solve that problem by not having aaaa records on my mailservers
19:43:44 <clarkb> #topic Open Discussion
19:43:51 <clarkb> That was what I had on the agenda.
19:44:01 <clarkb> Apologies for the ton of various docker image and gitea and gerrit and so on changes
19:44:22 <ianw> i'm hopefully i can remove debian-stretch soon
19:44:25 <clarkb> I just started looking at this as I did gitea things and the thread was very long. I think it does get us to a better spot overall once we get through them
19:44:43 <clarkb> ianw: jayf noticed the fedora 32 mirror removal. dib functests stopped working
19:44:51 <corvus> we can configure exim to not use ipv6 for outgoing if we want
19:44:53 <ianw> yeah, mea culpa for not updating that
19:44:59 <ianw> #link https://review.opendev.org/c/zuul/zuul-jobs/+/802981
19:45:01 <clarkb> ianw: https://review.opendev.org/c/openstack/diskimage-builder/+/799341 thta apparently fixes it which I'll review after the meeting
19:45:12 <fungi> i have working ipv6 on my mta and spamassassin was logging that the RCVD_IN_XBL test was matching on messages from the lists, but it wasn't subtracting enough from the score to send them to my spam inbox
19:45:23 <fungi> so i didn't spot it
19:45:24 <ianw> removes centos-8-stream from using centos-8 wheels
19:45:46 <ianw> i'm working on getting centos-8-wheels built; it basically works
19:45:49 <ianw> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/802988
19:45:54 <ianw> i just need to get the publish jobs working
19:46:02 <ianw> i have added some volumes
19:46:47 <clarkb> ianw: will that change fix it? I think we still want to use the pypi mirrors but not the wheel
19:46:48 <ianw> fungi: if you could confirm on https://review.opendev.org/c/opendev/base-jobs/+/802639 removal is the plan that would be great
19:47:23 <clarkb> ianw: that role seems to only do the distro mirrors though
19:47:38 <ianw> clarkb: my intention was to turn off just the wheel bits but i may have missed the mark
19:48:14 <clarkb> ianw: ya the tasks/mirror.yaml top level tasks lists does the pypi mirror config
19:48:34 <fungi> can do
19:48:53 <clarkb> ianw: ianw  left a note
19:48:58 <ianw> ok will cycle back
19:49:05 <clarkb> tab completion fail
19:49:20 <ianw> oh and clarkb you mentioned backup pruning too?
19:49:52 <clarkb> ianw: oh ya I mentioend it yesterday since we're getting the emails warning us of disk space
19:49:56 <clarkb> ianw: did that get run?
19:50:08 <clarkb> seems we still got the warning email overnight
19:50:10 <ianw> no i didn't yet.  how about i clean up review-test etc. and then i'll do that
19:50:17 <clarkb> sounds good thanks
19:51:51 <clarkb> sounds like that may be it. As awalys feel free to bring up discussion on IRC or on the mailing list
19:51:57 <clarkb> we aren't limited to this hour block
19:52:00 <clarkb> Thanks everyone!
19:52:03 <clarkb> #endmeeting