19:01:19 #startmeeting infra 19:01:19 Meeting started Tue Aug 3 19:01:19 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:19 The meeting name has been set to 'infra' 19:01:25 #link http://lists.opendev.org/pipermail/service-discuss/2021-August/000272.html Our Agenda 19:01:49 o/ 19:01:58 ohai 19:02:36 Hello 19:02:41 #topic Announcements 19:02:45 I had no announcements 19:02:50 #topic Actions from last meeting 19:02:55 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-07-27-19.01.txt minutes from last meeting 19:02:59 #action someone write spec to replace Cacti with Prometheus 19:03:20 tristanC has some prometheus checks in the matrix gerritbot. I'm still hopeful I may have time this week to start a draft of this spec 19:03:47 considering that people want to add this to new software and some of our existing software has prometheus integration (gitea) getting this moving forwad seems like a good idea 19:04:04 Zuul has it too 19:04:38 ah neat 19:05:05 (For health, not job data) 19:06:05 #topic Topics 19:06:14 the topics topic! 19:06:16 #topic Service Coordinator Election 19:06:25 fungi: I kept it because you enjoy it so much :) 19:06:33 * fungi blushes 19:06:59 As mentioend in the email sent out last week Nominations will run for another week. I'm really encouraging someone else to take this on now so I can not do it :) 19:07:23 frickler ianw fungi corvus ^ I know we're all busy but if you're interested please feel free to volunteer :) 19:08:13 #topic Review Upgrade 19:08:30 I expect that we are close to being able to drop this topic from the meeting agenda, but wanted to keep it for at least this week 19:08:48 We just merged a change to bump the review.opendev.org dns record ttl up to one hour from 5 minutes 19:09:19 #link https://review.opendev.org/c/opendev/system-config/+/803372 Stack of gerrit fixes 19:09:45 I've also got this stack of changes which should pull in the mariadb and openid fixes that we landed upstream as well as switch us over to using the mariadb connector and then removing the mysql connector from our images 19:10:12 They are split out this way because I suspect that we'll want to land the first two. Do a gerrit restart, then after we have happy gerrit for a bit we can land the third 19:10:41 ianw: The other major followup is the old server cleanups? 19:10:59 I think frickler ack'd the file preservation question. Do we want to set a date for cleaning those up? 19:11:15 yep, i think i can probably just go ahead with that at this point 19:11:27 agreed 19:11:46 cool and thank you for doing a ton of work to make this happen 19:13:34 #topic Gerrit User Cleanups 19:13:56 I put together another set of proposed user cleanups. There are about 103 remaining conflicts. I think I've got 73 proposed for cleanup. That will take us to 30 remaining 19:14:25 My goal here is that we'll do the 73 then the 30 will be manageable as a single comit pushed back to All-Users and we won't have to do the more hacky retire and delete external ids thing we have been doing 19:14:36 I plan to reach out to those ~30 when we get there as well 19:15:47 fungi has reviewed this list. If anyone has tiem to take a look that is appreciated. I basically put together an audit file with a bunch of user info and teh separately indicate which of those we'll be retiring and cleaning up 19:15:58 But I plan to try and run the cleanup scripts against that list tomorrow 19:16:24 #topic Project Renames 19:16:39 I'll pull this topic off the agenda for next week but wanted to follow up on how the renames went last week 19:16:47 They went really well. Having the testing is great 19:16:49 smoothly, i thought 19:16:58 #link https://review.opendev.org/c/opendev/system-config/+/802922 Need to manage ssh known_hosts for review02 users 19:17:00 yeah, the testing was a huge help 19:17:06 I think ^ was the major thing that we worked around by hand 19:17:27 and we predicted that ahead of time so it didn't disrupt the maintenance 19:17:37 if others can review that and indicate the chagne looks good I'll add the host var data to prod host vars and then I can approve that 19:17:43 the only surprise was that it choked on storyboard-dev being unreachable 19:17:48 yup 19:17:50 #link https://review.opendev.org/c/opendev/system-config/+/803266 Further testing improvements 19:17:59 is another related chagne which aims to make the testing we've added even more useful 19:18:04 that will hopefully merge in the next hour or so 19:18:15 ahh sorry yep i looked at that yesterday 19:18:28 that doesn't add the host key for giteas? 19:18:39 was that intentional? 19:18:50 ianw: yes. The idea is we'll do it via private vars. 19:19:11 ianw: the reason for this is the previous patchset tried to add the giteas and the localhost known_hosts but failed becusae that overrode the testing group vars 19:19:27 ianw: this meant the known hosts entry we had for the testing review was actually the prod review and host key verification failed 19:19:48 ianw: to work around this the plan is to just set those values in the private host vars which I'll do as soon as we think that change is mergeable. 19:20:05 The content I'll add is at https://review.opendev.org/c/opendev/system-config/+/802922/6/inventory/service/host_vars/review02.opendev.org.yaml 19:21:15 ianw: if you think a different approach makes sense feel free to leave that in review. There are a few options there but fungi and corvus felt that keeping the pubkey and private key stuff close together in private vars made sense 19:21:41 ok, i feel like we could grab the keys from gitea servers directly but i'll rethink on it 19:22:33 #topic Matrix eavesdrop and gerritbot bots 19:23:03 corvus and tristanC have been working to get these bots deployed. Testing has shown a couple of issues. Specifically creating log dirs properly and not using fedora as the base image so that ssh works for gerritbot 19:23:23 I think fixes for both of those items are on their way in whcih means I expect we can interact with those bots in the opendev test room 19:23:39 corvus: ^ is there anything else to be aware of as the matrix bots get deployed? 19:24:11 Nope just waiting to check back 19:24:32 #topic gitea01 backups 19:25:03 gitea01 backups are still sad. This is more important now that we have done a project rename. We do have backups to the other host working. If you have to restore gitea database from backup we need to be careful to use the up to date backup 19:25:17 backups are listed by date so this should be pretty apparent but calling it out here so that others see it 19:25:53 One option available to us is to drop the AAAA record for the backup server in vexxhost. Now that we actually want backups to update I think we should consider this but I know others feel more strongly about ipv6 than I do (I don't have local ipv6 yet :( ) 19:26:19 note we do have daily copies going to rax, so we're not too bad 19:26:26 yup 19:27:13 i dunno, perhaps email is the next step. i haven't heard any more on the underlying issue 19:27:25 that seems reasonable to me 19:27:42 i'm open to ideas but i feel like i've more or less exhausted my options for fiddling things client side 19:28:56 yup I think we need to rely on the cloud to correct it. The only thing we can do now is workaround it by dropping the aaaa record 19:29:00 I'm happy if we start with email 19:29:08 i can do that 19:29:11 thanks 19:29:21 #topic Gitea 1.15.0 upgrade prep 19:29:37 When I did the gitea 1.14 upgrade recently I noticed there was a release candidate for gitea 1.15. 19:30:48 This turned into a thing yesterday where 1.15.0 didn't work for a number of reasons 19:31:20 The first is nodejs 16 and gitea are no longer compatibile despite their release notes saying this is the default nodejs to build with. We have since (just a few minutes ago) landed a change to use nodejs 14 to work around that as 14 is happy with old and new gitea 19:31:48 Next we've discovered that some of our old web ui based project management for things like project settings updatse and renames don't necessarily work with new gitea (specifically the renames) 19:32:07 I took this as a good indication we should rewrite to use the proper rest api as the api does support these actions now 19:32:15 #link https://review.opendev.org/c/opendev/system-config/+/803366/ Update project renaming and its child 19:32:24 Again that should be safe with older and newer gitea 19:32:31 i expect we didn't use the api for those in the past because there weren't methods/objects to accomplish that? 19:32:38 oh nice, and gate tested with the new playbooks? 19:32:41 fungi: yup that is what corvus thought 19:32:43 ianw: yup 19:33:29 then at the end of the stack I've placed a WIP change for gitea v1.15.0-rc2 that does a bunch of v1.15.0 specific things around templates and file paths and some config 19:33:49 I don't want to land the last change until we can do so with a real release. But we should be in good shape to deploy that release when it is ready 19:34:36 I'm happy I caught this early because there are a number of things that all needed updating to make this work right. Also when we get a real release we can hold nodes and double check it all looks good (I did that with the rc recently and it was mostly ok other than the rename not working) 19:35:48 #topic Mailman Ansible and Server Upgrades 19:36:00 #link https://review.opendev.org/c/opendev/system-config/+/803263 fix newlist command 19:36:23 corvus: ianw ^ if one of you can review that it would be great. It seems small on the surface but I think it gets deep into how ansible handles commands and how newlist reads prompts 19:36:25 that was a fun subtle "we're not testing *exactly* like production" bug 19:36:32 otherwise I'd go ahead and approve it 19:36:51 fungi did test redirecting a file with only a newline in it to newlist on a held test node and that worked 19:37:03 so we expect it will work but we got this wrong once already so eyeballs are a good thing :) 19:37:23 i guess the way we could have caught that was if we'd not used --quiet in the newlist test and instead blackholed delivery in exim 19:37:58 ya I suppose that is still an option but someone that groks exim better than me may want to take a shot at it 19:38:20 I've also got a todo I need to get back to which is "upgrade the list serve servers" 19:39:07 It occured to me that the lists.katacontainers.io server is a good guinea pig. In part because I expect it to be easier to snapshot and therefore easier to iterate on for testing. But also because if it goes really sideways we can emergency migrate those lists onto lists.openstack.org (long planned anyway) and then figure out why it was so broken 19:39:29 fungi: ^ I'll probably bug you for a second set of eyeballs as I try to bootstrap that testing 19:39:46 hrm, i wonder if running it under nohup might work too? 19:39:48 oh, on a related note, a user approached me privately today to find out why they'd stopped receiving posts from one of the lists, and in researching i discovered it was because the ipv6 address for lists.o.o ended up on the spamhaus xbl. as of ~16:30 deliveries were working to the problem destinations again 19:40:51 ianw: ya that might work too. We went with the ansible version to avoid needing to use the shell module, but nohup should do that as well 19:41:10 if we feel strognly about it I expect fungi can test newlist on the test server again using nohup 19:41:40 rough timeframe on the xbl listing is it likely got added around july 21 19:42:01 fungi: fun. I guess I solve that problem by not having aaaa records on my mailservers 19:43:44 #topic Open Discussion 19:43:51 That was what I had on the agenda. 19:44:01 Apologies for the ton of various docker image and gitea and gerrit and so on changes 19:44:22 i'm hopefully i can remove debian-stretch soon 19:44:25 I just started looking at this as I did gitea things and the thread was very long. I think it does get us to a better spot overall once we get through them 19:44:43 ianw: jayf noticed the fedora 32 mirror removal. dib functests stopped working 19:44:51 we can configure exim to not use ipv6 for outgoing if we want 19:44:53 yeah, mea culpa for not updating that 19:44:59 #link https://review.opendev.org/c/zuul/zuul-jobs/+/802981 19:45:01 ianw: https://review.opendev.org/c/openstack/diskimage-builder/+/799341 thta apparently fixes it which I'll review after the meeting 19:45:12 i have working ipv6 on my mta and spamassassin was logging that the RCVD_IN_XBL test was matching on messages from the lists, but it wasn't subtracting enough from the score to send them to my spam inbox 19:45:23 so i didn't spot it 19:45:24 removes centos-8-stream from using centos-8 wheels 19:45:46 i'm working on getting centos-8-wheels built; it basically works 19:45:49 #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/802988 19:45:54 i just need to get the publish jobs working 19:46:02 i have added some volumes 19:46:47 ianw: will that change fix it? I think we still want to use the pypi mirrors but not the wheel 19:46:48 fungi: if you could confirm on https://review.opendev.org/c/opendev/base-jobs/+/802639 removal is the plan that would be great 19:47:23 ianw: that role seems to only do the distro mirrors though 19:47:38 clarkb: my intention was to turn off just the wheel bits but i may have missed the mark 19:48:14 ianw: ya the tasks/mirror.yaml top level tasks lists does the pypi mirror config 19:48:34 can do 19:48:53 ianw: ianw left a note 19:48:58 ok will cycle back 19:49:05 tab completion fail 19:49:20 oh and clarkb you mentioned backup pruning too? 19:49:52 ianw: oh ya I mentioend it yesterday since we're getting the emails warning us of disk space 19:49:56 ianw: did that get run? 19:50:08 seems we still got the warning email overnight 19:50:10 no i didn't yet. how about i clean up review-test etc. and then i'll do that 19:50:17 sounds good thanks 19:51:51 sounds like that may be it. As awalys feel free to bring up discussion on IRC or on the mailing list 19:51:57 we aren't limited to this hour block 19:52:00 Thanks everyone! 19:52:03 #endmeeting