19:01:06 #startmeeting infra 19:01:06 Meeting started Tue Jun 29 19:01:06 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 The meeting name has been set to 'infra' 19:01:17 #link http://lists.opendev.org/pipermail/service-discuss/2021-June/000262.html Our Agenda 19:01:22 #topic Announcements 19:01:54 No real announcements other than my life is returning to normally scheduled day to day so I'll be around at typical times now 19:02:06 The one exception to that is Monday is apparently the observation of a holiday here 19:02:31 yes, th eone where citizens endeavor to celebrate the independence of their nation by blowing up a small piece of it 19:02:51 o/ 19:02:51 always a fun occasion 19:02:53 fungi: yup, but also this year I think we are declaring the pandemic is over here and we should remove all precautions 19:03:07 blowing up in more ways than one, in that case 19:03:35 But I'll be around Tuesday and we'll have a meeting as usual 19:03:41 #topic Actions from last meeting 19:03:47 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-06-15-19.01.txt minutes from last meeting 19:04:30 I have not brought up the ELK situation with openstack leadership yet. diablo_rojo fyi I intend on doing that when I find time in the near future. Mostly just to plan out what we are doing next as far as wind down goes 19:04:40 #action clarkb Followup with OpenStack on ELK retirement 19:04:50 ianw: have ppc packages been cleaned up from centos mirrors? 19:04:58 though they're presenting a call to action for it to the board of directors tomorrow 19:05:07 makes sense to me. 19:05:16 (the elastic recheck support request, i mean) 19:05:30 fungi: yup, I don't think we have to say "its turning off tomorrow" more of a we are doing these things you are doing those things, when is a reasonable time to say its dead or not 19:05:44 and start to create the longer term expectations 19:05:46 clarkb: yep, that was done https://review.opendev.org/c/opendev/system-config/+/797365 19:06:04 ianw: excellent thanks! 19:06:34 and I don't think a prometheus replacement for cacti spec has been written yet either. I'm mostly keeping this on the list because i think it is a good idea and keeping it visible can only help make it happen :) 19:06:44 #action someone write spec to replace Cacti with Prometheus 19:07:02 also, while it didn't get flagged as an action item it effectively was one: 19:07:05 #link https://review.opendev.org/797990 Stop updating Gerrit RDBMS for repo renames 19:07:33 now i can stop forgetting to remember to do that 19:07:42 fungi: great, I'll have to give that a review (I've been on a review push myself the last few days trying to catch up on all the awesome work everyone has been doing) 19:08:15 #topic Topics 19:08:22 #topic Eavesdrop and Limnoria 19:08:47 We discovered there was a bug in the channel log conversion from raw text logs to html that may have explained the lag people noticed in those files 19:09:01 basically we ran the conversion once an hour instead of every 15 minutes. Fungi wrote a fix for that. 19:09:08 and it merged 19:09:20 so should be back to behavnig normally now 19:09:23 behaving 19:09:29 Would be good to keep an eye out for any new reports of lag in those logs, but I think we can call it fixed now based on what we saw timestamp wise yesterday 19:09:32 ++ 19:09:57 sorry about that, missed a * from the old job :/ 19:10:00 that was the new lag, by the way, the old lag before that was related to flushing files 19:10:12 so we actually had two lag sources playing off one another 19:10:44 ah cool I wasn't sure if we saw lag in the text files previously or only html 19:10:55 text files were happy yseterday it seemed like when we looked at least and then we fixed the html side 19:11:33 #topic Gerrit Account Cleanup 19:11:51 I'm hoping to find time for this among everything else and deactivate those accounts whose external ids we'll delete later 19:12:07 fungi: you started to look at that more closely, have you had a chance to do a sufficient sampling to be comfortable with the list? 19:13:02 yes, my spot-checking didn't turn up any concerns 19:13:36 great, I'll try to pencil this in for the end of the week then and do the account retirement/deactivation then in a few weeks we can do the external id deletions for all those that don't complain (and none should) 19:14:04 #topic Review Upgrade 19:14:15 #link https://etherpad.opendev.org/p/gerrit-upgrade-2021 Upgrade Checklist 19:14:26 The agenda says this document is ready for review. infra-root please take a look at it 19:14:42 ianw: does the ipv6 problem that recently happened put a pause on this while we sort that out? 19:15:46 i'm not sure, i rebooted the host and it came back 19:16:09 i know there was some known issue at some point that required a reboot prior, it might have been broken then 19:16:24 ianw: the issue that hapepned on the cloud side? 19:16:58 Considering that we do have the otpion of removing the AAAA record from DNS temporarily if necessary I suspect this isn't critical. But others may feel more strongly about ipv6 19:17:19 there was a host outage and reboot/migration forced at one point, but i don't recall how long ago 19:17:40 and probably didn't track it closely since the server was not yet in production 19:17:41 right, that feels like the sort of thing that duplicate addresses might pop up in 19:17:46 it happened the weekend after I did all those focal reboots 19:17:58 I remember because I delayed review02s reboot and then vexxhost took care of it for me :) 19:18:12 ahh, right, mnaser let us know about it, could find a more precise time in irc logs 19:18:18 and ya that seems like a possibility if there was a migration with two instances out there fighting over arp 19:18:38 (or even just not properly flushing the router's tables first) 19:19:08 well, dad operates on the server seeing evidence of a conflict 19:19:29 so presumably there really were two systems trying to use the same v6 address at the same moment 19:19:32 got it 19:19:40 anyway, if we can work on that checklist, i'm happy to maybe do this on a .au monday morning. that's usually a very quiet time 19:20:07 i'm not sure if we could be ready for the 5th, but that would be even quieter 19:20:08 will do, thanks! 19:20:18 ianw: yup I'll need to add that to my list of reviews for today. And I can do .au morning as well usually. Since that overlaps with my afternoon/evening without too much pain 19:20:37 ianw: I think your suggested date of the 19th is probably reasonable 19:20:57 that way we can announce it with a couple of weeks of notice too (so that firewall rules can be updates in various places if necessary) 19:21:12 maybe plan to send that out in a couple of days after we have a chance to double check your checklist 19:21:17 the 12th maybe too, although i'll be out a day or two before that (still deciding on plans wrt. lockdowns, etc.) 19:22:31 I like giving a bit of notice for this and 19th feels like a good balance between too little and too much 19:22:41 infra-root ^ feel free to weigh in though 19:23:07 i the past we've announced the new ip addresses somewhat in advance 19:23:33 yes in the past we've tried to do ~4 weeks iirc 19:23:37 since a number of companies maintain firewall exceptions allowing their employees or ci systems to connect 19:23:48 but we also had more companies with strict firewall rules than we have today (or at least they don't complain as much anymore) 19:23:56 ok, i can construct a notification for that soon then, as i don't see any reason we'll change the ip and reverse dns is setup too 19:24:04 right, i do think 4 weeks is probably excessive today 19:24:26 but if we can give them a heads up, sooner would be better than later 19:24:30 ++ 19:24:58 we could even advertise the new IPs with a no sooner than X date 19:25:07 then they can add firwall rules and keep the old one in place until we do the switch 19:25:17 but the 19th seems like a good option to me. 19:25:48 should cross check against release schedules for various projects but I think that is relatively quiet time 19:25:56 Anything else on the review upgrade topic? 19:26:26 not really, i just want to get the checklist as detailed as possible 19:26:26 i got nothin' 19:26:45 thanks for organizing this, ianw! 19:26:46 #topic Listserv upgrades 19:26:49 ++ thanks! 19:27:04 I've somewhat stalled out on this and worry I've got a number of other tasks that are just as or more important fighting for time 19:27:29 If anyone else wants to boot hte test node and run through an upgrade on it I've already started notes on an etherpad somewhere I should dig up again. But if not I'll keep this on my list and try to get to it when I can 19:28:00 Mostly this is a heads up that I'm probably not getting to it this week. Hopeflly next 19:28:26 #topic Draft matrix spec 19:28:36 #link https://review.opendev.org/796156 Draft matrix spec 19:28:51 I reached out to EMS (element matrix services) today through a contact that corvus had 19:29:10 Their day was largely already over but they said they will try to scheduel a call with me tomorrow. 19:29:56 I suspect that corvus would be itnerested in bneing on that call. Is anyone else interested too? We'll be overlapping with pacific timezone and europe so the window for that isn't very large 19:30:40 thanks! i'm hoping we can narrow the options down and revise the spec with something more concrete there 19:30:41 I suspect this intiial conversation will be super high level and not incredibly important for everyone to be on. But I'm happy to include others if there is interest 19:30:52 corvus: ++ 19:31:03 i can be on the call, but am happy to entrust the discussion to the two of you 19:32:08 alright I'll see what they say schedule wise tomorrow 19:32:12 #topic gitea01 backups 19:32:28 Not sure if anyone has looked into this yet but gitea01 seems to be failign to backup to one of our two backup targets 19:32:54 is it somewhat random? 19:33:05 Thought I would bring it up here to ensure it wasn't forgotten. I don't think this is super urgent as we haven't made any recent project renames (which would update the db tables that we want to backup) 19:33:08 i haven't checked the logs, just noticed the notifications to the root inbox 19:33:11 ianw: no it seems to happen consistently each day 19:33:13 seems like it's consistently every day 19:33:24 the consistency is why I believe only one backup target is affected 19:33:30 (otherwise we'd see multiple timestamps?) 19:33:39 i'm sure it's mysql dropping right? 19:33:44 appears to have started on 2021-06-12 19:34:34 ianw: I havne't even dug in that far, but probably a good guess 19:34:45 clarkb: (sorry, I'd also love to be on the matrix call, but obviously don't block on me) 19:34:53 http://paste.openstack.org/show/807046/ 19:35:28 socket timeouts maybe? 19:35:41 mordred: noted 19:35:43 i wonder if the connection goes idle waiting on the query to complete 19:35:56 but only to the vexxhost backup 19:36:12 which implies some router in that path dropping state prematurely 19:36:27 or nat if we're doing a vip 19:36:29 and this runs in vexxhost, right? so the external further-away rax backup is working 19:36:40 yup gitea01 is in sjc vexxhost 19:36:44 and the mysql is localhost 19:37:11 oh, it's vexx-to-vexx dropping? hmm... yeah that's strange 19:37:23 and same region presumably 19:37:36 64 bytes from 2604:e100:1:0:f816:3eff:fe83:a5e5 (2604:e100:1:0:f816:3eff:fe83:a5e5): icmp_seq=6 ttl=47 time=72.0 ms 19:37:44 64 bytes from backup01.ord.rax.opendev.org (2001:4801:7825:103:be76:4eff:fe10:1b1): icmp_seq=3 ttl=52 time=49.9 ms 19:37:51 the ping to rax seems lower 19:37:57 also surprising 19:38:16 if the backup server is in montreal then that would make sense 19:38:25 since ord is slightly closer to sjc than montreal 19:38:47 anyway we don't have to do live debugging in the meeting. I just wanted to bring it up as a not super urgent issue but one that should probably be addressed 19:39:00 (the db backups in both sites should be complete until we do a project rename) 19:39:05 i thought he was saying tat higher rtt was locally within vexxhost 19:39:15 but yeah, we can dig into it after the meeting 19:39:21 as it is project renames that update the redirects whcih live in the db 19:39:27 this streams the output of mysqldump directly to the server 19:39:52 #topic Scheduling Project Renames 19:40:04 so if anyone knows any timeout options for that, let me know :)\ 19:40:08 Lets move on and then we can discuss further at the end or eat lunch/breakfast/dinner :) 19:40:21 in theory we can "just do it" now that the rename playbook no longer tries to update the nonexistent mysql db 19:40:44 For project renames do we want to try and incorporate that into the server move? My preference would be that maybe we do the renames the week after once we're settled into the new server and not try to overdo it 19:40:57 i don't think we had any other pending blockers besides actual scheculing anyway 19:40:59 fungi: linked one of the changes we need to do renames 19:41:04 #link https://review.opendev.org/c/opendev/system-config/+/797990/ 19:41:20 yeah, once that merges i mean 19:41:47 Anyone have a concern with doing the renames a week after the move? 19:42:05 That should probably be enough time to be settled in on the new server and if not we can always reschedule 19:42:11 ++ 19:42:12 wfm 19:42:15 but that gives us a time frame to tell people to get their requests in for 19:42:23 great 19:42:39 and also a window to do any non-urgent post-move config tweaks 19:42:44 ++ 19:42:58 in case we spot things which need adjusting 19:43:47 #topic Open Discussion 19:44:21 Anything else to bring up? 19:44:41 I think I have the container mostly setup for the ptgbot? 19:44:53 diablo_rojo: oh cool are there changes that need review? 19:44:54 oh. failing zuul though. 19:45:23 on the oftc migration wrap-up, i have an infra manual change which needs reviewing: 19:45:24 clarkb, just the one kinda? I havent written the role yet for it. Started with setting up the container 19:45:25 #link https://review.opendev.org/797531 Switch docs from referencing Freenode to OFTC 19:45:42 diablo_rojo: have a link? 19:45:59 https://review.opendev.org/c/openstack/ptgbot/+/798025 19:46:29 great I'll try to take a look at that change too. Feel free to reach out about the failures too 19:47:10 fungi: that looks like a good one to get in ASAP to avoid any additional confusion that may be causing 19:47:36 there was some discussion between other reviewers about adjustments, so more feedback around those for preferences would be appreciated 19:48:04 diablo_rojo: i think you've got an openstack that hsould be an opendev at first glance : FileNotFoundError: [Errno 2] No such file or directory: '/home/zuul/src/openstack.org/opendev/ptgbot' 19:48:46 Oh I thought I had that as opendev originally. 19:48:53 I can change that back 19:49:15 i think it has a high chance of working with that 19:49:21 Sweet. 19:49:24 Will do that now. 19:49:36 speaking of building images for external projects 19:49:38 #link https://review.opendev.org/c/openstack/project-config/+/798413 19:50:03 is there a reason lodgeit isn't in openstack? i can't reference it's image build jobs from system-config jobs, so can't do a speculative build of the image 19:50:09 yeah, the ptgbot repo is openstack/ptgbot 19:50:25 the puppet-ptgbot repo we'll be retiring is opendev/puppet-ptgbot 19:50:29 different namespaces 19:50:32 yeah, i think "opendev.org/openstack/ptgbot" is the path 19:50:32 ianw: no I think it was one of the very first moves out to opendev and we probably just figured it was fine to be completely separate 19:50:45 ianw: we've learned soem stuff since then 19:51:13 ok, if we could add it with that review that would be helpful :) 19:51:15 #link https://review.opendev.org/c/opendev/system-config/+/798400 19:51:16 ianw: you may need a null include for that repo though 19:51:26 ianw: since its jobs are expected to be handled in the opendev tenant 19:51:45 include: [] is what we do for gerrit just above in your change 19:51:57 corvus: ^ can probably confirm that 19:52:00 yeah, i think the expectation was that the rest would be moving to the opendev tenant in time, and then we could interlink them 19:52:16 is working https://104.130.239.208/ is a held node 19:52:35 i've managed to move some more leaf repos into the opendev tenant, but things heavily integrated with system-config or openstack/project-config are harder 19:53:24 but there is some sort of db timeout weirdness. when you submit, you can see in the network window it gets redirected to the new paste but then it seems to take 60s for the query to return 19:53:43 i'm not yet sure if it's my janky hand-crafted server there or somethign systematic 19:53:52 suggestions welcome 19:54:00 ianw: if you hack /etc/hosts locally wouldn't that avoid any redirect problems? 19:54:27 might help isolate things a bit. But I doubt that is a solution 19:55:04 i don't think it is name resolution; it really seems like the db, or something in sqlalchemy, takes that long to return 19:55:17 but then it does, and further queries work fine 19:55:34 it only happens the first time? 19:56:43 We are just about at time. I need lunch and then I have a large stack of changes and etherpads to review :) Thank you everyone! We'll be back here same time and place next week. As always feel free to reach out to us anytime on the mailing list of in #opendev 19:56:55 when you paste a new ... paste. anyway, yeah, chat in #opendev 19:57:06 thanks clarkb! 19:57:26 ya sorry, realized we should move along (not going to lie in part because I am now very hungry :) ) 19:57:29 #endmeeting