19:01:18 #startmeeting infra 19:01:20 Meeting started Tue Mar 30 19:01:18 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:23 The meeting name has been set to 'infra' 19:01:27 #link http://lists.opendev.org/pipermail/service-discuss/2021-March/000199.html Our Agenda 19:01:57 o/ 19:01:57 I wasn't around last week, but will do my best :) feel free to jump in help keep things going in the right direction 19:02:02 o/ 19:02:59 #topic Announcements 19:03:33 I didn't have any. Do others? 19:03:52 i don't think so 19:03:59 gitea was upgraded 19:04:07 keep an eye out for oddities? 19:04:27 ++ 19:04:27 zuul was recently updated to move internal scheduler state into zookeeper 19:04:37 keep an eye on that too 19:05:11 #topic Actions from last meeting 19:05:19 #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-03-23-19.01.txt minutes from last meeting 19:05:32 ianw had an action to start asterisk retirement. I saw an email to service-discuss about it. 19:05:53 no response on that, so i guess i'll propose the changes soon 19:06:06 ianw do you want to keep the action around until the changes are up and or landed? seems to be moving along at least 19:06:20 sure, make sure i don't forget :) 19:06:36 #action ianw Propose changes for asterisk retirement 19:06:47 #topic Priority Efforts 19:06:54 #topic OpenDev 19:07:03 as mentioned we upgraded gitea from 1.13.1 to 1.13.6 19:07:31 keep an eye out for weirdness. 19:07:48 Do we also want to reenable project description updates and see if 1.13.6 handles that better? or maybe get the token usage change in first? 19:08:34 tokens seems to maybe isolate us from any future hashing changes, but either way i think we can 19:09:04 ianw: maybe I should push up the description update change again and then compare dstat results with and without the token use. 19:09:20 that should give us a good indication for whether or not 1.13.6 has improved hashing enough or not? 19:09:25 maybe 19:09:54 #link https://review.opendev.org/c/opendev/system-config/+/782887 19:09:54 it was never completely smoking gun that project management changes triggered the cpu load 19:09:54 for anyone reading without context :) 19:10:09 they would sometimes overload *a* gitea backend and the rest would be perfectly happy 19:10:25 ya I suspect it has to do with background load as well 19:10:32 so if we want to experiment in that direction, we'll need to leave it in that state for a while and it's not a surety 19:10:36 due to the way we load balance we don't necessary get a very balanced load 19:11:50 I also made some new progress on the gerrit account classification process before taking time off 19:12:10 if you can review groups in review:~clarkb/gerrit_user_cleanups/notes.20210315 and determine if they can be safely cleaned up like previous groups that would be great 19:12:23 I'll pick that up again as others have had a chance to cross check my work 19:12:29 #link https://review.opendev.org/c/opendev/system-config/+/780663 more user auditing improvements 19:12:41 that is a related scripting improvement. Looks like I have one +2 so I may just approve it today 19:13:01 essentially I had the scripts collect a bunch of data into yaml then I could run "queries" against it to see different angles 19:13:12 the different angesl are written down in the file above and can be corss checked 19:14:31 #topic Update Configuration Management 19:14:42 Any new config mgmt updates we should be aware of/review? 19:16:08 i don't think so 19:16:19 #topic General Topics 19:16:30 #topic Server Upgrades 19:16:59 I did end up completing the upgrades for zuul executors and mergers and nodepool launchers 19:17:09 That leaves us with the zookeeper cluster and the scheduler itself 19:17:25 I have started looking at the zk upgrade and writing notes on an etherpad 19:17:26 #link https://etherpad.opendev.org/p/opendev-zookeeper-upgrade-2021 19:18:02 that etherpad proposes two options we could take to do the upgrade. If ya'll can review it and make sure the plans are complete and/or express an opinion on which path you would like to take I can boot instances and keep pushing on that 19:20:02 #topic Deploy new refstack server 19:20:10 #link https://review.opendev.org/c/opendev/system-config/+/781593 19:20:26 this change merged yesterday. ianw should I go ahead and remove this item from the meeting agenda? 19:20:51 yep, deployment job ran so i'm not aware of anything else to do there 19:21:01 cool I'll get that cleaned up 19:22:15 #topic PTG Planning 19:22:31 I did submit a survey and put us on the schedule last week 19:22:52 the event runs April 19-23 and I selected Thursday April 22 1400-1600UTC and 2200-0000UTC for us 19:23:16 the first time should hopefully work for those in EU timezones and the second for those in asia/pacific/australia 19:23:45 my thought on that was we could do office hours and try to help some of our new project-config reviewers get up to speed or help other projects with infra related items 19:24:31 if the times just don't work or you think we need more or less let me know. I indicated we may need to rearrange scheduling when I filled out the survey 19:24:51 #topic docs-old volume cleanup 19:25:14 not sure if this is still current but it was on the agenda so here it is :) 19:25:52 oh it was from when i was clearing out space the other day 19:26:05 do we still need docs-old? 19:26:39 we do not 19:26:47 is docs-old where we stashed the really old openstack documentation so that it could be found if people have really old installations but otherwise wouldn't show up in google results? 19:27:12 that was kept around for people to manually copy things from if we failed to rebuild them during the transition to zuul v3 19:27:39 i think anything we weren't actively building but was relevant was manually copied to the docs volume 19:27:47 clarkb: yeah, it leaking into google via https://static.opendev.org/docs-old/ which i guess has nothing to stop that was a concern 19:28:08 ok, well it sounds like i can remove it then 19:28:10 we should probably robots.txt to exclude spiders from the whole static vhost 19:28:38 would it make sense to see if Ajaeger has an opinion? 19:28:46 since Ajaeger was pretty involved in that at the time iirc 19:29:44 fungi: yeah, i can propose that. everything visible there should have a "real" front-end i guess 19:31:20 I don't have enough of the historical context to make a decision. I'll defer to others, but suggest maybe double checking with ajaeger if we can 19:31:57 ok, i can ask, don't want to bother him with too much old cruft these days :) 19:32:21 ya I don't think ajaeger needs to help with cleanup or backups or anything, just indicate if he thinks any of it is worth saving 19:32:51 #topic planet.openstack.org 19:33:05 Another one I don't have a ton of background on but I see a retire it option and I like the sound of that >_> 19:33:23 looks like the aggregator software is not being maintained anymore whih puts us in a weird spot doing server updates 19:33:26 yeah, linux australia retired their planet which made me think of it 19:33:40 i guess we should probably at least let the folks using it know somehow 19:33:45 like make an announcement 19:33:58 ++ and probably send that one to openstack-discuss given the service utilization 19:34:00 i did poke at aggregation software, i can't see any that look python3 and maintained 19:34:00 i could get the foundation to include a link to the announcement in a newsletter 19:34:16 basically say the software is not maintained and we can't find alternaties. We will retire the service as a result. 19:34:23 i thought we could replace it with a site on static that has an OPML of the existing blogs if we like 19:34:37 these days, a RSS to twitter feed would probably be more relevant anyway 19:34:38 or if the foundation sees benefit in it, they may have a different way they would want to do something similar anyway 19:34:55 yeah 19:35:27 microblogging sites have really become the modern blog aggregators anyway 19:35:57 (i did actually look for an rss to twitter thing too, thinking that would be more relevant. nothing immediately jumped out, a buch of SaaS type things) 19:36:05 ya twitter, hacker news, reddit etc seem to be the modern tools 19:36:24 and authors just send out links from their accounts on those platforms 19:36:26 vale RSS, RIP with google reader 19:37:10 maybe give me an action item to remember and i can send that mail and start the process 19:38:31 #action ianw Announce planet.o.o retirement 19:38:42 i am old enough to remember when jdub wrote and released the original "planet" and we all though that was super cool and created a bunch of planets 19:39:02 #topic Tarballs ORD replication 19:39:26 ok, last one, again from clearing out things earlier in the week 19:40:04 of the things we might want to keep if a datacentre burns down, i think tarballs is pretty much the only one not replicated? 19:40:10 #link https://etherpad.opendev.org/p/gjzssFmxw48Nn3_SBVo6 19:40:13 that's the list 19:41:09 docs is already replicated 19:41:15 ++ I think the biggest consideration has been that the vos release to a remote site of large sets of data isnt' quick 19:41:23 I think tarballs is not as large as our mirrors but bigger than docs? 19:41:33 I also suspect that we can set it up and see how bad it is and go from there? 19:41:37 yeah, in that ballpark 19:41:51 also the churn is not bad as it's mostly append-only 19:42:05 or at least that's the impression i have 19:42:17 i guess we'll find out if that's really true 19:42:36 yeah, i don't think it's day-to-day operation; just recovery situations 19:42:39 which happen more than you'd hope 19:43:05 but still, i'd hate to feel silly if something happened and we just didn't have a copy of it 19:44:07 ya I think this is the sort of thing where we can make the change, monitor it to see if it is unhappy and go from there 19:44:12 ORD has plenty of space. we can always drop the RO there in a recovery situation i guess too, if we need 19:44:27 alright, i'll set that up. lmn if you think anything else in that list is similar 19:44:29 I want to say the newer openafs version we upgraded to is better about higher latency links? 19:44:59 apparently, but still there's only so fast data gets between the two when it's a full replication scenario 19:45:40 ianw: maybe do all the project.* volumes? 19:46:08 I think those host docs for various things like zuul and starlingx 19:46:25 mirror.* shouldn't matter and is likely to be the most impacted by latency 19:46:46 yeah, probably a good idea. i can update the docs for volume creation because we've sometimes done it and sometimes not it seems 19:46:55 ++ 19:47:24 sure, small volumes are probably good to mirror more widely if for no other reason than we can, and they're one less thing we might lose in a disaster 19:48:12 yeah, it all seems theoretical, but then ... fires do happen! :) 19:49:29 indeed 19:49:37 #topic Open Discussion 19:49:47 That was all on the published agenda 19:49:59 i have a couple of easy ones from things that popped up 19:50:02 worth noting we think we have identified a zuul memory leak which is causing zk disconnects 19:50:11 #link https://review.opendev.org/c/opendev/system-config/+/782868 19:50:15 stops dstat output to syslog 19:50:24 fungi was going to restart the scheduler to reset the leak and keep us limping along. corvus mentioned being able to actually debug tomorrow 19:50:31 #link https://review.opendev.org/c/opendev/system-config/+/783120 19:50:40 puts haproxy logs into our standard container locations 19:50:59 #link https://review.opendev.org/c/opendev/system-config/+/782898 19:51:06 ianw: the dstat thing is unexpected but change lgtm 19:51:09 allows us to boot very large servers when they are donated to us :) 19:51:28 ha on that last one 19:52:10 yeah, we're a few minutes out from being able to restart the scheduler without worrying about openstack release impact 19:52:26 i'm just waiting for one build to finish updating the releases site 19:52:34 is it helpful to restart with a debugger or anything for the leak? 19:53:10 oh, clarkb, that oddity we were looking at with stale gerritlib used in a jeepyb job? it happened again when i rechecked 19:53:17 clarkb: yeah, i was like "i'm sure i provided a reasonable size for boot from volume ... is growroot failing, etc. etc." :) 19:53:33 ianw: I want to say we already have a hook to run profiling on object counts 19:53:42 ianw: but that is agood question and we should confirm with corvus before we restart 19:53:49 i have not previously used a debugger when debugging a zuul memory leak; only the repl and siguser 19:54:07 i'm always open to new suggestions on debugging memleaks though :) 19:54:20 seems like the repl stuff and getting object counts has been really helpful in the past at least 19:56:38 corvus: when I've tried in the past its been "fun" to figure out adding debugging symbols and all that. I suspect that since we use a compiled python via docker that this may be even more fun? 19:56:49 we can't just install the debugger symbols package from debian 19:57:05 (sorting that out may be a fun exercise for someone with free time though as it may be useful generally) 19:57:25 sounds like this may be about it. I can end here and we can go have breakfast/lunch/dinner :) 19:57:29 thank you everyone! 19:57:31 #endmeeting