#opendev-meeting log

19:01:06 <clarkb> #startmeeting infra
19:01:07 <openstack> Meeting started Tue Jun 16 19:01:06 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <openstack> The meeting name has been set to 'infra'
19:01:18 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-June/000039.html Our Agenda
19:01:26 <clarkb> #topic Announcements
19:01:31 <clarkb> I didn't have any announcements
19:02:00 <clarkb> #topic Actions from last meeting
19:02:07 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-09-19.01.txt minutes from last meeting
19:02:31 <clarkb> no actions recorded, but it is feeling like things are returning to normal after the PTG. Oddly ti seemed like we still had the quiet week last week even though people didn't need to travel
19:02:45 <clarkb> (maybe that was just my perception)
19:02:46 <clarkb> #topic Specs approval
19:02:55 <clarkb> #link https://review.opendev.org/#/c/731838/ Authentication broker service
19:03:06 <clarkb> This isn't ready for approval yet, but wanted to keep pointing eyeballs towards it
19:03:09 <mordred> o/
19:03:13 <clarkb> fungi: ^ anything else to say about this spec?
19:03:16 <fungi> i haven't updated it yet, had a "quiet week" ;)
19:03:29 <fungi> more comments appreciated though
19:03:39 <corvus> o/
19:03:40 * mordred looks forward to fungi's updates
19:04:56 <clarkb> #topic Priority Efforts
19:05:04 <clarkb> #topic Update Config Management
19:05:45 <clarkb> I'm not aware of a ton of changes here recently. Anyone have topics to bring up?
19:06:07 <mordred> corvus improved the disable-ansible script
19:06:08 <clarkb> (I'm moving somewhat quickly beacuse our general topics list is pretty long this week and want to be sure we get there but feel free to bring stuff up under priority topics if relevant)
19:06:22 <mordred> nope. /me shuts up
19:06:27 <fungi> i've got a half-baked change to move the rest of our repo mirrornig configuration from puppet to ansible
19:06:34 <clarkb> mordred: thats a good call out. We should get into the habit of providing detailed reasons for disabling ansible there
19:06:35 <mordred> \o/
19:07:07 <clarkb> fungi: is that ready for review yet?
19:07:21 <fungi> clarkb: it's ready for suggestions, but no it's not ready to merge yet
19:07:41 <fungi> it's a lot of me learning some ansible and jinja concepts for the first time
19:07:44 <clarkb> #link https://review.opendev.org/#/c/735406/ Ansiblify reprepro configs. Is WIP comments welcome
19:08:40 <fungi> it will be a pretty massive diff
19:08:47 <fungi> (once complete)
19:09:18 <clarkb> it should be a pretty safe transition too as we can avoid releasing volumes until we are happy with the end results?
19:09:23 <clarkb> thanks for working on that
19:09:35 <fungi> yep, once i rework it following your earlier suggestion
19:10:30 <clarkb> #topic OpenDev
19:10:51 <clarkb> I don't have much to add here. I've completely failed at sending reminder emails about the advisory board but mnaser has responded. Thank you!
19:10:57 <clarkb> I'll really try to get to that this week
19:10:59 <mnaser> \o/
19:12:12 * mordred hands mnaser an orange
19:12:41 <clarkb> Anything else to add re OpenDev?
19:13:34 <mordred> oh -
19:13:59 <mordred> this is really minor - but I snagged the opendev freenode nick yesterday and put it in our secrets file (thanks for the suggestion mnaser)
19:14:31 <mordred> in case we want to use it for opendev-branded bots
19:14:57 <clarkb> thanks
19:15:35 <clarkb> #topic General Topics
19:15:53 <clarkb> #topic pip-and-virtualenv
19:16:09 <clarkb> This change has landed and we're starting to see more and more fallout from it. Nothing unexpected yet I don't think
19:16:41 <clarkb> possibly even a new case of it in #zuul right now  too :)
19:16:58 <clarkb> Keep an eye out for problems and thank you to AJaeger and mordred for jumping on fixes
19:17:30 <clarkb> ianw: where are we with considering the spec completed? and cleanup in the nodepool configs? can we start on that or should we wait a bit logner?
19:17:49 <ianw> i have pushed changes to do cleanup of the old jobs and images
19:18:17 <ianw> i guess it's not too high a priority right now, the changes are there and i'll keep on them as appropriate
19:19:31 <clarkb> thanks. Should I push up a change to mark the spec compelted? or wait a bit more on cleanup for that?
19:20:26 <ianw> i guess we can mark it complete, if we consider dealing with the fallout complete :)
19:20:47 <ianw> i thought it was too quiet yesterday, so i'll try to catch up on anything i've missed
19:21:18 <clarkb> ianw: I think a lot of people may have had friday and monday off or something
19:21:23 <clarkb> because ya definitely seems to be picking up now
19:21:56 <corvus> 40% of vacation days are fridays or mondays
19:23:16 <clarkb> #topic Zookeeper TLS
19:23:30 <clarkb> This is the thing that led to the ansible limbo
19:23:37 <clarkb> corvus: want to walk us through this?
19:23:43 <corvus> this topic was *supposed* to be about scheduling some downtime for friday to switch to zk tls
19:23:56 <corvus> but as it turns out, this morning we switched to tls and switched back already
19:24:13 <corvus> the short version is that yesterday our self-signed gearman certs expired
19:24:32 <corvus> well, technically just the ca cert
19:25:06 <corvus> which means that no zuul component could connect to gearman.  so we lost the use of the zuul cli, and if any component were restarted for any reason, it would be unable to connect, so the system would decay
19:25:25 <corvus> correcting that required a full restart, as did the zk tls work, so we decided to combine them
19:25:53 <corvus> unfortunately, shortly after starting the nodepool launchers, we ran into a bug
19:25:58 <corvus> #link kazoo bug https://github.com/python-zk/kazoo/issues/587
19:26:19 <corvus> so we manually reverted the tls change (leaving the new gear certs in place)
19:26:25 <corvus> and everything is running again.
19:26:46 <corvus> next steps: make sure this is merged:
19:26:51 <corvus> #link revert zk tls https://review.opendev.org/735990
19:27:02 <corvus> then when it is, we can clear the disable-ansible file and resume speed
19:27:20 <corvus> after that, i'm going to look into running zk in a mode where it can accept tls and plain connections
19:27:36 <clarkb> and once disable ansible is cleared we'll get updates to a few docker images, apply dns zone backups, and change rsync flags for centos/fedora mirrors
19:27:36 <corvus> if that's possible, i'd like to restart the zk cluster with that, and then try to repro the bug against production
19:27:58 <clarkb> calling that out so toehrs are aware there will be a few changes landing once ansible is reenabled
19:28:05 <clarkb> corvus: ++ I like that plan
19:28:08 <mordred> ++
19:28:11 <corvus> based on info from tobiash, we suspect it may have to do with response size, so it may help to get a repro case out of production data
19:28:25 <clarkb> corvus: we should be able to easily switch over a single builder or launcher without major impact to production to help sort out what is going on
19:28:48 <corvus> clarkb: agreed
19:29:41 <corvus> eot
19:29:48 <clarkb> #topic DNS Cleanup
19:30:06 <clarkb> The change to implement the recording of zone data has landed and should apply to bridge when ansibel starts rerunning
19:30:08 <clarkb> ianw: ^ fyi
19:30:22 <clarkb> probably want to make sure that is working properly once it goes in?
19:30:28 <ianw> yeah, i added the credentials so should be ok
19:31:06 <clarkb> I annotated the etherpad with notes on things I thinkwe can cleanup
19:31:23 <clarkb> what are we thinking about for cleanup? wait for backups to run with existing records first so we've got that info recorded then do cleanup?
19:31:30 <clarkb> (that is sort of what I thought would be a good process)
19:31:43 <mordred> yeah
19:32:03 <clarkb> k I can help with the button clicking to clean up records once we're at that point
19:32:23 <clarkb> ianw: anything else worth mentioning on this topic?
19:32:40 <ianw> yeah, we can iterate on it a bit then too, as the list gets shorter it's easier to see what can go :)
19:32:50 <ianw> nope, thanks
19:32:53 <clarkb> sounds good, thanks for putting this together
19:32:56 <clarkb> #topic Etherpad Upgrade to 1.8.4 or 1.8.5
19:33:13 <clarkb> Fungi did some work to get our etherpad server upgraded to 1.8.4 (from 1.8.0)
19:33:21 <clarkb> we then noticed that there was a UI rendering bug when testing that
19:33:29 <clarkb> #link https://review.opendev.org/#/c/729029/ Upgrade Etherpad
19:33:52 <fungi> not dissimilar from some of the weirdness we noticed with author colors when going through meetpad
19:33:54 <clarkb> this change now addresses that with a simple css hack that I came up with. Upstream they've fixed this differently with a fairly large css refactor and we should see that in the next release (1.8.5?(
19:34:17 <fungi> i wonder if that will also resolve it for meetpad uses
19:34:25 <clarkb> the question I've got is do we think we should upgrade with the workaround as 1.8.4 includes a bug fix around db writes? or wait for 1.8.5 to avoid potential UI weirdness
19:35:01 <fungi> i'm in no huge hurry. i'm excited for the potential fix for perpetually "loading..." pads, but other than that there's no urgency
19:35:24 <mordred> there's no urgency, but the workaround isn't super onerous either
19:35:41 <mordred> so I'm ok rolling it out in that form - or with waiting
19:35:57 <mordred> we can remove the sed from our dockerfile when we bump the version
19:36:24 <mordred> (it's not like one of those "in a few months we're not going to be paying attention and our local hack is going to bork us)
19:36:40 <corvus> clarkb: what's the workaround?
19:36:48 <corvus> oh i see it now
19:36:51 <corvus> sorry, buried in the run cmd
19:36:52 <clarkb> corvus: changing the padding between the spans that contain lines
19:37:13 <clarkb> the way the padding was set up before caused the lines to overlap so their colors successively covered each other
19:37:42 <corvus> i agree it seems safe to move forward with 029
19:37:50 <fungi> you tracked that back to a particular change between 1.8.1 and 1.8.3 yeah?
19:38:00 <clarkb> the bug is also purely cosmetic so shouldn't affect content directly, just how we see it
19:38:13 <clarkb> fungi: ya its in 1.8.3 (there was no 1.8.1 or 1.8.2 iirc)
19:39:09 <fungi> yeah, there was at least no 1.8.2, for sure
19:39:22 <clarkb> I think what I'm taking away from this is if everything else calms down (uwsgi, pip/virtualenv, dns, zk tls, etc) then we can go ahead with this and watch it
19:39:39 <fungi> sounds fine to me
19:39:45 <clarkb> thanks for the feedback
19:40:03 <clarkb> and if 1.8.5 happens before then we can drop the workaround
19:40:36 <clarkb> #topic Getting more stuff off of python2
19:41:15 <clarkb> One of the things that came out of the PTG was it would be useful for those a bit more familiar with our systems to do an audit of where we stand with python2 usage. This way others can dive in and port or switch runtimes
19:41:17 <clarkb> #link https://etherpad.opendev.org/p/opendev-tools-still-running-python2
19:42:06 <clarkb> I've started this audit in that etherpad. It is nowhere near complete. One thing that I have discovered is that a lot of our software is python3 capable but running under python2. We'll want to keep that in mind as we update configuration management that a good next step is to switch the runtime too
19:42:38 <clarkb> I have also found a couple cases that are definitely python2 only right now. Our use of supybot for meetbot and the logstash data pipeline. For meetbot we have a spec to replace it already which is noted on the etherpad
19:42:58 <clarkb> if you can think of other tools we need to be checking on feel free to add them to the list and I can dig in further as I have time
19:43:08 <clarkb> The goal here isn't really to fix everything as much as to be aware of what needs fixing
19:45:13 <fungi> as we move services and automation to platforms without python 2.7, we can fix things where needed
19:45:27 <fungi> if they don't become urgent before that
19:45:37 <clarkb> yup and it gives people a list of things they can pick off over time if they want to help out
19:45:46 <mordred> yeah - most of the things are easy enough to work on - but are pretty opaque that they need to be worked on
19:45:49 <mordred> what clarkb said
19:45:54 <fungi> useful to know where we expect the pain points for such moves to be though
19:46:57 <clarkb> #topic Trusty Upgrades
19:47:29 <clarkb> I don't have much to add on this topic but did want to point out it seems that osf's interop working group is picking up some steam. I'm hoping that may translate into some better interest/support for refstack
19:47:54 <clarkb> and we can maybe channel that into a refstack server upgrade. The docker work I did is actually really close. Its mostly a matter of having someone test it now (which I'm hoping the interop wg can do)
19:48:47 <clarkb> #topic Open Discussion
19:49:02 <clarkb> Anything else to bring up before we end the meeting?
19:51:06 <fungi> it's just come to light in #opendev that the zuul restart switched our ansible default from 2.8 to 2.9, so expect possible additional behavior changes
19:51:36 <clarkb> fungi: have we confirmed that (I theorized it and wouldn't be surprised)
19:51:59 <fungi> ianw: compared logs which showed passing results used 2.8 and the failure observed used 2.9
19:53:02 <corvus> yes, latest zuul master uses 2.9 by default
19:53:10 <corvus> we can pin the openstack tenant to 2.8 if we need to
19:53:34 <clarkb> so far its only popped up as a single issue which has a fix
19:53:40 <clarkb> I guess if it gets worse we can pin
19:53:41 <fungi> i don't know yet if that's warranted, may be the issues are small enough we can just fix them
19:53:44 <corvus> though istr we did some testing around this and didn't see a lot of issues
19:54:00 <fungi> apparently match as a filter is one
19:54:02 <clarkb> corvus: ya we tested with a lot of general playbooks in zuul-jobs
19:54:04 <corvus> so yeah, i think it's probably best to roll forward
19:54:13 <frickler> also devstack is broken by a new uwsgi release and still needs fixes
19:54:40 <frickler> and neutron-grenade-multinode seems to suffer from the venv removal
19:55:37 <clarkb> frickler: ya I meant to look into why multinode was sad about that
19:56:40 <clarkb> the last day and a half have been distracting :) After lunch today I'll try to be useful for all the things
19:58:16 <clarkb> sounds like that may be it for our meeting
19:58:18 <clarkb> thank you everyone!
19:58:21 <clarkb> #endmeeting