#opendev-meeting log

19:01:20 <clarkb> #startmeeting infra
19:01:21 <openstack> Meeting started Tue Sep 15 19:01:20 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:24 <openstack> The meeting name has been set to 'infra'
19:01:50 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000097.html Our Agenda
19:01:58 <clarkb> #topic Announcements
19:02:19 <clarkb> If our air clears out before the end of the week I intend on taking a day off to get out of the house. This is day 7 or something of not going outside
19:02:48 <clarkb> but the forecasts have a really hard time predicting that and it may not happen :( just a heads up that I may pop out to go outside if circumstances allow
19:03:02 <frickler> good luck with that
19:03:20 <ianw> o/
19:03:31 <clarkb> #topic Actions from last meeting
19:03:37 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-08-19.01.txt minutes from last meeting
19:03:43 <clarkb> Theer were no recorded actions
19:04:20 <clarkb> #topic Priority Efforts
19:04:26 <clarkb> #topic Update Config Management
19:04:53 <clarkb> nb03.opendev.org has replaced nb03.openstack.org. We are also trying to delete nb04.opendev.org but it has built our only tumbleweed images and there are issues building new ones
19:05:05 <clarkb> Overall though I think taht went pretty well. One less thing running puppet
19:06:01 <clarkb> to fix the nb04.opendev.org thing I think we want https://review.opendev.org/#/c/751919/ and its parent in a dib release
19:06:17 <clarkb> then nb01 and nb01 can build the tumbleweed images and nb04 can be deleted
19:06:28 <clarkb> ianw: ^ fyi that seemed to fix the tumbleweed job on dib itself if you have a chance to review it
19:06:50 <clarkb> there are pypi problems which we'll talk about in a bit
19:06:51 <ianw> ok, there were a lot of gate issues yesterday, but they might have all fixed
19:06:58 <ianw> or, not :)
19:06:59 <fungi> the priority for containerized storyboard deployment has increased a bit, i think... just noticed that we're no longer deploying new commits to production because it now requires python>=3.6 which is not satisfyable on xenial, and if we're going to redeploy anyway i'd rather not spend time trying to hack up a solution in puppet for that
19:07:29 <clarkb> fungi: any idea why we bumped the python version if the deployment doesn't support it? just a miss?
19:07:40 <corvus> o/
19:07:49 <clarkb> fungi: I agree that switching to a container build makes sense
19:08:37 <fungi> it was partly cleanup i think, but maybe also dependencies which were dropping python2.7/3.5
19:09:18 <fungi> storyboard has some openstackish deps, like oslo.db
19:09:31 <fungi> we'd have started to need to pin a bunch of those
19:09:56 <clarkb> note that pip should handle those for us
19:10:06 <clarkb> because openstack libs properly set python version metadata
19:10:11 <clarkb> (no pins required)
19:10:15 <fungi> but not the version of pip shipped in xenial ;)
19:10:16 <clarkb> but we would get old libs
19:10:40 <clarkb> fungi: we install a newer pip so I think it would work
19:11:10 <fungi> yeah, we do, though looks like it's pip 19.0.3 on that server so maybe still too old
19:11:26 <fungi> could likely be upgraded
19:11:44 <fungi> i'm betting the pip version installed is contemporary with when the server was built
19:11:50 <clarkb> ya
19:11:58 <clarkb> in any case a container sounds good
19:12:15 <clarkb> any other config management issues to bring up?
19:13:05 <clarkb> #topic OpenDev
19:13:20 <clarkb> I've not made progress on Gerrit things recently. Way too many other fires and distrctions :(
19:13:32 <clarkb> ttx's opendev.org front page update has landed and deployed though
19:13:41 <clarkb> fungi: are there followups for minor fixes that you need us to review?
19:13:45 <fungi> yeah, i still have some follow-up edits for that on my to do list
19:13:51 <fungi> haven't pushed yet, no
19:13:58 <clarkb> ok, feel free to ping me when you do and I'll review them
19:14:08 <fungi> gladly, thanks
19:15:17 <clarkb> I didn't really have anything else to bring up here? Anyone else have something?
19:15:57 <fungi> seems like we can probably dive into the ci issues
19:16:13 <clarkb> #topic General Topics
19:16:38 <clarkb> really quicky want to go over two issues that I think we've mitigated then we can dig into new ones
19:16:42 <clarkb> #topic Recurring bogus IPv6 addresses on mirror01.ca-ymq-1.vexxhost.opendev.org
19:17:02 <clarkb> We did end up setting a netplan config file to statically configure what vexxhost was setting via RAs on this server
19:17:12 <clarkb> since we'vedone that I've not seen anyone complain about broken networking on this server
19:17:25 <fungi> we should probably link the open neutron bug here for traceability
19:17:42 <fungi> though i don't have the bug number for that handy
19:17:50 <frickler> I've started my tcpdump again earlier to double check what happens when we see another stray RA
19:18:04 <clarkb> #link https://bugs.launchpad.net/bugs/1844712 This is the bug we think causes the problems on mirror01.ca-ymq-1.vexxhost.opendev.org
19:18:06 <openstack> Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete]
19:18:11 <fungi> thanks, perfect
19:18:12 <clarkb> fungi: ^ there it is
19:18:16 <frickler> would also be good to hear back from mnaser whether still to collect data to trace things
19:19:20 <clarkb> mnaser: ^ resyncing on that would be good. Maybe not in the meeting as I don't think you're here, but at some point
19:19:35 <clarkb> #topic Zuul web performance issues
19:20:18 <clarkb> Last week users noticed that the zuul web ui was really slow. corvus identified a bug that caused zuul's status dashboard to always fetch the status json blob even when tabs were not foregrounded
19:20:29 <clarkb> and ianw sorted out why we weren't caching things properly in apache
19:20:43 <clarkb> it seems now like we are stable (at least as a zuul web user I don't have current complaints)
19:20:56 <fungi> brief anecdata, the zuul-web process is currently using around a third of a vcpu according to top, rather than pegged at 100%+ like it was last week
19:21:28 <clarkb> I also have a third change up to run more zuul web processes. It seems we don't need that anymore so I may rewrite it to be a switch from mod rewrite proxying to mod proxy proxying as taht should perform better in apache too
19:22:04 <clarkb> thank you for all the help on that one
19:22:17 <corvus> clarkb: why would it perform better?
19:22:25 <corvus> clarkb: (i think mod_rewrite uses mod_proxy with [P])
19:22:29 <clarkb> corvus: that was something that ianw foudn when digging into the cache behavior
19:22:47 <clarkb> ianw: ^ did you have more details? I think it may be because the rules matching is simpler?
19:23:02 <ianw> well the manual goes into it, i'll have to find the page i read
19:23:17 <corvus> so something like "mod_proxy runs less regex matching code?"
19:23:39 <clarkb> "Using this flag triggers the use of mod_proxy, without handling of persistent connections. This means the performance of your proxy will be better if you set it up with ProxyPass or ProxyPassMatch"
19:23:55 <clarkb> I guess its connection handling that is the difference
19:23:57 <ianw> no something about the way it's dispatched with threads or something
19:24:07 <ianw> or yeah, what clarkb wrote :) ^
19:24:15 <clarkb> from http://httpd.apache.org/docs/current/rewrite/flags.html#flag_p
19:24:22 <ianw> that's it :)
19:24:23 <corvus> does that affect us?
19:25:13 <clarkb> I could see that being part of the problem with many status requests
19:25:20 <clarkb> if a new tcp connection has to be spun up for each one
19:25:22 <fungi> that probably won't change the load on zuul-web, only maybe the apache workers' cycles?
19:25:35 <clarkb> fungi: but ya the impact would be on the apache side not the zuul-web side
19:25:43 <corvus> i'm not sure we ever evaluated whether zuul-web is well behaved with persistent connections
19:27:21 <clarkb> corvus: something to keep in mind if we change it I guess. Would you prefer we leave it with mod rewrite?
19:27:55 <corvus> anyway, if we don't need mod_rewrite, i'm fine changing it.  maybe we'll learn something.  i normally prefer the flexibility of rewrite.
19:27:58 <fungi> are the connections tied to a thread, or are the descriptors for them passed around between threads? if the former, then i suppose that could impact thread reaping/recycling too
19:29:49 <fungi> (making stale data a concern)
19:30:13 <clarkb> I think connection == thread and they have a tll
19:30:24 <clarkb> or can ahve a ttl
19:30:37 <fungi> that's all handled by cherrypy?
19:30:42 <clarkb> no thats in apache
19:30:50 <clarkb> I don't know what cherrypy does
19:31:19 <clarkb> its less critical anyway since rewrite seems to work now
19:31:29 <fungi> oh, sorry, i meant the zuul-web threads. anyway yeah no need to dig into that in the meeting
19:31:33 <clarkb> it was just part of my update to do multiple backends because rewrite can't do multiple backends I don't think
19:31:48 <clarkb> #topic PyPI serving stale package indexes
19:32:07 <clarkb> This has become the major CI issue for all of our python based things in the last day or so
19:32:36 <clarkb> the general behavior of it is project that uses constraints that pins to a recent (and latest) package version fails because only version prior to that latest version are present in the index served to  pip
19:33:03 <clarkb> there has been a lot of confusion about this from people. From thinking that 404s are the problem to afs
19:33:30 <clarkb> AFS is not involved and there are no 404s. We appear to be getting back a valid index because pip says "here is the giant list of things I can install that doesn't include the version you want"
19:33:50 <clarkb> projects that don't use constraints (like zuul) may be installing prior versions of things occasionally since that won't error
19:34:02 <clarkb> but I expect they are mostly happy as a result (just something to keep in mind)
19:34:16 <frickler> I was wondering if we could keep a local cache of u-c pkgs in images, just like we do for git repos
19:34:18 <clarkb> we have seen this happen for different packages across our mirror nodes
19:34:26 <frickler> #link https://pip.pypa.io/en/latest/user_guide/#installing-from-local-packages
19:34:32 <clarkb> frickler: I think pip will still check pypi
19:34:42 <clarkb> though it may not error if pypi only has older versions
19:34:54 <ianw> i wonder if other people are seeing it, but without such strict constraints as you say, don't notice
19:35:01 <clarkb> ianw: ya that is my hunch
19:35:05 <frickler> running the pip download with master u-r gives me about 300M of cache, so that would sound feasible
19:35:39 <frickler> IIUC the "--no-index --find-links" should assure a pure local install
19:35:43 <clarkb> the other thing I notice is that it seems to be openstack built packages
19:36:16 <clarkb> frickler: that becomes tricky because it means we'd have to update our images before openstack requirements cna update constraints
19:36:45 <clarkb> frickler: I'd like to avoid that tight coupling if possible as it will become painful if image builds have separaet problems but openstack needs a bug fix python lib
19:36:56 <clarkb> its possible we could use an image cache with still checking indexes though
19:37:01 <fungi> current release workflow is that as soon as a new release is uploaded to pypi a change is pushed to the requirements repo to bump that constraint entry
19:37:16 <ianw> fungi: i noticed in scrollback you're running some grabbing scripts; i had one going in gra1 yesterday to with no luck catching an error
19:37:52 <fungi> ianw: yep, we're evolving that to try to make it more pip-like in hopes of triggering maybe pip-specific behaviors from fastly
19:38:09 <fungi> i'm currently struggling to mimic the compression behavior though
19:38:27 <clarkb> https://mirror.mtl01.inap.opendev.org/pypi/simple/python-designateclient/ version 4.1.0 is one that just failed a couple minutes ago at https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b5f/751040/4/check/openstack-tox-pep8/b5ff50c/job-output.txt
19:38:33 <ianw> i've definitely managed to see it happen from fastly directly ... but i guess we can't rule out apache?
19:38:33 <clarkb> but if you load it 4.1.0 is there :/
19:38:45 <fungi> wget manpage on the mirror server says it supports a --compression option but then wget itself complains about it being an unknown option
19:39:06 <clarkb> ianw: ya its possible that apache is serving old content for some reason,
19:39:15 <clarkb> pip sets cache-control: max-age=0
19:39:33 <clarkb> that should force apache to check with fastly that the content it has cached is current for every pip index request though
19:40:05 <clarkb> we could try disabling the pypi proxies in our base job
19:40:10 <clarkb> and see if things are still broken
19:40:39 <clarkb> I expect they will be beacuse ya a few months back it seemed like we could repreoduce from fastly occasionally
19:41:18 <clarkb> I also wonder if it could be a pip bug
19:41:22 <fungi> aha, compression option confusion resolved... it will only be supported if wget is built with zlib, and ubuntu 19.04 is the earliest they added zlib to its build-deps
19:41:38 <clarkb> perhaps in pips python version checking of packages it is excluding results for some reason
19:42:36 <clarkb> but I haven't been able to reproduce that either (was fiddling with pip installs locally against our mirrors and same version of pip here is fine with it)
19:43:43 <clarkb> the other thing I looked into was whether or not pip debug logging would help and I don't think it will
19:43:45 <corvus> fungi: looks like curl has an option
19:43:56 <clarkb> at least for succesful runs it doesn't seem to log index content
19:44:17 <clarkb> `pip --log /path/to/file install foo` is one way to set that up which we could add to say devstack and have good coverage
19:44:23 <clarkb> but I doubt it will help
19:44:48 <corvus> fungi: "curl -V" says "Features: ... libz" for me
19:45:06 <fungi> corvus: yep, i'm working on rewriting with curl instead of wget
19:46:03 <clarkb> not having issues with common third party pacakges does make me wonder if it is something with how we build and release packages
19:46:57 <clarkb> at least I've not seen anything like that fail. six, cryptography, coverage, eventlet, paramiko, etc why don't they do this too
19:47:47 <clarkb> other than six they all have releases within the last month too
19:48:49 <clarkb> I've largely been stumped. Maybe we should start to reach out to pypi even though what we've got is minimal
19:49:04 <clarkb> Anything else to add to this? or should we move on?
19:49:06 <fungi> do all the things we've seen errors for have a common-ish release window?
19:49:21 <clarkb> fungi: the oldest I've seen is oslo.log August 26
19:49:31 <clarkb> fungi: but it has happened for newer packages too
19:49:36 <fungi> okay, so the missing versions weren't all pushed on the same day
19:49:49 <clarkb> correct
19:50:36 <ianw> yeah, reproducing will be 99% of the battle i'm sure
19:50:39 <fungi> i suppose we could blow away apache's cache and restart it on all the mirror servers, just to rule that layer out
19:50:54 <fungi> though that seems like an unfortunate measure
19:51:04 <clarkb> fungi: ya its hard to only delete the pypi cache stuff
19:51:05 <ianw> i was just looking if we could grep through, to see if we have something that looks like an old index there
19:51:22 <clarkb> ianw: we can but I think things are hashed in weird ways, its doable just painful
19:51:29 <ianw> that might clue us in if it *is* apache serving something old
19:52:29 <clarkb> ya that may be a good next step then if we sort out apache cache structure we might be able to do specific pruning
19:52:42 <clarkb> if we decide that is necessary
19:52:57 <frickler> but do we see 100% failure now? I was assuming some jobs would still pass
19:53:01 <fungi> also if the problem isn't related to our apache layer (likely) but happens to clear up around the same time as we reset everything, we can't really know if what we did was the resolution
19:53:09 <ianw> looks like we can zcat the .data files in /var/cache/apache2/proxy
19:53:32 <ianw> i'll poke a bit and see if i can find a smoking gun old index, for say python-designateclient in mtl-01 (the failure linked by clarkb)
19:53:33 <clarkb> frickler: correct many jobs seems fine
19:53:47 <frickler> also, not sure if you saw my comment earlier, I did see the same issue with local devstack w/o opendev mirror involved
19:53:55 <clarkb> frickler: its just enough fo them to be noticed and cause problems for developers. But if you look at it on an individual job basis most are passing I think
19:54:11 <frickler> so I don't see how apache2 could be the cause
19:54:11 <fungi> and we're sure that the python version metadata says the constrained version is appropriate for the interpreter version pip is running under?
19:54:14 <clarkb> frickler: oh I hadn't seen that. Good to confirm that pypi itself seems to exhibit it in some cases
19:54:19 <clarkb> frickler: agreed
19:54:34 <clarkb> fungi: as best as I can tell yes
19:54:49 <fungi> just making sure it's not a case of openstack requirements suddenly leaking packages installed with newer interpreters into constraints entries for older interpreters which can't use them
19:55:00 <clarkb> fungi: in part because those restrictions are now old enough that the previous version that pip says si valid also has the same restriction on the interpreter
19:55:21 <clarkb> fungi: so it will say I can't install foo==1.2.3 but I can install foo==1.2.2 and they both have the same interpreter restriction in the index.html
19:55:21 <fungi> k
19:55:40 <clarkb> however; double double checking that would be a good idea
19:55:51 <clarkb> since our packages tend to be more restricted tahn say cryptography
19:56:24 <clarkb> anyway we're just about at time. I expect this to consume most of the rest of my time today.
19:56:35 <clarkb> we can coordinate further in #opendev and take it from there
19:56:41 <clarkb> #topic Open Discussion
19:56:50 <clarkb> Any other items that we want to call out before we run out of time?
19:57:52 <fungi> the upcoming cinder volume maintenance in rax-dfw next month should no longer impact any of our servers
19:58:10 <clarkb> thank you for handling that
19:58:13 <fungi> i went ahead and replaced or deleted all of them, with the exception of nb04 which is still pending deletion
19:58:23 <ianw> ++ thanks fungi!
19:58:23 <fungi> they've also cleaned up all our old error_deleting volumes now too
19:58:48 <fungi> note however they've also warned us about an upcoming trove maintenance. databases for some services are going to be briefly unreachable
19:59:39 <frickler> maybe double check our backups of those are in good shape, just in case?
19:59:48 <fungi> #info provider maintenance 2020-09-30 01:00-05:00 utc involving ~5-minute outages for databases used by cacti, refstack, translate, translate-dev, wiki, wiki-dev
19:59:51 <clarkb> I think ianw did just check those but ya double checking them is a good idea
20:00:04 <clarkb> and we are at time.
20:00:06 <clarkb> THank you everyone!
20:00:09 <clarkb> #endmeeting