19:01:20 #startmeeting infra 19:01:21 Meeting started Tue Sep 15 19:01:20 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:24 The meeting name has been set to 'infra' 19:01:50 #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000097.html Our Agenda 19:01:58 #topic Announcements 19:02:19 If our air clears out before the end of the week I intend on taking a day off to get out of the house. This is day 7 or something of not going outside 19:02:48 but the forecasts have a really hard time predicting that and it may not happen :( just a heads up that I may pop out to go outside if circumstances allow 19:03:02 good luck with that 19:03:20 o/ 19:03:31 #topic Actions from last meeting 19:03:37 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-08-19.01.txt minutes from last meeting 19:03:43 Theer were no recorded actions 19:04:20 #topic Priority Efforts 19:04:26 #topic Update Config Management 19:04:53 nb03.opendev.org has replaced nb03.openstack.org. We are also trying to delete nb04.opendev.org but it has built our only tumbleweed images and there are issues building new ones 19:05:05 Overall though I think taht went pretty well. One less thing running puppet 19:06:01 to fix the nb04.opendev.org thing I think we want https://review.opendev.org/#/c/751919/ and its parent in a dib release 19:06:17 then nb01 and nb01 can build the tumbleweed images and nb04 can be deleted 19:06:28 ianw: ^ fyi that seemed to fix the tumbleweed job on dib itself if you have a chance to review it 19:06:50 there are pypi problems which we'll talk about in a bit 19:06:51 ok, there were a lot of gate issues yesterday, but they might have all fixed 19:06:58 or, not :) 19:06:59 the priority for containerized storyboard deployment has increased a bit, i think... just noticed that we're no longer deploying new commits to production because it now requires python>=3.6 which is not satisfyable on xenial, and if we're going to redeploy anyway i'd rather not spend time trying to hack up a solution in puppet for that 19:07:29 fungi: any idea why we bumped the python version if the deployment doesn't support it? just a miss? 19:07:40 o/ 19:07:49 fungi: I agree that switching to a container build makes sense 19:08:37 it was partly cleanup i think, but maybe also dependencies which were dropping python2.7/3.5 19:09:18 storyboard has some openstackish deps, like oslo.db 19:09:31 we'd have started to need to pin a bunch of those 19:09:56 note that pip should handle those for us 19:10:06 because openstack libs properly set python version metadata 19:10:11 (no pins required) 19:10:15 but not the version of pip shipped in xenial ;) 19:10:16 but we would get old libs 19:10:40 fungi: we install a newer pip so I think it would work 19:11:10 yeah, we do, though looks like it's pip 19.0.3 on that server so maybe still too old 19:11:26 could likely be upgraded 19:11:44 i'm betting the pip version installed is contemporary with when the server was built 19:11:50 ya 19:11:58 in any case a container sounds good 19:12:15 any other config management issues to bring up? 19:13:05 #topic OpenDev 19:13:20 I've not made progress on Gerrit things recently. Way too many other fires and distrctions :( 19:13:32 ttx's opendev.org front page update has landed and deployed though 19:13:41 fungi: are there followups for minor fixes that you need us to review? 19:13:45 yeah, i still have some follow-up edits for that on my to do list 19:13:51 haven't pushed yet, no 19:13:58 ok, feel free to ping me when you do and I'll review them 19:14:08 gladly, thanks 19:15:17 I didn't really have anything else to bring up here? Anyone else have something? 19:15:57 seems like we can probably dive into the ci issues 19:16:13 #topic General Topics 19:16:38 really quicky want to go over two issues that I think we've mitigated then we can dig into new ones 19:16:42 #topic Recurring bogus IPv6 addresses on mirror01.ca-ymq-1.vexxhost.opendev.org 19:17:02 We did end up setting a netplan config file to statically configure what vexxhost was setting via RAs on this server 19:17:12 since we'vedone that I've not seen anyone complain about broken networking on this server 19:17:25 we should probably link the open neutron bug here for traceability 19:17:42 though i don't have the bug number for that handy 19:17:50 I've started my tcpdump again earlier to double check what happens when we see another stray RA 19:18:04 #link https://bugs.launchpad.net/bugs/1844712 This is the bug we think causes the problems on mirror01.ca-ymq-1.vexxhost.opendev.org 19:18:06 Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete] 19:18:11 thanks, perfect 19:18:12 fungi: ^ there it is 19:18:16 would also be good to hear back from mnaser whether still to collect data to trace things 19:19:20 mnaser: ^ resyncing on that would be good. Maybe not in the meeting as I don't think you're here, but at some point 19:19:35 #topic Zuul web performance issues 19:20:18 Last week users noticed that the zuul web ui was really slow. corvus identified a bug that caused zuul's status dashboard to always fetch the status json blob even when tabs were not foregrounded 19:20:29 and ianw sorted out why we weren't caching things properly in apache 19:20:43 it seems now like we are stable (at least as a zuul web user I don't have current complaints) 19:20:56 brief anecdata, the zuul-web process is currently using around a third of a vcpu according to top, rather than pegged at 100%+ like it was last week 19:21:28 I also have a third change up to run more zuul web processes. It seems we don't need that anymore so I may rewrite it to be a switch from mod rewrite proxying to mod proxy proxying as taht should perform better in apache too 19:22:04 thank you for all the help on that one 19:22:17 clarkb: why would it perform better? 19:22:25 clarkb: (i think mod_rewrite uses mod_proxy with [P]) 19:22:29 corvus: that was something that ianw foudn when digging into the cache behavior 19:22:47 ianw: ^ did you have more details? I think it may be because the rules matching is simpler? 19:23:02 well the manual goes into it, i'll have to find the page i read 19:23:17 so something like "mod_proxy runs less regex matching code?" 19:23:39 "Using this flag triggers the use of mod_proxy, without handling of persistent connections. This means the performance of your proxy will be better if you set it up with ProxyPass or ProxyPassMatch" 19:23:55 I guess its connection handling that is the difference 19:23:57 no something about the way it's dispatched with threads or something 19:24:07 or yeah, what clarkb wrote :) ^ 19:24:15 from http://httpd.apache.org/docs/current/rewrite/flags.html#flag_p 19:24:22 that's it :) 19:24:23 does that affect us? 19:25:13 I could see that being part of the problem with many status requests 19:25:20 if a new tcp connection has to be spun up for each one 19:25:22 that probably won't change the load on zuul-web, only maybe the apache workers' cycles? 19:25:35 fungi: but ya the impact would be on the apache side not the zuul-web side 19:25:43 i'm not sure we ever evaluated whether zuul-web is well behaved with persistent connections 19:27:21 corvus: something to keep in mind if we change it I guess. Would you prefer we leave it with mod rewrite? 19:27:55 anyway, if we don't need mod_rewrite, i'm fine changing it. maybe we'll learn something. i normally prefer the flexibility of rewrite. 19:27:58 are the connections tied to a thread, or are the descriptors for them passed around between threads? if the former, then i suppose that could impact thread reaping/recycling too 19:29:49 (making stale data a concern) 19:30:13 I think connection == thread and they have a tll 19:30:24 or can ahve a ttl 19:30:37 that's all handled by cherrypy? 19:30:42 no thats in apache 19:30:50 I don't know what cherrypy does 19:31:19 its less critical anyway since rewrite seems to work now 19:31:29 oh, sorry, i meant the zuul-web threads. anyway yeah no need to dig into that in the meeting 19:31:33 it was just part of my update to do multiple backends because rewrite can't do multiple backends I don't think 19:31:48 #topic PyPI serving stale package indexes 19:32:07 This has become the major CI issue for all of our python based things in the last day or so 19:32:36 the general behavior of it is project that uses constraints that pins to a recent (and latest) package version fails because only version prior to that latest version are present in the index served to pip 19:33:03 there has been a lot of confusion about this from people. From thinking that 404s are the problem to afs 19:33:30 AFS is not involved and there are no 404s. We appear to be getting back a valid index because pip says "here is the giant list of things I can install that doesn't include the version you want" 19:33:50 projects that don't use constraints (like zuul) may be installing prior versions of things occasionally since that won't error 19:34:02 but I expect they are mostly happy as a result (just something to keep in mind) 19:34:16 I was wondering if we could keep a local cache of u-c pkgs in images, just like we do for git repos 19:34:18 we have seen this happen for different packages across our mirror nodes 19:34:26 #link https://pip.pypa.io/en/latest/user_guide/#installing-from-local-packages 19:34:32 frickler: I think pip will still check pypi 19:34:42 though it may not error if pypi only has older versions 19:34:54 i wonder if other people are seeing it, but without such strict constraints as you say, don't notice 19:35:01 ianw: ya that is my hunch 19:35:05 running the pip download with master u-r gives me about 300M of cache, so that would sound feasible 19:35:39 IIUC the "--no-index --find-links" should assure a pure local install 19:35:43 the other thing I notice is that it seems to be openstack built packages 19:36:16 frickler: that becomes tricky because it means we'd have to update our images before openstack requirements cna update constraints 19:36:45 frickler: I'd like to avoid that tight coupling if possible as it will become painful if image builds have separaet problems but openstack needs a bug fix python lib 19:36:56 its possible we could use an image cache with still checking indexes though 19:37:01 current release workflow is that as soon as a new release is uploaded to pypi a change is pushed to the requirements repo to bump that constraint entry 19:37:16 fungi: i noticed in scrollback you're running some grabbing scripts; i had one going in gra1 yesterday to with no luck catching an error 19:37:52 ianw: yep, we're evolving that to try to make it more pip-like in hopes of triggering maybe pip-specific behaviors from fastly 19:38:09 i'm currently struggling to mimic the compression behavior though 19:38:27 https://mirror.mtl01.inap.opendev.org/pypi/simple/python-designateclient/ version 4.1.0 is one that just failed a couple minutes ago at https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b5f/751040/4/check/openstack-tox-pep8/b5ff50c/job-output.txt 19:38:33 i've definitely managed to see it happen from fastly directly ... but i guess we can't rule out apache? 19:38:33 but if you load it 4.1.0 is there :/ 19:38:45 wget manpage on the mirror server says it supports a --compression option but then wget itself complains about it being an unknown option 19:39:06 ianw: ya its possible that apache is serving old content for some reason, 19:39:15 pip sets cache-control: max-age=0 19:39:33 that should force apache to check with fastly that the content it has cached is current for every pip index request though 19:40:05 we could try disabling the pypi proxies in our base job 19:40:10 and see if things are still broken 19:40:39 I expect they will be beacuse ya a few months back it seemed like we could repreoduce from fastly occasionally 19:41:18 I also wonder if it could be a pip bug 19:41:22 aha, compression option confusion resolved... it will only be supported if wget is built with zlib, and ubuntu 19.04 is the earliest they added zlib to its build-deps 19:41:38 perhaps in pips python version checking of packages it is excluding results for some reason 19:42:36 but I haven't been able to reproduce that either (was fiddling with pip installs locally against our mirrors and same version of pip here is fine with it) 19:43:43 the other thing I looked into was whether or not pip debug logging would help and I don't think it will 19:43:45 fungi: looks like curl has an option 19:43:56 at least for succesful runs it doesn't seem to log index content 19:44:17 `pip --log /path/to/file install foo` is one way to set that up which we could add to say devstack and have good coverage 19:44:23 but I doubt it will help 19:44:48 fungi: "curl -V" says "Features: ... libz" for me 19:45:06 corvus: yep, i'm working on rewriting with curl instead of wget 19:46:03 not having issues with common third party pacakges does make me wonder if it is something with how we build and release packages 19:46:57 at least I've not seen anything like that fail. six, cryptography, coverage, eventlet, paramiko, etc why don't they do this too 19:47:47 other than six they all have releases within the last month too 19:48:49 I've largely been stumped. Maybe we should start to reach out to pypi even though what we've got is minimal 19:49:04 Anything else to add to this? or should we move on? 19:49:06 do all the things we've seen errors for have a common-ish release window? 19:49:21 fungi: the oldest I've seen is oslo.log August 26 19:49:31 fungi: but it has happened for newer packages too 19:49:36 okay, so the missing versions weren't all pushed on the same day 19:49:49 correct 19:50:36 yeah, reproducing will be 99% of the battle i'm sure 19:50:39 i suppose we could blow away apache's cache and restart it on all the mirror servers, just to rule that layer out 19:50:54 though that seems like an unfortunate measure 19:51:04 fungi: ya its hard to only delete the pypi cache stuff 19:51:05 i was just looking if we could grep through, to see if we have something that looks like an old index there 19:51:22 ianw: we can but I think things are hashed in weird ways, its doable just painful 19:51:29 that might clue us in if it *is* apache serving something old 19:52:29 ya that may be a good next step then if we sort out apache cache structure we might be able to do specific pruning 19:52:42 if we decide that is necessary 19:52:57 but do we see 100% failure now? I was assuming some jobs would still pass 19:53:01 also if the problem isn't related to our apache layer (likely) but happens to clear up around the same time as we reset everything, we can't really know if what we did was the resolution 19:53:09 looks like we can zcat the .data files in /var/cache/apache2/proxy 19:53:32 i'll poke a bit and see if i can find a smoking gun old index, for say python-designateclient in mtl-01 (the failure linked by clarkb) 19:53:33 frickler: correct many jobs seems fine 19:53:47 also, not sure if you saw my comment earlier, I did see the same issue with local devstack w/o opendev mirror involved 19:53:55 frickler: its just enough fo them to be noticed and cause problems for developers. But if you look at it on an individual job basis most are passing I think 19:54:11 so I don't see how apache2 could be the cause 19:54:11 and we're sure that the python version metadata says the constrained version is appropriate for the interpreter version pip is running under? 19:54:14 frickler: oh I hadn't seen that. Good to confirm that pypi itself seems to exhibit it in some cases 19:54:19 frickler: agreed 19:54:34 fungi: as best as I can tell yes 19:54:49 just making sure it's not a case of openstack requirements suddenly leaking packages installed with newer interpreters into constraints entries for older interpreters which can't use them 19:55:00 fungi: in part because those restrictions are now old enough that the previous version that pip says si valid also has the same restriction on the interpreter 19:55:21 fungi: so it will say I can't install foo==1.2.3 but I can install foo==1.2.2 and they both have the same interpreter restriction in the index.html 19:55:21 k 19:55:40 however; double double checking that would be a good idea 19:55:51 since our packages tend to be more restricted tahn say cryptography 19:56:24 anyway we're just about at time. I expect this to consume most of the rest of my time today. 19:56:35 we can coordinate further in #opendev and take it from there 19:56:41 #topic Open Discussion 19:56:50 Any other items that we want to call out before we run out of time? 19:57:52 the upcoming cinder volume maintenance in rax-dfw next month should no longer impact any of our servers 19:58:10 thank you for handling that 19:58:13 i went ahead and replaced or deleted all of them, with the exception of nb04 which is still pending deletion 19:58:23 ++ thanks fungi! 19:58:23 they've also cleaned up all our old error_deleting volumes now too 19:58:48 note however they've also warned us about an upcoming trove maintenance. databases for some services are going to be briefly unreachable 19:59:39 maybe double check our backups of those are in good shape, just in case? 19:59:48 #info provider maintenance 2020-09-30 01:00-05:00 utc involving ~5-minute outages for databases used by cacti, refstack, translate, translate-dev, wiki, wiki-dev 19:59:51 I think ianw did just check those but ya double checking them is a good idea 20:00:04 and we are at time. 20:00:06 THank you everyone! 20:00:09 #endmeeting