#openstack-meeting log

19:01:11 <clarkb> #startmeeting infra
19:01:12 <openstack> Meeting started Tue Aug 20 19:01:11 2019 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:12 <ianw> o/
19:01:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:15 <openstack> The meeting name has been set to 'infra'
19:01:20 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2019-August/006452.html Our Agenda
19:01:41 <clarkb> thank you ianw for running the meeting last week
19:02:07 <clarkb> #topic Announcements
19:02:31 <clarkb> This wasn't on the agenda but you have a handful of hours left to vote on the openstack U naming poll if you would like to do so before it ends
19:03:06 <clarkb> #topic Actions from last meeting
19:03:12 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2019/infra.2019-08-13-19.01.txt minutes from last meeting
19:03:32 <clarkb> I didn't see any actions in the meeting notes
19:03:39 <clarkb> ianw: ^ anything to point out here before we move on?
19:03:56 <ianw> no it was fairly quiet
19:04:22 <clarkb> #topic Priority Efforts
19:04:34 <clarkb> #topic OpenDev
19:04:54 <clarkb> I've made some minor progress on having gitea timeout requests.
19:05:00 <clarkb> #link https://github.com/cboylan/gitea/commit/d11d4dab34f769f3ba4589bb938a2dbd09ff8b3a
19:05:34 <clarkb> It turns out that gitea's http framework is not directly compatible with golang's http lib beacuse they use a context type that doesn't conform to the standard
19:05:55 <clarkb> They do drag along the underlying http stdlib request's context though so we can update that and get it to do things
19:06:44 <corvus> oh "neat"
19:06:47 <clarkb> However the http.TimeoutHandler is a bit more robust than what I have there and I'm not sure how much of that I can replicate within the macaron framework so this might get clunky (probably largely due to my lack of go knowledge)
19:07:21 <clarkb> in any case that now builds and seems to work in the job we have. Next I need to exercise that it times out long requests as expected
19:07:41 * diablo_rojo sneaks in late
19:08:07 <clarkb> corvus: ya my understanding of the correct way to implement that is to have a context type that implements the standard interface while adding the bits you want in addition to that
19:08:20 <clarkb> corvus: then you can use stdlib handlers like the timeout handler but also track application specific info
19:08:42 <clarkb> instead they track the request as an attribute of the context object and that request object has the standard matching context object
19:09:50 <clarkb> I was also thinking that I might want to file an issue with them and share what I have so far and see if they can point me in a better direction if one exists
19:10:33 <clarkb> any other opendev specific items to talk about before we move on?
19:10:47 <fungi> too bad github doesn't have a "wip" flag for pull requests
19:11:29 <corvus> the gitea folks use [wip] in the summary (much like we do)
19:11:35 <fungi> ahh, heh
19:12:08 <fungi> then yeah, seeking their input with what you already have sounds like a great option
19:14:24 <clarkb> #topic Update Config Management
19:15:04 <clarkb> corvus: should we talk about the intermediate registry here?
19:15:20 <clarkb> aiui the swift backend for the intermediate docker registry loses blobs
19:15:35 <clarkb> and then our jobs that rely on working docker images fail because tehy can't get the layer(s) they need
19:15:49 <corvus> oh yeah
19:16:06 <corvus> we're seeing this even in moderate use of the intermediate registry
19:16:16 <corvus> especially if there's a patch series
19:16:29 <corvus> (it could be happening with image promotion too, we may just not notice as much)
19:16:53 <corvus> the logical thing to do would be intensive debugging of the problem resulting in a patch to docker
19:17:08 <fungi> so the registry itself is losing track of the swift objects? i'd be surprised if swift is losing those itself
19:17:24 <corvus> afaict, the registry is inserting 0-byte objects in swift
19:17:30 <fungi> ouch
19:17:33 <corvus> no idea what's up with that
19:17:55 <corvus> the nice thing is it's easy to verify we're seeing the same problem (all zero byte objects have the same sha256sum value :)
19:18:15 <fungi> indeed
19:18:40 <corvus> there's a lot of things we would like the registry to do which it doesn't -- authentication only for writes, pass-through to dockerhub, support for pass-through to multiple registries...
19:19:17 <corvus> so i'm seriously inclined to solve this by writing a new registry shadowing system from scratch
19:19:43 <clarkb> I remember pulp saying they support docker image registries as one of their archives
19:19:50 <clarkb> (that might be another option to look at)
19:20:07 <corvus> ooh, well remembered
19:20:23 <clarkb> https://pulpproject.org/ for those that may not be familiar
19:21:06 <clarkb> https://docs.pulpproject.org/plugins/crane/index.html is something pulp points at
19:21:16 <clarkb> that may be too simple for what we want though (crane not pulp)
19:21:26 <corvus> yeah, we do need to write to it
19:22:43 <clarkb> if anyone else knows of alternative options they are probably worth sharing. Like what does openshift run?
19:24:05 <corvus> i believe we learned that running openshift container registry does require running it in openshift
19:24:10 <clarkb> ah
19:24:46 <corvus> https://docs.openshift.com/container-platform/3.5/install_config/install/stand_alone_registry.html#install-config-installing-stand-alone-registry
19:25:16 <clarkb> Alright we probably won't solve that problem in the meeting. But wanted to call it out as it seems like the kind of problem where someone might already know of a preexisting solution (surely we aren't the only people that want to run a docker registry that reliably serves data)
19:25:44 <corvus> i mean, it'd be better for everyone if docker registry did support swift without flaws
19:25:50 <clarkb> ++
19:26:13 <corvus> but my enthusiasm for fixing that bug without addressing all the other things is limited
19:27:03 <corvus> (also, to be fair, i'm not sure we've eliminated the possibility that skopeo is to blame)
19:27:15 <corvus> (it seems highly unlikely though)
19:27:46 <corvus> i don't think the logging available is adequate
19:29:19 <clarkb> Anything else on this topic or should we moev on?
19:30:16 <clarkb> sounds like that is it.
19:30:20 <clarkb> #topic Storyboard
19:30:45 <clarkb> diablo_rojo: fungi: you were both distracted with meetings all last week, but anything to bring up re storyboard
19:31:07 <fungi> i got nuthin'
19:32:06 <diablo_rojo> I remembered to bother mordred twice about db stuff but I don't think he had time to do anything yet
19:32:11 <diablo_rojo> mordred, shall I keep poking?
19:32:46 <diablo_rojo> I didn't have anything else
19:33:06 <clarkb> #topic General Topics
19:33:20 <clarkb> #link https://etherpad.openstack.org/p/201808-infra-server-upgrades-and-cleanup
19:33:29 <clarkb> job logs are now in swift
19:34:05 <clarkb> I think that leaves tarballs on static.o.o ?
19:34:20 <clarkb> corvus had set up afs based tarballs.opendev.org
19:34:22 <fungi> well, it also leaves a bunch of our static sites content
19:34:30 <fungi> security, governance, releases, and so on
19:34:32 <clarkb> fungi: oh I thought that was all on files.openstack.org now
19:34:44 <clarkb> DNS says I'm wrong
19:34:50 <fungi> governance.openstack.org is an alias for static.openstack.org.
19:34:57 <fungi> et cetera
19:35:29 <fungi> also we still have some logs on static.o.o until they age out
19:35:38 <clarkb> /srv/static has ci  election  governance  logs  lost+found  mirror  old-docs-draft  old-pypi  release  release.new  releases  reviewday  security  service-types  sigs  specs  status  tarballs  tc  uc
19:36:07 <clarkb> fungi: ya ~4 weeks iirc
19:36:40 <clarkb> One option is to upgrade the server with a much simpler lvm setup as we'll not need to worry about massive impact to job results
19:36:51 <clarkb> or we can try to push that content onto files.o.o instead
19:37:13 <fungi> some of those are dead (mirror, old-pypi, old-docs-draft, ...), some are just redirects (ci), some are mapped into subtrees of the same vhosts (governance, sigs, tc, uc)
19:37:34 <fungi> so the list looks more daunting than it really is
19:37:59 <ianw> i don't mind taking an action item to audit it all and report back with a list of work?
19:38:05 <fungi> also, yeah, we can pvmove everything left onto a single volume, for starters, and then swap it for a smaller volume if we wanrt
19:38:06 <clarkb> ianw: that would be great
19:38:21 <fungi> thanks ianw!
19:38:32 <corvus> also, i think all the mechanicas are worked out, so moving them to afs probably wouldn't be too difficult
19:38:41 <corvus> mechanics even
19:38:58 <clarkb> #action ianw audit static.openstack.org webserver content and create a list of work to either get off the server or upgrade the server now that job logs are not hosted there (or won't be in 4 weeks)
19:39:33 <ianw> ++ that was what i thought when i had a quick poke a couple of weeks ago, but will take a more systematic look
19:39:34 <fungi> right, the fiddly bits will be redirecting or mapping openstack tarballs since they're published without a namespace prefix, doing something akin to root-marker with stuff like the governance site which is published by stitching together multiple repos...
19:40:36 <clarkb> The other related item on this is getting wiki-dev working
19:40:47 <clarkb> fungi: ^ I doubt you had much time for that last week. Anything to point out there?
19:40:52 <fungi> ianw: if you want to start by just tossing it all into an etherpad i'm happy to flag some stuff in the list too
19:41:07 <fungi> yeah, there's some wiki-dev updates actually
19:41:46 <fungi> first, i've updated the cname for wiki-dev.openstack.org to point to the new wiki-dev03 server for ease of testing. it's not like the old one was in perfect shape either
19:42:10 <fungi> (also because i got tired of editing my /etc/hosts on multiple clients)
19:42:28 <fungi> i thought openid was broken but it's actually working better than it did on the old wiki-dev01
19:43:14 <fungi> the reason i didn't realize that is up in the top-right corner of https://wiki-dev.openstack.org/ the drop-down is doing language selection instead of login
19:43:31 <fungi> so need to figure out how to get that to point to the right thing
19:43:44 <clarkb> oh a theming bug I guess?
19:43:48 <fungi> #link https://wiki-dev.openstack.org/wiki/Special:OpenIDLogin
19:43:59 <fungi> well, so it may be related to the next thing
19:44:25 <fungi> #link https://review.opendev.org/675713 Put image data in a parallel path to source code
19:44:53 <fungi> the old wiki-dev deployment was puppeting the installation onto a cinder volume we'd attached
19:45:21 <fungi> and then having mediawiki store its static content (image uploads and so on) into subtrees of that installation path
19:45:26 <ianw> ahh, was going to say i had a few broken images, so that fixes that?
19:45:38 <fungi> well, it's the first step in fixing it, yeah
19:45:56 <fungi> we were essentially mixing configuration-managed stuff with mw-managed stuff in the same tree
19:47:01 <fungi> so with the proposed change i want mediawiki to start managing its persistent file content into a separate path (which we can put in a cinder volume) and not have to cart around crufy from configuration-managed bits previously on other machines
19:47:14 <clarkb> sounds liek a great idea
19:47:14 <fungi> er, cruft
19:47:29 <fungi> i've already got the new volume in place and formatted/mounted
19:47:35 <fungi> on the new wiki-dev03
19:47:53 <fungi> so once that change lands i'll rsync over the images tree and whatever else needs to go there
19:48:25 <fungi> and that will allow us to completely blow away the config-managed stuff any time we want and redeploy without risking loss of precious files
19:48:43 <fungi> also a related change...
19:49:00 <fungi> #link https://review.opendev.org/675733 Update to 1.28.x branch
19:49:33 <fungi> we'd already manually upgraded production wiki.o.o to 1.28
19:49:44 <clarkb> seems like config management should reflect that then
19:49:46 <fungi> so this brings the wiki-dev configuration management in line with production
19:50:03 <fungi> which also both need badly to be upgraded some more, but...
19:50:08 <clarkb> as a timecheck we have about 10 minutes left and a few more items on the agenda. Anything else urgent on this subject before we move on?
19:50:23 <fungi> having -dev at least on the same version will facilitate that
19:50:25 <fungi> yeah, we can move on
19:50:34 <fungi> anyway, reviews of those two changes most welcome
19:50:42 <clarkb> We been having flaky afs publishing from mirror-update.opendev.org
19:50:58 <clarkb> ianw it occured to me that that host may be running the bionic openafs build which we know is broken?
19:51:06 <clarkb> ianw: maybe the proper fix there is to install from our ppa if we haven't already?
19:52:05 <ianw> it's worth checking but i think it is using the latest
19:52:32 <clarkb> ok for those that might not be aware we've had vos releases fail or run for weeks at a time resulting in slow updates to the rsync'd mirrors we have
19:52:45 <clarkb> ianw and I were working on it yesterday. I need to catch up on that after the meting
19:53:00 <ianw> ii  openafs-client                       1.8.3-1~bionic
19:53:15 <fungi> not the broken prerelease then
19:53:51 <clarkb> rules that out then
19:54:00 <clarkb> Other items really quickly:
19:54:03 <clarkb> #link https://review.opendev.org/#/c/675537 New backup server
19:54:04 <ianw> i think we keep an eye at this point; it might have been around the time i was restarting to get some audit data
19:54:20 <clarkb> ianw ^ has a new backup server ready to go if we can get reviews on that
19:54:36 <ianw> yeah, just wanted eyes because it's non-regular semantics with that host keeping it out of ansible, as we discussed
19:54:53 <clarkb> ianw also updated dib with what we think will fix limestone ipv4 issues
19:54:59 <clarkb> #link https://review.opendev.org/#/c/677410/ DIB fix for limestone ipv4 issues
19:55:11 <ianw> i can do a release and rebuild in a bit for that
19:55:19 <clarkb> ianw: note there was a tripleo job failing on that change
19:55:25 <clarkb> and apparently it was a known failure they had yet to fix
19:55:31 <clarkb> we might consider making that job non voting?
19:55:57 <ianw> if it doesn't work, there are options but i think they'll require us updating how glean writes config in one way or another, comments in that change
19:56:10 <clarkb> And finally feel free to start adding ideas for PTG topics on the etherpad
19:56:12 <clarkb> #link https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019
19:56:24 <ianw> (other than, you know, intense debugging of legacy areas of networkmanager)
19:56:25 <clarkb> I expect planning for that will start in earnest much closer to the event
19:56:51 <clarkb> ianw: it is odd that the bug has been around for so long too
19:57:02 <clarkb> its clearly a fairly major problem they've had for a long time
19:57:15 <clarkb> I guess if the kernel is manageing the interface NM doesn't want to step on its toes
19:57:24 <clarkb> #topic Open Discussion
19:57:29 <clarkb> we have about 2.5 minutes for anything else
19:57:29 <ianw> speaking of the afs thing before
19:57:32 <ianw> #link https://review.opendev.org/#/q/status:open+topic:openafs-reccomends
19:58:17 <shadiakiki> o/  hey there. I had sent a few emails on the mailing list about server sizing
19:58:18 <ianw> that was related to us not installing the correct openafs packages on new servers
19:59:01 <clarkb> shadiakiki: hello. I've been trying to keep up with responses (as has ianw looks like)
19:59:02 <shadiakiki> Just want to ask if it's a subject that's of interest for you in terms of cost savings
19:59:37 <clarkb> shadiakiki: as ianw pointed out I think we tend to end up undersizing servers more than we oversize them
19:59:39 <shadiakiki> Thanks Clark. You guys have been very responsive. It's fantastic
19:59:56 <ianw> shadiakiki: hello! for mine, i'd say we're always open to contribution :)
20:00:31 <clarkb> our gitea backends and zuul executors could all be bigger probably. Thinking out loud here, it might be moer useful for us to see where we should run larger instances
20:00:31 <ianw> in this respect, i think that with our new trend towards containerising that might be the best place to start looking at this
20:00:45 <shadiakiki> Awesome! I founded my startup a few weeks ago to solve the issue of sizing for large infra. It'll be great if I can communicate with you guys from time to time
20:01:00 <clarkb> and we are at time
20:01:03 <clarkb> Thank you everyone
20:01:08 <fungi> it's good to note that while we try to be careful and cost-wise in how we use donated server resources, we're also a very small team with limited time to invest in complex solutions if the return on investment is minor
20:01:13 <fungi> thanks clarkb!
20:01:14 <clarkb> feel free to continue discussion in #openstack-infra
20:01:17 <clarkb> #endmeeting