19:01:06 #startmeeting infra 19:01:07 Meeting started Tue Nov 3 19:01:06 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 o/ 19:01:11 The meeting name has been set to 'infra' 19:01:19 link http://lists.opendev.org/pipermail/service-discuss/2020-November/000123.html Our Agenda 19:01:22 #link http://lists.opendev.org/pipermail/service-discuss/2020-November/000123.html Our Agenda 19:01:32 #topic Announcements 19:01:40 Wallaby cycle signing key has been activated https://review.opendev.org/760364 19:01:48 Please sign if you haven't yet https://docs.opendev.org/opendev/system-config/latest/signing.html 19:02:01 this is fungis semi annual reminder that we should verify and sign the contents of that key 19:02:11 fungi: ^ anything else to add on that topic? 19:02:53 not really, it's in place now 19:03:17 The other announcement I had was that much of the world has or is going to soon end/start summer time 19:03:24 eventually i'd like to look into some opendev-specific signing keys, but haven't had time to plan how we'll handle the e-mail address yet 19:03:42 double check your meetings against your local timezone as things may be offset by an hour from where they were the last ~6months 19:04:58 #topic Actions from last meeting 19:05:05 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-10-13-19.01.txt minutes from last meeting 19:05:26 I don't see any recorded actions, but it has been a while. Was there anything from previous meetings we should call out quickly? 19:06:17 nothing comes to mind 19:06:42 #topic Priority Efforts 19:06:50 #topic Update Config Management 19:07:10 One thing to call out here is that docker's new rate limiting as gone into effect (or should've) 19:07:21 I've yet to see catastrophic results from that for our jobs (and zuuls) 19:07:25 but we should keep an eye on it. 19:07:55 If things do get really sad I've pushed up changes that will stop us funneling traffic through our caching proxies which will diversify the source addresses and should reduce the impact of the rate limiting 19:08:21 frickler also reached out to them about their open source project support and they will give us rate limit free images but we have to agree to a bunch of terms which we may not be super thrilled about 19:08:43 in particular one that worries me is that we can't use third party container tools? something like that 19:09:08 fungi: do you think we should reach out to jbryce about those terms and see what he thinks about them and go from there? 19:09:32 (I mean other opinions are good too but jbryce tends to have a good grasp on those times of use agreements) 19:09:36 well, it was more like we can't imply that unofficial tools are supported for retrieving and running those images, it seemed like 19:10:00 i.e. podman, etc is what that means? 19:10:06 right, but we actively use skopeo for our image jobs ... 19:10:12 maybe we should reply to docker and ask what they really mean with all that 19:10:19 also it's specifically about the images we publish, not about how many images we can retrieve which are published by others 19:10:25 ianw: that was how I read it and ya clarification on that point may be worthwhile too 19:12:12 for those who weren't forwarded a copy, here's the specific requirement: "...the Publisher agrees to...Document that Docker Engine or Docker Desktop are required to run their whitelisted images" 19:13:03 the good news so far is that our volume seems to be low enough that we haven't hit imediate problems. And fungi and I can see if jbryce has any specific concerns about their agreement (we can have our concerns too) ? 19:13:16 ya I wasn't sure whether the mail should be considered confidential, but I think I could paste it into an etherpad to let us agree on a reply? 19:13:21 the other requirements were mainly about participating in press releases and marketing materials for docker inc 19:13:51 which while maybe distasteful are probably not as hard to agree to do if we decide this is important 19:14:29 ya and may not even be the worst thing if we end up talking about how we build images and do the speculative builds and all that 19:14:48 it might also be interesting to find out whether base images like python+ubuntu might already be under the free program 19:15:12 frickler: thats a good point too because if our base images aren't then we are only solving half the problem 19:15:16 I wonder if there is a way to check 19:15:20 which might imply that we don't have a lot of issues anyway, yes 19:15:37 try to retrieve it 101 times in an afternoon? ;) 19:16:12 we could ask about that in our reply, too. do we have a (nearly) complete list of namespaces we use images from? 19:16:36 frickler: you can probably do a serach on codesearch for dockerfile and FROM lines to get a representative sample? 19:16:45 also, do we have a list of "opendev" namespaces? I know about zuul only 19:16:54 we have opendevorg and zuul 19:16:56 opendevorg 19:17:00 well zuul has zuul and opendev has opendevorg 19:17:33 i think "we" "have" openstack too 19:17:37 do we talk for both or would we let zuul do a different contact 19:18:00 frickler: for now it is probably best to stick to opendevorg and figure out what the rules are then we can look at expanding from there? 19:18:10 clarkb: ++ 19:18:19 zuul may not be comfortable with all the same rules we may be comfortable with (or vice versa). Starting small seems like a good thing 19:18:54 kolla also publishes images to their own namespace i think, loci may as well? 19:19:04 but yeah, i would start with one 19:19:47 alright anything else on this topic or shold we move on? 19:21:04 possible we could publish our images to more than one registry and then consume from one which isn't dockerhub, though that may encounter similar rate limits 19:21:13 https://etherpad.opendev.org/p/EAfLWowNY8N96APS1XXM 19:21:24 yes I seem to recall tripleo ruled out quay as a quic fix because they have rate limits too 19:21:40 I think figuring out how we can run a caching proxy of some sort would still be great (possibly a version of zuul-registry) 19:21:42 sadly that doesn't include the embedded links, I'll add those later, likely tomorrow 19:21:52 frickler: thanks 19:22:50 #topic OpenDev 19:23:11 The work to upgrade Gerrit continues. I announced to service-announce@lists.opendev.org that this is going to happen november 20-22 19:23:23 fungi and I will be driving that but others are more than welcome to help out too :) 19:23:47 on the prep and testing side of things we need to spin review-test back up on 2.13 with an up todate prod state and reupgrade it 19:23:49 yep, the more the merrier 19:24:06 we're also investigating mnaser's idea for using a surrogate gerrit on a performant vexxhost flavor 19:24:17 but I think we'll test that from a 2.13 review-test clone 19:24:31 fungi: do you think that is something we can start in the next day or two 19:24:33 ? 19:24:58 yeah, i was hoping to have time for it today, but i'm still catching my breath and catching up after the past few weeks 19:25:15 cool I'm hoping for time tomorrow at this point myself 19:25:48 ianw: any new news on the jeepyb side of things where the db access will go away? 19:27:10 https://review.opendev.org/758595 is an unrelated bug in jeepyb that I caught during upgrade testing if people have time for that one 19:27:41 clarkb: sorry no didn't get to that yet, although we poked at some api bits 19:28:01 no worries, I think we're all getting back into the swing of things after an eventful couple of weeks 19:28:13 it seems what we need is in later gerrits (ability to lookup id's and emails) 19:28:42 but not in current gerrit, which makes it a bit annoying that i guess we can't pre-deploy things 19:28:44 oh right the api exposes that. I think the thing we ned to check next on that is what perms are required to do that and we can look at that once review-test is upgraded again 19:29:55 we can definitely use review-test to dig into that more hopefully soon 19:30:15 anything else on the gerrit upgrade? or other opendev related topics? 19:31:36 we can probably discuss out of the meeting, but i did just see that we got an email from the person presumably crawling gerrit and causing a few slowdowns recently 19:31:41 yeah, i figured we'll leave review-test up again after the upgrade test for developing things like that against more easily 19:32:01 ianw: yeah, they replied to the ml too, and i've responded to them on-list 19:32:19 i dropped you from the explicit cc since you read the ml 19:32:45 oh, ok haven't got that far yet :) 19:33:23 #topic General Topics 19:33:53 Quick note that I intend on putting together a PTG followup email in the near future too. Just many things to catch up on and that has been lagging 19:35:21 #topic Meetpad Access issues from China 19:35:27 frickler: you added this one so feel to jump in 19:35:42 It is my understanding that it apperas either corporate networks or the great firewall are blocking access to meetpad 19:35:51 this caused neutron (and possibly others) to fallback to zoom 19:36:02 yeah, so I just saw that one person had difficulty joining the neutron meetpad 19:36:28 and I was thinking that it would be good if we could solve that issue 19:36:43 any idea which it was? businesses/isps blocking web-rtc at their network borders, or the national firewall? 19:36:49 but it would likely need cooperation with someone on the "inside" 19:36:56 can we characterize that issue? (yes what fungi said) 19:37:14 he said that he could only listen to audio 19:37:20 (was it even webrtc being blocked or...) 19:37:55 is there more than one report? 19:37:58 yes, i would say first we should see if we can find someone who is on a "normal" (not corporate) network in mainland china who is able to access meetpad successfully (if possible), and then try to figure out what's different for people who can't 19:38:08 i should say not corporate and not vpn 19:38:42 there are also people outside china who can't seem to get meetpad to work for various reasons, so i would hate to imply that it's a "china problem" 19:39:39 maybe we can see if horace has time to do a test call with us? 19:39:42 then work from there? 19:40:49 ftr there were also people having issues with zoom 19:41:20 I'll try to reach out to horace later to day local time (horace's morning) and see if that is something we can test out 19:41:25 and even some for whom meetpad seemed to work better than zoom, so not a general issue in one direction 19:41:59 see the feedback etherpad https://etherpad.opendev.org/p/October2020-PTG-Feedback 19:42:09 yes, meetpad works marginally better for me than zoom's webclient (i'm not brave enough nor foolhardy enough to try zoom's binary clients) 19:43:12 anything else on this subject? sounds like we need to gather more data 19:43:29 another, likely unrelated issue, was that meetpad was dropping the etherpad window at times when someone with video enabled was talking 19:43:51 i also had a number of meetpad sessions where the embedded etherpad stayed up the whole time, so i still am not quite sure what sometimes causes it to keep getting replaced by camera streams 19:44:39 though yeah, maybe it's that those were sessions where nobody turned their cameras on 19:44:49 i didn't consider that possibility 19:44:55 may be worth filing an upstream bug on that one 19:45:13 we're also behind on the js client; upstream hasn't merged my pr 19:45:16 I did briefly look at the js when it was happening for me and I couldn't figure it out 19:45:29 corvus: ah, maybe we should rebsae and deploy a new image and see if it persists? 19:45:35 maybe it's worth a rebase/update before the next event 19:45:39 ++ 19:46:00 sometime in the next few weeks might be good for that matter 19:46:20 something happening in a few weeks? 19:46:36 fungi is gonna use meetpad for socially distant thanksgiving? 19:46:50 nah, just figure that gives us lots of time to work out any new issues 19:47:24 ah yep. well before the next event would be best i agree 19:47:33 ok we've got ~13 minutes left and a couple topics I wanted to bring up. We can swing back around to this if we hvae time 19:47:39 #topic Bup and Borg Backups 19:47:53 ianw: I think you've made progress on this but wanted to check in on it to be sure 19:48:17 there's https://review.opendev.org/#/c/760497/ to bring in the second borg backup server 19:48:30 that should be ready to go, the server is up wit hstorage attached 19:49:08 so basically i'd like to get ethercalc backup up to both borg servers, then stage in more servers, until the point all are borg-ing, then we can stop bup 19:49:08 any changes yet to add the fuse support deps? 19:49:19 todo is the fuse bits 19:49:19 k 19:49:19 that's all :) 19:49:27 thank you for pushing on that 19:49:38 #topic Long term plans for running openstackid.org 19:49:53 Among the recent fires was the openstackid melted down during the virtual summit 19:50:15 it turned out there were caching problems which caused bsaically all the requests to retry auth and that caused openstack id to break 19:50:51 we were asked to scale up openstackid's deployment which fungi and I did. What we discovered doing that is if we had to rebuild or redeploy the service we wouldn't be able to do so successfully without intervention from the foundation sysadmins due to firewalls 19:51:23 I'd like to work with them to sort out what the best options are for hosting the service and it is feeling like we may not be it. But I want to see if others have strong feelings 19:51:48 they did mention they have docker image stuff now so we could convert them to our ansible + docker compose stuff if we wanted to keep running it 19:54:14 for background, we stood up the openstackid.org deployment initially because there was a desire from the oif (then osf) for us to switch to using it, and we said that for such a change to even be on the table we'd need it to be run within our infrastructure and processes. in the years since, it's become clear that if we do integrate it in some way it will be as an identity option for our users so not 19:54:16 something we need to retain control over 19:55:17 currently i think translate.openstack.org, refstack.openstack.org and survey.openstack.org are the only services we operate which rely on it for authentication 19:56:19 of those, two can probably go away (translate is running abandonware, and survey is barely used), the other could perhaps also be handed off to the oif 19:56:24 ya no decisions made yet, just wanted to call that out as a thing that is going on 19:56:38 we are just about at time now so I'll open it up to any other items really quick 19:56:41 #topic Open Discussion 19:57:52 we had a spate of ethercalc crashes over the weekend. i narrowed it down to a corrupt/broken spreadsheet 19:58:14 i'll not link it here, but in short any client pulling up that spreadsheet will cause the service to crash 19:58:27 can/should we delete it? 19:58:44 (the ethercalc which must not be named) 19:58:46 and the webclient helpfully keeps retrying to access it for as long as you have the tab/window open, so it re-crashes the service again as soon as you start it back up 19:59:04 if we are able to delete it that seems like a reasonable thing to do 19:59:12 yeah, i looked into how to do deletes, there's a rest api and the document for it mentions a method to delete a "room" 19:59:15 its a redis data store so not sure what that looks like if there isn't an api for it 19:59:46 i'm still not quite sure how you auth to the api, suspect it might work like etherpad's 20:00:11 and now we are at time 20:00:17 fungi: yup they are pretty similra that way iirc 20:00:21 thank you everyone! 20:00:23 #endmeeting