19:01:08 #startmeeting infra 19:01:09 Meeting started Tue Jun 26 19:01:08 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:12 The meeting name has been set to 'infra' 19:01:22 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:23 o/ 19:02:00 The agenda hasn't been updated for today, as I'm somewhat underprepared chasing a bunch of stuff this morning as well as watching WC. That said its not too terribly wrong 19:02:16 #topic Announcements 19:02:56 As mentioned world cup is happening. Big game right now :) 19:03:09 Other than that don't forget to take the openstack user survey if you run or use an openstack cloud 19:03:25 provides valuable feedback to openstack on what is important and what can be improved 19:03:38 * mordred can't take the survey - too much football on tv 19:04:15 #topic Specs Approval 19:04:44 Monty's spec for future config mgmt is up and no longer WIP. I don't think it is ready for approval but we should all review that this week if we can find time 19:04:46 #link https://review.openstack.org/#/c/565550/ config mgmt and containers 19:04:47 patch 565550 - openstack-infra/infra-specs - Update config management for the Infra Control Plane 19:05:08 thanks mordred! 19:05:44 notmyname: ^ can you clear out patchbot please? 19:05:53 If possible I'd like for next week's meeting to be able to have a discussion about it in detail if necessary and we can work to merging that spec the week after (if not next week if everyone agrees with what monty has already written) 19:05:56 \o/ 19:05:56 tl;dr please review :) 19:06:33 though fungi and I will be travelingthe week after 19:06:34 I tried to list all the things - but to leave some bits open for impl - definitely feedback welcome if I missed something or am just dumb 19:06:51 yup 19:07:17 * fungi is disappearing a lot next month 19:08:09 as for working on this at the PTG I've heard rumor we'll hvae ~3 days of room space as well as aday or two of general help room 19:08:11 * mordred throws paint on fungi to try to defeat the invisibility 19:08:18 I think that will work well for digging into this in denver 19:08:40 yah. if people are happy with the direction - there are several tasks I think I should be able to knock out by denver 19:08:59 i don't expect us to have any dedicated zuul time at this ptg, so those of us with multiple hats should be able to focus on this too 19:09:04 I should get an etherpad for the PTG going so that we can start coordinating what happens there vs what happens prior 19:09:11 * clarkb makes a note to get that going today 19:09:36 GOL! (sorry) 19:10:01 note that there will, again, be a cross-project "helproom" for a couple days where we can still help people with questions about zuul job configuration and the like 19:10:03 (though getting this done unlocks a bunch of neato zuul stuff, like running zuul from containers and more cd of zuul) 19:10:07 fungi: ++ 19:10:07 clarkb: I think we're watching different games 19:10:53 the futbol is multi-threaded 19:11:01 mordred: possibly, I was leaving that info out for people that might be avoiding spoilers 19:11:21 corvus: yup I think this opens a few exciting followup threads once we get things rolling 19:11:26 highly parallel gol processing 19:11:51 #topic Priority Efforts 19:12:03 Exciting updates have happend on the storyboard database query side of things 19:12:14 you might notice that storyboard boards are much quicker now thanks to dhellmann 19:12:19 mostly thanks to dhellmann 19:12:24 thanks again dhellmann! 19:12:43 it's amazing how much adding an index on a column can speed stuff up ;) 19:12:51 knowing we were missing an index made that pretty easy to figure out 19:13:03 * mordred hands dhellmann a fluffy bunny rabbit 19:13:12 * dhellmann gets out the stew pot 19:13:21 mmm. rabbit stew 19:13:24 mmm 19:13:44 fungi: any other storyboard related items worth bringing up? 19:13:56 we're still struggling a bit on the name-based project urls... something's not quite right with decoding url-escaped slashes in project names 19:14:25 fungi: does it work without apache in front of it? apache likes to mangle those 19:14:27 see discussion in #storyboard for current status 19:14:40 yeah, the pecan-based dev server works fine 19:14:52 but when running the api server through apache it's a problem 19:15:38 * mordred is very excited about the future of name-based urls 19:15:46 i spent a good chunk of last night experimenting but couldn't come up with a workaround, though i also don't fully grasp the api routing in sb 19:16:25 so adding debugging wasn't easy 19:16:51 in other news, vitrage migrated all their deliverables from lp to sb on friday and that seems to be going well 19:17:32 oh, also important, ianw noticed that the occasional rash of 500 internal server error from write operations may be related to rabbitmq disconnects causing the socket to it to get blocked 19:17:50 fungi: rabbitmq runs on the same instance though right? odd for there to be disconnects 19:18:00 at first glance at the code, it looked liked it was trying to handle that 19:18:02 yeah, unless it gets restarted or something i suppose 19:18:18 but something in pika seemed to get itself stuck, if i had to guess 19:19:04 * ianw pats myself on the back for such an excellent, helpful bug report :) 19:19:43 it's more detail than we've gathered on the situation to date 19:19:58 so thanks for the insightful observation 19:20:56 anything else re storyboard? 19:20:58 no other storyboard news afaik 19:21:35 Ok and to follow up on config management proposed changes please go review https://review.openstack.org/#/c/565550/ and we'll catch up on that in more detail next week. 19:21:42 #topic General Topics 19:21:58 Why don't we start with a packethost cloud update. We've set max servers to 0 due to the mtu problem. 19:22:17 I've got a couple changes up to address that in zuul-jobs and devstack-gate for our network overlay setup 19:22:31 where "the mtu problem" is described simply as the interface mtu on instances there is 1450 19:22:45 right it is smaller than we assume (1500) in a few places 19:22:56 (same as in our linaro arm64 cloud, too) 19:23:08 #link https://review.openstack.org/578146 handle small mtu in devstack-gate 19:23:24 #link https://review.openstack.org/#/c/578153/ handle small mtu in zuul-jobs 19:23:43 if we can get those in and osa isn't strongly opposed I'd like to turn packethost back on and see if we get more reliable results 19:25:36 ++ 19:26:15 I should also write an email to the dev list explaining we can't make these assumptions anymore 19:26:23 that looks like a good solution and backwards compat, so merging into zuul-jobs should be fine 19:26:32 at least, backwards compat for working setups :) 19:26:36 corvus: yup should be backward compat if you already had to yes that 19:27:29 kolla and tripleo are the other two I'm semi worried about but I think tripleo has done a decent job reconsuming our overaly tooling 19:27:37 I expect they will get a working setup if we upate our tools 19:28:39 The other packethost issue is a bit more osa specific according to logstash and has to do with tcp checksum errors. Possibly related to mtus but osa also explicitly sets some iptables checksum options 19:29:22 this might be worth a quick email then, just to make sure kolla/tripleo/other folks see the changes? 19:29:41 yup, I'll write one up explaining we have morethan one cloud with smaller MTUs now and we hvae to stop assuming 1500 19:30:31 hopefully it isn't too controversial as it is a problem we've made for ourselves via neutron overlay networking :) 19:30:34 and we could stand to update the testing environment document accordingly 19:30:43 fungi: I have a pathc up for that 19:30:53 ahh, see what i miss when i go to get lunch? 19:31:00 looks like it must've merged its not in my open queue anymore 19:31:24 i'm sure i was +2 on that in spirit 19:31:25 https://review.openstack.org/578159 19:33:00 On winterstack naming I followed up with jbryce last week really quickly and the plan he has proposed is that he wants to double check a couple foundation people have had a chance to look it over then he will request we start +1'ing or similar to whittle our list down. He and a bunch of the foundation staff are travling in APAC right now so that may not happen this week 19:33:11 feel free to continue adding suggestions if you have new ideas 19:35:16 SSL certs for just about everything were updated late last week. Rooters are not longer getting all those emails every day :) 19:36:00 One thing we/I discovered in that process is that we can no longer rely on email based verification for the signing requests 19:36:12 oh? 19:36:32 yup, GDPR fallout is that our registrar for .orgs is not publishing contact info in whois anymore 19:36:50 is there no forwarder? 19:37:10 and despite asking them fairly directly to publish thatdata they don't seem interested. This wasn't a problem for openstack.org as namecheap lets you use hostmaster@openstack.org but for openstackid.org was more problematic 19:37:16 corvus: not that I could determine 19:37:28 it appears that the .org tld registry is special in this case 19:37:29 The solution I used was to use DNS record based verification 19:37:45 I created a random string CNAME to a comodo name and then after 20 minutes they checked and found it there and signed the cert 19:38:05 is one cname sufficient for all certs, or did you have to do that for each? 19:38:07 .com and .net (for example) are still publishing technical, anuse and administrative contact e-mail addresses and mailing addresses in the publiv 43/tcp whois 19:38:14 corvus: one for each name in a cert 19:38:16 s/anuse/abuse/ 19:38:21 blech 19:38:43 informally, what are our latest thoughts about exploring letsencrypt? 19:39:03 any questions/problems i can help answer/research? 19:39:11 fungi has apparnetly been poking at it. The process for verifying certs with them is basically the same if using DNS 19:39:33 i've gotten over my organizational concerns with the isrg members and am trying it out on some of my personal systems. at this point i'm more concerned about the bootstrapping problem 19:39:34 you use a TXT record instead of a cname though iirc 19:40:39 most of the tooling that seems to exist today assumes root access which bugs me (but is necessary if listening on a privileged port for verification) 19:40:41 yeah, the acme rfc draft has details, but basically it's a specially-formatted txt record in your domain, _or_ serving a string from a webserver already present at the dns name in question 19:40:41 i've been using apache 19:41:13 any reason not to use http instead of dns? 19:41:30 corvus: mostly my concern with it needing root to perform config management tasks out of band of config management 19:41:33 the dns solution presumes orchestrating dns, and the www mechaism assumes you'll do multipstage deployment initially with no ssl cert 19:42:04 separately I also think the cert limit per domain could bite us with letsencrypt 19:42:11 fungi: true, it is difficult to convince apache to start with no cert... but maybe we could have config management handle that? if no cert, don't add the ssl hosts? 19:42:13 the www solution can also be worked around with a proxy sort of setup, but then boils back down to dns orchestration 19:42:17 a wildcard is an option but that reduces segmentation 19:42:55 though unrelated, we noticed that *.openstack.org is already in use for a cert with the caching proxy service the foundation is contracting for www.openstack.org 19:43:20 we could put a handler in all of our vhosts for /.well-known/acme-challenge/ 19:43:22 somewhat worrisome, and makes me eager to be on our own domain for important things 19:43:31 fungi: yeah 19:43:48 corvus: ya we can engotiate the verification exchange ourselves rather than use the published tools 19:43:51 i mean, we put in a lot of work to avoid doing that and then poof 19:44:07 the ubiquitous acme handler solution is compelling, though still means eventually-consistent sice the first pass of configuration management won't bring up working https 19:44:33 clarkb: oh, i just meant that all we need from apache is a non-ssl vhost which support servince static files in /.well-known/acme-challenge/. then we can use certbot 19:44:59 oh will certbot operate without opening the port itself? 19:45:13 last I looked that was a requirement iirc and jamielennox had written a tool to avoid that 19:45:16 it operates over https (80/tcp) 19:45:19 fungi: right, so it's mostly config-management complication. i haven't thought about what that would mean in the new ansible-container space 19:45:20 er, over http 19:45:34 i suppose we could temporarily copy snakeoil into the certbot cert/key path until it updates those files 19:45:49 fungi: that's an option too 19:45:52 so allowing apache to listen on 443/tcp from the start 19:46:21 clarkb: aiui, it will write files to, eg, /var/www/www.example.com/.well-known/acme-challenge/ poke its server, get a cert, then write out the key 19:46:21 and just make sure any generic 80->443 redirect we configure has a carve-out for .well-known/acme-challenge 19:46:42 er, write out the signed cert 19:46:49 fungi: ya 19:47:03 clarkb: "certbot-auto certonly --webroot -w /var/www/www.example.com/ -d www.example.com -d example.com" makes it do that 19:47:30 corvus: cool, good to know 19:47:42 though on my newer systems the updated apache certbot module is working fine and basically provides the same 19:47:49 (this is how i use it without giving it any special access, other than write to that directory and /etc/letsencrypt) 19:48:00 fungi: oh neat, didn't know about that 19:48:15 other than that my only other real concern is rate limits and quota limits 19:48:41 I know when they first started it was quite limited but I think they are a lot less limited now 19:49:39 looks like limit is 20 certs per week now and each cert can have up to 100 names 19:49:41 20 certs/week now according to https://letsencrypt.org/docs/rate-limits/ 19:50:08 it's more making sure that since we're in the 20 certs range, we don't try to renew them all in the same week 19:50:19 apparently there's an exemption for renewing :) 19:50:45 those limits should be fine given what I just refreshed 19:50:55 I did 17 certs with one having 8 names and the rest had 1 name 19:51:10 and we have a handful of certs that weren't expiring soon enough to worry about 19:51:37 le didn't allow san at one time, not sure if they've started to do so 19:51:52 fungi: the link above implies they do now up to 100 per cert 19:51:54 but could imply we need 24 at this point if all distinct certs? 19:51:57 ahh, nice 19:52:00 i missed that 19:52:59 if renewing doesn't count (which is how i read it), we're effectively unlimited. it's a growth limit, not a size limit. 19:53:10 er, growth rate 19:53:29 Before we run out of time I wanted to give a quick kata + zuul update. We've got a fedora 28 job running now that sometimes crashes (presumably due to nested virt problems). I'm also sorting through what they feel are requirements from their end after a bug update last night 19:53:32 oh, neat 19:53:59 from my perspective we haven't run into any deal breakers, things that are broken were either already broken or never running in jenkins in the first place 19:54:05 clarkb: where's the f28 job running? 19:54:18 corvus: against kata-containers/proxy on vexxhost VM 19:54:37 so it still sometimes crashes even in vexxhost? 19:54:46 (i thought vexxhost had the right settings or something) 19:54:52 corvus: yes, apparently they have issues with their Jenkins running ubuntu 17.10 as well 19:54:59 mnaser thinks it is related to kernel versions in the guest 19:55:07 (newer kernels less stable) 19:55:26 we're running the newest centos kernel on the hypervisor 19:55:27 he is working with them to debug that since it affects jenkins just as much as zuul 19:55:40 and noticed 16.04 never crashed, but 17.10+ was crashing 19:55:47 salvador from kata came up with a reproducer too 19:56:47 i gave all the tracebacks and the info necessary but this isn't as much of a zuul/infra issue as it is a kernel, but i'm open to providing whatever they need to solve things :) 19:57:05 downgrading kernel versions is an option too 19:57:13 (i mean, in the job as a pre-playbook) 19:57:26 since zuul *can* handle in-job test-node reboots 19:57:33 really quickly before we run out of time, is there anything else anyone had to bring up? 19:57:42 #topic Open Discussion 19:57:49 (happy to follow up on kata in -infra) 19:57:53 or in kata-dev 19:58:54 i gave a talk on zuul at openinfradays china. it was well attended. 19:59:26 thanks for doing that! 19:59:39 my pleasure :) 20:00:23 We are out of time. Thank you everyone. I amgoing to grab lunch then will follow up to the mailing lists with the mtu topic and infra ptg item 20:00:26 #endmeeting