19:03:48 #startmeeting infra 19:03:49 Meeting started Tue Jun 20 19:03:48 2017 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:50 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:52 The meeting name has been set to 'infra' 19:03:55 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:04:01 #topic Announcements 19:04:04 #info Don't forget to register for the PTG if you're planning to attend! 19:04:09 #link https://www.openstack.org/ptg/ PTG September 11-15 in Denver, CO, USA 19:04:13 as always, feel free to hit me up with announcements you want included in future meetings 19:04:19 #topic Actions from last meeting 19:04:26 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-06-13-19.02.html Minutes from last meeting 19:04:38 clarkb finish writing up an upgrade doc for gerrit 2.11 to 2.13 19:04:41 #link https://etherpad.openstack.org/p/gerrit-2.13.-upgrade-steps upgrade doc for gerrit 2.11 to 2.13 19:04:44 still available to work on testing that on review-dev in a couple hours? 19:04:49 i'll be around to help however you need 19:05:00 o/ 19:05:02 yup 19:05:10 though i may still be dialled into the board meeting and listening at the same time, depending on how long that ends up running 19:05:16 if others are able to give that a look over in the next few hours that would be nice 19:05:44 we can also talk through it during open discussion if we want 19:05:47 fungi start an infra ml thread about puppet 4, beaker jobs and the future of infra configuration management 19:05:50 #link http://lists.openstack.org/pipermail/openstack-infra/2017-June/005454.html Puppet 4, beaker jobs and the future of our config management 19:05:56 sorry that took a couple weeks, but everyone interested please follow up there 19:06:11 #action ianw abandon pholio spec and shut down pholio.openstack.org server 19:06:14 (carrying that over so we don't forget) 19:06:29 not yet, sorry ... gate issues have had my attention 19:06:36 perfectly fine! 19:06:43 it's not a hurry, just cleanup 19:06:50 #topic Specs approval: PROPOSED PTG Bot (fungi) 19:06:53 #link https://review.openstack.org/473582 "PTG Bot" spec proposal 19:06:59 #info Council voting is open for the "PTG Bot" spec proposal until 19:00 UTC on Thursday, June 22. 19:07:02 just a reminder, i gave that one the extra week since i only proposed it just before the meeting last week 19:07:21 i also have some initial changes proposed under that review topic i'll un-wip after it gets approved 19:07:33 #topic Specs approval: PROPOSED Provide a translation check site for translators (eumel8, ianychoi, fungi) 19:07:37 #link https://review.openstack.org/440825 "Provide a translation check site for translators" spec proposal 19:07:44 i added this following conversation with ianychoi during the open discussion period in last week's meeting 19:08:21 it seems to be ready enough for a vote; it's mainly just a change of direction on an already approved spec which ended up being untenable 19:08:40 any objections for putting it up for council vote until thursday? 19:09:23 no objection by me 19:10:19 #info Council voting is open for the "Provide a translation check site for translators" spec proposal until 19:00 UTC on Thursday, June 22. 19:10:31 #topic Priority Efforts 19:11:01 clarkb: any interest in talking more about the gerrit upgrade plan during the meeting? 19:11:08 sure 19:11:21 #topic Priority Efforts: Gerrit 2.13 Upgrade 19:11:43 worth noting, this is a version skip. always fun 19:11:43 I've got the rough plan sketched out in that etherpad for upgrading review-dev to 2.13.7.ourlocalbuild 19:12:05 yes because it is a version skip we cannot do online reindexing after upgrade, we have to do a full offline reindex before starting the service 19:12:11 and per an earlier meeting, we're choosing to roll forward with 2.13.x instead of 2.14.x for now 19:12:38 so fungi and I will walk through that on review-dev in order to watch the reindex process 19:12:59 timeframe for offline reindexing in 2.10 and online in 2.11 suggests we should budget at least 4 hours of downtmie 19:13:01 once that is done we will want to test services and scripts against 2.13. Particularly zuul as there may be new event types that we don't handle or otherwise want to handle btetter 19:13:31 (4 hours of downtime for the production review.o.o reindex i mean) 19:13:40 ya don't expect review-dev to take that long 19:13:41 review-dev will likely be far faster 19:13:43 fungi: I know we discussed not-14 before - but I think I thought that was partially because going to 2.14 was going to be significantly more expensive ... 19:13:44 right 19:13:48 as reindexing is on a thread per repo basis 19:14:07 mordred: it would require java 8 which requires xenial (or at least not trusty) 19:14:08 fungi: if we're going to have to do an offline reindex in this case, is it work reconsidering that? 19:14:16 mordred: yes, some much more significant changes in 2.14 which we didn't want to complicate the current progress with 19:14:17 clarkb: nod 19:14:22 kk. just making sure 19:14:37 mordred: if we then go from 2.13. to 2.14 we should be able to do online reindex as part of that upgrade 19:14:51 and also because we already missed the boat on the 2.12 upgrade by deciding to refocus on 2.13 19:14:56 mordred: I think it is a good idea to separate the distro upgrade from the 2.14 upgrade as a result 19:14:57 clarkb: ah - ok . so this should be the last offline reindex we need to eat 19:15:01 clarkb: ++ 19:15:02 mordred: hopefully 19:15:04 definitely agree 19:15:30 so given how long it takes us to prepare for gerrit upgrades compared to their frequency of major releases, if we keep revising our plan to be whatever the latest major release is we may never upgrade 19:16:07 and since we already have a lot of progress and 2.13 acceptance testing behind us, i'd rather not lose that momentum 19:16:39 I also think that 2.13 has had a chance to mature (7 point releases) whereas 2.14 not so much yet 19:16:51 also there are a number of useful things we can do with 2.13 (and could have done with 2.12) that make the upgrade worthwhile even if it means we immediately begin planning for the next upgrade 19:17:12 ++ 19:17:40 so ya hopefully after today we can start poking at testing with zuul against review-dev and our hook scripts and the election roll geneartion and all that 19:17:41 like, enabling individual teams or the stable team to take care of the eol process, or simplifying the release automation 19:17:56 then maybe in a week or two we can schedule an upgrade in production 19:18:06 (trouble is we are getting to the fun aprt of the release cycle) 19:19:04 i don't mind if we get the upgrade details worked out and then have to put it on ice until a lull in release activity, even if that means between the ptg and summit 19:19:39 yup especially because momentum on process is here now 19:19:44 yah 19:19:49 just somethign to be aware of as we get closer to being ready to upgrade production 19:20:08 #link https://releases.openstack.org/pike/schedule.html Pike Release Schedule 19:20:47 _if_ we can swing it this cycle, it'll probably be in the next ~3 weeks 19:21:32 which is theoretically doable if we can get people testing it out on review-dev and making whatever changes we need to address bugs 19:22:34 given what we don't yet know and may uncover while running through this, i'm hesitant to commit to being able to upgrade before we get into the library final release window and things start picking up 19:23:00 so while it would be nice if it works out, i'm not going to get my hopes up 19:24:05 basically we'd need the details ironed out in the next 2 weeks and then a maintenance announcement with a week of advance warning since the outage (for the entire ci system) will be pretty lengthy 19:24:37 ya 19:24:44 I think better to not kill ourselves with that effort 19:24:52 and instead be thorough 19:24:56 anyway, let's see what we figure out this week and i'll make sure we touch on the updates status in next week's meeting too at which point we may have a better idea as to how feasible it is 19:25:11 er, updated status 19:26:48 soudns good 19:27:01 having read through the etherpad, i'm wondering whether we need to disable puppet for gerrit? 19:27:13 (and disable puppet globally when we do teh real maintenance) 19:27:33 fungi: my concern there is that puppet could update the war under us while we are in progress 19:27:44 so basiclly we want to deactivate puppe ttehre until we get the change merged to reflect the right war 19:28:02 no, that's what i meant, but i missed that you already have it as the first step there 19:28:19 I don't think I have the step of merge change to reflect war though 19:28:58 good point, and that probably has to be done after disabling puppet but certainly before stopping any of the other services 19:31:00 well before starting pupept again at least 19:31:23 oh, sure, can be done at either end 19:32:00 after makes the most sense i guess, since it's less to revert if we need to roll back 19:33:41 our gerrit fork's tags are lagging behind too... 19:34:10 #link https://gerrit.googlesource.com/gerrit/+/v2.13.8 Gerrit 2.13.8 stable point release from April 26, 2017 19:34:28 might make sense to confirm that still builds for us when we get time 19:34:53 oh hrm do we want to push the upgrade a day and get ^ built 19:35:03 I thought I double checked for tags and 2.13.7 was latest 19:35:33 i'm fine doing it that way too. i have even more time tomorrow to help (fewer meetings) 19:36:15 I'm trying to get a changelog to see what 2.13.8 adds 19:36:23 i think we need to rebase all our tags onto that if we do 19:36:32 er, s/tags/backports/ 19:36:47 https://www.gerritcodereview.com/releases/2.13.md#2.13.8 19:36:54 yes we'll need to rebase and merge the ~4 changes 19:37:24 2.13.8 includes jgit and performance fixes 19:37:31 my hunch is we probably do want those? 19:37:42 there are a few additional patches in the stable branch on top of that tag too 19:38:06 #link https://git.openstack.org/cgit/openstack-infra/gerrit/log/?h=upstream/stable-2.13 our fork of Gerrit stable-2.13 19:38:33 including some which look like bug fixes 19:38:50 most recent is 2 days old 19:39:52 one thing is that if we have to delay for prod upgrade we may end up doing a point release upgrade on review-dev anyways just to get latest? 19:40:36 right, i would be cool doing one last minor update on review-dev before the production maintenance just to vet the latest stable state 19:40:42 perhaps we should just bake that into our thinking for this upgrade process. Go to 2.13.7 now, start testing stuff like zuul against. Then do upgrade to 2.13.8/9/whatever closer to production upgrade then do 2.13.8/9/whatever in production 19:40:51 sure, sgtm 19:41:01 I actually like that as I think the jump to 2.13 and testing that is the biggest concern right now 19:41:10 then it will be easy to add on a different point release in the future. 19:41:15 upstream/stable-2.13 branch tip maybe 19:41:20 ya 19:41:40 since they seem to apply fixes there far more often than they tag 19:42:00 in that case lets stick with the original plan for review-dev for now 19:42:04 that gets us moving on the testing front 19:42:20 then we can incorporate a minor bump down the line when things are more firmed up for production 19:42:43 #agreed proceed with testing gerrit-v2.13.7.4.988b40f.war today, update to upstream/stable-2.13 branch tip and briefly re-test shortly before production upgrade maintenance 19:43:16 #topic Open discussion 19:43:35 DNS is really hurting us in osic. 19:43:36 we've still got about 15 minutes before the tc and board meetings start if anyone has anything else to bring up 19:43:49 (message:"failed: Temporary failure in name resolution." OR message:"Temporary failure resolving" OR message:"wget: unable to resolve" OR message:"Could not resolve host") AND NOT message:"Could not resolve host: fake" AND tags:"console" is my logstash query 19:43:50 and yeah, sad dns 19:44:23 I think that the problem is unbound's list of forwarders don't have priority/order so we are using ipv4 and ipv6 resolvers in osic 19:44:50 to hit ipv4 resolvers we have to go through PAT/NAT which I think is likely the cause of our troubles 19:44:55 did you consider my suggestion to have a udev rule rejigger unbound as soon as you get a v6 default route? 19:45:07 I haven't yet 19:45:18 * fungi has no idea how terrible that might be 19:45:33 unbound-control does allow you to configure those things on the fly 19:45:47 so we could have it remove resolvers from the existing list based on ipv6 coming up 19:45:52 oh, there is a period of no dns if you start all v6? 19:46:21 given the async nature of boot in general and v6 autoconfig in particular, i think triggering reconfiguraion off kernel events is about teh fastest solution you're going to get there 19:46:23 ianw: more backrgound on that is my systemd unit file hack doesn't work because network-online comes up before we have working ipv6 because ipv4 is up 19:46:51 ianw: so my check of "do we have ipv6" isn't working right 19:47:02 ahh, ok ... that makes sense, i guess :/ 19:47:21 and we can't stop ipv4ing because github and gems and other things 19:47:51 another option is to set it in a nodepool ready script based on whether or not ipv6 is present at that point (it should be because nodepool will prefer ipv6) 19:48:09 ahh ... hence the discussions with mordred i'm guessing 19:48:11 fungi: I think I have a preference for ^ because it is simple and straightforward 19:48:20 fungi: though udev is likely workable as well 19:48:42 yah - I think doing it in a nodepool ready script is a great idea 19:48:50 udev addresses your concern of "how do you do this without a ready script and without zuul" 19:49:40 most of the other discussion is more for "how do we do this action in v3" - I think we've got great information at ready-script/pre-playbook time to do this well 19:51:58 * ianw has no strong opinions, but is glad for such tenacity from clarkb investigating it! 19:52:22 I'll push up an update to ready script it as that is quick and easy 19:52:30 well needs new images I guess 19:52:34 but otherwise is quick and easy :) 19:54:20 okay, i'm ending the meeting 5 minutes early to give anyone who wants to time to dial into the board of directors conference call and/or grab popcorn before the tc meeting in here on queens goals refinement 19:54:52 fungi: I'm landing ... which is going to make my participatoin in the TC meeting a bit difficult 19:54:57 5 mins is not enough to get popcorn ;/ 19:55:24 thanks everyone! 19:55:26 #endmeeting