19:04:40 #startmeeting infra 19:04:40 Meeting started Tue Aug 9 19:04:40 2016 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:04:41 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:04:43 The meeting name has been set to 'infra' 19:04:46 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:04:52 #topic Announcements 19:04:57 #info New IRC channel #zuul for discussions on development of Zuul 19:04:59 #link http://lists.openstack.org/pipermail/openstack-infra/2016-August/004614.html 19:05:03 #info New IRC channel #openstack-shade for discussions on development of the Shade library 19:05:04 #link http://lists.openstack.org/pipermail/openstack-infra/2016-August/004616.html 19:05:10 apparently the #openstack-infra show has two exciting new spin-offs 19:05:17 hopefully these won't be like laverne and shirley 19:05:19 in stereo? 19:05:21 unless maybe it's the lesser known animated laverne and shirley. that might be okay 19:05:27 except we hope these channels will last more than one season 19:05:30 I liked laverne and shirley 19:05:33 should that be #openstack-zuul? or is it just #zuul? 19:05:40 just #zuul 19:05:42 just #zuul, as it's not openstack-specific 19:05:52 gotcha 19:05:55 looks like #zuul was created in 2009 19:06:00 which predates openstack 19:06:05 anteaya: we're really good at time travel 19:06:10 zuul gets used far and wide beyond our community borders, so trying to do what we can to avoid making it look like it's openstack-centric 19:06:10 you are so 19:06:27 there is also now #openstack-dib for diskimage-builder which may sometimes be relevant to infra concerns 19:06:32 Just as quick announcement - if we need more time, we can discuss later: 19:06:36 bindep: We've changed the default file name from other-requirements.txt to bindep.txt and once we have done a new bindep release, I suggest we advocate usage of it more. I plan to send an email to the mailing list about this. 19:06:43 constraints: All tox based jobs can now use constraints in the gate. I'm currently struggling with a translation change (waiting for merge) and still test a bit but plan to send an announcement to the list about this in the next days. 19:07:00 AJaeger: yeah, i left teh bindep one out of announcements for today, we still need a release, image updates... 19:07:15 fungi, yeah... 19:07:19 AJaeger: i have your items as meeting topics 19:07:24 so we can discuss 19:07:26 #info Results Presentation: Operator Information Needs 19:07:28 Piet has invited infra team members (at least those okay with proprietary videoconferencing systems) to see a presentation summarizing feedback from operator interviews 19:07:34 #link http://lists.openstack.org/pipermail/openstack-infra/2016-August/004611.html 19:07:35 fungi, ok 19:07:44 also the obligatory... 19:07:49 #info Reminder: late-cycle joint Infra/QA get together to be held September 19-21 (CW38) in at SAP offices in Walldorf, DE 19:07:50 #link https://wiki.openstack.org/wiki/Sprints/QAInfraNewtonSprint 19:07:56 #topic Actions from last meeting 19:08:02 #link http://eavesdrop.openstack.org/meetings/infra/2016/infra.2016-08-02-19.04.html 19:08:05 "1. (none)" 19:08:20 #topic Specs approval 19:08:26 #info APPROVED: Pholio Service Installation 19:08:32 #link http://specs.openstack.org/openstack-infra/infra-specs/specs/pholio.html 19:08:39 #topic Priority Efforts: A Task Tracker for OpenStack (zaro) 19:08:45 zaro: you have some changes linked on the agenda for gerrit 2.11.4 upgrades and its-storyboard fixes, which i approved earlier today 19:08:46 need to discuss anything specific about these? 19:09:07 yes, one more change for its-storyboar #link https://review.openstack.org/#/c/353046 19:09:27 also for gerrit upgrade. thanks for approving those. 19:09:31 FYI, there are two changes up to fix warnings/errors building specs: 352215 and 352218 19:09:40 i'm going to test the features it bring today. 19:10:05 i'm wondering if we need to schedule an upgrade time/date? 19:10:18 i mean to review.o.o 19:10:43 or should i just work with an infra-root to approve and make sure it doesn't fail 19:10:45 sounds good. as soon as you're comfortable with the state of review-dev, we should decide on a convenient window (or have you already checked out review-dev at this point?) 19:11:08 have not tested on review-dev yet. 19:11:16 we are in R-8, feature freeze is R-5 19:11:25 since it's a minor update and we don't need to do any offline reindex, we should be able to schedule a brief outage on short notice 19:11:28 will do today but shouldn't take too long 19:12:01 * mordred is excited 19:12:07 but yeah, we're ramping up to high volume time for our whole ci, so soon it'll be harder to do much with it 19:12:25 =D I've tested the storyboard changes on review-dev and they look good 19:12:36 It feels already very high volume ;( 19:12:43 zaro: i could help with the upgrade-related restart on this friday if all goes well. just let me know later today and i'll send out a maintenance announcement 19:12:49 AJaeger: huge backlog at the very least 19:13:00 \o/ 19:13:01 fungi: cool 19:13:21 #topic Switch to chrony (ianw) 19:13:27 ianw: there's a novel on the agenda for this topic. care to summarize? 19:13:57 fungi: ntpd has been giving us some issues since we tried to deprecate ntpdate 19:14:05 who is us? 19:14:17 our (openstack's) ci system 19:14:35 though it sounds like it primarily hit tripleo 19:14:39 or their workers 19:14:46 if i put up changes to switch everything to chronyd, will people have issues with this? 19:15:17 I commented on the review, if we are going to do it, we'd need to create puppet-chrony as a drop in replacement for puppetlabs-ntp 19:15:17 it seems chronyd is better suited to our needs than ntpd, as discussed (as noted, at some length) in the attached bug 19:15:21 what about ntpd for servers and chrony for zuul worker nodes? 19:15:39 pabelanger: yep, happy to do that, if we agree to follow that path 19:15:42 ianw: my take was that since chrony was what red hat is replacing ntpd with instead of rolling forward to a default ntpd that has deprecated ntpdate, we should switch to chronyd on red hat based platforms 19:16:01 so i just wondered if it's worth keeping two things running in our puppet, etc 19:16:07 (could be an opportunity to move ntp out of the worker template and into a dib element) 19:16:09 why i was proposing a global change 19:16:25 jeblair: or ansible? 19:16:27 as debian/ubuntu are sticking with ntpd as their default time sync implementation, we continue to use ntpd there 19:16:48 pabelanger: just so we're really, really, clear -- we are NOT using ansible to build images 19:16:56 every week we seem to have a conversation about that 19:17:05 and this is the last one i'm participating in 19:17:15 jeblair: right, but have DIB elements to run ansible (over puppet) 19:17:36 pabelanger: which dib elements are running ansible? 19:17:55 pabelanger: i don't have an opinion on that 19:18:00 I'm pulling the topic sideways, but none today 19:18:03 i thought we were trying to get configuration management/orchestration services out of our dib elements over time 19:18:27 so that we can avoid having them installed on our images 19:18:34 but yes, different topic 19:19:30 I'm happy to talk more about it after the meeting 19:19:47 ianw: anyway, we know that separating long-running server infra from our worker images is something we want to do anyway, so pulling ntp from the worker puppet (but keeping it in server puppet), then doing something to install chrony in the images via an element seems like a good step 19:19:48 so to bring this back around, are we saying switch from ntpdate/ntpd to chronyd on centos and fedora single-use job nodes, but stick with ntp-wait/ntpd on debian and ubuntu? or switch to chronyd on all single-use job nodes regardless of distro? 19:20:30 do we have the same ntp-wait problem on debuntu? 19:20:35 well i would rather not have two branches of time-keeping logic, so we should choose one or the other 19:20:53 jeblair: in the minutes there is a review where someone is having issues, they are using a snapshot though 19:21:19 to my knowledge, the symptom has so far only been observed on centos/fedora dib images booted in tripleo's environment 19:21:33 so, replacing ntpd seems strange to me BUT - if it's a thing we need to do, I tend to fall in favor of doing it everywhere rather than just half of the places 19:21:33 tripleo-test-cloud-rh2 was a networking issue 19:21:42 tripleo-test-cloud-rh1 is actually working today 19:21:43 https://review.openstack.org/352621 19:21:47 fungi: oh. hrm. 19:21:54 okay, what's really going on then? 19:22:05 I'm in favour of the solution addressing the problem 19:22:11 mordred: yes, that was my feeling, if we could get agreement 19:22:16 ianw: ++ 19:22:17 i just worry that switching all our distros to chronyd flip-flops us from using a non-default time sync implementation on one family of distros to using a non-default time sync implementation on the other family of nodes 19:22:46 i'm more in favor of sticking with whatever each distro's default recommendation is, even if they disagree 19:22:57 fungi: i lean toward that way of thinking as well 19:23:16 * crinkle too 19:23:24 (unless we have a Good Reason(TM)) 19:24:03 debian and ubuntu seem to be moving forward with the assumption that ntpdate should no longer be used and ntpd needs a rapid quiescence at boot option, while red hat seems to say you should continue to use ntpdate, or switch to chrony because that's what they plan to default to 19:24:19 ianw: is 352621 an issue anymore now that ntpdate.service was enabled? 19:24:49 pabelanger: i think it might be a separate issue due to the use of snapshots 19:25:01 but my interpretation of the bug ianw linked is that red hat's package maintainers have little interest in doing away with ntpdate, and would rather just drop ntpd altogether 19:25:26 if the choice is to maintain two time-sync paths, then i think i'd prefer to just have a decree that infra uses ntpd and we just suck up any issues 19:25:30 also, we don't use snapshots 19:25:46 (and we're not planning on supporting them in nodepool in the future) 19:26:02 jeblair: sure, but my thought is that we might make life easier for everyone 19:26:23 if chronyd is making better choices about keeping the time in sync for our situation 19:26:26 yeah, just wanted to throw that out as not being a primary use case 19:26:26 ianw: well, if support for ntpd on red hat is planned to evaporate, it doesn't sound like we should continue using it there 19:27:21 does chronyd do a good job of keeping subsecond synchronization over teh course of a long-running job, or is it going to be more subject to skips and jumps? 19:27:33 (and do we care, i guess?) 19:28:30 fungi: i think for the most part, we're syncing the time in devstack-gate, and from that point we're not going to be jumping much 19:28:45 of course we could move that sync out of d-g into a more generic part 19:30:13 i guess the ultimate requirements are, 1. make sure we have a fairly high-precision global synchronization of time on our job workers, and 2. make sure it's actually in sync prior to starting job payloads 19:30:24 for something like this, are people happy with a "if redhat include ::chrony else include ::ntp" ? http://git.openstack.org/cgit/openstack-infra/puppet-kerberos/tree/manifests/client.pp 19:31:07 really, that seems to want to say "kerberos needs the time in sync", but how that happens seems orthogonal 19:31:21 i'm still at: chrony in dib on rh, ntp in dib on debuntu (support both in d-g), and ntp in puppet for servers 19:31:37 jeblair: what about rh servers? 19:31:55 mordred: do we have any? 19:31:59 git*.o.o 19:32:03 gits 19:32:03 sure, our git servers at least 19:32:03 the git farm runs centos 19:32:09 and pbx servers 19:32:12 pleia2: centos6 isn't it? 19:32:16 I mean, it's old centos so does not have a problem currently 19:32:17 or did we switch those to trusty/. 19:32:35 mordred: ah, thanks 19:32:35 mordred: git servers are centos 7.x... not sure how old you mean 19:33:06 me either 19:33:21 if we want to advance the state of our puppet to be able to run servers on fedora, i'm not opposed to that conditional. 19:34:14 yeah, they're on 7.2 19:34:17 ianw: what was the bugzilla link for that ntpd systemd element dependency bug pabelanger opened and the maintainer closed>? 19:35:07 er, systemd unit 19:35:23 fungi: not sure i saw that one. was it to wait for the network to be up before ntpd starts? 19:35:54 yeah, i thought that's the one where they had said red hat was going to be defaulting to chrony anyway 19:36:04 cause it kind of deliberately does *not* do that ... ntpd itself picks up when interfaces appear 19:36:04 fungi: I haven't created that one etc. It was a dependency issue on networking, which ntpdate.service provided 19:36:17 pabelanger: right, that's the one i'm referring to 19:36:47 fungi: That fixed the need to restart ntpd to get ntp-wait working 19:37:21 yeah, it is the default for rhel and fedora https://bugzilla.redhat.com/show_bug.cgi?id=1361382#c7 19:37:21 #link https://bugzilla.redhat.com/show_bug.cgi?id=1361382#c3 19:37:21 bugzilla.redhat.com bug 1361382 in ntp "ntp-wait hangs after boot for a long time, unless ntpd is restarted" [Unspecified,Closed: notabug] - Assigned to mlichvar 19:37:34 just found it in the channel logs 19:37:49 fungi: thanks for the link 19:37:54 ah, fungi-the-irc-search-engine ;) 19:38:14 alright, well in conclusion, not much interest in switching to chronyd everywhere 19:38:42 "As a long-term solution it's probably best to switch to chrony. It's the default NTP client in Fedora/RHEL [...]" 19:38:43 i can see what it looks like to just switch centos/fedora, but i'm not really sold on keeping both paths tbh 19:38:47 that's the quote i was looking for 19:39:20 so based on that i can see switching to chrony as our default ntp client in fedora/centos 19:39:34 because we're running a non-default on those distros right now 19:40:01 and our usual preference is to operate our servers and run our jobs on as default an upstream choice of options as possible 19:40:07 we have a lot of switches like that in puppet, fwiw. comes with the territory i think. we could abstract it with a new module so we only have the switch in one place and use that module everywhere in puppet. 19:40:16 I believe that discussion could also happen upstream with puppetlabs-ntp too, improving support for RHEL / fedore 19:40:19 (but i still think dib is the way to go if we want to focus on worker nodes first) 19:40:25 ++ 19:40:31 I think worker nodes are the most important part of this 19:40:48 i concur 19:42:22 #agreed We should take a phased approach switching CentOS/Fedora DIB images from ntpdate and ntpd to chronyd, and then consider whether to also switch to chronyd on our long-running CentOS servers and with conditionals in any of our various puppet modules 19:42:45 ok, thanks, onward ... 19:42:51 that a close enough summary? 19:42:56 looks good to me 19:42:57 thanks ianw! 19:43:12 #topic Bindep uses now bindep.txt by default (AJaeger) 19:43:18 \o/ 19:43:27 AJaeger just wanted to mention that some possoble mass change proposals are coming, i think 19:43:34 at least after we get a new bindep release tagged, and then confirmed present on all our images 19:43:37 yeah, exactly 19:43:38 only like 150 projects 19:43:52 yeah, only ;( 19:44:11 i'm happy to submit that change when the time comes, if you want 19:44:12 If anybody wants to vote in all PTL elections, please tell me and youcan do the changes ;) 19:44:22 I actually am surprised these days if I see an AJaeger patch that isnt' a mass change 19:44:32 ha ha ha 19:44:39 heh, AJaeger has a lot of good non-mass changes ;) 19:44:53 true 19:45:01 fungi, I'm fine doing it myself - like to complete this... 19:45:09 anyway, for those that missed the discussion, rationale is in the review and commit message 19:45:43 #link https://review.openstack.org/350184 19:45:46 there we go 19:45:59 #topic Constraints can be used in tox (AJaeger) 19:46:12 All tox based jobs can now use constraints in the gate. I'm currently struggling with a translation change (waiting for merge) and still test a bit but plan to send an announcement to the list about this in the next days. 19:46:16 this was a ton of work, thanks for pushing through it AJaeger 19:46:31 \o/ 19:46:32 thank you AJaeger 19:46:37 ++ 19:46:42 A lot of work by many different people - implementation, review, design discussions, .... 19:46:54 thanks for shepherding it 19:46:55 * mordred suggests AJaeger celebrate at White Trash Fast Food (it's what I would do if I were anywhere near Berlin) 19:47:06 ha ha ha 19:47:15 AJaeger: are you near Belin? 19:47:23 or just closer than the rest of us? 19:47:24 * AJaeger will drink a glass of wine after the meeting ;) 19:47:41 anteaya: closer than the rest of you ;) 300 miles away 19:47:46 ha ha ha 19:47:56 anteaya: :) 19:48:05 day trip for some fast food ha h aha 19:48:16 wow, now that i've searched for that venue, i'm all up for catching a punk show there 19:48:27 A plea for everybody: The tricky part are the post and release jobs with constraints. If anything unusual pops up, please carefully evaluate! 19:48:55 fungi: it's like the best place ever 19:49:04 right, those have been switched the most recently, so not entirely certain all the bugs have shaken out 19:49:30 and those are also the ones that most people do not check ;( 19:49:35 yep 19:50:09 okay, anything else on constraints? 19:50:11 nothing further from me on this unless there are questions 19:50:31 #topic wiki status update (jpmaxman, Krenair, fungi) 19:50:57 jpmaxman has done some awesome work getting through an in-place upgrade to mediawiki 1.27 with a copy of our production data 19:51:18 \o/ 19:51:29 that demo is up for all to poke at and see if they can spot anything broken, particularly plugins/extensions we may have forgotten about 19:51:34 #link https://wiki-upgrade-test.openstack.org/ 19:51:56 thank you jpmaxman 19:51:58 it's set up with the newer "nocaptcha" recaptcha which is supposed to be a lot better at thwarting spammers 19:52:08 * mordred hands jpmaxman a cat that isn't too angry 19:52:31 also Krenair has worked out the bits needed to puppet a mw 1.27 deployment on ubuntu trusty 19:52:39 #link https://review.openstack.org/#/q/topic:wiki-upgrade+is:open 19:52:45 * AJaeger hands jpeeler a glass of wine and says thanks! 19:52:55 argh, completion ;( 19:53:06 * AJaeger hands jpmaxman a glass of wine and says thanks! 19:53:40 thanks krenair 19:53:42 fungi: also important to note: "ReCaptcha module will be removed in the near future" https://lists.wikimedia.org/pipermail/mediawiki-announce/2016-August/000193.html so it's good we're switching 19:53:50 i've also added a trove database running mysql 5.6 and our "sane" configuration defaults which jpmaxman is working out some tests of our production data in next 19:53:53 hah hah 19:54:22 after the meeting i'll be following up to the -dev ml with a call for testing on the wiki-upgrade-test site as well 19:54:31 oh good 19:54:54 hopefully the keep the wiki folks will show up in droves 19:55:28 * jpmaxman blushes 19:55:28 worth noting, new account creation has been open again on the production wiki for a few weeks at this point (i'm just manually deleting spam and blocking spammer accounts a few times a day for now), but file/image uploads have been disabled since the older recaptcha plugin didn't cover them 19:55:33 super glad to have Krenair's help on this, lots of great patches 19:56:10 i don't think there's been any testing yet of whether the new nocaptcha confirmedit plugin covers file uploads but we can try that out after we switch production to 1.27 19:56:11 happy to help hopefully we can bring this home 19:56:17 will new wiki allow file/image uploads? 19:56:40 anteaya: right now it is disabled 19:56:47 yeah, "to be determined" 19:56:50 jpmaxman: thank you 19:57:05 there are also some other extensions we can try adding to further thwart spammers, if need be 19:57:26 yeah there will be work still to be done for sure - the objective is a feature limited spam free wiki to start ;) 19:57:27 as well as the somewhat nuclear option of adding a robots.txt to stop search engines indexing wiki content entirely 19:57:41 if we need to 19:57:53 fungi: I got word we can do that as a short term solution if needed 19:58:17 fungi, we could also only blacklist certain file types or all uploads from indexing... 19:58:35 I think the foundation would rather that than the google juice be diluted for all of www.o.o - but that ideally people would be able to find content on the wiki in the future so it should be a short term fix only 19:58:37 AJaeger: yeah, we did that already. blacklisted pdfs and they switched to uploading images 19:58:43 ;( 19:58:50 what do you mean you got word? 19:59:35 anteaya: briefly discussed with jamesmcarthur and sparkycollier 19:59:41 he checked with the foundation site maintainers as to whether they thought it would help mitigate their concerns about the wiki spam influencing keyword rankings for the www site 20:00:15 anyway, we're out of time 20:00:20 thank you fungi 20:00:21 thanks everyone! 20:00:25 #endmeeting