19:03:42 #startmeeting infra 19:03:43 Meeting started Tue Sep 6 19:03:42 2016 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:03:44 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:03:46 The meeting name has been set to 'infra' 19:03:47 o/ 19:03:49 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:03:52 o/ 19:03:52 o/ 19:03:55 #topic Announcements 19:04:00 #info Reminder: late-cycle joint Infra/QA get together to be held September 19-21 (CW38) in at SAP offices in Walldorf, DE 19:04:02 #link https://wiki.openstack.org/wiki/Sprints/QAInfraNewtonSprint 19:04:09 #topic Actions from last meeting 19:04:13 #link http://eavesdrop.openstack.org/meetings/infra/2016/infra.2016-08-30-19.02.html 19:04:19 pleia2 set up sprint booking for infra bug day 19:04:24 last week we settled on teh 12th, but it looks like that's taken for an upstream training sprint in #openstack-sprint 19:04:32 yeah 19:04:32 #link https://wiki.openstack.org/wiki/VirtualSprints#Upstream_training_Sprint 19:04:37 I am ok moving to tuesday or wednesday 19:04:50 wfm 19:05:06 wednesday 19:05:11 fewer meetings 19:05:17 yeah, Tuesday is pretty meeting heavy 19:05:19 either day works for me though wednesday is slightly better 19:05:19 anybody who wanted to participate in the infra bug sprint have issues with or a preference for wednesday? 19:05:29 sgtm 19:05:44 tuesday similarly heavy with meetings *cough* 19:05:47 and yeah, i'll echo the concern about meetings consuming most of tuesday 19:05:52 ok, I'll get that firmed up today w/ wiki+announcement 19:06:10 pleia2: thank you 19:06:11 either day wfm 19:06:16 (weds is storyboard meeting so... don't do anything exciting at 15:00 UTC) 19:06:25 #agreed Infra bug sprint is being moved to Wednesday, September 14 19:06:38 * fungi double checks calendars 19:06:49 yeah, that looks right 19:06:57 yup, the 14th is the wed 19:07:02 #action pleia2 set up sprint booking for infra bug day 19:07:38 anything else on this? maybe at 15:00-16:00 utc during the sprint, we can take the hour to crash^H^H^H^H^Hpitch in on the storyboard meeting 19:08:07 :D 19:08:26 #topic Specs approval 19:08:27 or that can be when we begin 19:08:33 #info APPROVED: Docs Publishing via AFS 19:08:39 #link http://specs.openstack.org/openstack-infra/infra-specs/specs/doc-publishing.html Docs Publishing via AFS 19:08:51 #topic Specs approval: PROPOSED "Zuulv3: drop variable interpolation and add nodesets" 19:09:00 #link https://review.openstack.org/361463 Zuulv3: drop variable interpolation and add nodesets 19:09:28 this change seems to have superseded the one we talked about last week, and appears to have some mindshare at this point 19:09:47 sharing minds is important 19:09:50 so i'm going to guess it's jeblair's intent that we move forward with it instead 19:09:56 fungi: you added this one to the agenda though right? not jeblair? 19:10:01 that is my understanding 19:10:32 anteaya: yeah, well he asked last week that we vote on its predecessor, but then he superseded it with this new alternative before the voting period ended 19:10:44 fungi: ah okay thank you 19:10:44 I thouight we talked about this one last week? 19:11:00 fungi: you explained before in -infra, just wanted the flow again for the logs 19:11:10 fungi: thanks for being willing to repeat yourself 19:11:10 hrm, lemme go find it again 19:11:27 oh! yes 19:11:38 there were two versions, he had linked the earlier one 19:11:55 okay, so i'm going ahead and approving 361463 now 19:12:05 oh okay that was easy 19:12:24 i forgot that the wrong (old) one had been linked in the agenda 19:12:30 ah 19:13:15 #info APPROVED "Zuulv3: drop variable interpolation and add nodesets" 19:13:45 #topic Priority Efforts 19:14:28 we have some time, if there's anybody who had something in the priority efforts list they wanted to cover but were hesitant to put on the agenda 19:14:57 oh, i know, we were going to discuss making the "Docs Publishing via AFS" a priority spec 19:15:10 +1 priority spec 19:15:28 I started the newton-xenialing late last week as we seem to be stablizing our image build/new cloud provider situation. I think nova is all done at this point. So the migration is getting there 19:15:37 clarkb: yay 19:15:39 i think we're needing some additional volunteers to help with the docs one 19:15:51 I should be able to start work on the AFS docs spec this week 19:16:16 so feel free to assign me some tasks 19:16:23 last meeting mordred and pabelanger volunteered 19:16:31 yup 19:16:32 so that's probably plenty 19:16:40 I think we can make the progresses 19:17:01 i'll propose a change to add the two of you as additional assignees and make it a priority spec. just a sec 19:18:22 huh, it also needs a review topic 19:19:14 afs-docs ? 19:19:39 yep 19:20:16 what a guess 19:20:38 great minds 19:20:55 :) 19:21:29 can we have more meetings like this? 19:21:36 this is great 19:21:56 sorry for the delay 19:22:07 #link https://review.openstack.org/366303 Prioritize Docs Publishing via AFS 19:22:38 #info Council voting will remain open on the "Prioritize Docs Publishing via AFS" change until 19:00 UTC on Thursday, September 8 19:22:53 no no, I loved it 19:23:17 a moment to actually relax with the team 19:23:26 #topic Priority Efforts: Newton testing on Xenial 19:23:32 clarkb: you were saying...? 19:24:39 just that I started picking this work up again 19:24:48 "nova is all done" meaning their unit tests, devstack jobs, everything? 19:24:49 since the more pressing feature freeze make clouds work stuff has slowed down 19:25:19 fungi: ya for a master/newton change against nova their check/gate jobs appear to all have been xenialed where necessary according to zuul status 19:25:22 i've been a little too out of touch with this one, so couldn't recall what was already running on xenial 19:25:37 basically unit tests, docs, pep8, functional tests, and integration tests 19:25:56 then grenade will happen when we switch grenade to doing newton -> master instaed of mitaka -> master 19:26:36 I think I am going to focus on using the zuul status page to identify the current gaps as that shows me projects that are active and haven't been xenialed yet 19:26:39 for other projects generally, what's the status? are most of them doing non-integration jobs on xenial already but not devstack-based? 19:26:48 there is some work in progress to conservatively test some of the stragglers for neutron for example 19:27:15 fungi: correct. Basically the common "core" of python-jobs docs and pep8 unittests etc is mostly done. 19:27:19 and yeah, periodically sampling the zuul status.json is a great way to see what's running most often (though i guess you could parse it out of zuul's logs) 19:27:32 then any projects using the common integration tests also are done 19:27:46 where the biggest gap seems to be now is where we have all of the one off jobs 19:28:01 oslo messaging for example has like 30 jobs for this massive matrix of testing against different backends 19:28:12 and ironic had >100 integration tests that need modifying last I counted 19:28:16 oh, or i guess you could get it from grpahite too 19:28:34 graphite 19:29:06 ironic isn't actually _running_ all those jobs though, right? 19:29:28 like they just did some sort of matric expansion template and only run a handful of the resulting set? 19:29:34 matrix 19:29:44 yes though its a fairly large subset of that matrix 19:29:56 * fungi it typing worse than usual today 19:29:59 its complicated enough that I handed off to ironic a few weeks back and asked them to sort out what they want to do there 19:30:12 got it 19:30:38 any specific bits of this you need volunteers to jump in and help on besides reviewing? 19:31:26 would also help if others could keep pushing patches to split the jobs between trusty and xenial as well 19:31:50 its just going to be slow methodical work 19:32:47 I should probably also send mail to the dev list asking projects that have a lot of one off stuff like oslo.messaging and ironic and so on to do their own evaluation from their side 19:32:52 cool--hopefully we have people interested enough in this (as it's a team priority) who might at least be comfortable with being able to follow the recipe from your previous changes to promose some more 19:33:10 but yes, an ml thread covering that would be helpful 19:34:06 #info Assistance welcome proposing incremental patches to split jobs between trusty (for <=stable/mitaka) and xenial (for >=stable/newton) welcome 19:34:10 er 19:34:12 #undo 19:34:13 Removing item from minutes: 19:34:18 #info Assistance welcome proposing incremental patches to split jobs between trusty (for <=stable/mitaka) and xenial (for >=stable/newton) 19:34:30 don't want people to feel doubly-welcomed 19:34:38 that's just a bit too much welcome, even for us 19:35:05 still not seeing any last-minute additions to the agenda, so... 19:35:10 #topic Open discussion 19:35:19 anyone have anything else of a general nature to bring up? 19:35:35 we have a lot of clouds now 19:35:40 i've noticed 19:35:48 turned up osic-cloud8 today 19:36:00 while bumping internap-mtl01 to 150 19:36:02 between the additional quota and feature freeze, zuul is hardly breaking a sweat 19:36:21 nice 19:36:26 Ya, things look real good 19:36:33 pabelanger: really great work on all this, I know you've been putting in a lot 19:36:50 we ran out of disk on our log server and oinly have 2 months of log retention right now so I am poking around to see if I can find where all of that disk went 19:36:56 I hope to turn on rax-iad back online now that we have a new glean release too 19:36:57 oh, right, that 19:37:01 pabelanger: yes excellent work pushing on the clouds 19:37:04 pabelanger: nice, thank you osic and internap for the additional nodes 19:37:04 ++ 19:37:05 pleia2: np, happy to do the work 19:37:24 looks like gerrit still has memory leak issue, anybody have any more ideas on tweeking? or suggest another course of action? 19:37:41 i have a feeling one part of it is related just to general increase in activity. would be interesting to see if the upward curve in log volume started around the same time as an upward curve in our job volume, for example 19:38:21 i thought we were being deffensive on bumping nodes prior to release 19:38:22 zaro: what sort of details do you think we should be gathering to provide to upstream to help track down the source of teh leak? 19:38:24 re: https://review.openstack.org/#/c/364101/ 19:38:37 should we wait for merging this? 19:38:39 fungi: ya but I don't think we doubled our job activity. The first major thing I notice is that a lot of jobs don't compress their logs before uploading them 19:39:03 and we don't appear to be compressing console logs anymore? 19:39:11 clarkb: oh, right, and we only do a post-upload compression pass weekly 19:39:25 oh that would explain why I don't see compressed console logs from yesterday 19:39:37 compressing those upfront might be a good feature for zuul-launcher 19:40:01 right, that's done each sunday by log_archive_maintenance.sh 19:40:10 rcarrillocruz: we should figure out why infra cloud can't handle 10 servers before we bump to 50 19:40:11 fungi: i've tried to provide all info i know that might help diagnose but we haven't gotten much help from upstream. https://groups.google.com/d/msg/repo-discuss/oj3h3JdioGs/37HTjJieBwAJ 19:40:13 #link http://git.openstack.org/cgit/openstack-infra/puppet-openstackci/tree/files/log_archive_maintenance.sh log_archive_maintenance.sh 19:40:17 rcarrillocruz: So, I noticed some ConnectionTimeout exceptions on the controller, we should figure out what is causing that first 19:40:18 rcarrillocruz: that timeout is likely to only get worse if we bump it 19:41:02 right, let's talk that on infra, because i don't know if it was due to sqlite or if you find the root cause already 19:41:05 fungi: i'm welcome to suggestions on providing anything else that may help but i'm not sure i know what other info to add. 19:41:25 #link https://groups.google.com/d/msg/repo-discuss/oj3h3JdioGs/37HTjJieBwAJ 19:43:32 the recent tweek with httpd threads did seem to help though, so i'm open to trying other tweeks if that's the path we want to take. 19:43:33 zaro: skimming that, i wonder if reducing our cache retention would help 19:44:50 zaro: do you happen to know what gerrit release gerrithub.io is on? 19:44:54 hmm, i think flush cache should lower memory usage if that's the issue 19:45:24 ahh, there we go. looks like they're running 2.12-1426-g7e45a46 19:46:00 yeah, different version. 19:46:09 zaro: good point. i don't know that we've explicitly tested a cache flush to see what happens with memory usage inside the jvm. i'll give that a shot in a bit 19:46:14 i'm wondering if we should put effort into upgrading instead of tweeking? 19:46:27 zaro: only if the upgrade is better ;) 19:46:48 zaro: its sort of hard to test the "make gerrit crash after 2 weeks of use" scenario though 19:46:50 zaro: well, upgrading _is_ something we should put effort into, but we may still find ourselves in need of tweaking after that's done 19:46:52 polygerrit UI is still not ready though 19:48:36 i'm cool either way, just wondering which way you think we should go. 19:48:56 I will admit that I am extremely wary of gerrit upgrades at this point 19:49:05 #action fungi Test whether gerrit flush-caches significantly reduces memory utilization in Gerrit's JVM 19:49:17 the last couple we have done appear to have caught critical bugs that no one else reported and upstream was unwilling to fix for siginficant amounts of time 19:49:53 wondering how we can do better testing on our end since upstream isn't able to shake the issues out 19:50:03 i would recommend skipping 2.12 and go to 2.13 but that's only in rc at this point. so maybe upgrade is prematue at this point. 19:50:25 we're unlikely to want to upgrade between now and release day anyway 19:50:42 zaro: is there a reason? 19:50:56 so maybe continue to look for memory fix until 2.13 shakes out. 19:51:11 ah okay 19:51:31 i assume the suggestion for 2.13 is that at least some of backports we're running aren't present in 2.12 but only in 2.13? 19:52:07 do they release on a schedule, or have any eta on 2.13? 19:52:08 yes, i believe 2.13 has more fixes we need. 19:52:19 I wonder if this would be a use case for a read only slave gerrit 19:52:19 no schedule. 19:52:36 basically have it chase upstream gerrit master and we can see if it breaks more or less awesome than the RW master 19:52:49 (thats probably a really complicated setup to keep running though) 19:53:04 i expect that unless people were using it as actively as the r+w master, we wouldn't really know anyway 19:53:05 db migrations etc nto sure how you resolve that 19:53:11 fungi: ya that too 19:53:15 clarkb: can do, but yeah more complex setup 19:53:41 gerritforge does that for active/pass setup 19:53:48 pass/passive 19:54:02 for example, the big mess where we had to roll back (was it 2.10? 2.9?) only showed up once activity ramped up 19:54:51 so we could very easily have not seen it at all even on a full copy of our production data, because it wouldn't have a full copy of our user volume 19:54:57 fungi: yes though it was trivially reproduceable on the dev server once we udnerstood the behavior 19:55:28 right. just pointing out that we'd need a load generator that mostly mimics our usage 19:55:41 no idea how hard that would be to build 19:56:44 though it's worth noting that, at least for now, review-dev has the same size server as production, so we could just leave it that way and let people try to build a load tester that exercises it 19:57:17 ok. well maybe think about tweeks we can make to avoid the memory leak until it's a viable option to discuss another gerrit upgrade? 19:57:18 back when our git-review testing ran against review-dev, i was sort of able to generate some of that kind of load 19:57:34 just by firing up multiple instances of the test script 19:57:57 review-dev load is very basic. not at all like review at this point. 19:58:13 though that script obviously only tested git-review relevant interactions, and certainly not browsery things 19:59:24 ya its always difficult to mimic your users 19:59:26 zaro: i think we should plan for another gerrit upgrade regardless. there is a released version newer than what we're running, and we know from experience that the pain from running an extremely outdated version only delays the inevitable upgrade pain 19:59:39 especially since gerrit has bits of the ui still not exposed to the rest api iirc 19:59:54 we resolved after we got off our 2.4 fork to try and keep up with upstream releases 19:59:55 I wonder if there's some open source test tooling for testing browser load, there must be 20:00:01 worth looking into 20:00:04 I might poke around that 20:00:08 hey! we managed to fill up the hour 20:00:12 thanks fungi 20:00:14 :) 20:00:15 thanks everyone! 20:00:19 #endmeeting