#openstack-meeting log

19:03:42 <fungi> #startmeeting infra
19:03:43 <openstack> Meeting started Tue Sep  6 19:03:42 2016 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:03:44 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:03:46 <openstack> The meeting name has been set to 'infra'
19:03:47 <bkero> o/
19:03:49 <fungi> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:03:52 <hrybacki> o/
19:03:52 <crinkle> o/
19:03:55 <fungi> #topic Announcements
19:04:00 <fungi> #info Reminder: late-cycle joint Infra/QA get together to be held September 19-21 (CW38) in at SAP offices in Walldorf, DE
19:04:02 <fungi> #link https://wiki.openstack.org/wiki/Sprints/QAInfraNewtonSprint
19:04:09 <fungi> #topic Actions from last meeting
19:04:13 <fungi> #link http://eavesdrop.openstack.org/meetings/infra/2016/infra.2016-08-30-19.02.html
19:04:19 <fungi> pleia2 set up sprint booking for infra bug day
19:04:24 <fungi> last week we settled on teh 12th, but it looks like that's taken for an upstream training sprint in #openstack-sprint
19:04:32 <pleia2> yeah
19:04:32 <fungi> #link https://wiki.openstack.org/wiki/VirtualSprints#Upstream_training_Sprint
19:04:37 <pleia2> I am ok moving to tuesday or wednesday
19:04:50 <fungi> wfm
19:05:06 <anteaya> wednesday
19:05:11 <anteaya> fewer meetings
19:05:17 <pleia2> yeah, Tuesday is pretty meeting heavy
19:05:19 <clarkb> either day works for me though wednesday is slightly better
19:05:19 <fungi> anybody who wanted to participate in the infra bug sprint have issues with or a preference for wednesday?
19:05:29 <bkero> sgtm
19:05:44 <bkero> tuesday similarly heavy with meetings *cough*
19:05:47 <fungi> and yeah, i'll echo the concern about meetings consuming most of tuesday
19:05:52 <pleia2> ok, I'll get that firmed up today w/ wiki+announcement
19:06:10 <anteaya> pleia2: thank you
19:06:11 <zaro> either day wfm
19:06:16 <Zara> (weds is storyboard meeting so... don't do anything exciting at 15:00 UTC)
19:06:25 <fungi> #agreed Infra bug sprint is being moved to Wednesday, September 14
19:06:38 * fungi double checks calendars
19:06:49 <fungi> yeah, that looks right
19:06:57 <anteaya> yup, the 14th is the wed
19:07:02 <fungi> #action pleia2 set up sprint booking for infra bug day
19:07:38 <fungi> anything else on this? maybe at 15:00-16:00 utc during the sprint, we can take the hour to crash^H^H^H^H^Hpitch in on the storyboard meeting
19:08:07 <Zara> :D
19:08:26 <fungi> #topic Specs approval
19:08:27 <pleia2> or that can be when we begin
19:08:33 <fungi> #info APPROVED: Docs Publishing via AFS
19:08:39 <fungi> #link http://specs.openstack.org/openstack-infra/infra-specs/specs/doc-publishing.html Docs Publishing via AFS
19:08:51 <fungi> #topic Specs approval: PROPOSED "Zuulv3: drop variable interpolation and add nodesets"
19:09:00 <fungi> #link https://review.openstack.org/361463 Zuulv3: drop variable interpolation and add nodesets
19:09:28 <fungi> this change seems to have superseded the one we talked about last week, and appears to have some mindshare at this point
19:09:47 <mordred> sharing minds is important
19:09:50 <fungi> so i'm going to guess it's jeblair's intent that we move forward with it instead
19:09:56 <anteaya> fungi: you added this one to the agenda though right? not jeblair?
19:10:01 <mordred> that is my understanding
19:10:32 <fungi> anteaya: yeah, well he asked last week that we vote on its predecessor, but then he superseded it with this new alternative before the voting period ended
19:10:44 <anteaya> fungi: ah okay thank you
19:10:44 <clarkb> I thouight we talked about this one last week?
19:11:00 <anteaya> fungi: you explained before in -infra, just wanted the flow again for the logs
19:11:10 <anteaya> fungi: thanks for being willing to repeat yourself
19:11:10 <fungi> hrm, lemme go find it again
19:11:27 <fungi> oh! yes
19:11:38 <fungi> there were two versions, he had linked the earlier one
19:11:55 <fungi> okay, so i'm going ahead and approving 361463 now
19:12:05 <anteaya> oh okay that was easy
19:12:24 <fungi> i forgot that the wrong (old) one had been linked in the agenda
19:12:30 <anteaya> ah
19:13:15 <fungi> #info APPROVED "Zuulv3: drop variable interpolation and add nodesets"
19:13:45 <fungi> #topic Priority Efforts
19:14:28 <fungi> we have some time, if there's anybody who had something in the priority efforts list they wanted to cover but were hesitant to put on the agenda
19:14:57 <fungi> oh, i know, we were going to discuss making the "Docs Publishing via AFS" a priority spec
19:15:10 <anteaya> +1 priority spec
19:15:28 <clarkb> I started the newton-xenialing late last week as we seem to be stablizing our image build/new cloud provider situation. I think nova is all done at this point. So the migration is getting there
19:15:37 <anteaya> clarkb: yay
19:15:39 <fungi> i think we're needing some additional volunteers to help with the docs one
19:15:51 <pabelanger> I should be able to start work on the AFS docs spec this week
19:16:16 <pabelanger> so feel free to assign me some tasks
19:16:23 <fungi> last meeting mordred and pabelanger volunteered
19:16:31 <mordred> yup
19:16:32 <fungi> so that's probably plenty
19:16:40 <mordred> I think we can make the progresses
19:17:01 <fungi> i'll propose a change to add the two of you as additional assignees and make it a priority spec. just a sec
19:18:22 <fungi> huh, it also needs a review topic
19:19:14 <anteaya> afs-docs ?
19:19:39 <fungi> yep
19:20:16 <anteaya> what a guess
19:20:38 <pleia2> great minds
19:20:55 <anteaya> :)
19:21:29 <anteaya> can we have more meetings like this?
19:21:36 <anteaya> this is great
19:21:56 <fungi> sorry for the delay
19:22:07 <fungi> #link https://review.openstack.org/366303 Prioritize Docs Publishing via AFS
19:22:38 <fungi> #info Council voting will remain open on the "Prioritize Docs Publishing via AFS" change until 19:00 UTC on Thursday, September 8
19:22:53 <anteaya> no no, I loved it
19:23:17 <anteaya> a moment to actually relax with the team
19:23:26 <fungi> #topic Priority Efforts: Newton testing on Xenial
19:23:32 <fungi> clarkb: you were saying...?
19:24:39 <clarkb> just that I started picking this work up again
19:24:48 <fungi> "nova is all done" meaning their unit tests, devstack jobs, everything?
19:24:49 <clarkb> since the more pressing feature freeze make clouds work stuff has slowed down
19:25:19 <clarkb> fungi: ya for a master/newton change against nova their check/gate jobs appear to all have been xenialed where necessary according to zuul status
19:25:22 <fungi> i've been a little too out of touch with this one, so couldn't recall what was already running on xenial
19:25:37 <clarkb> basically unit tests, docs, pep8, functional tests, and integration tests
19:25:56 <clarkb> then grenade will happen when we switch grenade to doing newton -> master instaed of mitaka -> master
19:26:36 <clarkb> I think I am going to focus on using the zuul status page to identify the current gaps as that shows me projects that are active and haven't been xenialed yet
19:26:39 <fungi> for other projects generally, what's the status? are most of them doing non-integration jobs on xenial already but not devstack-based?
19:26:48 <clarkb> there is some work in progress to conservatively test some of the stragglers for neutron for example
19:27:15 <clarkb> fungi: correct. Basically the common "core" of python-jobs docs and pep8 unittests etc is mostly done.
19:27:19 <fungi> and yeah, periodically sampling the zuul status.json is a great way to see what's running most often (though i guess you could parse it out of zuul's logs)
19:27:32 <clarkb> then any projects using the common integration tests also are done
19:27:46 <clarkb> where the biggest gap seems to be now is where we have all of the one off jobs
19:28:01 <clarkb> oslo messaging for example has like 30 jobs for this massive matrix of testing against different backends
19:28:12 <clarkb> and ironic had >100 integration tests that need modifying last I counted
19:28:16 <fungi> oh, or i guess you could get it from grpahite too
19:28:34 <fungi> graphite
19:29:06 <fungi> ironic isn't actually _running_ all those jobs though, right?
19:29:28 <fungi> like they just did some sort of matric expansion template and only run a handful of the resulting set?
19:29:34 <fungi> matrix
19:29:44 <clarkb> yes though its a fairly large subset of that matrix
19:29:56 * fungi it typing worse than usual today
19:29:59 <clarkb> its complicated enough that I handed off to ironic a few weeks back and asked them to sort out what they want to do there
19:30:12 <fungi> got it
19:30:38 <fungi> any specific bits of this you need volunteers to jump in and help on besides reviewing?
19:31:26 <clarkb> would also help if others could keep pushing patches to split the jobs between trusty and xenial as well
19:31:50 <clarkb> its just going to be slow methodical work
19:32:47 <clarkb> I should probably also send mail to the dev list asking projects that have a lot of one off stuff like oslo.messaging and ironic and so on to do their own evaluation from their side
19:32:52 <fungi> cool--hopefully we have people interested enough in this (as it's a team priority) who might at least be comfortable with being able to follow the recipe from your previous changes to promose some more
19:33:10 <fungi> but yes, an ml thread covering that would be helpful
19:34:06 <fungi> #info Assistance welcome proposing incremental patches to split jobs between trusty (for <=stable/mitaka) and xenial (for >=stable/newton) welcome
19:34:10 <fungi> er
19:34:12 <fungi> #undo
19:34:13 <openstack> Removing item from minutes: <ircmeeting.items.Info object at 0x7f5bbff76b50>
19:34:18 <fungi> #info Assistance welcome proposing incremental patches to split jobs between trusty (for <=stable/mitaka) and xenial (for >=stable/newton)
19:34:30 <fungi> don't want people to feel doubly-welcomed
19:34:38 <fungi> that's just a bit too much welcome, even for us
19:35:05 <fungi> still not seeing any last-minute additions to the agenda, so...
19:35:10 <fungi> #topic Open discussion
19:35:19 <fungi> anyone have anything else of a general nature to bring up?
19:35:35 <pabelanger> we have a lot of clouds now
19:35:40 <fungi> i've noticed
19:35:48 <pabelanger> turned up osic-cloud8 today
19:36:00 <pabelanger> while bumping internap-mtl01 to 150
19:36:02 <fungi> between the additional quota and feature freeze, zuul is hardly breaking a sweat
19:36:21 <pleia2> nice
19:36:26 <pabelanger> Ya, things look real good
19:36:33 <pleia2> pabelanger: really great work on all this, I know you've been putting in a lot
19:36:50 <clarkb> we ran out of disk on our log server and oinly have 2 months of log retention right now so I am poking around to see if I can find where all of that disk went
19:36:56 <pabelanger> I hope to turn on rax-iad back online now that we have a new glean release too
19:36:57 <fungi> oh, right, that
19:37:01 <clarkb> pabelanger: yes excellent work pushing on the clouds
19:37:04 <anteaya> pabelanger: nice, thank you osic and internap for the additional nodes
19:37:04 <mordred> ++
19:37:05 <pabelanger> pleia2: np, happy to do the work
19:37:24 <zaro> looks like gerrit still has memory leak issue, anybody have any more ideas on tweeking? or suggest another course of action?
19:37:41 <fungi> i have a feeling one part of it is related just to general increase in activity. would be interesting to see if the upward curve in log volume started around the same time as an upward curve in our job volume, for example
19:38:21 <rcarrillocruz> i thought we were being deffensive on bumping nodes prior to release
19:38:22 <fungi> zaro: what sort of details do you think we should be gathering to provide to upstream to help track down the source of teh leak?
19:38:24 <rcarrillocruz> re: https://review.openstack.org/#/c/364101/
19:38:37 <rcarrillocruz> should we wait for merging this?
19:38:39 <clarkb> fungi: ya but I don't think we doubled our job activity. The first major thing I notice is that a lot of jobs don't compress their logs before uploading them
19:39:03 <clarkb> and we don't appear to be compressing console logs anymore?
19:39:11 <fungi> clarkb: oh, right, and we only do a post-upload compression pass weekly
19:39:25 <clarkb> oh that would explain why I don't see compressed console logs from yesterday
19:39:37 <clarkb> compressing those upfront might be a good feature for zuul-launcher
19:40:01 <fungi> right, that's done each sunday by log_archive_maintenance.sh
19:40:10 <clarkb> rcarrillocruz: we should figure out why infra cloud can't handle 10 servers before we bump to 50
19:40:11 <zaro> fungi: i've tried to provide all info i know that might help diagnose but we haven't gotten much help from upstream. https://groups.google.com/d/msg/repo-discuss/oj3h3JdioGs/37HTjJieBwAJ
19:40:13 <fungi> #link http://git.openstack.org/cgit/openstack-infra/puppet-openstackci/tree/files/log_archive_maintenance.sh log_archive_maintenance.sh
19:40:17 <pabelanger> rcarrillocruz: So, I noticed some ConnectionTimeout exceptions on the controller, we should figure out what is causing that first
19:40:18 <clarkb> rcarrillocruz: that timeout is likely to only get worse if we bump it
19:41:02 <rcarrillocruz> right, let's talk that on infra, because i don't know if it was due to sqlite or if you find the root cause already
19:41:05 <zaro> fungi: i'm welcome to suggestions on providing anything else that may help but i'm not sure i know what other info to add.
19:41:25 <fungi> #link https://groups.google.com/d/msg/repo-discuss/oj3h3JdioGs/37HTjJieBwAJ
19:43:32 <zaro> the recent tweek with httpd threads did seem to help though, so i'm open to trying other tweeks if that's the path we want to take.
19:43:33 <fungi> zaro: skimming that, i wonder if reducing our cache retention would help
19:44:50 <fungi> zaro: do you happen to know what gerrit release gerrithub.io is on?
19:44:54 <zaro> hmm, i think flush cache should lower memory usage if that's the issue
19:45:24 <fungi> ahh, there we go. looks like they're running 2.12-1426-g7e45a46
19:46:00 <zaro> yeah, different version.
19:46:09 <fungi> zaro: good point. i don't know that we've explicitly tested a cache flush to see what happens with memory usage inside the jvm. i'll give that a shot in a bit
19:46:14 <zaro> i'm wondering if we should put effort into upgrading instead  of tweeking?
19:46:27 <clarkb> zaro: only if the upgrade is better ;)
19:46:48 <clarkb> zaro: its sort of hard to test the "make gerrit crash after 2 weeks of use" scenario though
19:46:50 <fungi> zaro: well, upgrading _is_ something we should put effort into, but we may still find ourselves in need of tweaking after that's done
19:46:52 <zaro> polygerrit UI is still not ready though
19:48:36 <zaro> i'm cool either way, just wondering which way you think we should go.
19:48:56 <clarkb> I will admit that I am extremely wary of gerrit upgrades at this point
19:49:05 <fungi> #action fungi Test whether gerrit flush-caches significantly reduces memory utilization in Gerrit's JVM
19:49:17 <clarkb> the last couple we have done appear to have caught critical bugs that no one else reported and upstream was unwilling to fix for siginficant amounts of time
19:49:53 <clarkb> wondering how we can do better testing on our end since upstream isn't able to shake the issues out
19:50:03 <zaro> i would recommend skipping 2.12 and go to 2.13 but that's only in rc at this point. so maybe upgrade is prematue at this point.
19:50:25 <fungi> we're unlikely to want to upgrade between now and release day anyway
19:50:42 <anteaya> zaro: is there a reason?
19:50:56 <zaro> so maybe continue to look for memory fix until 2.13 shakes out.
19:51:11 <anteaya> ah okay
19:51:31 <fungi> i assume the suggestion for 2.13 is that at least some of backports we're running aren't present in 2.12 but only in 2.13?
19:52:07 <fungi> do they release on a schedule, or have any eta on 2.13?
19:52:08 <zaro> yes, i believe 2.13 has more fixes we need.
19:52:19 <clarkb> I wonder if this would be a use case for a read only slave gerrit
19:52:19 <zaro> no schedule.
19:52:36 <clarkb> basically have it chase upstream gerrit master and we can see if it breaks more or less awesome than the RW master
19:52:49 <clarkb> (thats probably a really complicated setup to keep running though)
19:53:04 <fungi> i expect that unless people were using it as actively as the r+w master, we wouldn't really know anyway
19:53:05 <clarkb> db migrations etc nto sure how you resolve that
19:53:11 <clarkb> fungi: ya that too
19:53:15 <zaro> clarkb: can do, but yeah more complex setup
19:53:41 <zaro> gerritforge does that for active/pass setup
19:53:48 <zaro> pass/passive
19:54:02 <fungi> for example, the big mess where we had to roll back (was it 2.10? 2.9?) only showed up once activity ramped up
19:54:51 <fungi> so we could very easily have not seen it at all even on a full copy of our production data, because it wouldn't have a full copy of our user volume
19:54:57 <clarkb> fungi: yes though it was trivially reproduceable on the dev server once we udnerstood the behavior
19:55:28 <fungi> right. just pointing out that we'd need a load generator that mostly mimics our usage
19:55:41 <fungi> no idea how hard that would be to build
19:56:44 <fungi> though it's worth noting that, at least for now, review-dev has the same size server as production, so we could just leave it that way and let people try to build a load tester that exercises it
19:57:17 <zaro> ok. well maybe think about tweeks we can make to avoid the memory leak until it's a viable option to discuss another gerrit upgrade?
19:57:18 <fungi> back when our git-review testing ran against review-dev, i was sort of able to generate some of that kind of load
19:57:34 <fungi> just by firing up multiple instances of the test script
19:57:57 <zaro> review-dev load is very basic. not at all like review at this point.
19:58:13 <fungi> though that script obviously only tested git-review relevant interactions, and certainly not browsery things
19:59:24 <clarkb> ya its always difficult to mimic your users
19:59:26 <fungi> zaro: i think we should plan for another gerrit upgrade regardless. there is a released version newer than what we're running, and we know from experience that the pain from running an extremely outdated version only delays the inevitable upgrade pain
19:59:39 <clarkb> especially since gerrit has bits of the ui still not exposed to the rest api iirc
19:59:54 <fungi> we resolved after we got off our 2.4 fork to try and keep up with upstream releases
19:59:55 <pleia2> I wonder if there's some open source test tooling for testing browser load, there must be
20:00:01 <fungi> worth looking into
20:00:04 <pleia2> I might poke around that
20:00:08 <fungi> hey! we managed to fill up the hour
20:00:12 <pleia2> thanks fungi
20:00:14 <clarkb> :)
20:00:15 <fungi> thanks everyone!
20:00:19 <fungi> #endmeeting