19:00:45 #startmeeting infra 19:00:46 Meeting started Tue Nov 17 19:00:45 2015 UTC and is due to finish in 60 minutes. The chair is pleia2. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:46 howdy 19:00:47 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:49 The meeting name has been set to 'infra' 19:00:51 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:00 o/ 19:01:01 o/ 19:01:05 o/ 19:01:09 o/ 19:01:12 alright, happy meeting time! 19:01:26 o/ 19:01:26 #topic Announcements 19:01:42 o/ 19:01:43 anyone have any announcements? 19:01:53 glad you're back? 19:01:54 o/ 19:01:59 anteaya: thanks :) 19:02:11 pleia2: elasticsearch upgrade is done done 19:02:19 yay! 19:02:20 clarkb: cool 19:02:21 now moing on to logstash and kibana upgrades 19:02:27 whoo 19:02:33 * pleia2 nods 19:02:37 (if anyone wants to learn more about how we logstash I can walk people through testing) 19:02:54 clarkb: is the page showing unclassified errors now only showing logs from gate runs? 19:03:04 anteaya: no logstash upgrade will address that 19:03:10 clarkb: okay thanks 19:03:19 #info elasticsearch upgrade is done, clarkb now moving on to logstash and kibana upgrades (and he's offered to help teach others about logstash) 19:03:31 ok, I think we can move on 19:03:33 #topic Actions from last meeting 19:03:43 jeblair investigate whether 209906 is needed for gerrit 2.11 19:03:46 jeblair: how'd that go? 19:04:07 short version: yes :) 19:04:12 #link https://review.openstack.org/#/c/209906 19:04:30 so we should add that into the list of things we need to do before the gerrit upgrade 19:04:47 should we change the topic on that patch to gerrit-upgrade? 19:04:55 anteaya: yes 19:05:03 ok, so the patch looks good to folks who have looked at it, mordred has asked for jeblair eyes before merging 19:05:08 o/ 19:05:25 done 19:05:29 we'll dive more into the migration later in the meeting, so we can move on from this 19:05:36 nibalizer send one-week reminder for scheduled maintenance on the 18th 19:05:43 I didn't see this, but I could have missed it, nibalizer? 19:06:24 o/ 19:06:35 o/ 19:06:37 o/ 19:06:40 i didn't see reminder either. 19:06:44 if it wasn't sent out, it might be worth sending one that says "tomorrow" :) 19:06:59 i dont think i did htat 19:07:02 sorry 19:07:08 nibalizer: want to send one now-ish? 19:07:13 sure 19:07:17 cool, thanks 19:07:26 reply to the original announcement or a new message 19:07:31 reply 19:07:33 I think reply to original 19:07:38 and the last action from the last meeting: 19:07:40 o/ 19:07:43 clarkb double-check git-review interactions with gerrit 2.11 on review-dev.openstack.org 19:07:49 clarkb: how did that go? 19:08:26 works fine 19:08:30 let me get my test change 19:08:32 \o/ 19:08:51 https://review-dev.openstack.org/#/c/5383/ 19:09:19 wonderful 19:09:22 great 19:09:35 #topic Specs approval 19:10:01 so the only one this week is carried over from last, phschwartz has a spec proposed to add extension to openstackci for next phase of work 19:10:05 #link https://review.openstack.org/#/c/239810/ 19:10:25 yolanda had some comments on the patch and asselin__ agreed 19:10:41 \o 19:11:10 that's due to discussion originated about graphite wrapper patch, and conversation that followed infra channel 19:11:15 yes, I think the spec needs a bit more clarification. Based on discussion, it doesn't seem to be just an extension 19:11:57 actually there were some comments that clarificated the issue, but done on graphite change 19:12:04 i asked to add these comments into the spec 19:12:11 makes sense 19:12:16 ++ 19:12:38 so it sounds like folks will keep moving forward on this, and we can revisit again next week 19:12:43 +1 19:12:49 #topic Priority Effort: Gerrit 2.11 Upgrade 19:13:13 big event for the week, the agenda has our series of etherpads 19:13:28 #link https://etherpad.openstack.org/p/test-gerrit-2.11 19:13:35 #link https://etherpad.openstack.org/p/gerrit-2.11-upgrade 19:13:43 thanks zaro 19:13:47 so patches are up for review. 19:14:04 i think we are still working thru this one #link https://review.openstack.org/#/c/243879/ 19:14:31 clarkb: says the change there is not correct 19:14:41 we'll need to fixo that. 19:15:21 can I ask who will be available tomorrow to help with the upgrade? 19:15:21 also rollback is probably a no go.. #link https://review.openstack.org/#/c/245598/ 19:15:25 ok, so we should make sure we focus on that this afternoon 19:15:26 Big shout out to Zaro and jeblair for representing us at Gerrit Summit again! I know Zaro goes most years. :) Really enjoyed reading the write up. 19:15:44 cody-somerville: thanks 19:15:48 anteaya: I will be around 19:15:58 I will be 19:16:01 i will 19:16:05 me too 19:16:15 of course me as well 19:16:33 how do we feel so far about the upgrade plan? 19:16:41 I can if needed 19:16:48 does the etherpad capture the steps needed? 19:16:51 zaro: so that patch means we don't have a good way prepared to roll back the database from 2.11 to 2.8 if things go sideways 19:17:02 anteaya: I added my comments to upgrade plan otherwise it looked straightforward 19:17:15 pleia2: that is a correct assesment. 19:17:19 pabelanger: okay what we need most from non root folks is reviewing patches and testing things on review-dev, happy to have your help 19:17:31 ok, how do we feel about not having a database rollback plan? 19:17:36 clarkb: thanks 19:17:42 pleia2: we can only go back to backup. 19:17:43 pleia2: I thought we had one? 19:17:54 pleia2: well we can keep the old data but we lose new data 19:17:59 zaro: ah, restore frm backups, no rollback 19:18:06 there is a rollback if required change 19:18:14 what am I missing? 19:18:19 but the rollboack loses new data 19:18:19 clarkb: zaro said it doesn't work https://review.openstack.org/#/c/245598/ 19:18:20 clarkb: i was plannig to investigate further today but rollback script doesn't work as is 19:19:00 yah, cannot access new changes after rolling back. 19:19:08 gotcha 19:19:13 is mordred aware? 19:19:28 yes we talked about this this morning in irc 19:19:36 yes, there was discussion in infra channel 19:20:30 other that what i have reported, i'm not aware that there's anything else that's needed for the upgrade to happen. 19:21:19 thanks to all the peeps that tested review-dev! 19:21:29 i think as long as we are able to merge the zuul change before the gerrit upgrade, we can just let the zuul upgrade happen with the restart that will go along with the gerrit upgrade. 19:21:53 given how poorly last upgrade went I am not sure how comfortable I am without a rollback 19:22:01 #link http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2015-11-17.log.html#t2015-11-17T16:30:01 19:22:05 jeblair: can we add that point to https://etherpad.openstack.org/p/gerrit-2.11-upgrade ? 19:22:07 that seems to be where the discussion began 19:22:29 Not having a roll back makes me rather uneasy. 19:22:29 (if something fails with the zuul change, it is most likely to affect only mergeability checks, so it's not super-critical) 19:22:38 I would also worry about git repo confusion if we revert then change with same change number ends up in repo that already has it 19:22:44 Although I'm also not able to help at the time. 19:23:21 clarkb: agreed, i think if we are unable to rollback with new data, we need to revert both the db and the git repos 19:23:49 so new changes would completely disappear and we would be back to the pre-upgrade state 19:24:40 Could new changes be set aside somewhere, so that authors could be notified? 19:24:43 so i think our choices are: (a) delay pending a rollback procedure that can backport new data, or (b) say our rollback plan is restore both the db _and_ the filesystem from pre-upgrade state 19:25:31 Either of those is a reasonable response, depending on the change rate and how many might be lost with b) 19:25:47 (b) has some serious problems if we merge changes to any repos -- we will be rewinding branch tips. 19:26:04 persia: I think if gerrit treated our data so well that we could do that, I doubt we would need to downgrade 19:26:09 potentially making repos non-fast-forwardable 19:26:23 so (b) is not a _good_ rollback plan. 19:26:32 jeblair: agreed 19:26:42 jeblair: agreed 19:26:44 (b) is only realistic on a short turnaround time 19:26:53 yep 19:27:05 a few hours or a day sure, but a week or something its a giant problem 19:27:43 the prior problem was observed after a few hours 19:27:56 but we didn't rollback until monday 19:28:04 after a satudray, upgrade 19:28:18 we didnt understand the scope of the issue until monday rush 19:29:12 long enough that if it happened again, we would have some non-ff-able branches, but perhaps not too many 19:30:58 do we think we can find a solution to make (a) reasonable? (and is there anyone to work on it?) 19:31:07 aside from zaro 19:31:18 the suspicion is that data is being stored in git 19:31:29 since reflogs are being created for everything 19:31:42 I don't know what the solution is to restore that 19:31:56 so as it stands, if we proceed, our plan is "if we notice a failure immediately, we can roll back fs and db; if we notice a failure after a delay, try _really hard_ to fix it moving forward; if that fails, we can roll back fs and db but it will be a *big deal* and cause mass confusion" 19:32:18 * pleia2 nods 19:32:22 anteaya: i think (a) has not been fully investigated 19:32:27 that sounds to be about it 19:32:37 jeblair: that is true 19:32:47 i think the investigation has gotten to "something might be happening in git" but we don't understand what yet, so it's hard to reason about 19:33:15 well as far as there is code that appears to be creating reflogs 19:33:26 sure, i don't know what that means though :) 19:33:30 me either 19:34:11 I don't know who makes a decision here, perhaps see if we can gather mordred and zaro and others later this afternoon? 19:34:16 My gut feel (having vontributed nothing to the work :-( ) is that it sounds like we're not quote ready. 19:34:23 make a decision before 00:00 utc 19:34:33 ruagair: but I don't know when we are going to be readier 19:34:39 ya I want zaro and mordred to weigh in 19:34:44 anteaya: nobody does :\ 19:34:52 right 19:35:05 so waiting for us to be readier might delay us a while 19:35:11 pleia2: i think that plan sounds good -- we need input from mordred and zaro 19:35:11 ok, let's see what they think this afternoon about the reasonable solveability of this problem 19:35:13 so i don't know aboout confort level but nodbody else ever does rollbacks in the gerrit community. 19:35:25 i've asked 19:35:30 any other concerns, etc about the upgrade? 19:35:31 zaro: they also run with major server breakin bugs 19:35:38 clarkb: heh, right 19:35:40 zaro: as we discovered 19:35:47 zaro: we never did either until it completely failed 19:36:20 and the fix is brand new and took something like 8 attempts 19:37:41 nibalizer: so you'll hold your announcement until we make a decision by 00:00 utc? 19:38:16 #agreed gather thoughts from mordred and zaro, make go/no go decision about gerrit upgrade by 00:00 UTC 19:38:19 ok 19:38:38 #topic maniphest migration 19:38:47 I don't know if this is left over from last week (there was discussion last week) 19:39:00 and it's early for ruagair :) 19:39:00 I've got no updates I've note. 19:39:03 oh, hi! 19:39:06 of note. 19:39:17 I've hacked further cauth, made small progress. 19:39:24 ruagair: I found docs for doing mod auth openid and LP dont have link handy though :( 19:39:26 ruagair: ok to remove the details from the agenda for the next meeting? 19:39:36 it does not work with openstackid yet 19:39:40 clarkb: link would be nice, I'd like to look at that. 19:39:48 yes pleia2. Thank you. 19:39:51 ruagair: I will dig it up 19:40:05 Thanks clarkb 19:40:05 the openstackid team is easy to work with, we made changes to it in order to support zanata's auth 19:40:12 so hopefully getting whatever we need won't be much of a barrier 19:40:45 #link http://www.keypressure.com/blog/modauthopenid-and-ubuntu-sso/ 19:40:51 ruagair: ^ 19:41:08 I am working with smarcet et al to get it working with openstackid too 19:41:09 Got it. 19:41:25 (openstackid isn't completely spec compliant when it comes to uri discovery) 19:41:31 heh 19:41:34 * pleia2 nods 19:42:03 I'll give that a run today clarkb 19:42:06 thanks for your work on this, ruagair 19:42:10 #topic Open discussion 19:42:26 You're welcome. Sorry about getting hung up on cauth. 19:42:43 ruagair: having been through this with another project, I totally understand 19:42:55 and I newly understand 19:42:59 turns out federated auth is hard 19:43:02 and they don't make debugging easy 19:43:10 YES. 19:43:15 "no more idp endpoints" 19:43:19 What does that mean? 19:43:33 and "this OP is not authorized to assert stuff about identity" 19:43:36 or something 19:43:47 I'd just like to ask the folks who felt we weren't ready for the gerrit upgrade to go over some of the etherpads and patches and help to get us to the point where they feel we are ready 19:43:50 clarkb: so would you mind investigating the '//' on review-dev further while i take some time to investigate the rollback error? 19:43:54 anteaya: ++ 19:44:04 Accepted anteaya. :-D 19:44:05 zaro: yes I can continue to work on that if someone can add my key to my user 19:44:16 otherwise I have to go get my old key out of cold storage 19:44:21 ruagair: thank you, would be grateful to have your input on the data issue 19:44:23 which I can do if it is easier 19:44:44 clarkb: i can do but don't know the mechanics of that. 19:45:50 clarkb: i'll fix 19:45:56 ok thanks 19:46:39 done 19:46:43 * clarkb advertises logstash local test setup and learning to anyone interested 19:46:54 to happe nafter tomorrow 19:46:59 jeblair: I am in 19:47:10 I wouldn't mind talking about stackalytics.o.o. The site has been live for a week, however I think we need to talk about maintenance. In all that time we have not restarted wsgi, so we are not running the latest version of stackalytics. 19:47:13 * ruagair is interested clarkb 19:47:32 we'll need to decide how to reload wsgi, since it takes more then 15mins to bring the process back up 19:47:42 ruagair: cool we can figure out details after meeting 19:48:19 pabelanger: seems like we might want to cron that for a slow time? either that or make it an ha system and alternate upgrades. 19:49:00 (one of those sounds notably easier than the other) 19:49:23 jeblair: right. slow time is how it is done today on stackalytics.com. So, if people are fine with that (15-20mins potential outage), I can setup cron to do that. But I agree, something HA might be better 19:49:46 clarkb: where do I need to show up to learn about logstash local setup? 19:50:22 mmedvede: mostly just trying to figure out who is interested then we can decide on details 19:50:30 so ping me and after meeting we can all sort out details 19:51:07 o/ I had a Q about the status of publishing docs.o.o from swift, spec is here: http://specs.openstack.org/openstack-infra/infra-specs/specs/doc-publishing.html 19:51:11 pabelanger: the slow cron sounds like a great start 19:51:26 jeblair: other option is what greghaynes suggested, using wsgi maximum-requests to restart processes 19:51:41 Looks like it's not a priority, but I also noticed the spec itself doesn't address one need that we have which is https. Does anyone have a thought/comment? 19:51:54 but honestly, need more help with wsgi to understand how it works 19:52:00 annegentle: i think https would be easy to add 19:52:10 pabelanger: Is it possible to bring up the upgraded one in a separate process, and redirect wsgi when it has finished the memory load? 19:52:11 err, uwsgi 19:52:30 persia: not sure, but I do like that approach 19:52:43 jeblair: does the spec need updating? We're also looking at writing a spec for developer.o.o publishing 19:52:48 persia: our current issue, we only have so much RAM left, and are limited to 2 processes 19:52:54 jeblair: and wondered if a 2nd spec is the way to go there 19:52:55 pabelanger: Ah, good point. 19:53:14 re: slow startup - we should have a way to preload what is needed before uwsgi considers the process 'up', that requires some knowledge of the app though 19:53:17 annegentle: but yeah, we haven't started working on that yet, maybe we can find someone to start on that soon. 19:53:20 that is the typical way that problem is solved 19:53:28 before finishing the meeting, i'd like to share progress about infra-cloud 19:53:36 greghaynes_: ya, I have been meaning to look into that 19:53:39 ricky and myself have been working on us east 19:53:46 yolanda: oh good 19:53:50 if you do that, then the max-lifetime will 'just work' because when it restarts the wsgi process and wait for the preload to finish before it adds it in to the worker pool 19:54:11 we've hit a problem with bifrost, when disks are > 2TB. I filed a change here https://review.openstack.org/#/c/246253, seems to have acceptation 19:54:12 jeblair: I'm a wee bit concerned about timing since it has been approved 18 months now? And we'll need to slice it in well before an actual release. So a target would be lovely, such as mitaka-2 or some such? 19:54:12 annegentle: if dev.o.o can use the same mechanism, then we can just update the existing spec; 19:54:20 greghaynes_: Ah, perfect 19:54:40 jeblair: it _can_ though we'd prefer to have a "real" server for developer.openstack.org for more dynamic content, so that sounds like a separate spec. 19:54:52 I am PTOing (in theory) today, but later in the week I could help look at it 19:54:52 jeblair: but wanted to ask about both here 19:55:03 annegentle: like you want developer.o.o to be a web app? 19:55:25 jeblair: ideally so we can more dynamically serve API content 19:55:33 jeblair: instead of static files 19:55:35 yolanda: sounds like you're on your way to a fix 19:55:53 greghaynes, want to take a look ? https://review.openstack.org/#/c/246253 - we also hit some problem with bifrost and duplicated templates https://review.openstack.org/244061 19:55:54 annegentle: i thought all the api content could be generated statically? i mean it doesn't change from moment to moment, right? 19:56:13 this one, had a -1 from Deva, seems we need to clarify which is the right template 19:56:14 jeblair: it can be, sure, but precludes the use of swagger-ui or any sort of sandbox 19:56:32 jeblair: I think a spec would be a good way to keep the idea fresh and get reviews, does that sound right to you? 19:56:43 jeblair: we didn't really get a meeting with mordred though we tried :) 19:56:50 annegentle: oh, last i heard i thought there was talk about using swagger or something like it to generate the static stuff. either way, yeah, sounds like speccing would be good. 19:56:55 pleia2, i wanted to raise the need for tracking progress properly on that... currently is quite messy. We pass info internally on mails, jira tickets on HP, storyboard... 19:57:02 * anteaya gives annegentle her club jacket 19:57:12 and etherpads! 19:57:15 jeblair: yes, we're definitely going static for now, but with facing 18+ months for even that I want to be sure we keep moving things forward incrementally 19:57:19 annegentle: oh, i see -- yeah, i was probably thinking of that mordred idea but i guess that hasn't happened. 19:57:38 yolanda: do you have any thoughts as to how to make it better, or just mentioning the problem? 19:57:41 jeblair: yeah. so better for me to at least make sure we have the idea on paper. bits/bytes paper anyway :) 19:57:47 yolanda: I'll have a look at it in a bit, trying to family time today :) 19:57:51 feels more responsible anyway 19:57:52 Oh. If people could also look at https://review.openstack.org/#/c/205596/ for an example of using puppetlabs-apache. I am wanting to start work on the vhost::custom upstream and want to make sure everybody is happy how it will look for use. While not 100%, the concept will be the same. 19:57:53 yolanda: I'd be happy to help finding a solution 19:58:00 yolanda: talk to Clint about tracking progress 19:58:08 pleia2, i wanted to put my hope on phabricator, but this seems to be WIP 19:58:15 that is the work yolanda was looking into to port forward to a newer version of puppet-apache 19:58:15 oh yes, that's a great one for Clint 19:58:17 * Clint grunts. 19:58:23 heh 19:58:28 o/ Clint 19:59:05 problem is that storyboard is not complete enough to track everything. JIRA is something internal... 19:59:10 so what do we have? 19:59:26 yolanda: honestly we sketch out work in etherpads 19:59:48 yep, but in terms of task tracking for example. Or managing the CMDB... 19:59:54 any news on a spec for task tracking? 20:00:02 all the upgrade stuff we're doing with gerrit should be in a proper task tracker 20:00:02 we currently have this on excel sheets and emails, not good 20:00:14 time 20:00:16 Zara: earlier in the meeting ruagair shared some progress 20:00:21 thanks everyone! 20:00:25 yolanda: this was intended to be useful for something: https://review.openstack.org/#/c/219372/ 20:00:27 #endmeeting