19:06:16 #startmeeting infra 19:06:17 Meeting started Tue Jan 2 19:06:16 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:06:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:06:20 The meeting name has been set to 'infra' 19:06:34 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:06:51 #topic Announcements 19:07:39 I don't know of any announcements, anything I'm forgetting because its the first day of work back from holidays and all that? 19:07:51 o/ 19:09:01 Oh I know. Feature freeze is coming up for openstack 19:09:27 that happens week of January 22nd, ~3 weeks from now 19:09:44 #topic Actions from last meeting 19:09:57 #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-12-19-19.01.log.txt Log from last meeting 19:10:19 That is the log from lsat meeting not the minutes as I think we had a meetbot restart in the middle of the meeting? There are no minutes 19:10:26 but grepping the log I don't find any actions \o/ 19:10:39 #topic Specs approval 19:11:20 I don't think there are nay specs ready for approval, but in looking over the specs I noticed that fungi's removal of old jenkins' votes has bubbled up some old specs to the top of the Gerrit list 19:12:03 I'd like to go through and resurrect any that need attention and abandon those that are no longer relevant 19:12:31 ++ 19:12:33 if you have any specs that you own and are able to rebase as necessary or abandon if no longer something we need that would be great 19:12:44 * mordred has been enjoying that side effect of the remove-jenkins patches 19:12:45 but I'll be going through the list sometime soon myself hopefully and do that as well 19:14:00 mordred: yes, its been nice to wholesale avoid problems related to old software :) 19:14:30 I'll put together a summary for the mailist that people can review once I'm done so that you cna make sure I haven't abandoned something important 19:14:38 #topic Priority Efforts 19:14:47 #topic Zuul v3 19:15:49 We still need to migrate off of the zuul v3 issues etherpad and into the bug tracker. I havne't gone over that list myself as we had problems with nodepool before holidays that ended up taking my attention but if we can empty the zuulv3-issues etherpad soon that would be great 19:16:09 Anything else we need to tlk about from a zuul perspective? I think shrews found a new fun corner case of branch behavior today 19:16:45 we're bringing the new finger gateway online 19:17:02 remote: https://review.openstack.org/530789 Open finger port for zuulv3 19:17:07 ^^^ last thing needed 19:17:19 was reading IRC backlogs over break, seems we had another zuul restart due to memory usage. Is that something we should discuss? 19:17:23 that's a new process running on zuulv3.o.o. once it's running, we can switch the executors to an unprivileged port 19:18:06 and we'll make finger urls point to zuulv3.o.o instead of the executors. 19:18:17 pabelanger: any idea why? 19:18:22 also noticed OOM killer running on zuul-executors still. Wasn't sure if we had some patches up for limit memory, think SpamapS 19:18:31 also regarding restarts, is the procedure for saving and restoring the queue documented somewhere? 19:19:07 the first oom on 22 seemed to have been cause by a stack of about 50 .zuul.yaml changes 19:19:17 pabelanger: yeah, i think the memory governor is the next step in the executor oom. 19:19:17 corvus: I believe there was an influx of new .zuul.yaml files in patchsets, which pushed zuul over to swapping 19:19:22 frickler: the scheduler oom'd? 19:19:52 frickler: https://docs.openstack.org/infra/system-config/zuulv3.html#restarting-the-scheduler is what I usually follow 19:19:55 corvus: we were 8+ GB in swap 19:20:00 corvus: no, the scheduler was stalled due to swapping 19:20:09 okay, that's what i thought. 19:20:45 pabelanger: thx, will try to use that next time 19:21:37 frickler: https://review.openstack.org/522678/ is actually the updated docs 19:22:22 #link https://review.openstack.org/530789 Open finger port for zuulv3 19:22:36 #link https://review.openstack.org/522678/ update zuul v3 queue saving docs 19:25:00 corvus: is this OOM problem something we could maybe use review-dev to help diagnose? just push a bunch of .zuul.yaml updates to it? 19:25:35 clarkb: i don't believe there is a bug there, the configuration is just very large. 19:26:38 it's possible that with the new late-binding inheritance approach we have, we may be able to reduce the memory usage on speculative layouts by sharing more objects between them. let me know if anyone wants to work on that. :) 19:27:11 so we should get a node with more ram and all will be fine? 19:27:49 i would prefer that we avoid uploading 50 .zuul.yaml changes at once until someone has a chance to further improve this. 19:28:43 maybe we can send mail to the dev list about ^ 19:28:52 but if folks would like to add more ram, that's certainly an option 19:28:59 I think a few people have learned not to d othat the hard way but I'm not sure we've communicated it broadly 19:29:58 it's also aggregate, yeah? so it may not be actionable for a person to avoid lots of zuul.yaml changes at once, since it could be 50 different people pushing zuul.yaml changes? 19:30:14 (mostly pondering what our communication to the dev list would tell people to do) 19:30:18 yah, seems mostly to happen once projects start moving existing jobs intree. So migration docs could be updated also, I can propose a patch for that 19:30:26 mordred: ya though I think we've really only seen it when a single person pushes a 50 stack set of changes 19:30:52 which project(s) caused the latest issue? 19:31:21 clarkb: ah - ok. that's much easier to tell people to avoid :) 19:31:27 frickler: ^ do you know? 19:31:35 I believe openstack-puppet, confirming 19:31:53 ya, we decided EmilienM was guilty ;) 19:32:09 tripleo, openstack-puppet, openstack-ansible, and openstack-infra are really the only ones likely to run into this 19:32:15 wat 19:32:23 ah 19:32:37 yeah sorry again, I stopped doing that 19:32:41 http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-12-22.log.html#t2017-12-22T06:40:52 19:32:50 so we could narrow the communication target a bit if we wanted 19:32:52 was what AJaeger noticed 19:33:09 (i'm not seeking blame, just wanted to find out if my list was incomplete) 19:33:48 obviously zuul should be robust against dos. it's just not yet. :) 19:33:54 agree, I think the larger deployment projects are primary the trigger, from what I have been seeing 19:36:34 ok so there are ways to improve memory use, talk to corvus if you want to dig into python memory usage and zuul. And while we work towards improving zuul beware large stacks of zuul config updates all at once 19:36:49 any other zuul items? 19:37:26 don't think so... we'll have the first zuul meeting of the new year next week. we may be better organized by then. :) 19:37:56 #topic General Topics 19:38:13 dmsimard had one on the agenda. The freenode spamming we've seen recently 19:38:28 I guess we've mitigated that by temporarily +r'ing channels which requires users to be registers to join them 19:39:08 hi, sorry 19:39:24 got sidetracked and missed the meeting until now .. 19:39:28 that's not in place now, though, right? (we cleared that after things subsided?) 19:39:59 corvus: ya my client doesn't report +r on channels currently 19:40:24 we removed +r and actually it's kinda awkward to add/remove it through chanserv 19:40:31 do we want to add flood detection into openstack bot? atleast to kick offending users with some sort of message? However doesn't stop if the bot is setup to autojoin on kick 19:40:41 (i'm personally +r, so will miss direct messages from unauthed folks) 19:40:43 fungi and I had discussed handling it through accessbot 19:41:01 yah, I am +r now myself too 19:41:11 corvus: +r for channels prevents unregistered people from joining and might be a hindrance for people finding IRC already complicated 19:41:26 dmsimard: i understand that.... why was that directed to me? 19:41:52 no particular reason, I don't have backlog from meeting :/ 19:42:07 I guess what I'm saying is that +r for users and channels is different 19:42:35 from the sounds of it, some channels did have +r set? Or did I misread that in backscroll 19:42:37 yes. i only mentioned that as a helpful parenthetical aside. i now believe it was not helpful for me to mention it and regret doing so. 19:43:22 so -- chanserv can't really set +r, however, it can set mlock +r 19:43:35 mlock prevents (even channel operators) from changing the mode 19:44:12 so to add +r it actually goes like this: set mlock +r (chanserv adds +r if it's not already set) 19:45:19 but to remove it -- you set an empty set mlock command (confirmed with #freenode) 19:45:51 it's kind of awkward because if we actually had mlock set, it would have wiped them (we spot checked a few channels and they had none) 19:46:41 I'm not even sure if removing the mlock removes the +r, I think it doesn't.. but I forget how I ended up removing the mode everywhere 19:47:18 we should only have mlock settings on channels which have been renamed; presumably they aren't on whatever list was being used. but it's worth keeping in mind in the future -- we don't want to accidentally unset forwards on #shade or whatever. 19:47:57 anyway, the spam is often offensive and I would like in the best case to prevent it from occurring in the first place or in the worst case, being able to react to it quickly across our 186 channels 19:48:27 looking at one of the floods from yesterday they do seem to be throttling the messages at about one per 5 seconds 19:48:33 so simple flood protections likely won't help 19:48:50 we can probably implement something like a nickname ping limit 19:48:52 yes, freenode, as we've found out, is pretty good about flood detection. 19:49:20 like, you're not supposed to be pinging 100+ people in a short timespan 19:49:42 that sounds like a reasonable heuristic 19:50:07 it might be a cat/mouse until they find another way but it doesn't force us to +r the channels 19:50:21 it's more reactive than it is preventing spam, but it's something 19:50:24 it's worth noting that afaik, we still have the 100 channel limit, and no one has created an openstack2 bot, so we don't have a single bot in all channels at the moment. 19:50:39 right -- that's why I used chanserv 19:50:52 also either the floods are reduced in frequency of freenode is catching more of them because I don't see themhappening as frequently 19:51:24 but ya I think if we can do something that avoids +r that would be good 19:51:41 (as I think we do have quite a few unregistered users based on the number of _ and ` and 1s out there 19:51:58 it's worth noting that adding +r doesn't kick unregistered people out 19:52:08 but it prevents new ones from joining 19:52:35 * mnaser apologizes for doing the UTC math incorrect for the meeting 19:52:43 * mordred waves at mnaser 19:53:06 if we want to make it easier to flip +r, we could use statusbot and/or accessbot. statusbot joins/leaves channels and so beats the 100 channel limit. 19:53:08 dmsimard: unregistered users are less likely ot have persistent clients so I think that is still a problem that should be avoided if ew can avoid it 19:53:28 hiya mordred 19:53:41 corvus: accessbot seems like a sane choice since it already modifies various channel modes doesn't it? 19:53:43 clarkb: yeah, +r should not be enabled all the time, perhaps only when there is a spam wave 19:53:48 accessbot isn't a real bot, it's just a script. if we wanted it to do things real time (avoiding the time delay for landing a change in git), we'd need to make it a long-lived process. 19:54:05 fyi: if we're still at spam topic, I just wanted to inform the infra team that I did self-approve a patch to give me operator level access to be able to do clean ups over the holidays .. just wanted to make sure folks knew if they didnt see patches merging 19:54:23 mnaser: thanks for the heads up 19:54:24 if landing a change to project-config is okay to flip +r, then it shouldn't be hard to do as it stands. 19:54:30 mnaser: I thought that was during last week's meeting for some reason -- I deleted your note on the agenda, sorry 19:54:46 no worries, i added it when i did it over the holidays 19:55:00 also, flipping +r might take a little while as I guess it needs to wait for a puppet run 19:55:09 mnaser: i think that was a good call, and happy for you to make it and make sure everyoone knows about it after the fact. :) 19:55:55 cool fun weekend project: use zuul and change accessbot to run on project-config accessbot change merges 19:56:01 corvus: so when we do a #status command, statusbot will leave and join channels as required ? 19:56:03 so that way we can instantly get +r once it lands 19:56:24 dmsimard: oh, er, sorry i'm wrong about that. gerritbot does that, not statusbot. 19:56:45 corvus: my question remains, though -- are not all channels notified ? 19:57:01 dmsimard: only ones listed in statusbot's config 19:57:06 okay 19:57:24 dmsimard: if there are 186 channels, then... certainly at least 86 are not notified. 19:57:38 I think I got my list of channels from accessbot 19:57:46 not sure how many there are for statusbot 19:57:48 all of them should be, however. no one has implemented it yet. 19:59:06 since accessbot doesn't join any channels, it doesn't have a limit. 19:59:27 ^ i learned more about accessbot when trying to get access that its not as much of a bot as a one-time script 19:59:34 yeah 19:59:51 I guess we're running out of time, but I'd like to carry the conversation perhaps in -infra 20:00:01 (we got a late start so can go a few minutes over time if necessary assuming no TC meeting) 20:00:16 dmsimard: ya I think its worth sorting out especially given the removal of shade's forward 20:00:23 but lets continue in -infra 20:00:31 #topic Open Discussion 20:00:36 any last minute items? 20:02:05 rumors all over the internet of cpu bug having big impact on VMs/hypervisors 20:02:25 so uh don't be surprised if our clouds reboot everything at some point 20:02:56 good to know 20:03:07 https://lwn.net/Articles/742404/ 20:03:28 and when they do, expect a 5-50% performance hit 20:03:59 proportional to your use of syscalls 20:04:05 aiui 20:05:08 ok thanks everyone. 20:05:13 #endmeeting