#openstack-meeting log

19:06:16 <clarkb> #startmeeting infra
19:06:17 <openstack> Meeting started Tue Jan  2 19:06:16 2018 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:06:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:06:20 <openstack> The meeting name has been set to 'infra'
19:06:34 <clarkb> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting
19:06:51 <clarkb> #topic Announcements
19:07:39 <clarkb> I don't know of any announcements, anything I'm forgetting because its the first day of work back from holidays and all that?
19:07:51 <pabelanger> o/
19:09:01 <clarkb> Oh I know. Feature freeze is coming up for openstack
19:09:27 <clarkb> that happens week of January 22nd, ~3 weeks from now
19:09:44 <clarkb> #topic Actions from last meeting
19:09:57 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2017/infra.2017-12-19-19.01.log.txt Log from last meeting
19:10:19 <clarkb> That is the log from lsat meeting not the minutes as I think we had a meetbot restart in the middle of the meeting? There are no minutes
19:10:26 <clarkb> but grepping the log I don't find any actions \o/
19:10:39 <clarkb> #topic Specs approval
19:11:20 <clarkb> I don't think there are nay specs ready for approval, but in looking over the specs I noticed that fungi's removal of old jenkins' votes has bubbled up some old specs to the top of the Gerrit list
19:12:03 <clarkb> I'd like to go through and resurrect any that need attention and abandon those that are no longer relevant
19:12:31 <mordred> ++
19:12:33 <clarkb> if you have any specs that you own and are able to rebase as necessary or abandon if no longer something we need that would be great
19:12:44 * mordred has been enjoying that side effect of the remove-jenkins patches
19:12:45 <clarkb> but I'll be going through the list sometime soon myself hopefully and do that as well
19:14:00 <clarkb> mordred: yes, its been nice to wholesale avoid problems related to old software :)
19:14:30 <clarkb> I'll put together a summary for the mailist that people can review once I'm done so that you cna make sure I haven't abandoned something important
19:14:38 <clarkb> #topic Priority Efforts
19:14:47 <clarkb> #topic Zuul v3
19:15:49 <clarkb> We still need to migrate off of the zuul v3 issues etherpad and into the bug tracker. I havne't gone over that list myself as we had problems with nodepool before holidays that ended up taking my attention but if we can empty the zuulv3-issues etherpad soon that would be great
19:16:09 <clarkb> Anything else we need to tlk about from a zuul perspective? I think shrews found a new fun corner case of branch behavior today
19:16:45 <corvus> we're bringing the new finger gateway online
19:17:02 <Shrews> remote:   https://review.openstack.org/530789 Open finger port for zuulv3
19:17:07 <Shrews> ^^^ last thing needed
19:17:19 <pabelanger> was reading IRC backlogs over break, seems we had another zuul restart due to memory usage. Is that something we should discuss?
19:17:23 <corvus> that's a new process running on zuulv3.o.o.  once it's running, we can switch the executors to an unprivileged port
19:18:06 <corvus> and we'll make finger urls point to zuulv3.o.o instead of the executors.
19:18:17 <corvus> pabelanger: any idea why?
19:18:22 <pabelanger> also noticed OOM killer running on zuul-executors still. Wasn't sure if we had some patches up for limit memory, think SpamapS
19:18:31 <frickler> also regarding restarts, is the procedure for saving and restoring the queue documented somewhere?
19:19:07 <frickler> the first oom on 22 seemed to have been cause by a stack of about 50 .zuul.yaml changes
19:19:17 <corvus> pabelanger: yeah, i think the memory governor is the next step in the executor oom.
19:19:17 <pabelanger> corvus: I believe there was an influx of new .zuul.yaml files in patchsets, which pushed zuul over to swapping
19:19:22 <corvus> frickler: the scheduler oom'd?
19:19:52 <pabelanger> frickler: https://docs.openstack.org/infra/system-config/zuulv3.html#restarting-the-scheduler is what I usually follow
19:19:55 <AJaeger> corvus: we were 8+ GB in swap
19:20:00 <frickler> corvus: no, the scheduler was stalled due to swapping
19:20:09 <corvus> okay, that's what i thought.
19:20:45 <frickler> pabelanger: thx, will try to use that next time
19:21:37 <pabelanger> frickler: https://review.openstack.org/522678/ is actually the updated docs
19:22:22 <clarkb> #link https://review.openstack.org/530789 Open finger port for zuulv3
19:22:36 <clarkb> #link https://review.openstack.org/522678/ update zuul v3 queue saving docs
19:25:00 <clarkb> corvus: is this OOM problem something we could maybe use review-dev to help diagnose? just push a bunch of .zuul.yaml updates to it?
19:25:35 <corvus> clarkb: i don't believe there is a bug there, the configuration is just very large.
19:26:38 <corvus> it's possible that with the new late-binding inheritance approach we have, we may be able to reduce the memory usage on speculative layouts by sharing more objects between them.  let me know if anyone wants to work on that.  :)
19:27:11 <frickler> so we should get a node with more ram and all will be fine?
19:27:49 <corvus> i would prefer that we avoid uploading 50 .zuul.yaml changes at once until someone has a chance to further improve this.
19:28:43 <clarkb> maybe we can send mail to the dev list about ^
19:28:52 <corvus> but if folks would like to add more ram, that's certainly an option
19:28:59 <clarkb> I think a few people have learned not to d othat the hard way but I'm not sure we've communicated it broadly
19:29:58 <mordred> it's also aggregate, yeah? so it may not be actionable for a person to avoid lots of zuul.yaml changes at once, since it could be 50 different people pushing zuul.yaml changes?
19:30:14 <mordred> (mostly pondering what our communication to the dev list would tell people to do)
19:30:18 <pabelanger> yah, seems mostly to happen once projects start moving existing jobs intree. So migration docs could be updated also, I can propose a patch for that
19:30:26 <clarkb> mordred: ya though I think we've really only seen it when a single person pushes a 50 stack set of changes
19:30:52 <corvus> which project(s) caused the latest issue?
19:31:21 <mordred> clarkb: ah - ok. that's much easier to tell people to avoid :)
19:31:27 <clarkb> frickler: ^ do you know?
19:31:35 <pabelanger> I believe openstack-puppet, confirming
19:31:53 <frickler> ya, we decided EmilienM was guilty ;)
19:32:09 <corvus> tripleo, openstack-puppet, openstack-ansible, and openstack-infra are really the only ones likely to run into this
19:32:15 <EmilienM> wat
19:32:23 <EmilienM> ah
19:32:37 <EmilienM> yeah sorry again, I stopped doing that
19:32:41 <pabelanger> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-12-22.log.html#t2017-12-22T06:40:52
19:32:50 <corvus> so we could narrow the communication target a bit if we wanted
19:32:52 <pabelanger> was what AJaeger noticed
19:33:09 <corvus> (i'm not seeking blame, just wanted to find out if my list was incomplete)
19:33:48 <corvus> obviously zuul should be robust against dos.  it's just not yet.  :)
19:33:54 <pabelanger> agree, I think the larger deployment projects are primary the trigger, from what I have been seeing
19:36:34 <clarkb> ok so there are ways to improve memory use, talk to corvus if you want to dig into python memory usage and zuul. And while we work towards improving zuul beware large stacks of zuul config updates all at once
19:36:49 <clarkb> any other zuul items?
19:37:26 <corvus> don't think so... we'll have the first zuul meeting of the new year next week.  we may be better organized by then.  :)
19:37:56 <clarkb> #topic General Topics
19:38:13 <clarkb> dmsimard had one on the agenda. The freenode spamming we've seen recently
19:38:28 <clarkb> I guess we've mitigated that by temporarily +r'ing channels which requires users to be registers to join them
19:39:08 <dmsimard> hi, sorry
19:39:24 <dmsimard> got sidetracked and missed the meeting until now ..
19:39:28 <corvus> that's not in place now, though, right?  (we cleared that after things subsided?)
19:39:59 <clarkb> corvus: ya my client doesn't report +r on channels currently
19:40:24 <dmsimard> we removed +r and actually it's kinda awkward to add/remove it through chanserv
19:40:31 <pabelanger> do we want to  add flood detection into openstack bot? atleast to kick offending users with some sort of message? However doesn't stop if the bot is setup to autojoin on kick
19:40:41 <corvus> (i'm personally +r, so will miss direct messages from unauthed folks)
19:40:43 <dmsimard> fungi and I had discussed handling it through accessbot
19:41:01 <pabelanger> yah, I am +r now myself too
19:41:11 <dmsimard> corvus: +r for channels prevents unregistered people from joining and might be a hindrance for people finding IRC already complicated
19:41:26 <corvus> dmsimard: i understand that.... why was that directed to me?
19:41:52 <dmsimard> no particular reason, I don't have backlog from meeting :/
19:42:07 <dmsimard> I guess what I'm saying is that +r for users and channels is different
19:42:35 <pabelanger> from the sounds of it, some channels did have +r set? Or did I misread that in backscroll
19:42:37 <corvus> yes.  i only mentioned that as a helpful parenthetical aside.  i now believe it was not helpful for me to mention it and regret doing so.
19:43:22 <dmsimard> so -- chanserv can't really set +r, however, it can set mlock +r
19:43:35 <dmsimard> mlock prevents (even channel operators) from changing the mode
19:44:12 <dmsimard> so to add +r it actually goes like this: set mlock +r (chanserv adds +r if it's not already set)
19:45:19 <dmsimard> but to remove it -- you set an empty set mlock command (confirmed with #freenode)
19:45:51 <dmsimard> it's kind of awkward because if we actually had mlock set, it would have wiped them (we spot checked a few channels and they had none)
19:46:41 <dmsimard> I'm not even sure if removing the mlock removes the +r, I think it doesn't.. but I forget how I ended up removing the mode everywhere
19:47:18 <corvus> we should only have mlock settings on channels which have been renamed; presumably they aren't on whatever list was being used.  but it's worth keeping in mind in the future -- we don't want to accidentally unset forwards on #shade or whatever.
19:47:57 <dmsimard> anyway, the spam is often offensive and I would like in the best case to prevent it from occurring in the first place or in the worst case, being able to react to it quickly across our 186 channels
19:48:27 <clarkb> looking at one of the floods from yesterday they do seem to be throttling the messages at about one per 5 seconds
19:48:33 <clarkb> so simple flood protections likely won't help
19:48:50 <dmsimard> we can probably implement something like a nickname ping limit
19:48:52 <corvus> yes, freenode, as we've found out, is pretty good about flood detection.
19:49:20 <dmsimard> like, you're not supposed to be pinging 100+ people in a short timespan
19:49:42 <corvus> that sounds like a reasonable heuristic
19:50:07 <dmsimard> it might be a cat/mouse until they find another way but it doesn't force us to +r the channels
19:50:21 <dmsimard> it's more reactive than it is preventing spam, but it's something
19:50:24 <corvus> it's worth noting that afaik, we still have the 100 channel limit, and no one has created an openstack2 bot, so we don't have a single bot in all channels at the moment.
19:50:39 <dmsimard> right -- that's why I used chanserv
19:50:52 <clarkb> also either the floods are reduced in frequency of freenode is catching more of them because I don't see themhappening as frequently
19:51:24 <clarkb> but ya I think if we can do something that avoids +r that would be good
19:51:41 <clarkb> (as I think we do have quite a few unregistered users based on the number of _ and ` and 1s out there
19:51:58 <dmsimard> it's worth noting that adding +r doesn't kick unregistered people out
19:52:08 <dmsimard> but it prevents new ones from joining
19:52:35 * mnaser apologizes for doing the UTC math incorrect for the meeting
19:52:43 * mordred waves at mnaser
19:53:06 <corvus> if we want to make it easier to flip +r, we could use statusbot and/or accessbot.  statusbot joins/leaves channels and so beats the 100 channel limit.
19:53:08 <clarkb> dmsimard: unregistered users are less likely ot have persistent clients so I think that is still a problem that should be avoided if ew can avoid it
19:53:28 <mnaser> hiya mordred
19:53:41 <clarkb> corvus: accessbot seems like a sane choice since it already modifies various channel modes doesn't it?
19:53:43 <dmsimard> clarkb: yeah, +r should not be enabled all the time, perhaps only when there is a spam wave
19:53:48 <corvus> accessbot isn't a real bot, it's just a script.  if we wanted it to do things real time (avoiding the time delay for landing a change in git), we'd need to make it a long-lived process.
19:54:05 <mnaser> fyi: if we're still at spam topic, I just wanted to inform the infra team that I did self-approve a patch to give me operator level access to be able to do clean ups over the holidays .. just wanted to make sure folks knew if they didnt see patches merging
19:54:23 <clarkb> mnaser: thanks for the heads up
19:54:24 <corvus> if landing a change to project-config is okay to flip +r, then it shouldn't be hard to do as it stands.
19:54:30 <dmsimard> mnaser: I thought that was during last week's meeting for some reason -- I deleted your note on the agenda, sorry
19:54:46 <mnaser> no worries, i added it when i did it over the holidays
19:55:00 <mnaser> also, flipping +r might take a little while as I guess it needs to wait for a puppet run
19:55:09 <corvus> mnaser: i think that was a good call, and happy for you to make it and make sure everyoone knows about it after the fact.  :)
19:55:55 <mnaser> cool fun weekend project: use zuul and change accessbot to run on project-config accessbot change merges
19:56:01 <dmsimard> corvus: so when we do a #status command, statusbot will leave and join channels as required ?
19:56:03 <mnaser> so that way we can instantly get +r once it lands
19:56:24 <corvus> dmsimard: oh, er, sorry i'm wrong about that.  gerritbot does that, not statusbot.
19:56:45 <dmsimard> corvus: my question remains, though -- are not all channels notified ?
19:57:01 <corvus> dmsimard: only ones listed in statusbot's config
19:57:06 <dmsimard> okay
19:57:24 <corvus> dmsimard: if there are 186 channels, then... certainly at least 86 are not notified.
19:57:38 <dmsimard> I think I got my list of channels from accessbot
19:57:46 <dmsimard> not sure how many there are for statusbot
19:57:48 <corvus> all of them should be, however.  no one has implemented it yet.
19:59:06 <corvus> since accessbot doesn't join any channels, it doesn't have a limit.
19:59:27 <mnaser> ^ i learned more about accessbot when trying to get access that its not as much of a bot as a one-time script
19:59:34 <dmsimard> yeah
19:59:51 <dmsimard> I guess we're running out of time, but I'd like to carry the conversation perhaps in -infra
20:00:01 <clarkb> (we got a late start so can go a few minutes over time if necessary assuming no TC meeting)
20:00:16 <clarkb> dmsimard: ya I think its worth sorting out especially given the removal of shade's forward
20:00:23 <clarkb> but lets continue in -infra
20:00:31 <clarkb> #topic Open Discussion
20:00:36 <clarkb> any last minute items?
20:02:05 <clarkb> rumors all over the internet of cpu bug having big impact on VMs/hypervisors
20:02:25 <clarkb> so uh don't be surprised if our clouds reboot everything at some point
20:02:56 <pabelanger> good to know
20:03:07 <corvus> https://lwn.net/Articles/742404/
20:03:28 <corvus> and when they do, expect a 5-50% performance hit
20:03:59 <clarkb> proportional to your use of syscalls
20:04:05 <clarkb> aiui
20:05:08 <clarkb> ok thanks everyone.
20:05:13 <clarkb> #endmeeting