19:01:12 <clarkb> #startmeeting infra
19:01:13 <openstack> Meeting started Tue Jan 28 19:01:12 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:16 <openstack> The meeting name has been set to 'infra'
19:01:23 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-January/006583.html Our Agenda
19:01:24 <zbr> o/
19:01:29 <clarkb> #topic Announcements
19:01:37 <clarkb> I did not have any announcements to announce
19:02:07 <corvus> clarkb: nice announcement
19:02:24 <clarkb> #topic Actions from last meeting
19:02:30 <clarkb> corvus: it was agood one
19:02:33 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-01-21-19.01.txt minutes from last meeting
19:02:42 <clarkb> There were no actions recorded in the last meeting
19:03:08 <clarkb> #topic Priority Efforts
19:03:13 <clarkb> Let's dive right in then
19:03:17 <clarkb> #topic OpenDev
19:03:30 <clarkb> link https://review.opendev.org/#/c/703134/ Split OpenDev out of OpenStack Governance
19:03:35 <clarkb> #link https://review.opendev.org/#/c/703134/ Split OpenDev out of OpenStack Governance
19:03:41 <clarkb> #link https://review.opendev.org/#/c/703488/ Update OpenDev docs with new Governance
19:04:02 <clarkb> I think these two changes are just about ready to go in. At least I haven't seen much new feedback recently on the first one
19:04:25 <clarkb> I'll bring it up with the TC to see what the next steps are from their side to keep it moving
19:04:34 <clarkb> but if you've got any input now would be a great time to record it
19:05:48 <clarkb> Are there any questions about this move to bring up here?
19:07:25 <clarkb> The other opendev item I wanted to bring up was that we had been experiencing a ddos from huawei cloud against our gitea servers
19:07:37 <corvus> (likely unintentional)
19:07:53 <clarkb> correct
19:08:15 <clarkb> I ended up emailing our OSF board member from huawei and they said they were customer IPs so couldn't put us directly in touch but did bring it up with the customer
19:08:21 <clarkb> since then I've not noticed similar behavior
19:08:43 <clarkb> If we continue to see similar OOMing behavior though we should likely strongly consider gitea hosts with more memory.
19:09:18 <clarkb> fungi mentioned rate limiting requests, but I think that may make it worse because the git processes would stay around longer as the requests would take more time. And it is the git processes loading a bunch of info into memory that causes the problems
19:09:34 <clarkb> (another option may be more gitea hosts as that will distribute load from lb better)
19:09:57 <clarkb> something to monitor but for now it is no longer an emergency
19:09:58 <corvus> in the long run, if we can get to a single distributed instance, we should be able to handle that better with load balancing
19:10:24 <clarkb> ++
19:10:32 <corvus> but yeah, until that happens, it seems like either more ram or more hosts are the best short term options
19:10:34 <fungi> yes, if the gitea servers all shared a coherent backend filesystem view that would be much easier to absorb and scale
19:11:05 <corvus> (more hosts should work in this case since we're talking about multiple source ips)
19:11:44 <fungi> right now we can't even guarantee that two gitea servers have the exact same commits at any given point in time, due to gerrit scheduling replication to them independently
19:11:55 <ianw> are they "legitimate" requests, or more looking like scripts gone wild?
19:12:14 <clarkb> ianw: it looked like periodic CI jobs that cloned everythign from scratch each day
19:12:18 <fungi> ianw: best guess is a ci system which doesn't cache repositories locally and clones them all for every build
19:12:20 <clarkb> from hundreds of hosts
19:13:02 <fungi> timing also could be related to the recent cinder announcement about drivers without third-party ci getting marked as unsupported
19:13:11 <clarkb> that is possible
19:13:44 <ianw> even if we could infinitely scale, perhaps rate limiting that sort of thing is best for everyone anyway
19:13:49 <fungi> huawei representatives did say it wasn't a huawei system though, just some customer in their public cloud
19:14:19 <fungi> the challenge is how to rate-limit it so that it doesn't make the matter worse, as clarkb points out
19:14:21 <clarkb> if we wanted to go the rate limit route we'd have to limit requests before git gets forked
19:14:30 <clarkb> it is doable but a naive approach would likely not work well
19:15:02 <fungi> i suspect the only sane limits would have to be implemented within gitea itself
19:15:44 <fungi> for example, to allow the host to stop serving new requests once system resource utilization reaches some defined thresholds
19:16:05 <fungi> so that the load balancer starts sending subsequent requests to other hosts in the pool
19:16:35 <fungi> could also implement that with a health check agent reporting host info in haproxy, but that's complex to set up
19:19:23 <ianw> can haproxy "count" how many connections have been made, and cut you off?  or are the ips spread out enough that it would get under that but still cause problems?
19:20:04 <clarkb> I think the ips were spread out enough in this case
19:20:05 <fungi> you could do that with iptables/conntrack actually
19:20:26 <clarkb> which is why a consumption monitoring system might be necessary
19:20:37 <clarkb> cloning nova requirse just over a gig of memory
19:20:45 <clarkb> have enough of those (and not very many) and you run out of memory
19:21:06 <fungi> yeah, it's less about the request volume and more about the impact of specific requests
19:22:20 <fungi> and somewhat, though not directly correlated to, the data transferred
19:23:10 <clarkb> having a proper gitea cluster should largely mitigate this as long as our k8s cluster has enough headroom
19:23:18 <clarkb> as new giteas can be spun up to meet increases in demand
19:23:27 <clarkb> its possible we may just want to focus on making ^ a reality
19:24:37 <corvus> i haven't checked in on the status of elasticsearch support recently
19:25:52 <corvus> progress is being made https://github.com/go-gitea/gitea/pull/9428
19:26:08 <corvus> but we still need code search for that
19:26:25 <clarkb> exciting, maybe we keep that as the focus as it solves other problems too. And keep the other option in mind if the ddos problems comes back or gets worse
19:27:02 <clarkb> Anything else on this topic or should we move on?
19:27:37 <ianw> is this they type of thing we could have a story to track
19:28:00 <ianw> i sort of worry that there's a lot of investigation not consolidated on it anywhere
19:28:19 <clarkb> ianw: ++ (too easy to forget when things are broken)
19:28:35 <clarkb> #action clarkb write up story on gitea OOMs and DDoS
19:28:40 <clarkb> I'll fix that
19:29:35 <ianw> thanks, that will be great as we track it over time
19:30:04 <clarkb> #topic Update Config Management
19:30:23 <clarkb> I don't think mordred has made much progress on the dockerification of gerrit
19:30:31 <clarkb> he has been busy with travel and such
19:30:41 <clarkb> Does anyone else have config management updates to bring up?
19:31:18 <ianw> i really want to get back to nodepool builder from containers very soon
19:32:14 <clarkb> ianw: is that running on the nodepool side's testing yet?
19:34:05 <ianw> ummm ... not sure
19:34:47 <clarkb> no worries, was just curious
19:34:54 <clarkb> #topic General Topics
19:35:23 <clarkb> I/we have been semi formally pushign on getting rid of Trusty once and for all the last few days
19:35:26 <ianw> (re: nodepool yes we have container tests merged)
19:36:05 <clarkb> I've been working on a static.openstack.org replacement which needs a new gerritlib release which I'll push after the meeting
19:36:19 <fungi> status not static, right?
19:36:20 <clarkb> once that is in I half expect it to be functional, I'll confirm that then update dns
19:36:24 <clarkb> er yes status.
19:36:49 <clarkb> As part of this I realized we have a lack of testing around jeepyb and gerritlib so I've been working on a new job this morning to do an integration test with a running gerrit
19:37:17 <clarkb> this should give us a lot more confidence in jeepyb and gerritlib changes which are likely to become important as we tool up some of the opendev self serve stuff. If nothing else we can reuse the test platform and build different tools
19:37:40 <clarkb> I'm close to having that stack ready for review and will bring it up after the meeting once I have the ready for review commits pushed
19:37:51 <clarkb> ianw: any progress on static? I think the change to add the host landed yesterday
19:38:19 <ianw> if i could get some eyes on
19:38:28 <ianw> #link https://review.opendev.org/704523
19:38:43 <ianw> hit a small snag deploying openafs
19:39:05 <ianw> but, after that, we should be able to test out governance and security sites with local host updates
19:39:33 <ianw> if we're happy, we can switch dns at some point after that
19:39:58 <ianw> with that POC the other bits should proceed quickly as we get them published to afs
19:40:34 <clarkb> sounds good, I'll review that after the meeting
19:40:50 <clarkb> wiki was the other big host in this list, fungi have you had a chance to psuh on it yet?
19:41:08 <fungi> unfortunately no. trying to get freed up to do that today
19:41:14 <clarkb> ok let us know if we can help
19:41:26 <zbr> can i add something about our static websites?
19:41:30 <fungi> have to run some errands after the meeting and will then figure out where i last left off
19:41:34 <clarkb> zbr: sure
19:41:59 <zbr> we should enable google site tools in order to know who is linking to us and be able to reindex quickly when we update stuff
19:42:15 <zbr> or find links that are broken
19:42:44 <ianw> that's usually dropping a randomly named html file in the root, right?
19:42:46 <zbr> enabling it requires only some kind of site verification, no JS mess is needed.
19:42:49 <zbr> exactly
19:43:09 <corvus> i don't think that's something that the infra/opendev team needs to do
19:43:14 <zbr> when zuul docs got broken I raised https://review.opendev.org/#/c/702888/ fro enabling it.
19:43:19 <corvus> that should be up to the individual projects
19:43:27 <clarkb> individual projects should be able to verify themselves if its based on content in the site
19:44:08 <fungi> i'd be very uncomfortable forcing an tacit endorsement of some proprietary third-party service on all sites opendev hosts
19:44:14 <corvus> my read is that the zuul project is not thrilled about the idea
19:44:56 <fungi> an alternative is to do what we did for docs.openstack.org and provide 404 reporting scraped from teh apache error logs
19:45:04 <zbr> so we provide a worsened experience to the users just for the spite of proprietary 3rd parties?
19:45:41 <clarkb> zbr: I think the goal would be to address the problem without relying on a proprietary service. The preexisting 404 scanner tool is one such method (as fungi points out)
19:45:43 <zbr> clearly the docs where broken for days before we fixed them
19:45:56 <clarkb> and if projects wish to opt into those proprietary tools they can do so without our input aiui
19:46:31 <fungi> for the record, zuul's docs were not broken for days. external serch engines were serving stale cached links
19:47:00 <fungi> zuul's documentation provides its own index which is required to be consistent by the tool which builds it
19:47:29 <fungi> i'll grant that the integrated keyword searching for sphinx is not great compared to what external services provide, but that's independent of the documentation index itself
19:47:58 <clarkb> zbr: but yes, one of the explicit goals here is to push viable project hosting via open source tooling
19:48:19 <fungi> and to improve those open tools where necessary
19:48:25 <ianw> is that hash in the file name based on the URL, or based on the user account requesting to add the site?
19:48:33 <zbr> ok, i just wanted to state that we should not ignore the UX
19:48:57 <corvus> ianw: that's for a user account
19:49:17 <corvus> https://review.opendev.org/702888 would give zbr administrative control of the zuul-ci.org domain in google webmaster tools
19:49:23 <ianw> ok, so it's not like you add one file and everyone/anyone could verify it
19:49:49 <zbr> true, but someone has to do it, i does not have to be me.
19:50:03 <corvus> in fact, no one has to do it
19:50:30 <fungi> as evidenced by the fact that no one has so far
19:50:48 <zbr> so is better to do nothing just because we don't want to give someone the permission to do SEO
19:51:16 <corvus> i feel like the conversation has looped back to 19:45
19:51:25 <Shrews> the fact that folks find it better to use an external search engine to find the docs they need points to the fact that we should improve the layout of our current docs to make things easier to find. the reorg was (hopefully) the beginning of that effort
19:51:31 <clarkb> I think this is a question for the zuul project not opendev, but I also don't think anyone is saying we have to give up. Simply thatZuul would like to use open source tools to address this problem
19:51:49 <clarkb> we only have about 8 minutes left and there are a couple more topics, I think we should move on
19:52:07 <corvus> Shrews, clarkb: ++
19:52:08 <clarkb> I wanted to point out we have deployed a new (arm64) cloud to production in nodepool
19:52:28 <clarkb> Unfortunately there is some weird network behavior between nb03 and the cloud apis so we have been unable to upload images
19:52:49 <ianw> also the mirror has disappeared unexpectedly, and the api hasn't helped determine why
19:52:54 <clarkb> we've reached out to kevinz on this but it is the chinese new year so expect it might be a little while before that gets fixed
19:53:04 <ianw> tracking things in
19:53:06 <ianw> #link https://storyboard.openstack.org/#!/story/2007195
19:53:30 <clarkb> this new cloud will give us like 5 times the capacity for arm64 jobs
19:53:33 <clarkb> it is very exciting
19:54:00 <ianw> yep, and we'll for sure help sort out any stability issues, we always do :)
19:54:36 <clarkb> And finally I set up a followup call with the airship team to talk about any new questions about adding the ericsson cloud to opendev nodepool. That happens tomorrow at 1600UTC on jitsi meet.
19:54:48 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-January/006578.html Airship CI Meeting details here
19:55:16 <clarkb> I ended up using jitsi meet for something else recently and it worked really well so wanted to give it a go. It is shaping up to be an open source alternative to google hangouts and zoom etc
19:55:43 <clarkb> ya'll are welcome to join. The timing is probably bad for ianw, you should sleep in instead :)
19:56:22 <clarkb> And with that we have ~3 minutes for anything else
19:56:25 <clarkb> #topic Open Discussion
19:57:35 <fungi> zbr: i've been evaluating our various options for open tools to perform analysis of our web activity in socially-conscious ways, producing reports which avoid any use of pii so they can be provided publicly. so far each of the classic tools i've evaluated has had one problem or another, but this one came to my attention last week i'm curious to try:
19:57:40 <fungi> #link https://www.goatcounter.com/
19:58:28 <zbr> thanks, still think we are missing the point.
19:58:54 <fungi> i apparently am
19:59:20 <zbr> you cannot force google to reindex your site with 3rd party tools, neither to convince them to tell you about incoming links, broken stuff and so on.
19:59:36 <zbr> this is not about analytics stuff
20:00:02 <clarkb> zbr: right, but if we properly add redirects then we don't need to force reindexing
20:00:04 <fungi> you can however find out what links are "broken" by seeing what urls folks request which return errors and add corresponding redirects
20:00:08 <clarkb> we can instead rely on their periodic reindexing
20:00:21 <clarkb> and we are at time
20:00:28 <clarkb> thank you everyone
20:00:30 <clarkb> #endmeeting