#opendev-meeting log

19:01:32 <clarkb> #startmeeting infra
19:01:33 <openstack> Meeting started Tue Jun 30 19:01:32 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:36 <openstack> The meeting name has been set to 'infra'
19:01:55 <clarkb> #topic Announcements
19:02:18 <ianw_pto> o/
19:02:20 <clarkb> If you hadn't noticed our gitea installation was being ddos'd, its under control now but only because we're blocking all of china unicom
19:02:31 <clarkb> we can talk more about this shortly
19:02:46 <clarkb> The other thing I wanted to mention is I'm taking next week off and unlike ianw_pto I don't intend to be here for the meeting :)
19:03:00 <clarkb> If we're going to have a meeting next week we'll need someone to volunteer for running it
19:03:28 <clarkb> #topic Actions from last meeting
19:03:35 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-23-19.01.txt minutes from last meeting
19:04:23 <clarkb> There were none
19:04:28 <clarkb> #topic Specs approval
19:04:40 <clarkb> ianw: oh no did the pto end?
19:04:46 <clarkb> #link https://review.opendev.org/#/c/731838/ Authentication broker service
19:05:17 <clarkb> Going to continue to call this out and we did get a new patchset
19:05:19 <clarkb> I should read it
19:05:20 <ianw> heh yes was just yesterday
19:05:55 <clarkb> #topic Priority Efforts
19:05:59 <clarkb> #topic Opendev
19:06:04 <clarkb> Let's dive right in
19:06:34 <clarkb> Before we talk about the ddos I wanted to remind people that the advisory board will start moving forward at the end of this week
19:06:39 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-May/000026.html Advisory Board thread.
19:06:47 <clarkb> we've got a number of volunteers which is exciting
19:07:16 <clarkb> Also we had a gitea api issue with the v1.12.0 release
19:07:36 <clarkb> long story short listing repos requires pagination now but the way the repos are listed from the db doesn't consistently produce a complete list
19:08:00 <clarkb> we worked around that with https://review.opendev.org/#/c/738109/ and I proposed an upstream change at https://github.com/go-gitea/gitea/pull/12057 which seems to fix it as well
19:08:29 <clarkb> For today's gitea troubles http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is a good illustration of what we saw
19:08:49 <clarkb> basically at ~midnight UTC today we immediately spiked to our haproxy connection limit
19:09:23 <clarkb> after digging aroud in gitea and haproxy logs it appears that there is a botnet that is doing a crawl of our gitea instllation from many many many IP addresses most of which belong to chinese ISPs
19:10:05 <clarkb> while doing that I noticed it appeared we had headroom to accept more connections so I proposed bumping that limit from 4k to 16k in haproxy (note the cacti number is 2x the haproxy number because haproxy has a connection the front end and the backend for each logical connection)
19:10:28 <clarkb> unfortunately our backends couldn't handle the new connections (of which we seemed to peak at about 8k logical connections)
19:11:11 <fungi> this may be in part due to specific characteristics of the requests we were being hit with
19:11:19 <clarkb> we went from having slowness and the occasional error to more persistent errors as the giteas ran out of memory. I manually reverted the maxconn change and https://review.opendev.org/#/c/738679/1 is in the gate to revert it properly. Then I restarted all the giteas and thigns got better.
19:11:37 <clarkb> As part of recovery we also blocked all IPv4 ranges for china unicom on the haporxy load balancer
19:11:51 <clarkb> if we want to undo those drop rules we can restart the netfilter-presistent service on that host
19:12:17 <clarkb> yes, the requests are looking at specific files and commits and checking them across the different localizations that gitea offers
19:12:41 <clarkb> its basically doing a proper web crawl, but not throttling itself and the way it does it causes us problems
19:13:04 <clarkb> We appear to be stable right now even though the crawler seems to still be running from other IPs
19:13:16 * diablo_rojo sneaks in late
19:13:25 <clarkb> we're under that 4k connection limit and giteas seem happy.
19:13:52 <clarkb> The problem we're now faced is how to address this more properly so that people that just want to clone nova from china aren't going to be blocked
19:13:54 <ianw> so it's currently manually applied config on haproxy node?
19:14:11 <clarkb> ianw: ya I did a for loop of iptables -I -j DROP -s $prefix
19:14:43 <clarkb> so a reboot or restart of our netfilter-persistent service will reset to our normal iptables ruleset
19:14:51 <ianw> cool; and does this thing have a specific UA string?
19:14:58 <fungi> we have no idea
19:14:59 <clarkb> ianw: good question
19:15:04 <clarkb> unfortunately gitea does not log UAs
19:15:10 <fungi> and haproxy can't see them
19:15:27 <clarkb> one idea I had was to tcpdump and then decrypt on gitea0X and see if we can sort that out
19:15:43 <clarkb> but was just trying to fight the fire earlier and haven't had time to really try ^
19:15:57 <clarkb> because ya if this is a well behaved bot maybe we can update/set robots.txt and be on our weay
19:16:43 <corvus> i'll look into gitea options to log uas
19:16:43 <ianw> ok, i can probably help
19:16:48 <clarkb> https://docs.gitea.io/en-us/logging-configuration/#the-access_log_template implies we may be able to get that out of gitea actually
19:16:49 <fungi> it's worth checking, but my suspicion is that it's not going to be well-behaved or else it wouldn't be sourced from thousands of addresses across multiple service providers
19:16:52 <clarkb> corvus: thanks
19:17:55 <ianw> the traffic goes directly into gitea doesn't it, not via a reverse proxy?
19:18:01 <fungi> it acts like some crawler implemented on top of a botnet of compromised machines
19:18:03 <clarkb> corvus: reading that really quickly I think we want to change from default logger to access logger
19:18:12 <corvus> clarkb: i agree
19:18:16 <fungi> ianw: it's a layer 4 proxy
19:18:18 <clarkb> ianw: no its all through the load balancer
19:19:04 <fungi> ianw: oh, you mean at the backend... right, gitea's listening on the server's ip address directly, there's no apache handing off those connections via loopback
19:19:04 <clarkb> thinking out loud here: I think that while we're stable we should do the logging siwtch as that gives us more data
19:19:05 <ianw> sorry i'm thinking that we could put apache infront of gitea on each gitea node, and filter at that level
19:19:14 <corvus> filter how?
19:19:22 <clarkb> corvus: mod rewrite based on UA?
19:19:22 <corvus> (i mean, based on what criteria)
19:19:30 <ianw> via UA, if we find it misbehaving
19:19:31 <clarkb> assuming the UA is discernable
19:19:41 <ianw> yeah, and not robots obeying
19:20:32 <fungi> i've seen discussions about similar crawlers, and if they're not obeying robots.txt they also are quite likely to use a random assortment of popular browser agent strings too
19:20:33 <clarkb> I like that. Basically improve our logging to check if it is a robot.txt fix. If not that will tell us if UA is filterable and if so we could add an apache to front the giteas
19:21:00 <clarkb> and that is all a reason to not further filter IPs since we're under the limits and happy but still have enough of those requests to be able to debug them further
19:21:10 <clarkb> then make decissions based on whatever that tells us
19:22:09 <corvus> johsom also mentioned we can limit by ip in haproxy
19:22:23 <fungi> and yes, improved loggnig out of gitea would be lovely. out of haproxy too... if we knew the ephemeral port haproxy sourced each forwarded socket from, we could map those to log entries from gitea
19:22:25 <corvus> so if none of the above works, doing that might be a slightly better alternative to iptables
19:22:58 <ianw> ++ would be good to encode in haproxy config
19:23:08 <fungi> currently haproxy doesn't tell us what the source port for its forwarded socket was, just the client' source port, so we've got a blind spot even with improved gitea logging
19:23:49 <ianw> what is our robots.txt situation; i get a 404 for https://opendev.org/robots.txt
19:23:49 <clarkb> fungi: https://www.haproxy.com/blog/haproxy-log-customization/ we can do that too looks like
19:24:33 <clarkb> ianw: I want to say its part of our docker image?
19:25:27 <clarkb> ah I think we can set it in our custom dir and it would serve it but we must not do that
19:25:33 <fungi> %bi provides "backend source IP (HAProxy connects with)" but maybe that includes the source port number
19:25:34 <corvus> #link https://review.opendev.org/738684 Enable access log in gitea
19:25:43 <clarkb> fungi: %bp is the port
19:26:05 <fungi> oh, duh, that was the next line below %bi and i totally missed it
19:26:09 <fungi> thanks
19:27:01 <clarkb> corvus: note we may need log rotation for those files
19:27:17 <clarkb> corvus: looks like we could have it interleave with the regular log if we want (then journald/dockerd deal with rotation?)
19:27:29 <fungi> yeah, so with the added logging in gitea and haproxy we'll be able to map any request back to an actual client ip address
19:27:39 <fungi> that will be a huge help
19:27:41 <clarkb> ++
19:27:54 <clarkb> fungi: would you like to do the haproxy side or should we find another volunteer?
19:28:03 <fungi> i'm already looking into it
19:28:07 <clarkb> awesome, thanks
19:28:20 <clarkb> Anything else we want to bring up on the subject of gitea, haproxy, or opendev?
19:28:34 <clarkb> I think this gives us a number of good next steps but am open to more ideas. Otherwise we can continue the meeting
19:29:09 <fungi> i just want to make it clear that even though we blocked access from china unicom's address space, we don't have any reason to believe they're a responsible party in this situation
19:30:15 <fungi> they're a popular isp who happens to have many customers in a place where pirated operating systems which can never receive security fixes are standard protocol, and so the majority of compromised hosts in large botnets tend to be on ip addresses of such isps
19:31:55 <clarkb> #topic Update Config Management
19:32:15 <clarkb> we've been iterating on having ze01 run off of the zuul-executor docker image
19:32:32 <clarkb> frickler turned it off again today for a reason I've yet to fully look into due to the gitea issues
19:32:49 <fungi> i saw some mention of newly discovered problems, yeah, but got sideswiped by other bonfires
19:33:22 <clarkb> looks like it was some sort of iptables issue. We've actually seen that issue before on non container executor jobs as well I think
19:33:29 <clarkb> but in this case they were all on ze01 so it was thought we should turn it off
19:33:47 <ianw> i had a quick look at that ... it was very weird and an ansible error that "stdout was not available in the dict instance"
19:33:50 <clarkb> we attempt to persist firewall rules on the remote host and do an iptables save for that
19:34:03 <clarkb> ianw: ya we've had that error before then it went away
19:34:04 <frickler> there were "MODULE FAILURE" errors in the job logs
19:34:20 <clarkb> I'm guessing some sort of ansible/iptables bug and maybe the container is able to reproduce it reliably
19:34:26 <ianw> basically a variable made with register: on a command: somehow seemed to not have stdout
19:34:27 <clarkb> (due to a timing issue or set of tooling etc)
19:34:37 <frickler> and then I found lots of "ModuleNotFoundError: No module named \'gear\'" on ze01
19:34:53 <frickler> and assumed some relation, though I didn't dig further
19:35:07 <clarkb> got it. So possibly two separate or related issues that should be looked into
19:35:11 <clarkb> thanks for the notes
19:35:16 <ianw> yeah, it was like it was somehow running in different process or something
19:36:02 <clarkb> mordred isn't around today otherwise he'd probably have ideas. Maybe we can sit on this for a bit until mordred can debug?
19:36:10 <clarkb> though if someone else would like to feel free
19:36:28 <frickler> the stdout error was just because it was trying to look at the output of the failed module
19:36:59 <corvus> is there a pointer to the error somewhere
19:37:13 <frickler> see the logs in #openstack-infra
19:37:33 <clarkb> https://fa41114c73dc4ffe3f14-2bb0e09cfc1bf1e619272dff8ccf0e99.ssl.cf2.rackcdn.com/738557/2/check/tripleo-ci-centos-8-containers-multinode/7cdd1b2/job-output.txt was linked there
19:37:39 <clarkb> and shows the module failure for iptables saving
19:37:53 <corvus> clarkb: thanks.  i do a lot better with "here's a link to a problem we don't understand"
19:39:39 <frickler> and the failure on ze01 appeared very often
19:39:42 <ianw> frickler: oh interesting; "persistent-firewall: List current ipv4 rules" shows up as OK in the console log, but seems like it was not OK
19:40:15 <frickler> ianw: there were two nodes, one passed the other failed
19:40:57 <corvus> why do we think that's related to the executor?
19:41:17 <ianw> to my eye, they both look OK in the output @  https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json
19:41:31 <frickler> corvus: because of "ModuleNotFoundError: No module named \'gear\'" in the executor log
19:41:46 <frickler> corvus: that may be a different thing, but it looked similar
19:41:47 <ianw> https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#34529 in particular
19:42:08 <clarkb> ya I think we may have two separate issues. The gear thing is probably related to the container image but the iptables thing I'm not sure
19:42:13 <frickler> corvus: together with this seeming a new issue and ze01 being changed yesterday, that was enough hints for me
19:42:35 <corvus> the gear issue didn't cause that job to fail though, right?
19:42:51 <clarkb> corvus: unless that causes post_failre? I'm not sure if the role is set up to fail on that or not
19:43:17 <corvus> clarkb: that was a retry_limit: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6
19:43:20 <corvus> centos8
19:43:51 <frickler> ianw: the failure is later: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#76130
19:44:24 <corvus> it sounds like there's perhaps a non-critical error on the executor with a missing gear package, but i don't think that should cause jobs to fail
19:44:51 <corvus> separately, there are lots of jobs retrying because of the centos8-tripleo issues
19:45:40 <ianw> frickler: yeah @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#37597
19:45:58 <ianw> but all the expected output is there
19:46:07 <ianw> anyway, we can probably debug outside the meeting
19:46:23 <clarkb> ++ lets continue afterwards
19:46:39 <clarkb> #topic General Topics
19:46:47 <clarkb> #topic DNS Cleanup
19:46:51 <corvus> https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary
19:46:58 <corvus> that's the task that caused the stdout error
19:47:07 <corvus> before we move on
19:47:21 <corvus> i'd like to understand what are the blockers for the executor
19:47:34 <clarkb> #undo
19:47:34 <openstack> Removing item from minutes: #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary
19:47:49 <clarkb> wait what thats not what I expected to be undone
19:47:50 <corvus> is it agreed that the only executor-related error is the (suspected non-fatal) missing gear package?
19:47:55 <clarkb> #undo
19:47:56 <openstack> Removing item from minutes: #topic DNS Cleanup
19:47:58 <clarkb> #undo
19:47:59 <openstack> Removing item from minutes: #topic General Topics
19:48:08 <clarkb> #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary caused iptables failure
19:48:09 <corvus> or am i missing something?
19:48:22 <clarkb> corvus: that is my understanding
19:48:30 <clarkb> gear is what needs addressing then we can turn ze01 back on?
19:48:42 <corvus> i suspect that would just cause us not to submit logstash jobs
19:48:57 <clarkb> that is my understanding as well
19:49:06 <corvus> cool, i'll work on adding that
19:49:22 <clarkb> #topic General Topics
19:49:26 <clarkb> #topic DNS Cleanup
19:49:43 <clarkb> I kept this on the agenda as a reminder that I meant to do a second pass of record removals and have not done that yet and things have been busy with fires
19:49:47 <clarkb> nothign else to add with this though
19:50:12 <clarkb> #topic Time to retire openstack-infra mailing list?
19:50:21 <clarkb> fungi: this was your topic want to quickly go over it ?
19:50:47 <clarkb> The last email to that list was on june 2
19:50:59 <fungi> sure, just noting that the infra team has been supplanted by the tact sig, which claims (currently) to use the openstack-discuss ml like other sigs
19:51:04 <clarkb> and was from zbr who we can probably convince to email service-discuss or openstack-discuss depending on the context
19:51:14 <fungi> and as you've observed, communication levels on it are already low
19:51:51 <fungi> we've likely still got the address embedded in various places, like pypi package metadata in older releases at the very least, so if we do decide it's time to close it down i would forward that address to the openstack-discuss ml
19:52:26 <clarkb> I'm good with shutting it down and setting up the forward
19:52:38 <clarkb> it was never a very busy list anyway so unlikely to cause problems with the forward
19:52:40 <fungi> this was manily an informal addition to the meeting topic just to get a feel for whether there are strong objections, it's not time yet, whatever
19:53:12 <fungi> next step would be for me to post to that ml with a proposed end date (maybe august 1?) and make sure there are no objections from subscribers
19:53:20 <clarkb> fungi: maybe send an email to that list with a proposed date a week or two in the future then just do it?
19:53:28 <clarkb> that way anyone still subbed will get a notification first
19:53:34 <frickler> seems fine for me
19:53:45 <fungi> ahh, okay, sure i could maybe say july 15
19:54:13 <fungi> if folks don't think that's too quick
19:54:17 <clarkb> works for me
19:54:36 <fungi> anyway, not hearing objections, i'll go forth with the (hopefully final) ml thread
19:54:43 <clarkb> thanks!
19:55:01 <clarkb> #topic Grafana deployments from containers
19:55:02 <diablo_rojo> thanks!
19:55:05 <clarkb> #link https://review.opendev.org/#/q/status:open+topic:grafana-container
19:55:24 <clarkb> ianw: want to quickly update us on this subject? I know you need reviews (sorry too many fires)
19:55:54 <fungi> yes, i stuck it on the top of my review stack when i went to bed last night, and it only got buried as soon as i woke up :/
19:56:18 <ianw> sorry, yeah basically grafanan and graphite containers
19:56:31 <ianw> if people want to review, then i can try deploying them
19:56:48 <ianw> grafana should be fine, graphite i'll have to think about data migration
19:56:53 <clarkb> cool, thanks for working on that. Its on my todo list for when I get out from under my fires backlog
19:57:19 <ianw> (but it can sit as graphite.opendev.org for testing and while we do that, and then just switch dns at an appropriate time)
19:58:02 <clarkb> and thats basically all we had time for.
19:58:16 <clarkb> We didn't manage to get to every item on the agenda but the gitea brainstorm was really useful
19:58:21 <clarkb> Thanks everyone
19:58:28 <clarkb> feel free to bring up anything we missed in #opendev
19:58:31 <clarkb> #endmeeting