19:01:32 #startmeeting infra 19:01:33 Meeting started Tue Jun 30 19:01:32 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:36 The meeting name has been set to 'infra' 19:01:55 #topic Announcements 19:02:18 o/ 19:02:20 If you hadn't noticed our gitea installation was being ddos'd, its under control now but only because we're blocking all of china unicom 19:02:31 we can talk more about this shortly 19:02:46 The other thing I wanted to mention is I'm taking next week off and unlike ianw_pto I don't intend to be here for the meeting :) 19:03:00 If we're going to have a meeting next week we'll need someone to volunteer for running it 19:03:28 #topic Actions from last meeting 19:03:35 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-23-19.01.txt minutes from last meeting 19:04:23 There were none 19:04:28 #topic Specs approval 19:04:40 ianw: oh no did the pto end? 19:04:46 #link https://review.opendev.org/#/c/731838/ Authentication broker service 19:05:17 Going to continue to call this out and we did get a new patchset 19:05:19 I should read it 19:05:20 heh yes was just yesterday 19:05:55 #topic Priority Efforts 19:05:59 #topic Opendev 19:06:04 Let's dive right in 19:06:34 Before we talk about the ddos I wanted to remind people that the advisory board will start moving forward at the end of this week 19:06:39 #link http://lists.opendev.org/pipermail/service-discuss/2020-May/000026.html Advisory Board thread. 19:06:47 we've got a number of volunteers which is exciting 19:07:16 Also we had a gitea api issue with the v1.12.0 release 19:07:36 long story short listing repos requires pagination now but the way the repos are listed from the db doesn't consistently produce a complete list 19:08:00 we worked around that with https://review.opendev.org/#/c/738109/ and I proposed an upstream change at https://github.com/go-gitea/gitea/pull/12057 which seems to fix it as well 19:08:29 For today's gitea troubles http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is a good illustration of what we saw 19:08:49 basically at ~midnight UTC today we immediately spiked to our haproxy connection limit 19:09:23 after digging aroud in gitea and haproxy logs it appears that there is a botnet that is doing a crawl of our gitea instllation from many many many IP addresses most of which belong to chinese ISPs 19:10:05 while doing that I noticed it appeared we had headroom to accept more connections so I proposed bumping that limit from 4k to 16k in haproxy (note the cacti number is 2x the haproxy number because haproxy has a connection the front end and the backend for each logical connection) 19:10:28 unfortunately our backends couldn't handle the new connections (of which we seemed to peak at about 8k logical connections) 19:11:11 this may be in part due to specific characteristics of the requests we were being hit with 19:11:19 we went from having slowness and the occasional error to more persistent errors as the giteas ran out of memory. I manually reverted the maxconn change and https://review.opendev.org/#/c/738679/1 is in the gate to revert it properly. Then I restarted all the giteas and thigns got better. 19:11:37 As part of recovery we also blocked all IPv4 ranges for china unicom on the haporxy load balancer 19:11:51 if we want to undo those drop rules we can restart the netfilter-presistent service on that host 19:12:17 yes, the requests are looking at specific files and commits and checking them across the different localizations that gitea offers 19:12:41 its basically doing a proper web crawl, but not throttling itself and the way it does it causes us problems 19:13:04 We appear to be stable right now even though the crawler seems to still be running from other IPs 19:13:16 * diablo_rojo sneaks in late 19:13:25 we're under that 4k connection limit and giteas seem happy. 19:13:52 The problem we're now faced is how to address this more properly so that people that just want to clone nova from china aren't going to be blocked 19:13:54 so it's currently manually applied config on haproxy node? 19:14:11 ianw: ya I did a for loop of iptables -I -j DROP -s $prefix 19:14:43 so a reboot or restart of our netfilter-persistent service will reset to our normal iptables ruleset 19:14:51 cool; and does this thing have a specific UA string? 19:14:58 we have no idea 19:14:59 ianw: good question 19:15:04 unfortunately gitea does not log UAs 19:15:10 and haproxy can't see them 19:15:27 one idea I had was to tcpdump and then decrypt on gitea0X and see if we can sort that out 19:15:43 but was just trying to fight the fire earlier and haven't had time to really try ^ 19:15:57 because ya if this is a well behaved bot maybe we can update/set robots.txt and be on our weay 19:16:43 i'll look into gitea options to log uas 19:16:43 ok, i can probably help 19:16:48 https://docs.gitea.io/en-us/logging-configuration/#the-access_log_template implies we may be able to get that out of gitea actually 19:16:49 it's worth checking, but my suspicion is that it's not going to be well-behaved or else it wouldn't be sourced from thousands of addresses across multiple service providers 19:16:52 corvus: thanks 19:17:55 the traffic goes directly into gitea doesn't it, not via a reverse proxy? 19:18:01 it acts like some crawler implemented on top of a botnet of compromised machines 19:18:03 corvus: reading that really quickly I think we want to change from default logger to access logger 19:18:12 clarkb: i agree 19:18:16 ianw: it's a layer 4 proxy 19:18:18 ianw: no its all through the load balancer 19:19:04 ianw: oh, you mean at the backend... right, gitea's listening on the server's ip address directly, there's no apache handing off those connections via loopback 19:19:04 thinking out loud here: I think that while we're stable we should do the logging siwtch as that gives us more data 19:19:05 sorry i'm thinking that we could put apache infront of gitea on each gitea node, and filter at that level 19:19:14 filter how? 19:19:22 corvus: mod rewrite based on UA? 19:19:22 (i mean, based on what criteria) 19:19:30 via UA, if we find it misbehaving 19:19:31 assuming the UA is discernable 19:19:41 yeah, and not robots obeying 19:20:32 i've seen discussions about similar crawlers, and if they're not obeying robots.txt they also are quite likely to use a random assortment of popular browser agent strings too 19:20:33 I like that. Basically improve our logging to check if it is a robot.txt fix. If not that will tell us if UA is filterable and if so we could add an apache to front the giteas 19:21:00 and that is all a reason to not further filter IPs since we're under the limits and happy but still have enough of those requests to be able to debug them further 19:21:10 then make decissions based on whatever that tells us 19:22:09 johsom also mentioned we can limit by ip in haproxy 19:22:23 and yes, improved loggnig out of gitea would be lovely. out of haproxy too... if we knew the ephemeral port haproxy sourced each forwarded socket from, we could map those to log entries from gitea 19:22:25 so if none of the above works, doing that might be a slightly better alternative to iptables 19:22:58 ++ would be good to encode in haproxy config 19:23:08 currently haproxy doesn't tell us what the source port for its forwarded socket was, just the client' source port, so we've got a blind spot even with improved gitea logging 19:23:49 what is our robots.txt situation; i get a 404 for https://opendev.org/robots.txt 19:23:49 fungi: https://www.haproxy.com/blog/haproxy-log-customization/ we can do that too looks like 19:24:33 ianw: I want to say its part of our docker image? 19:25:27 ah I think we can set it in our custom dir and it would serve it but we must not do that 19:25:33 %bi provides "backend source IP (HAProxy connects with)" but maybe that includes the source port number 19:25:34 #link https://review.opendev.org/738684 Enable access log in gitea 19:25:43 fungi: %bp is the port 19:26:05 oh, duh, that was the next line below %bi and i totally missed it 19:26:09 thanks 19:27:01 corvus: note we may need log rotation for those files 19:27:17 corvus: looks like we could have it interleave with the regular log if we want (then journald/dockerd deal with rotation?) 19:27:29 yeah, so with the added logging in gitea and haproxy we'll be able to map any request back to an actual client ip address 19:27:39 that will be a huge help 19:27:41 ++ 19:27:54 fungi: would you like to do the haproxy side or should we find another volunteer? 19:28:03 i'm already looking into it 19:28:07 awesome, thanks 19:28:20 Anything else we want to bring up on the subject of gitea, haproxy, or opendev? 19:28:34 I think this gives us a number of good next steps but am open to more ideas. Otherwise we can continue the meeting 19:29:09 i just want to make it clear that even though we blocked access from china unicom's address space, we don't have any reason to believe they're a responsible party in this situation 19:30:15 they're a popular isp who happens to have many customers in a place where pirated operating systems which can never receive security fixes are standard protocol, and so the majority of compromised hosts in large botnets tend to be on ip addresses of such isps 19:31:55 #topic Update Config Management 19:32:15 we've been iterating on having ze01 run off of the zuul-executor docker image 19:32:32 frickler turned it off again today for a reason I've yet to fully look into due to the gitea issues 19:32:49 i saw some mention of newly discovered problems, yeah, but got sideswiped by other bonfires 19:33:22 looks like it was some sort of iptables issue. We've actually seen that issue before on non container executor jobs as well I think 19:33:29 but in this case they were all on ze01 so it was thought we should turn it off 19:33:47 i had a quick look at that ... it was very weird and an ansible error that "stdout was not available in the dict instance" 19:33:50 we attempt to persist firewall rules on the remote host and do an iptables save for that 19:34:03 ianw: ya we've had that error before then it went away 19:34:04 there were "MODULE FAILURE" errors in the job logs 19:34:20 I'm guessing some sort of ansible/iptables bug and maybe the container is able to reproduce it reliably 19:34:26 basically a variable made with register: on a command: somehow seemed to not have stdout 19:34:27 (due to a timing issue or set of tooling etc) 19:34:37 and then I found lots of "ModuleNotFoundError: No module named \'gear\'" on ze01 19:34:53 and assumed some relation, though I didn't dig further 19:35:07 got it. So possibly two separate or related issues that should be looked into 19:35:11 thanks for the notes 19:35:16 yeah, it was like it was somehow running in different process or something 19:36:02 mordred isn't around today otherwise he'd probably have ideas. Maybe we can sit on this for a bit until mordred can debug? 19:36:10 though if someone else would like to feel free 19:36:28 the stdout error was just because it was trying to look at the output of the failed module 19:36:59 is there a pointer to the error somewhere 19:37:13 see the logs in #openstack-infra 19:37:33 https://fa41114c73dc4ffe3f14-2bb0e09cfc1bf1e619272dff8ccf0e99.ssl.cf2.rackcdn.com/738557/2/check/tripleo-ci-centos-8-containers-multinode/7cdd1b2/job-output.txt was linked there 19:37:39 and shows the module failure for iptables saving 19:37:53 clarkb: thanks. i do a lot better with "here's a link to a problem we don't understand" 19:39:39 and the failure on ze01 appeared very often 19:39:42 frickler: oh interesting; "persistent-firewall: List current ipv4 rules" shows up as OK in the console log, but seems like it was not OK 19:40:15 ianw: there were two nodes, one passed the other failed 19:40:57 why do we think that's related to the executor? 19:41:17 to my eye, they both look OK in the output @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json 19:41:31 corvus: because of "ModuleNotFoundError: No module named \'gear\'" in the executor log 19:41:46 corvus: that may be a different thing, but it looked similar 19:41:47 https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#34529 in particular 19:42:08 ya I think we may have two separate issues. The gear thing is probably related to the container image but the iptables thing I'm not sure 19:42:13 corvus: together with this seeming a new issue and ze01 being changed yesterday, that was enough hints for me 19:42:35 the gear issue didn't cause that job to fail though, right? 19:42:51 corvus: unless that causes post_failre? I'm not sure if the role is set up to fail on that or not 19:43:17 clarkb: that was a retry_limit: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6 19:43:20 centos8 19:43:51 ianw: the failure is later: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#76130 19:44:24 it sounds like there's perhaps a non-critical error on the executor with a missing gear package, but i don't think that should cause jobs to fail 19:44:51 separately, there are lots of jobs retrying because of the centos8-tripleo issues 19:45:40 frickler: yeah @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#37597 19:45:58 but all the expected output is there 19:46:07 anyway, we can probably debug outside the meeting 19:46:23 ++ lets continue afterwards 19:46:39 #topic General Topics 19:46:47 #topic DNS Cleanup 19:46:51 https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary 19:46:58 that's the task that caused the stdout error 19:47:07 before we move on 19:47:21 i'd like to understand what are the blockers for the executor 19:47:34 #undo 19:47:34 Removing item from minutes: #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary 19:47:49 wait what thats not what I expected to be undone 19:47:50 is it agreed that the only executor-related error is the (suspected non-fatal) missing gear package? 19:47:55 #undo 19:47:56 Removing item from minutes: #topic DNS Cleanup 19:47:58 #undo 19:47:59 Removing item from minutes: #topic General Topics 19:48:08 #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary caused iptables failure 19:48:09 or am i missing something? 19:48:22 corvus: that is my understanding 19:48:30 gear is what needs addressing then we can turn ze01 back on? 19:48:42 i suspect that would just cause us not to submit logstash jobs 19:48:57 that is my understanding as well 19:49:06 cool, i'll work on adding that 19:49:22 #topic General Topics 19:49:26 #topic DNS Cleanup 19:49:43 I kept this on the agenda as a reminder that I meant to do a second pass of record removals and have not done that yet and things have been busy with fires 19:49:47 nothign else to add with this though 19:50:12 #topic Time to retire openstack-infra mailing list? 19:50:21 fungi: this was your topic want to quickly go over it ? 19:50:47 The last email to that list was on june 2 19:50:59 sure, just noting that the infra team has been supplanted by the tact sig, which claims (currently) to use the openstack-discuss ml like other sigs 19:51:04 and was from zbr who we can probably convince to email service-discuss or openstack-discuss depending on the context 19:51:14 and as you've observed, communication levels on it are already low 19:51:51 we've likely still got the address embedded in various places, like pypi package metadata in older releases at the very least, so if we do decide it's time to close it down i would forward that address to the openstack-discuss ml 19:52:26 I'm good with shutting it down and setting up the forward 19:52:38 it was never a very busy list anyway so unlikely to cause problems with the forward 19:52:40 this was manily an informal addition to the meeting topic just to get a feel for whether there are strong objections, it's not time yet, whatever 19:53:12 next step would be for me to post to that ml with a proposed end date (maybe august 1?) and make sure there are no objections from subscribers 19:53:20 fungi: maybe send an email to that list with a proposed date a week or two in the future then just do it? 19:53:28 that way anyone still subbed will get a notification first 19:53:34 seems fine for me 19:53:45 ahh, okay, sure i could maybe say july 15 19:54:13 if folks don't think that's too quick 19:54:17 works for me 19:54:36 anyway, not hearing objections, i'll go forth with the (hopefully final) ml thread 19:54:43 thanks! 19:55:01 #topic Grafana deployments from containers 19:55:02 thanks! 19:55:05 #link https://review.opendev.org/#/q/status:open+topic:grafana-container 19:55:24 ianw: want to quickly update us on this subject? I know you need reviews (sorry too many fires) 19:55:54 yes, i stuck it on the top of my review stack when i went to bed last night, and it only got buried as soon as i woke up :/ 19:56:18 sorry, yeah basically grafanan and graphite containers 19:56:31 if people want to review, then i can try deploying them 19:56:48 grafana should be fine, graphite i'll have to think about data migration 19:56:53 cool, thanks for working on that. Its on my todo list for when I get out from under my fires backlog 19:57:19 (but it can sit as graphite.opendev.org for testing and while we do that, and then just switch dns at an appropriate time) 19:58:02 and thats basically all we had time for. 19:58:16 We didn't manage to get to every item on the agenda but the gitea brainstorm was really useful 19:58:21 Thanks everyone 19:58:28 feel free to bring up anything we missed in #opendev 19:58:31 #endmeeting