19:01:15 #startmeeting infra 19:01:16 Meeting started Tue Nov 24 19:01:15 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:19 The meeting name has been set to 'infra' 19:01:40 I didnt' send out an email agenda because I Figured we'd use this time to write down some Gerrit situation updates 19:01:48 that way peopel don't have to dig through as much scrollback 19:01:56 o/ 19:02:18 situation normal: all functioning usually? 19:02:22 #link https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes continues to capture a good chunk of info particularly from users 19:02:40 if you're needing to catch up on anything starting there seems like a good place 19:03:13 i've been doing my best to direct users there or to the maintenance completion announcement which links to it 19:03:43 we should definitely make sure to add links to fixes or upstream bug reports, and cross out stuff which gets solved 19:03:46 There were 6 major items I managed to scrape out of recent events though: Gerritbot situation, High system load this morning, the /x/* bug, the account index log bug, the project watches bug, and openstack releases situation 19:04:08 Why don't we start with ^ that list then continue with anything else? 19:04:20 #topic Gerritbot connection issues to gerrit 3.2 19:04:33 ianw: fungi: I'e not managed to keep up on this topic, can you fill us in? 19:05:11 in short, i think it is not retrying when the connection drops correctly 19:05:42 and somehow the behavior there changed coincident with the upgrade (either because of the upgrade or for some other reason we haven't identified) 19:05:45 ... sorry, have changes but just have to re-log in 19:05:59 i approved the last of them i thinl 19:06:20 including the one to switch our gerritbot container to use master branch of gerritlib instead of consuming releases 19:06:31 ok, https://review.opendev.org/c/opendev/gerritlib/+/763892 should fix the retry loop 19:06:47 a version of that is actually running in a screen on eavesdrop now, manually edited 19:07:10 oh, right, i had read through that one and forgot to leave a review. approved now 19:07:38 https://review.opendev.org/c/opendev/gerritbot/+/763927 as mentioned builds the gerritbot image using master of gerritlib, it looks like it failed 19:07:42 it was partially duplicative of the change i had pushed up before, which is since abandoned 19:07:48 gerritbot-upload-opendev-image https://zuul.opendev.org/t/openstack/build/28ddd61a8f024791880517f4b2be97de : POST_FAILURE in 5m 34s 19:07:50 will debug 19:08:22 i can babysit that and keep debugging if there's more issues 19:08:29 cool so just need some chagnes to land? anything else on this topic? 19:08:48 the other thing i thought we should do is use the python3.8 base container too, just to keep up to date 19:09:04 if it works that seems like a reasonable change 19:09:09 sounds good to me 19:09:12 but yeah, i have that somewhat under control and will keep on it 19:09:19 thanks! 19:09:42 #topic The gerrit /x/* namespace conflict 19:10:00 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13721 19:10:21 The good news is upstream has responded to the bug and they think there are only like three plugins that actually conflict and only two are open source and I don't think we run either 19:10:25 so our fix should be really safe for now 19:10:47 The less good news is they suggested changing the path and updating those plugins instead of fixing the bigger issue which emans we have to check for conflicts when adding new namespaces or upgrading 19:11:01 fungi: ^ you responded to them asking about the bigger issue right? 19:11:09 yep 19:11:22 it's down there at the bottom 19:11:31 ok, I thinkwe can likely sit on this one while we sort it out with upstream particularly now that we have more confirmation that the conflicts are minimal 19:12:54 sounds like that may be it on this one 19:13:03 #topic Excessive change emails for some users 19:13:12 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13733 19:13:33 We tracked this down to over greedy/buggy project watch rules. The bug has details on how to workaround it (the user can update their settings) 19:14:23 I think it is a bug beacuse the rules really mean "send me change notifications for things I own or have reviewed" but you end up getting all the changes 19:14:36 I was able to reproduce with my own user and then confirm the fix worked for me too 19:14:45 Just be aware of that if peopel complain about spam 19:14:59 #topic Loss of account index filesystem lock 19:15:04 easy workaround is to remove your watch on all-projects 19:15:09 yup 19:15:13 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=13726 19:15:52 Yesterday a user mentioned they got a 500 error when trying to reset their http password 19:16:15 examinign the gerrit error_log we found that the tracebacks related to that showed the gerrit server had lost its linux fs lock on the accounts index lock file 19:16:25 sudo lslocks on review confirmed it had no lock for the file in question 19:16:51 After a bit of debugging we decided the best thing to do was to restart gerrit which allowed it to reclaim the lock and things appeared happy since 19:16:58 which leads us to the next topic 19:17:11 #topic High Gerrit server load with low cpu utilization and no iowait 19:17:39 Today users were complaining about slowness in gerrit. cacti and melody confirmed it was a busy server based on load but other resources were fine (memory, cpu, io, etc) 19:18:02 digging into the melody thread listing we noticed two things: first we only had one email send thread and had started to back up our queues for email sending 19:18:22 second many zuul ssh queries (typica lzuul things to get info about changes) were taking significant time 19:18:41 We updated gerrit to use 4 threads to send email instead of 1 in case this was the issue. After restarting gerrit the problem came back 19:19:18 Looking closer at the other issue we identified the stacktraces from melody showed that many of the ssh queries by zuul were looking up account details via jigt, bypassing both the lucene index and the cache 19:19:49 From this we theorized that perhaps the account index lock failure meant our index was incomplete and that was forcing gerrit to go straight to the source whcih is slow 19:20:25 in aprticular it almost looked like the slowness had to do with locking like each ssh query was waiting for the jgit backend lock so they wouldn't be reading out of sync (but I haven't confrimed this, it is a hunch based on low cpu low io but high load) 19:20:47 fungi triggered an online reindex of accounts with --force and since that completed things have been happier 19:21:11 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=26&rra_id=all shows the fall off in load 19:21:25 but it's hard to know for sure if that "fixed" it or was merely coincident timing 19:21:35 yup, though definitely correlation seems strong 19:21:35 so we need to keep a close eye on this 19:22:01 if we have to restart again due to index lock failures we should probably consider reindexing as well 19:22:21 anythin else to add on this one? 19:23:10 i don't think so 19:23:19 oh waitI did have one other thing 19:23:22 ... sounds good 19:23:39 Our zuul didn't seem ot have these issues. It uses the http api for querying change info 19:23:58 to get zuul to do that you have to set the gerrit http password setting 19:24:34 It is possible that the http side of things is going to be better performing for that sort of stuff (maybe it doesn't fall back to git as aggressively) and we should consider encouraging ci operators to switch over 19:24:46 sean-k-mooney did it today at our request to test thigns out and seems to have gone well 19:25:21 #topic OpenStack Release tooling chagnes to accomodate new Gerrit 19:25:32 fungi: I have also not kept up to date on this one 19:25:34 if they're running relatively recent zuul v3 releases (like from ~this year) then they should be able to just add an http password and make sure they're set for basic auth 19:25:39 is there any way we can ... strongly suggest that via disabling something? 19:25:46 #undo 19:25:47 Removing item from minutes: #topic OpenStack Release tooling chagnes to accomodate new Gerrit 19:26:00 ianw: yes we could disable their accounts and force them to talk to us 19:26:02 ianw: we can individually disable their accounts 19:26:06 maybe do that as a last resort after emailing them first 19:26:32 thing is, zuul still needs ssh access for the event stream (and i think it uses that for git fetches as well) 19:26:43 ok, yeah. if over the next few days things go crazy and everyone is off doing other things, it might be an option 19:26:48 fungi: yes, but neither of those seem to be a load burden based on show-queue info 19:27:09 yep, but it does mean that we can't just turn off ssh and leave rest api access for them 19:27:10 I think its specifically the change info queries because that pulls comments and votes which needs account info 19:27:15 fungi: ah yup 19:27:38 fwiw load is spiking right now and its ci ssh queries if you do a show queue you'll see it 19:27:53 (so maybe we haven't completely addressed this) 19:28:46 some 20+ ci accounts have ssh tasks in the queue but it's like one per account 19:28:52 https://review.opendev.org/c/openstack/magnum/+/763997/ is what they are fetching (for ps2 I think then they will do 3 and 4) 19:29:07 looks like mostly cinder third-party ci 19:29:24 maybe this happens whenever someone pushes a stack of cinder changes 19:29:43 but if you look at our zuul it already has 763997 in the dashboard and there and running jobs 19:29:51 this is why I'm fairly confident the http lookups are better 19:30:21 maybe when corvus gets back from vacation he can look at this from the zuul perspective and see if we need to file bugs upstream or if we can make the ssh stuff better or something 19:30:42 ok lets talk release stuff now 19:30:48 #topic OpenStack Release tooling chagnes to accomodate new Gerrit 19:31:30 this is the part where i say there were problems, all available fixes have been merged and as of a moment ago we're testing it again to see what else breaks 19:31:45 cool so nothing currently outstanding on this subject 19:32:01 this is more generally a problem for jobs using existing methods to push tags or propose new changes in gerrit 19:32:26 one factor in this is that our default nodeset is still ubuntu-bionic which carries a too-old git-review version 19:32:51 specifying a newer ubuntu-focal nodeset gets us new enough git-review package to interact with our gerrit 19:32:52 an update for git-review is in progress there 19:33:11 https://bugs.launchpad.net/ubuntu/+source/git-review/+bug/1905282 19:33:14 Launchpad bug 1905282 in git-review (Ubuntu Bionic) "[SRU] git-review>=1.27 for updated opendev gerrit" [High,Triaged] 19:33:45 another item is that new gerrit added and reordered its ssh host keys, so the ssh-rsa key we were pre-adding into known_hosts was not the first ley ssh saw 19:34:14 this got addressed by adding all gerrit's current host keys into known_hosts 19:34:49 fungi: was it causing failures or just excessive logging? 19:34:56 another problem is that the launchpad creds role in zuul-jobs relied on the python 2.7 python-launchpadlib package which was dropped in focal 19:35:06 clarkb: the host key problem was causing failures 19:35:35 new ssh is compatible with the first host key gerrit serves, but it's not in known_hosts, so ssh just errors out with an unrecognized host key 19:35:59 huh I thoughti t did more negotiating than that 19:36:26 basically whatever the earliest host key the sshd presents that your openssh client supports needs to be recognized 19:36:33 got it 19:36:49 so there is negotiation between what key types are present and what ones the client supports 19:37:07 but if it reaches one which is there and supported by the client that's the one it insists on using 19:37:36 so the fact that gerrit 3.2 puts a ne key type sooner than the rsa host key means the new type has to be accepted by the client if it's supported by it 19:38:21 and yeah, the other thing was updating zuul-jobs to install python3-launchpadlib which we're testing now to see if that worked or was hiding yet more problems 19:39:22 also zuul-jobs had some transitive test requirements which dropped python2 support recently and needed to be addressed before we could merge the launchpadlib change 19:39:38 so it's been sort of involved 19:40:12 and i think we can close this topic out because the latest tag-releases i reenqueued has succeeded 19:40:50 that is great news 19:40:54 #topic Open Discussion 19:41:12 That concluded the items I had identified. Are there others to bring up? 19:41:55 oh, the java 11 upgrade you posted 19:42:04 do you think that's worth pushing on? 19:42:16 ianw: ish? the main reason gerrit recommends java 11 is better GC performance 19:42:29 we don't seem to be having GC issues currently so I don't think it is urgent, but it would be good to do at some opint 19:42:45 yeah, but we certainly have hit issues with that previously 19:42:52 though also they're getting ready to drop support for <11 19:43:12 so we'd need to do it anyway to keep upgrading gerrit past some point in the near future 19:43:15 i noticed they had a java 15 issue with jgit, and then identified they didn't have 15 CI 19:43:30 so, going that far seems like a bad idea 19:43:42 ya we need to drop java 8 before we upgrade to 3.4 19:43:49 I think doing it earlier is fine, just calling it out as not strictly urgent 19:44:02 ianw: yes Gerrit publishes which javas they support 19:44:16 for 3.2 it is java 8 and 11. 3.3 is 8 and 11 and 3.4 will be just 11 I think 19:44:27 might be nice to upgrade to 11 while not doing it at the same time as a gerrit upgrade 19:44:31 fungi: ++ 19:44:44 just so we can rule out problems as being one or the other 19:45:01 there's failing dashboards, not sure how urgent we want to fix those 19:45:22 frickler: I noted on the etherpad that I don't see any method for configuring the query terms limit via configuration 19:45:23 also dashboards not working when not logged in 19:45:30 frickler: I also suggested bugs be filed for those items 19:45:41 (I was hoping that users hitting the problems would file the bugs as I've been so swamped with other things) 19:45:41 yeah, those both seem like good candidates for upstream bugs 19:46:04 oh, and also tristanC's plugin 19:46:08 I tried to udpate the etehrpad where I thought filing bugs upstream was appropriate and asked reporters to do that, though I suspect that etherpad is largely write only 19:46:08 we've been filing bugs for the problems we're working on, but not generally acting as a bug forwarder for users 19:46:35 re performance things I'm noticing there are a number of tunables at https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core 19:46:57 it might be worth a thread to the gerrit mailing list or to luca to ask about how we might modify our tunabels nmow that we have real world data 19:55:58 As a heads up I'm going to try and start winding down my week. I haev a few more things I want to get done but I'm finding I really need a break and will endeavor to do so 19:56:14 and sounds like everyone else may be done based on lack of new conversation here 19:56:17 thanks everyone! 19:56:21 #endmeeting