Tuesday, 2020-11-24

*** sboyron__ has joined #opendev-meeting07:58
*** sboyron__ is now known as sboyron08:12
*** hashar has joined #opendev-meeting12:00
*** gouthamr_ has quit IRC14:36
*** hashar is now known as hasharAway16:11
-openstackstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly to troubleshoot an SMTP queuing backlog, downtime should be less than 5 minutes16:41
*** hasharAway is now known as hashar16:46
*** timburke has quit IRC17:00
clarkbAnyone else here for the infra meeting? we'll get started in a couple of minutes18:59
fungiohai19:00
clarkb#startmeeting infra19:01
openstackMeeting started Tue Nov 24 19:01:15 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
*** openstack changes topic to " (Meeting topic: infra)"19:01
openstackThe meeting name has been set to 'infra'19:01
clarkbI didnt' send out an email agenda because I Figured we'd use this time to write down some Gerrit situation updates19:01
clarkbthat way peopel don't have to dig through as much scrollback19:01
ianwo/19:01
fungisituation normal: all functioning usually?19:02
clarkb#link https://etherpad.opendev.org/p/gerrit-3.2-post-upgrade-notes continues to capture a good chunk of info particularly from users19:02
clarkbif you're needing to catch up on anything starting there seems like a good place19:02
fungii've been doing my best to direct users there or to the maintenance completion announcement which links to it19:03
fungiwe should definitely make sure to add links to fixes or upstream bug reports, and cross out stuff which gets solved19:03
clarkbThere were 6 major items I managed to scrape out of recent events though: Gerritbot situation, High system load this morning, the /x/* bug, the account index log bug, the project watches bug, and openstack releases situation19:03
clarkbWhy don't we start with ^ that list then continue with anything else?19:04
clarkb#topic Gerritbot connection issues to gerrit 3.219:04
*** openstack changes topic to "Gerritbot connection issues to gerrit 3.2 (Meeting topic: infra)"19:04
clarkbianw: fungi: I'e not managed to keep up on this topic, can you fill us in?19:04
ianwin short, i think it is not retrying when the connection drops correctly19:05
fungiand somehow the behavior there changed coincident with the upgrade (either because of the upgrade or for some other reason we haven't identified)19:05
ianw... sorry, have changes but just have to re-log in19:05
fungii approved the last of them i thinl19:05
fungiincluding the one to switch our gerritbot container to use master branch of gerritlib instead of consuming releases19:06
ianwok, https://review.opendev.org/c/opendev/gerritlib/+/763892 should fix the retry loop19:06
ianwa version of that is actually running in a screen on eavesdrop now, manually edited19:06
fungioh, right, i had read through that one and forgot to leave a review. approved now19:07
ianwhttps://review.opendev.org/c/opendev/gerritbot/+/763927 as mentioned builds the gerritbot image using master of gerritlib, it looks like it failed19:07
fungiit was partially duplicative of the change i had pushed up before, which is since abandoned19:07
ianwgerritbot-upload-opendev-image https://zuul.opendev.org/t/openstack/build/28ddd61a8f024791880517f4b2be97de : POST_FAILURE in 5m 34s19:07
ianwwill debug19:07
ianwi can babysit that and keep debugging if there's more issues19:08
clarkbcool so just need some chagnes to land? anything else on this topic?19:08
ianwthe other thing i thought we should do is use the python3.8 base container too, just to keep up to date19:08
clarkbif it works that seems like a reasonable change19:09
fungisounds good to me19:09
ianwbut yeah, i have that somewhat under control and will keep on it19:09
clarkbthanks!19:09
clarkb#topic The gerrit /x/* namespace conflict19:09
*** openstack changes topic to "The gerrit /x/* namespace conflict (Meeting topic: infra)"19:09
clarkb#link https://bugs.chromium.org/p/gerrit/issues/detail?id=1372119:10
clarkbThe good news is upstream has responded to the bug and they think there are only like three plugins that actually conflict and only two are open source and I don't think we run either19:10
clarkbso our fix should be really safe for now19:10
clarkbThe less good news is they suggested changing the path and updating those plugins instead of fixing the bigger issue which emans we have to check for conflicts when adding new namespaces or upgrading19:10
clarkbfungi: ^ you responded to them asking about the bigger issue right?19:11
fungiyep19:11
fungiit's down there at the bottom19:11
clarkbok, I thinkwe can likely sit on this one while we sort it out with upstream particularly now that we have more confirmation that the conflicts are minimal19:11
clarkbsounds like that may be it on this one19:12
clarkb#topic Excessive change emails for some users19:13
*** openstack changes topic to "Excessive change emails for some users (Meeting topic: infra)"19:13
clarkb#link https://bugs.chromium.org/p/gerrit/issues/detail?id=1373319:13
clarkbWe tracked this down to over greedy/buggy project watch rules. The bug has details on how to workaround it (the user can update their settings)19:13
clarkbI think it is a bug beacuse the rules really mean "send me change notifications for things I own or have reviewed" but you end up getting all the changes19:14
clarkbI was able to reproduce with my own user and then confirm the fix worked for me too19:14
clarkbJust be aware of that if peopel complain about spam19:14
clarkb#topic Loss of account index filesystem lock19:14
*** openstack changes topic to "Loss of account index filesystem lock (Meeting topic: infra)"19:15
fungieasy workaround is to remove your watch on all-projects19:15
clarkbyup19:15
clarkb#link https://bugs.chromium.org/p/gerrit/issues/detail?id=1372619:15
clarkbYesterday a user mentioned they got a 500 error when trying to reset their http password19:15
clarkbexaminign the gerrit error_log we found that the tracebacks related to that showed the gerrit server had lost its linux fs lock on the accounts index lock file19:16
clarkbsudo lslocks on review confirmed it had no lock for the file in question19:16
clarkbAfter a bit of debugging we decided the best thing to do was to restart gerrit which allowed it to reclaim the lock and things appeared happy since19:16
clarkbwhich leads us to the next topic19:16
clarkb#topic High Gerrit server load with low cpu utilization and no iowait19:17
*** openstack changes topic to "High Gerrit server load with low cpu utilization and no iowait (Meeting topic: infra)"19:17
clarkbToday users were complaining about slowness in gerrit. cacti and melody confirmed it was a busy server based on load but other resources were fine (memory, cpu, io, etc)19:17
clarkbdigging into the melody thread listing we noticed two things: first we only had one email send thread and had started to back up our queues for email sending19:18
clarkbsecond many zuul ssh queries (typica lzuul things to get info about changes) were taking significant time19:18
clarkbWe updated gerrit to use 4 threads to send email instead of 1 in case this was the issue. After restarting gerrit the problem came back19:18
clarkbLooking closer at the other issue we identified the stacktraces from melody showed that many of the ssh queries by zuul were looking up account details via jigt, bypassing both the lucene index and the cache19:19
clarkbFrom this we theorized that perhaps the account index lock failure meant our index was incomplete and that was forcing gerrit to go straight to the source whcih is slow19:19
clarkbin aprticular it almost looked like the slowness had to do with locking like each ssh query was waiting for the jgit backend lock so they wouldn't be reading out of sync (but I haven't confrimed this, it is a hunch based on low cpu low io but high load)19:20
clarkbfungi triggered an online reindex of accounts with --force and since that completed things have been happier19:20
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=26&rra_id=all shows the fall off in load19:21
fungibut it's hard to know for sure if that "fixed" it or was merely coincident timing19:21
clarkbyup, though definitely correlation seems strong19:21
fungiso we need to keep a close eye on this19:21
clarkbif we have to restart again due to index lock failures we should probably consider reindexing as well19:22
clarkbanythin else to add on this one?19:22
fungii don't think so19:23
clarkboh waitI did have one other thing19:23
ianw... sounds good19:23
clarkbOur zuul didn't seem ot have these issues. It uses the http api for querying change info19:23
clarkbto get zuul to do that you have to set the gerrit http password setting19:23
clarkbIt is possible that the http side of things is going to be better performing for that sort of stuff (maybe it doesn't fall back to git as aggressively) and we should consider encouraging ci operators to switch over19:24
clarkbsean-k-mooney did it today at our request to test thigns out and seems to have gone well19:24
clarkb#topic OpenStack Release tooling chagnes to accomodate new Gerrit19:25
*** openstack changes topic to "OpenStack Release tooling chagnes to accomodate new Gerrit (Meeting topic: infra)"19:25
clarkbfungi: I have also not kept up to date on this one19:25
fungiif they're running relatively recent zuul v3 releases (like from ~this year) then they should be able to just add an http password and make sure they're set for basic auth19:25
ianwis there any way we can ... strongly suggest that via disabling something?19:25
clarkb#undo19:25
openstackRemoving item from minutes: #topic OpenStack Release tooling chagnes to accomodate new Gerrit19:25
clarkbianw: yes we could disable their accounts and force them to talk to us19:26
fungiianw: we can individually disable their accounts19:26
clarkbmaybe do that as a last resort after emailing them first19:26
fungithing is, zuul still needs ssh access for the event stream (and i think it uses that for git fetches as well)19:26
ianwok, yeah.  if over the next few days things go crazy and everyone is off doing other things, it might be an option19:26
clarkbfungi: yes, but neither of those seem to be a load burden based on show-queue info19:26
fungiyep, but it does mean that we can't just turn off ssh and leave rest api access for them19:27
clarkbI think its specifically the change info queries because that pulls comments and votes which needs account info19:27
clarkbfungi: ah yup19:27
clarkbfwiw load is spiking right now and its ci ssh queries if you do a show queue you'll see it19:27
clarkb(so maybe we haven't completely addressed this)19:27
fungisome 20+ ci accounts have ssh tasks in the queue but it's like one per account19:28
clarkbhttps://review.opendev.org/c/openstack/magnum/+/763997/ is what they are fetching (for ps2 I think then they will do 3 and 4)19:28
fungilooks like mostly cinder third-party ci19:29
fungimaybe this happens whenever someone pushes a stack of cinder changes19:29
clarkbbut if you look at our zuul it already has 763997 in the dashboard and there and running jobs19:29
clarkbthis is why I'm fairly confident the http lookups are better19:29
clarkbmaybe when corvus gets back from vacation he can look at this from the zuul perspective and see if we need to file bugs upstream or if we can make the ssh stuff better or something19:30
clarkbok lets talk release stuff now19:30
clarkb#topic OpenStack Release tooling chagnes to accomodate new Gerrit19:30
*** openstack changes topic to "OpenStack Release tooling chagnes to accomodate new Gerrit (Meeting topic: infra)"19:30
fungithis is the part where i say there were problems, all available fixes have been merged and as of a moment ago we're testing it again to see what else breaks19:31
clarkbcool so nothing currently outstanding on this subject19:31
fungithis is more generally a problem for jobs using existing methods to push tags or propose new changes in gerrit19:32
fungione factor in this is that our default nodeset is still ubuntu-bionic which carries a too-old git-review version19:32
fungispecifying a newer ubuntu-focal nodeset gets us new enough git-review package to interact with our gerrit19:32
frickleran update for git-review is in progress there19:32
fricklerhttps://bugs.launchpad.net/ubuntu/+source/git-review/+bug/190528219:33
openstackLaunchpad bug 1905282 in git-review (Ubuntu Bionic) "[SRU] git-review>=1.27 for updated opendev gerrit" [High,Triaged]19:33
fungianother item is that new gerrit added and reordered its ssh host keys, so the ssh-rsa key we were pre-adding into known_hosts was not the first ley ssh saw19:33
fungithis got addressed by adding all gerrit's current host keys into known_hosts19:34
clarkbfungi: was it causing failures or just excessive logging?19:34
fungianother problem is that the launchpad creds role in zuul-jobs relied on the python 2.7 python-launchpadlib package which was dropped in focal19:34
fungiclarkb: the host key problem was causing failures19:35
funginew ssh is compatible with the first host key gerrit serves, but it's not in known_hosts, so ssh just errors out with an unrecognized host key19:35
clarkbhuh I thoughti t did more negotiating than that19:35
fungibasically whatever the earliest host key the sshd presents that your openssh client supports needs to be recognized19:36
clarkbgot it19:36
fungiso there is negotiation between what key types are present and what ones the client supports19:36
fungibut if it reaches one which is there and supported by the client that's the one it insists on using19:37
fungiso the fact that gerrit 3.2 puts a ne key type sooner than the rsa host key means the new type has to be accepted by the client if it's supported by it19:37
fungiand yeah, the other thing was updating zuul-jobs to install python3-launchpadlib which we're testing now to see if that worked or was hiding yet more problems19:38
fungialso zuul-jobs had some transitive test requirements which dropped python2 support recently and needed to be addressed before we could merge the launchpadlib change19:39
fungiso it's been sort of involved19:39
fungiand i think we can close this topic out because the latest tag-releases i reenqueued has succeeded19:40
clarkbthat is great news19:40
clarkb#topic Open Discussion19:40
*** openstack changes topic to "Open Discussion (Meeting topic: infra)"19:40
clarkbThat concluded the items I had identified. Are there others to bring up?19:41
ianwoh, the java 11 upgrade you posted19:41
ianwdo you think that's worth pushing on?19:42
clarkbianw: ish? the main reason gerrit recommends java 11 is better GC performance19:42
clarkbwe don't seem to be having GC issues currently so I don't think it is urgent, but it would be good to do at some opint19:42
ianwyeah, but we certainly have hit issues with that previously19:42
fungithough also they're getting ready to drop support for <1119:42
fungiso we'd need to do it anyway to keep upgrading gerrit past some point in the near future19:43
ianwi noticed they had a java 15 issue with jgit, and then identified they didn't have 15 CI19:43
ianwso, going that far seems like a bad idea19:43
clarkbya we need to drop java 8 before we upgrade to 3.419:43
clarkbI think doing it earlier is fine, just calling it out as not strictly urgent19:43
clarkbianw: yes Gerrit publishes which javas they support19:44
clarkbfor 3.2 it is java 8 and 11. 3.3 is 8 and 11 and 3.4 will be just 11 I think19:44
fungimight be nice to upgrade to 11 while not doing it at the same time as a gerrit upgrade19:44
clarkbfungi: ++19:44
fungijust so we can rule out problems as being one or the other19:44
fricklerthere's failing dashboards, not sure how urgent we want to fix those19:45
clarkbfrickler: I noted on the etherpad that I don't see any method for configuring the query terms limit via configuration19:45
frickleralso dashboards not working when not logged in19:45
clarkbfrickler: I also suggested bugs be filed for those items19:45
clarkb(I was hoping that users hitting the problems would file the bugs as I've been so swamped with other things)19:45
fungiyeah, those both seem like good candidates for upstream bugs19:45
ianwoh, and also tristanC's plugin19:46
clarkbI tried to udpate the etehrpad where I thought filing bugs upstream was appropriate and asked reporters to do that, though I suspect that etherpad is largely write only19:46
fungiwe've been filing bugs for the problems we're working on, but not generally acting as a bug forwarder for users19:46
clarkbre performance things I'm noticing there are a number of tunables at https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core19:46
clarkbit might be worth a thread to the gerrit mailing list or to luca to ask about how we might modify our tunabels nmow that we have real world data19:46
clarkbAs a heads up I'm going to try and start winding down my week. I haev a few more things I want to get done but I'm finding I really need a break and will endeavor to do so19:55
clarkband sounds like everyone else may be done based on lack of new conversation here19:56
clarkbthanks everyone!19:56
clarkb#endmeeting19:56
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"19:56
openstackMeeting ended Tue Nov 24 19:56:21 2020 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:56
openstackMinutes:        http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.html19:56
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.txt19:56
openstackLog:            http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-11-24-19.01.log.html19:56
diablo_rojothanks clarkb!19:56
*** sboyron has quit IRC20:09
*** hamalq has joined #opendev-meeting20:16
*** sboyron has joined #opendev-meeting20:18
*** hamalq has quit IRC20:55
*** hashar has quit IRC20:59
*** hamalq has joined #opendev-meeting21:10
*** hamalq has quit IRC21:15
*** sboyron has quit IRC21:34
*** sboyron has joined #opendev-meeting21:34
*** sboyron has quit IRC21:58
*** sboyron has joined #opendev-meeting21:59
*** sboyron has quit IRC22:11
*** sboyron has joined #opendev-meeting22:11
*** sboyron has quit IRC22:12
*** sboyron has joined #opendev-meeting22:13
*** jentoio has quit IRC22:14
*** sboyron has quit IRC22:15
*** sboyron has joined #opendev-meeting22:16
*** sboyron has quit IRC22:20
*** sboyron has joined #opendev-meeting22:20
*** sboyron has quit IRC22:25
*** sboyron has joined #opendev-meeting22:26
*** jmorgan has joined #opendev-meeting22:29
*** sboyron has quit IRC22:33
*** sboyron has joined #opendev-meeting22:34
*** sboyron has quit IRC22:40
*** sboyron has joined #opendev-meeting22:41
*** sboyron has quit IRC22:48
*** sboyron has joined #opendev-meeting22:48
*** sboyron has quit IRC22:57
*** sboyron has joined #opendev-meeting23:06
*** hamalq has joined #opendev-meeting23:37
*** hamalq has quit IRC23:41
*** hamalq has joined #opendev-meeting23:52
*** hamalq has quit IRC23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!