Tuesday, 2022-06-28

clarkbAlmost meeting time18:59
fungiahoy!19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Jun 28 19:01:10 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-June/000341.html Our Agenda19:01
clarkb#topic Announcements19:01
ianwo/19:01
clarkbNext week Monday is a big holiday for a few of us. I would expect it to be quiet ish early next week.19:01
clarkbAdditionally I very likely won't be able to make the meeting two weeks from today19:02
clarkbMore than happy to skip that week or have someone else run the meeting (its July 12, 2022)19:02
ianwi can do 12th july if there's interest19:03
clarkbfigured I'd let people knwo early then we can organize with plenty of time19:03
clarkbAny other announcements?19:03
clarkb#topic Topics19:05
clarkb#topic Improving CD throughput19:05
clarkbThere was a bug in the flock path for the zuul auto upgrade playbook which unfortunately caused last weekends upgrade and reboots to fail19:05
clarkbThat issue has since been fixed so the next pass should run19:05
clarkbThis is the downside to only trying to run it once a week.19:05
clarkbBut we can always manually run it if necessary at an earlier date. I'm also hoping that I'll be feeling much better next weekend and can pay attention to it as it runs. (I missed the last one because I wasn't feeling well)19:06
clarkbSlow progress, but that still counts :)19:07
clarkbAnything else on this topic?19:08
clarkb#topic Gerrit 3.5 upgrade19:09
clarkb#link https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 WorkInProgress always treated as merge conflict19:09
clarkbI did some investigating of this problem that frickler called out.19:09
clarkbI thought I would dig into that more today and try to write a patch, but what I've realized since is that there isn't a great solution here since WIP changes are not mergeable. But Gerrit overloads mergable to indicate there is a conflict (which isn't necessarily true in the WIP case)19:10
clarkbso now I'm thinking I'll wait a bit and see if any upstream devs have some hints for how we might address this. Maybe it is ok to drop merge conflict in the case of all wips. Or maybe we need a better distinction between the triple state and use something other than a binary value19:10
clarkbIf the latter option then that may require them to write a chagne as I think it requires a new index version19:11
clarkbBut I do think I understand this enough to say it is largely a non issue. It looks werid in the UI, but it doens't indicate a bug in merge conflict checking or the index itself19:11
clarkbWhich means I think it is fairly low priority and we can focus effort elsewhere19:12
ianw(clarkb becoming dangerously close to a java developer again :)19:12
clarkbThe other item I wanted to bring up here is whether or not we think we are ready to drop 3.4 images and add 3.6 as well as testing19:12
clarkbhttps://review.opendev.org/q/topic:gerrit-3.4-cleanups19:12
clarkbIf so there are three changes ^ there that need review. The first one drops 3.4, second adds 3.6, and last one adds 3.5 -> 3.6 upgrade testing. That last one is a bit complicated as there are steps we have to take on 3.5 before upgrading to 3.6 and the old test system for that wasn't able to do that19:13
clarkbConsidering it has been a week and of the two discovered issues one has already been fixed and the other is absically just a UI thing I'm comfortable saying it is unlikely we'll revert at this point19:14
clarkbmemory usage has also looked good19:14
ianw++ I think so, I can't imagine we'd go back at this point, but we can always revert19:14
clarkbya our 3.4 iamges will stay on docker hub for a bit and we can revert without reinstating all the machinery to build new ones19:15
fungion the merge conflict front, maybe just changing the displayed label to be more accurate would suffice?19:15
clarkblooks like ianw has already reviewed those changes. Maybe fungi and/or frickler can take a second look. Particularly of the changes that add 3.6 just to make sure we don't miss anything.19:15
clarkbfungi: that is possible but merge conflict relies on mergable: false even though it can also mean wip. So it becomes tricky to not break the merge conflict reporting on non wip changes19:16
clarkbBut ya maybe we just remove the merge conflict tag entirely on wip things in the UI19:16
clarkbthat is relatively straightforward19:16
clarkbmaybe upstream will haev a good idea and we can fix it some way I haven't considered19:17
clarkbAnything else on this subject? I think we're just about at a place where we acn drop it off the schedule (once 3.4 images are removed)19:17
clarkbs/schedule/agenda/19:18
fungiwell, s/merge conflict/unmergeable/ would be more accurate to display19:18
fungisince it's not always a git merge conflict causing it to be marked as such19:18
fricklerin particular the "needs rebase" msg is wrong19:18
clarkbfungi: but that is only true for wip changes aiui19:18
fungiright19:19
clarkbbut ya maybe clarifying that in the case of wip changes is a way to go "unmergeable due to the wip state"19:19
fungiwell, also changes with outdated parents get marked as being in merge conflict even if they're technically not (though in those cases, rebases are warranted)19:19
clarkboh that is news to me but after reading the code is not unexpected. Making note of that on the bug I filed would be good19:20
fungialso possible i've imagined that, i'll have to double-check19:21
clarkbk19:21
clarkbWe have a few more topics to get through. Any other gerrit upgrade items before we move on?19:22
clarkb#topic Improving grafana management tooling19:23
clarkbThis topic was largely going to talk about the new grafyaml dashboard screenshotting jobs, but those have since merged.19:23
clarkbI guess maybe we should catch up on the current state of things and where we think we might be headed?19:24
clarkbpulling info from last meeing what ianw has discovered is that grafyaml uses old APIs which can't properly express things like threshold levels for colors in graphs. This means success and failure graphs both show green in some cases19:24
ianwi'm still working on it all19:25
ianwbut in doing so i did find one issue19:25
ianw#link https://review.opendev.org/c/opendev/system-config/+/84787619:25
ianwthis is my fault, missing parts of the config when we converted to ansible19:26
ianwin short, we're not setting xFilesFactor to 0 for .wsp files created since the update.  this was something corvus fixed many years ago that reverted19:26
ianwas noted, i'll have to manually correct the files on-disk after we fix the configs19:27
clarkbnoted. I've got that on my todo list for after the meeting and lunch19:27
clarkbreviewing the change I mean19:27
ianwi noticed this because i was not getting sensible results in the screenshots of the graphs we now create when we update graphs19:28
clarkbianw: jrosser_ also noted that the screenshots may be catching a spinning loading wheel in some cases. Is this related?19:29
clarkbthe info is in #opendev if you want to dig into that more19:29
ianwahh, ok, each screenshot waits 5 seconds, but that may not be long enough19:30
clarkbit may depend on the size of the dashboard. I think the OSA dashboards have a lot of content based on hte change diff19:30
ianwit's quite difficult to tell if the page is actually loaded19:30
clarkbOnce we've got this fairly stable do we have an idea of what sorts of things we might be looking at to address the grafyaml deficiencies?19:31
clarkbor maybe too early to tell since bootstrapping testing has been the focus19:32
fungii wonder if there's a way to identify when the page has completed loading19:32
ianwmy proposal would be that editing directly in grafana and committing dashboards it exports, using the screenshots as a better way to review changes than trying to be human parsers19:32
ianwhowever, we are not quite at the point I have a working CI example of that19:32
ianwso i'd like to get that POC 100%, and then we can move it to a concrete discussion19:33
clarkbgot it. Works for me19:33
corvuswill the screenshots show the actual metrics used?19:33
corvusby that, i mean the metrics names, formulas applied, etc?19:34
clarkbI think grafana can be convinced to show that info, but it may be equivalent to what is the in json (aka just the json backing)19:35
corvus(so that a reviewer can see that someone is adding a panel that, say, takes a certain metric and divides by 10 and not 100)19:35
corvusokay, so someone reviewing the change for accuracy would need to read the json?19:35
clarkbI'm looking at the prod dashboard and to see that info currently it does seem like you have to load the json version (it shows the actual data and stats separatenyl but not how they were formulated)19:37
ianwyes, you would want to take a look at the json for say metric functions19:38
ianw"the json" looks something like https://review.opendev.org/c/openstack/project-config/+/833213/1/grafana/infra-prod-deployment.json19:38
corvusthe comment about reviewers not needing to be human parsers made me think that may no longer be the case, but i guess reviews still require reading the source (which will be json instead of yaml)19:39
corvusor maybe there's some other way to output that information19:40
clarkbone idea had was to use a simpler translation tool between the json and yaml to help humans. But not try to encode logic as much as grafyaml does today as that seems to be part of what trips us up.19:41
clarkbBut I think we can continue to improve the testing. users have already said how helpful it is while using grafyaml so no harm in improving things this way19:42
clarkband we can further discuss the future of manageing the dashboards as we've learned more about our options19:42
clarkbWe've got 18 minutes left in the meeting hour and a few more topics. Anything urgent on this subject before we continue on?19:42
ianwnope, thanks19:43
clarkb#topic URL Shortener Service19:43
clarkbfrickler: Any updates on this?19:43
fricklerstill no progress here, sorry19:43
clarkbno worries19:43
clarkb#topic Zuul job POST_FAILUREs19:43
clarkbStarting sometime last week openstack ansible and tripleo both noticed a higher rate of POST_FAILURE jobs19:44
clarkbfungi did a fair bit of digging last week and I've tried to help out more recently. It isn't easy to debug because these post failure appear related to log uploads which means we get no log url and no log links19:44
clarkbWe strongly suspect that this si related to the executor -> swift upload process with the playbook timing out in that period of time.19:45
clarkbWe suspect that it is also related to either the total number of log files, their size or some combo of the two since only OSA and tripleo seem to be affected and they log quite a bit compared to other users/jobs19:45
fungithe time required to upload to swift endpoints does seem to account for the majority of the playbook's time, and can vary significantly19:46
clarkbWe've helped them both identify places they might be able to trim their log content down. The categories largely boiled down to no more ARA, reduce deep nesting of logs since nesting requires an index.html for each dir level, remove logs that are identical on every run (think /etc contents that are fixed and never change), and things like journald binary files.19:46
clarkbDoing this cleanup does appear to have helped but not completed removed the issue19:47
corvuswell, if hypothetically, under some circumstances it takes 4x time to upload, it may simply be that only those jobs are long enough that 4x time is noticeable?19:47
fungiyes19:47
corvus(but the real issue is surely that under some circumstances it takes 4x time, right?)19:47
clarkbcorvus: yup, I think that is what we are suspecting19:47
clarkbin the OSA case we've seen some jobs take ~2 minutes to upload logs, ~9 minutes, and ~22 minutes19:47
corvusso initial steps are good, and help reduce the pain, but underlying problem remains19:47
fungialso it's really only impacting tripleo and openstack-ansible changes, so seems to be something specific to their jobs (again, probably the volume of logs they collect)19:48
clarkbthe problem is we have very little insight into this due to how the issues occur. We lose a lot of info. Even on the executor log side the timeout happens and we don't get info about where we were uploading to19:48
fungiunfortunately a lot of the troubleshooting is hampered by blind spots due to ansible not outputting things when it gets aborted mid-task19:48
clarkbwe could add logging of that to the ansible role but then we set no_log: true which I think may break any explicit logging too19:48
fungiso it's hard to even identify which swift endpoint is involved in one of the post_failure results19:48
clarkbI think we've managed the bleeding, but now we're looking for ideas on how we might log this better going forward.19:49
clarkbOne idea that fungi had that was great was to do two passes of uploads. The first can upload the console log and console json and maybe the inventory content. Then a second pass can upload the job specific data19:49
ianwyeah -- when i started hitting this, it turned out to be the massive syslog that was due to a kernel bug that only hit in one cloud-provider flooding with backtrace messages.  luckily in that that, some of the uploads managed to work, so we could see the massive file.  but it could be something obscure and unrelated to the job like this19:49
corvusthat would help us not have post_failures, but it wouldn't help us have logs, and it wouldn't help us know that we have problems uploading logs.19:50
clarkbThe problem with this is we generate a zuul manifect with all of the log files and record that for the zuul dashboard so we'd essentially need to upload those base logs twice to make that work19:50
corvusiow, it could sweep this under the rug but not actually make things better19:50
fungii think it's the opposite?19:50
clarkbcorvus: I don't think it would stop the post failures. The second upload pass would still cause that to happen19:50
fungiit wouldn't stop the post_failure results, but we'd have console logs and could inspect things in the dashboard19:50
corvusoh i see19:51
clarkbit would allow us to in theoryk now where we're slow to upload to19:51
clarkband some other info.19:51
fungibasically try to avoid leaving users with a build result that says "oops, i have nothing to show you, but trust me this broke"19:51
clarkbBut making that shift work in the way zuul's logging system currently works is not trivial19:51
corvusthat sounds good.  the other thing is that all the info is in the executor logs.  so if you want to write a script to parse it out, that could be an option.19:51
clarkbMostly calling this out here so people are aware of the struggles and also to brainstorm how we can log better19:51
clarkbcorvus: the info is only there if we don't time out the task though19:51
corvusi suggest that because even if you improve the log upload situation, it still doesn't really point at the problem19:52
fungiexcept a lot of the info we want for debugging this isn't in the executor logs, at least not that i can find19:52
clarkbcorvus: when we time out the task we kill the task before it can record anything in the executor logs19:52
fungithough we can improve that, yes19:52
clarkbat least that was my impression of the issue here19:52
corvusno i mean all the info is in the executor log.  /var/log/zuul/executor-debug.log19:52
fungibasically do some sort of a dry-run task before the task which might time out19:52
clarkbcorvus: yes that file doesn't get the info because the task is forcefully killed before it records the info19:52
fungiall of the info we have is in the executor log, but in these cases the info isn't much19:53
clarkbya another appraoch may be to do the random selection of target, record it in the log, then start the upload similar to the change jrosser wrote19:53
clarkbthen we'd at least have that information19:53
corvusyou're talking about the swift endpoint?19:53
clarkbcorvus: yes that is the major piece of info19:53
fungithat's one piece of data, yes19:53
clarkbpotentially also the files copied19:53
fungiit gets logged by the task when it ends19:53
fungiexcept in these cases, because it isn't allowed to end19:54
fungiinstead it gets killed by the timeout19:54
clarkbthe more I think about it the more I think a change like jrosser's could be a good thing here. Basically make random selection, record target, run upload. Then we record some of the useful info before the forceful kill19:54
fungiso we get a log that says the task timed out, and no other information that task would normally have logged (for example, the endpoint url)19:54
fungiwe can explicitly log those other things by doing it before we run that task though19:55
clarkbhttps://review.opendev.org/c/opendev/base-jobs/+/847780 that change19:55
corvusyeah that's a good change19:55
clarkbthat change was initiall made so we can do two upload passes, but maybe we start with it just to record the info and do one upload19:55
fungianother idea was to temporarily increase the timeout for post-logs and then try to analyze builds which took longer in that phase than the normal timeout19:56
corvusthen you'll have everything in the debug log :)19:56
clarkbyup. Ok thats a good next step and we can take debugging from there as it may provide important info19:56
clarkbwe are alomost out of time but do have one more agenda item I'd like to get to19:56
corvusin particular, you can look for log upload times by job, and see "normal" and "abnormal" times19:56
fungithe risk of temporarily increasing the timeout, of course, is that jobs may end up merging changes that make the situation worse in the interim19:56
clarkb#topic Bastion host19:57
clarkbianw put this on the agenda to discuss two items re bridge.19:57
corvusyeah, i wouldn't change any timeouts; i'd do the change to get all the data in the logs, then analyze the logs.  that's going to be a lot better than doing a bunch of spot checks anyway.19:57
clarkbThe first is whether or not we should put ansible and openstacksdk in a venv rather than global install19:57
ianwthis came out of19:57
ianw#link https://review.opendev.org/c/opendev/system-config/+/84770019:57
ianwwhich fixes the -devel job, which uses these from upstream19:58
ianwi started to take a different approach, moving everything into a non-privileged virtualenv, but then wondered if there was actually any appetite for such a change19:58
ianwdo we want to push on that?  or do we not care that much19:58
clarkbI think putting pip installs into a venv is a good idea simply because not doing that continues to break in fun ways over time19:59
clarkbThe major downsides are things are no longer automatically in $PATH but we can add them explicitly. And when python upgrades you get really weird errors running stuff out of venvs19:59
ianwyeah they basically need to be regenerated20:00
fungii am 100% in favor of avoiding `sudo pip install` in favor of either distro packages or venvs, yes20:00
fungialso python upgrades on a stable distro shouldn't need regenerating unless we do an in-place upgrade of the distro to a new major version20:00
clarkbianw: and config management makes that easy if we just rm or move the broken venv aside and let config management rerun (there is a chicken and egg here for ansible specifically though, but I think that is ok if we shift to venvs more and more)20:00
clarkbfungi: yes that is the next question :)20:01
clarkbThe next questions is re upgrading bridge and whether or not we should do it in place or with a new host20:01
fungiand to be clear, an in-place upgrade of the distro is fine with me, we just need to remember to regenerate any venvs which were built for the old python versions20:01
clarkbI personally really like using new hosts when we can get away with it as it helps us start from a clean slate and delete old cruft automatically. But bridge's IP address might be important? In the past we explicitly allowed root ssh from its IP on hosts iirc20:01
ianwthis is a longer term thing.  one thing about starting a fresh host is that we will probably find all those bits we haven't quite codified yet20:02
clarkbI'm not sure if we still do that or not. If we do then doing an in place upgrade is probably fine. But I have a small prefernce for new host if we can get away with it20:02
corvusclarkb: i think that should be automatic for a replacement host20:02
clarkbcorvus: ya for all hosts that run ansible during the time frame we have both in the var/list/group20:02
clarkbmostly concerned that a host might get missed for one reason or another and get stradned but we can always manually update that too20:03
ianwok, so i'm thinking maybe we push on the virtualenv stuff for the tools in use on bridge first, and probably end up with the current bridge as a franken-host with things installed everywhere everyhow20:03
clarkbAnyway no objection from me shifting ansibel and openstacksdk (and more and more of our other tools) into venvs. 20:03
fungisame here20:03
ianwhowever, we can then look at upgrade/replacement, and we should start fresh with more compartmentalized tools20:03
clarkbAnd preference for new host to do OS upgrade20:03
fungii also prefer a new host, all things being equal, but understand it's a pragmatic choice in some cases to do in-place20:05
clarkbWe are now a few minutes over time. No open discussion today, but feel free to bring discussion up in #opendev or on the mailing list. Last call for anything else on bastion work20:05
fungithanks clarkb!20:05
corvusthanks!20:07
clarkbSounds like that is it. Thank you everyone!20:07
clarkb#endmeeting20:07
opendevmeetMeeting ended Tue Jun 28 20:07:49 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:07
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.html20:07
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.txt20:07
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-06-28-19.01.log.html20:07

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!