Friday, 2022-06-24

BlaisePabon[m]How much work are we talking about? 00:38
BlaisePabon[m]I have a fedora nested kvm server with 20 cores sitting in my closet?00:38
BlaisePabon[m]You're welcome to throw jobs at it.00:38
opendevreviewIan Wienand proposed opendev/grafyaml master: [wip] test real import of graphs  https://review.opendev.org/c/opendev/grafyaml/+/84742100:38
ianwBlaisePabon[m]: we do obviously gratefully accept resources.  but it's also non-zero cost to setup and maintain them, so we have to find the balance00:39
ianwwe have a control-plane and a CI tenant, in the control plane we run a cloud-local mirror (so CI nodes don't have to go over the network).  00:40
ianwthe best situation is where somebody is motivated to maintain the cloud for other reasons ($$$$, probably :) and provides us resources.  so we help develop the software they use, and they contribute resources in return.  everyone wins when it works out00:42
opendevreviewIan Wienand proposed opendev/grafyaml master: [wip] test real import of graphs  https://review.opendev.org/c/opendev/grafyaml/+/84742100:45
fungii'm going to perform a controlled reboot of lists.o.o now to make sure the latest unpacked kernel is viable01:29
*** rlandy|bbl is now known as rlandy01:33
fungiugh, it's going straight into shutoff state01:37
fungii'm going to have to perform a rescue boot with it01:39
*** rlandy is now known as rlandy|out01:40
ianw:/ do you think it's the kernel or something else that's happened in the install?01:41
Clark[m]In the past it's been either xen can't find the kernel or the kernel wasn't extracted properly 01:41
Clark[m]You should be able to edit the menu.lst to put an old kernel back ?01:42
clarkbfungi: it was the vmlinuz that you extracted right? thats the file that needs to be uncompressed. Otherwise I got nothing01:48
fungiyeah, i edited the menu.lst and moved the entry for the kernel we were previously booted on to the top of the list, the unrescued, but it still immediately goes to shutoff state when i try to start it01:49
fungiit just picks the first one listed, right?01:50
clarkbyes that was what it seemed to do in the past01:50
clarkbI wonder if it is looking at grub.conf instead?01:50
clarkbthough I thought for us to do that we had to put an entry in menu.lst to boot the shim instead of the normal vmlinuz to chain load grub01:50
clarkband I think I tried to get that to work back when and couldnt' get it to happen01:51
fungii'll try rolling it back in grub.cfg, yeah01:55
clarkbits a lot more complicated there unfortunately01:56
clarkbI wonder if the console log shows anything that might indicate which kernel it is finding and attempting?01:57
clarkb(that seems like a long shot but may offer clues if it does)01:57
*** ysandeep|out is now known as ysandeep01:59
corvusoh hi01:59
clarkbhello01:59
corvusi was just noticing mailing list accessibility probs, and i see you're looking at it... catching up now.01:59
fungijust trying to test a reboot on a new unpacked kernel when it's less likely to impact anyone. glad i didn't do it at peak time02:00
clarkbcorvus: fungi was doing a controlled reboot to bump the kernel after extracting it (as is necessary because xen), but something is unhappy with that almost certainly xen finding the unpacked kernel is the problem02:00
corvusanything another set of eyes/hands can do?02:01
fungiunfortunately switching the primary profile in grub.cfg to the old kernel doesn't seem to have solved it either02:02
clarkbI'm not sure. I'm still feeling pretty blah and not firing all cylinders so will defer to fungi on that02:02
fungii'll see if i can get a console log out of it, but my luck with getting boot console output in rackspace has not been good02:03
clarkbfungi: what is the suffix of the old kernel? -96?02:03
fungiyes02:03
fungithe new one is 12102:03
clarkbya that matches my memory from when I looked a few days ago02:03
corvusi've got a console up in my browser02:05
corvusi'll stand by and wait to see if fungi gets a working console; otherwise, we can reboot it and i can watch it for clues02:06
clarkbhttps://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy is the chainloading thing I mentioned earlier should we want to try that02:07
clarkbbasically you edit menu.lst so that xen's grub1 built in stuff chainloads grub2 which reads the grub2 configs02:07
clarkbBut if I remember correctly I was never able to get that to work as expected02:08
fungii've put the configs back to what they were for the first reboot and unrescued02:08
corvushrm, console disconnected and i didn't get it back, so i guess the browser console doesn't work for this situation.02:10
clarkbI think it may disconnect you when you reboot which makes this sort of debugging more difficult?02:10
fungii've started the server now02:11
corvusweb ui says it's in "error" state02:11
fungioh, right after unrescue it seems to go into error02:12
fungii'll stop and start it again, but previously it would immediately go into shutoff when i tried to start it again02:12
fungiyeah, trying to start the server it remains in "shutoff" state according to the api02:13
clarkbJust off the top of my head things to consider: double check that vmlinuz was the extracted file. Double check that reextracting it results in the same file. Double check the default entry in menu.lst is the first (0th) item? Its possible xen is doing more parsing of the content than simply finding the first entry. Try the chainload grub2 thing from above02:15
fungiyeah, i think we're stuck blindly trying things from rescue mode02:15
fungii also shouldn't have tried this so late in my day, but i expected rolling back to the working kernel would be a viable way out if the new one wasn't working02:16
fungiputting it back into a rescue boot again now02:17
clarkbfungi: do you want to add my pubkey to the rescue node and get a second set of eyes on it?02:20
fungisure, just a sec02:21
fungii've added the public keys for you, corvus and ianw to the root account02:23
fungithe server's rootfs is mounted at /mnt02:23
fungichecksum of /mnt/root/kernel-stuff/vmlinuz-5.4.0-121-generic.extracted matches /mnt/boot/vmlinuz-5.4.0-121-generic which is roughly the same size as /mnt/boot/vmlinuz-5.4.0-96-generic02:24
clarkbmenu.lst default is 0 so that is confirmed that the first entry should be used02:28
fungiand /mnt/boot/vmlinuz-5.4.0-96-generic still has the same checksum as the extracted copy in kernel-stuff and has a january last modified date02:32
clarkbya I'm thinking it likely isn't an issue with the kernels but with finding them if that makes sense02:35
clarkbsince the old kernel should still boot if it is findable02:35
clarkbtalking out loud here: maybe we try a super simplified menu.lst to remove possibility that xen is being confused by something else in the file. And if that doesn't work try to chainload also using a super simplified version of the file?02:37
corvusmenu.lst_backup_by_grub2_prerm is interesting; it looks very similar to menu.lst02:37
corvusif that's a legit old working menu.lst, it would suggest that nothing significant changed other than version numbers02:38
corvusregardless, i do like clarkb's plan02:38
corvusstep 1: menu.lst with only the old kernel; step 2 https://wiki.xenproject.org/wiki/PvGrub2#Chainloading_from_pvgrub-legacy02:38
fungiyes, that sounds solid02:39
corvusi think (hd0,0) should be right for us (assuming those menu.lst files are roughly correct)02:39
clarkbcorvus:  Ithink it may just be hd0 ?02:39
clarkbour existing and old menu.lsts all use hd0 not hd0,0 at least02:39
fungithough also that may be irrelevant if xen is reading the fs anyway02:40
corvuswell, the pvgrub example has a partition... and our device has a partition table and / is the first partition, so i'd assume (hd0,0) for that...02:41
clarkbcorvus: good point02:41
clarkb/root/clarkb-menu.lst on the rescue side has a simplified version of what I Think it may want to look like. though that still uses (hd0)02:41
clarkbmaybe if that looks correct to the others we move aside the current menu.lst and copy that in place?02:42
corvus(actually (and still just thinking ahead to pvgrub) should it be (hd0,1)?)02:42
fungithe current menu.lst and the menu.lst_backup_by_grub2_prerm from a year ago both use (hd0) in the root entries02:42
clarkbcorvus: because it is xvda1 ?02:43
corvusyeah i'm trying to remember whether the part arg is base 0 or 102:43
fungithe example in the comment block shows a dual-boot scenario with windows on 0,0 and linux on 0,102:44
corvuslegacy is base0 for partitions02:44
fungiand equates root=/dev/hda2 with (hd0,1)02:44
corvusgrub2 is base1 (but also has things like "(hd0, msdos1)")02:45
clarkbI guess the other thing to consider is if we have to try both new and old kernel with the slimmed down menu.lst as it may be the new kernel that has problems too? But one step at a time02:46
corvusso clarkb's file lgtm except do we want to do 96 instead of 121?02:46
clarkbjinx!02:46
corvus:)02:46
clarkbI'm happy to update to that version instead02:46
corvusstep1: simplified 121; step2: simplified 96; step3: chainload?  that sounds good to me02:46
fungiprobably try 96 and if it works plan another window to retry an upgraded boot02:46
clarkbok updating02:47
corvus++02:47
fungibut i'm okay trying 121 first if people don't mind extending the unscheduled window02:47
clarkbhows that02:47
clarkbfungi: I'll let you put it in place since you have control of the unrescue02:47
corvuslgtm02:48
corvusi've logged out02:48
clarkbI have too02:48
fungiyeah, lgtm02:48
fungii've moved the current menu.lst into /mnt/boot/grub/menu.lst_broken_2022-06-2402:50
fungiand swapped in clarkb's02:50
fungiunrescuing now02:50
fungiwent into error state after unrescuing02:52
fungii'll stop and start the server02:52
fungistill staying in shutoff state02:52
fungishall i put it back into rescue?02:53
clarkbya I think the other thing to try is the chainload and we need rescue for that02:53
clarkband if that doesn't work maybe we need to see if rax will give us some error logs so that we can divine where it is failing? that old kernel worked before and hasn't changed which really makes me suspect the grub interaction not the kernel itself02:55
ianwsorry stepped out for lunch at an exciting time; back now02:57
fungiokay, i have everyone's keys back on the rescue root02:58
clarkbI made another clarkb-menu.lst this time for chainloading (it uses hd0,0 can change to hd0 if we prefer) if that is what we want to try next02:59
fungilgtm, i can put that in place next and see what happens03:02
fungiit's copied in03:02
clarkbif [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi <- just noticing that in grub.conf03:03
clarkbmakes me wonder if there are other things that need extracting03:03
fungilooks like you probably have an open file handle on /mnt03:03
clarkbyup I've dropped it03:03
clarkbwas looking at grub.conf and am done now03:04
fungiumounted and unrescuing03:04
fungiit went into an active state instead of error03:05
clarkbit is pinginging for me too03:06
clarkband I can log in03:06
funginovnc shows a login prompt03:06
clarkband it is running the kernel03:07
clarkbwow so it was the grub stuff03:07
clarkbthe kernel was fine03:07
fungiindeed03:07
* fungi grumbles03:07
clarkbif we can continue to chainload I think that is preferable fwiw then we can grub2 almost like everything else03:07
fungithanks for working that out!03:08
clarkbbut we should probably double check that menu.lst isn't updated the wrong way with the next kernel update03:08
fungiyep03:08
corvusi wonder if a kernel failure would leave us in an active state; so it showing up in error points toward "grub issues"03:08
clarkblooks like 6 mailmanctls and some exim processes are running which implies to me that services started ok too03:08
fungi`ps auxww|grep -c ^list` returns 5403:09
fungi(9*6)03:09
fungiso that looks sane03:09
clarkbif corvus isn't ready to call it a day sending that email that was held up through might be a good test?03:10
corvusi released ianw's zuul-announce message03:10
clarkbjinx again! :)03:10
corvusjinx03:10
clarkbI see it in my inbox too03:10
fungii've copied /boot/grub/menu.lst to /boot/grub/menu.lst_working_2022-06-24 just in case03:11
fungiyep, i too received the ensure-pip announcement03:11
clarkb++ on making that copy03:12
clarkbok I'm going to pop off now as it seems like things are likely working03:12
fungiyep thanks, and sorry about the fire drill! i anticipated that the reboot might not work, but did not expect to be unable to roll it back03:12
clarkbya not being able to rollback was definitely unexpected. I wonder what about the menu.lst we had before wasn't working. Could it be the (hd0)?03:13
clarkbnot the sort of thing I want to spend all day rescue and unrescuing to bisect though03:13
clarkbanyway good night!03:13
fungithanks again!03:13
corvusg'night!03:13
opendevreviewIan Wienand proposed openstack/project-config master: grafana: import graphs and take screenshots  https://review.opendev.org/c/openstack/project-config/+/84712903:56
opendevreviewIan Wienand proposed opendev/grafyaml master: Test with project-config graphs  https://review.opendev.org/c/opendev/grafyaml/+/84742104:00
*** undefined_ is now known as Guest311204:09
*** akahat is now known as akahat|ruck04:39
opendevreviewIan Wienand proposed opendev/system-config master: gerrit docs: cleanup and use shell-session  https://review.opendev.org/c/opendev/system-config/+/84506605:38
opendevreviewIan Wienand proposed opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove  https://review.opendev.org/c/opendev/system-config/+/84585305:38
*** undefined_ is now known as Guest311505:51
opendevreviewMerged opendev/system-config master: gerrit docs: cleanup and use shell-session  https://review.opendev.org/c/opendev/system-config/+/84506605:56
opendevreviewMerged opendev/system-config master: gerrit docs: add note that duplicate user may have email addresses to remove  https://review.opendev.org/c/opendev/system-config/+/84585306:03
*** arxcruz|rover is now known as arxcruz06:55
*** jpena|off is now known as jpena07:06
*** ysandeep is now known as ysandeep|afk07:11
*** ysandeep|afk is now known as ysandeep09:00
noonedeadpunkhey there! I was wondering if you folks don't accidentally run ara-api somewhere for opendev?09:46
noonedeadpunkAs I bet that ara-report is smth why we get POST_FAILURES from one of the providers09:47
noonedeadpunkBasically I downloaded 1 ara report for random job and got 4805 files worth of 179M09:48
noonedeadpunkand trying to search ways to optimize that09:50
noonedeadpunkseems like posting records to remote API server is most efficient way of all available ones. But then looking at demo deployment I'm not sure how to search for specific job results...09:51
noonedeadpunkbut it's probably more question to dmsimard regarding how to mark specific job/deployment to make it searchable and relate all playbooks that were run within that job09:52
*** rlandy|out is now known as rlandy10:21
*** ysandeep is now known as ysandeep|brb11:01
*** ysandeep|brb is now known as ysandeep11:18
*** dviroel|out is now known as dviroel11:19
akahat|ruckHello 11:27
akahat|ruckI'm seeing POST_FAILURES in the jobs11:28
akahat|ruckhttps://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=011:28
akahat|ruckcan someone tell me why this is happening?11:28
*** rlandy is now known as rlandy|dr_appt11:48
funginoonedeadpunk: yes, we looked into the new centralized ara model, but if memory serves (it's been a while) it's not well-adapted to separate out test results, and it would be yet another service we'd need to care for and feed11:55
fungiakahat|ruck: are you referring specifically to tripleo jobs? because the only post_failure result in the openstack tenant's gate pipeline today was for a tripleo-ci-centos-9-containers-multinode build roughly 5 hours ago11:58
noonedeadpunkfungi: we have tons of them today fwiw11:59
funginoonedeadpunk: not in the gate pipeline11:59
fungithough there were 6 yesterday, most of which were for openstack-anisble-deploy jobs12:00
noonedeadpunknah, in check12:00
noonedeadpunkthat's why again started looking on how to reduce pressure on swift by our logs12:00
fungiharder to reason about post_failure results in check because there's no way to know that the change is ever able to pass its jobs, while in the gate pipeline it has at least succeeded once12:01
noonedeadpunkand ara is an outstanding leader...12:01
noonedeadpunkI bet we had in gates as well12:01
noonedeadpunklet me look for it12:02
noonedeadpunkhttps://zuul.opendev.org/t/openstack/build/128f27979b424d4f825036f6301053df12:02
fungias soon as i get a little more coffee in me, i'll try to run down the causes in the executor logs12:03
noonedeadpunkbut I bet for us it's swift upload timeouts again12:03
fungiprobably, but i'm more curious to see if it's the same thing for the tripleo jobs or something different12:04
*** ysandeep is now known as ysandeep|afk12:12
noonedeadpunkyup, I just catched that in console12:12
fungithe upload timeout?12:14
Clark[m]Note some tripleo jobs do more than half an hour of log uploads. Its possible they ride right along the edge of what times out and what doesnt12:14
Clark[m](we've noticed this when doing zuul executor restarts and waiting on that one last job to finish)12:15
fungiyeah, you may be onto something with ara reports, they can create a massive number of files, so may significantly inflate swift upload times (since the files are uploaded one-by-one to the api) without exceeding the disk space limit we enforce on the workspace12:16
noonedeadpunkClark[m]: so we also see log uploads >30m now. But in average it takes <10m...12:16
noonedeadpunkI'm not sure if tripleo _always_ upload that long though12:16
Clark[m]Ya I think tripleo long uploads are consistent. But if things slow down then they may be very likely to exceed timeouts12:17
fungii wonder if it's impacted by rtt? jobs run in na providers uploading to rackspace and ovh-bhs1 go quickly, jobs run in eu providers uploading to ovh-gra1 go quickly...12:17
noonedeadpunkyah, ara spawns ridiculously many files12:17
Clark[m]Yes, it was a regression and part of why we stopped running it for jobs by default12:18
fungiso it could be the latency crossing the atlantic inflating the upload times to explain the 3x discrepancy you're observing12:18
Clark[m]fungi: and that could be made worse by internet activity12:18
fungiabsolutely12:18
fungior choked peering points between backbone providers along a preferred route...12:19
fungithe sorts of things that tend to silently (from our perspective) come and go12:19
noonedeadpunkwell, for projects that has ansible as base ara is quite important source of information :(12:21
fungicould you tar it up?12:24
Clark[m]Or talk to dmsimard about adding the functionality back that was removed to store it in a SQLite db12:25
Clark[m]Or use a different tool like zuul did12:25
Clark[m]I wonder how hard it would be to point zuul's renderer at a different json log12:26
fungithough i get the impression that the file-backed implementation in ara is considered "legacy" and the intent is that users put multiple runs in a single database now12:26
noonedeadpunkfungi: well. then you need to download tar locally for each job, unpack, browse locally12:26
funginoonedeadpunk: yes, that's exactly what i'm suggesting12:26
noonedeadpunkClark[m]: oh well we had some talk back then and it was hardly achievable unfortunatelly...12:26
fungii know it would be less convenient than consuming it directly over the web12:26
noonedeadpunkperfect situation would be having ara-api :)12:27
fungihave you tried feeding test results for multiple jobs into an ara-api server?12:27
noonedeadpunkgiven that dmsimard can help out with implementing some filtering/tagging per job12:27
noonedeadpunkyeah... dmsimard uploads kolla-ansible and ours results https://demo.recordsansible.org/12:28
noonedeadpunkquite impossible to understand where's what12:28
Clark[m]I'm going to say upfront that I don't think we should deploy another service for this. There are alternatives like zuul's that work with minimal overhead and no additional service work on our part. Adding tons of little services like this has only bitten us down the road when it needs maintaining 12:28
fungiif you (or someone) wanted to run an ara-api service similar to how dpawlik is running a log indexing service for openstack, you could adjust jobs to submit results to that api12:28
fungibut yes, i agree with Clark[m] that the bar to add it to opendev's current service set would be pretty high12:29
fungior you could do a scraper similar to how ci-log-processing is set up, to query zuul and pull archived ara datasets from the recorded job logs/artifacts and feed those into an ara-api instance asynchronously12:32
noonedeadpunkI think the biggest problem to leverage zuul as we need to run "own" ansible with ansible...12:33
fungithat service model has the added benefit that it wouldn't increase job runtime if the ara-api interface is slow or gets bogged down under extreme load12:33
noonedeadpunknot sure I fully got this tbh...12:34
noonedeadpunkprobably because I have no idea about how ci-log-processing is set12:35
Clark[m]Re using zuul's renderer, yes I think a small amount of work would be necessary to update zuul to optionally look at a different Ansible json output file.12:35
fungici-log-processing is the solution dpawlik developed to query the zuul builds table, retrieve build logs for each of the builds it's interested in, and then feed them into an opensearch backend12:36
Clark[m]And maybe that doesn't want to live in zuul directly but get extracted into something you can include in your job logs. I'm not sure what the best approach is there. I just know that part of the reason this exists in zuul is this very issue with ara12:36
noonedeadpunkfungi: ah, you mean if we host ara-api somewhere else I guess?12:36
fungiright, exactly how the new opensearch is working12:37
fungiand doing it asynchronously from the job running has the added benefit that it won't impact the jobs themselves if it stops working for some reason12:37
noonedeadpunkWell, we can only host that withing someones ccompany, and I don't really like to make project logs dependent on any company...12:37
noonedeadpunkAnd I kind of like Clark[m] idea to leverage zuul renderer12:38
fungifor ci-log-processing we provided a plain unconfigured server instance in opendev's control plane tenant on one of our donor providers and gave the admin of the service access to ssh into it12:38
noonedeadpunkas it has basically everything we need12:39
fungibut yes, reusing zuul's tast renderer would be awesome, maybe even possible to turn it into a library which zuul consumes in order to reduce the collective maintenance burden12:39
fungis/tast/task/12:40
noonedeadpunkit sounds simpler and quite useful overall12:40
fungiand yeah, the idea is you take the ansible json output, and then interpret it browser-side with javascript12:41
fungithough i do wonder if it would scale well performance-wise to a dataset as large as what openstack-ansible builds produce12:41
fricklerdo we want to limit log item count in addition to log volume on the executor side? not sure though how a reasonable limit would look like12:42
noonedeadpunkthat is good question, if it's done in browser I totally can see where things can go south...12:42
* dpawlik reading12:42
fungifrickler: i was thinking the same thing, it's harder to know what a sane file limit might be, but also the reason we limit log size has more to do with not filling up the executor's disk and less to do with making sure builds will be able to upload results reliably12:43
funginoonedeadpunk: though maybe there's a reasonable way to shard by play or something12:43
Clark[m]The disk limit is more to pretect the executor than to ensure jobs pass. I think we have much less concern about inodes12:44
funginoonedeadpunk: and have it only fetch the json for the play if the user expands it (you'd be trading responsiveness to user interaction in that case i guess)12:44
fricklerfungi: well we also have an inode limit on /var/lib/zuul. though for example we are at 22% inodes on ze01 currently vs. 33% space12:44
Clark[m]fungi: noonedeadpunk: you should be able to test it via web browser debugging tools and changing the json path location to load real osa data12:45
fricklerbut in theory a job with a huge number of tiny files could likely exhaust inodes12:45
noonedeadpunkfungi: yes, I should totally look at how's that's done. I'm not that good in JS but hope can figure out smth12:45
fungiyes, i buy the inode argument. we could set a file count governer to limit to a percentage of our inode capacity similar to the percentage of our block capacity12:45
Clark[m]frickler: ok so less inside concern but not much less as I thought12:45
Clark[m]*less inode. Mobile keyboards try to be too helpful sometimes 12:46
noonedeadpunkok, thanks for ideas!12:46
fungii need to step away from the keyboard for a few minutes, but will be back shortly and try to dig into today's tripleo post_failure build12:46
noonedeadpunkI have smth to work on now :)12:46
*** rlandy|dr_appt is now known as rlandy12:51
*** ysandeep|afk is now known as ysandeep12:58
*** undefined__ is now known as rcastillo13:08
*** undefined_ is now known as rcastillo13:09
noonedeadpunkhm, there's also elastic callback plugin to log directly into elasticsearch....13:11
*** dasm|off is now known as dasm13:33
*** ysandeep is now known as ysandeep|out15:00
*** dviroel is now known as dviroel|lunch15:19
clarkbnoonedeadpunk: that would require direct access to ES and that isn't available with the current setup aiui. But even if you had that our experience has shown that it is a good idea to decouple jobs from writing to ES as that can be quite slow and occasionally have errors. Also you probably need something to render the data out of ES once it is there15:55
*** jpena is now known as jpena|off16:02
fungiclarkb: revisiting the lists.o.o outage and grub chainloading solution from earlier, does that mean we can get away with no longer decompressing the kernels and letting grub do it in the second stage?16:02
fungiobviously it's something we would test in a scheduled window, but curious if it simplifies further kernel upgrades16:03
clarkbfungi: I think it does open that possibility. However, if you look in grub.conf there was that if condition for xen where it adds grub modes for things like xz I think maybe we have to add lz4 too?16:03
clarkbbecause xen grub is reading our basic menu.lst which poinst to the grub2 shim loader elf which xen can read. Then that loads grub2 proper aiui which in theory can do the lz416:04
clarkbthat said I know I tested this before, but I think I was testing it with a compressed kernel and it didn't work then. So there is likely something like adding the extra mod that we need to do?16:05
fungiyeah, needing to add another module in the second stage grub config wouldn't surprise me16:11
fungiakahat|ruck: i looked at the executor's debug log for https://zuul.opendev.org/t/openstack/build/ffb35b4d680d48a8bd21b1964964019d and it seems superficially similar to the problem noonedeadpunk has been running down in the openstack-ansible-deploy jobs... TASK [upload-logs-swift : Upload logs to swift] ends in "Ansible timeout exceeded: 3600"16:15
fungii wonder how trivial it would be to put another task just before that which says how many files will be uploaded, so we can start to get a feel for what the upper bound is on these16:16
fungiand whether it's related to file count at all (even if that's only one of the variables feeding into the problem)16:16
clarkbwhen successful they are all listed already. This doesn't help if the problem is build specific due to some exceptional case, but I don't think that is the trpleo situation. They already log excessively and could probably take a critical eye to that16:17
clarkbusing known information without changing anything about the jobs to collect more info16:17
*** dviroel|lunch is now known as dviroel16:18
fungiyeah, looking at https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py i think we're relying on the sdk to explode paths/globs anyway, so there would be a bit of code involved to guess how the sdk will enumerate the input16:21
fungibut yes, my thought was to be able to tell whether the builds hitting upload timeouts had significantly more files/chunks/whatever than their successful counterparts16:22
clarkboh interesting they use their own log collection playbook so the one we provide via zuul jobs which logs better doesn't log anything :(16:23
clarkbso the response to my first suggestions is that isn't as easy as I first though because tripleo is special16:24
fungianother thought would be to split the uploading into two separate tasks, the first only uploading the console log, console json, and zuul info files16:24
fungithat way if the problem is uploading way too many log files, at least there's the basic logs and a working result dashboard16:24
clarkboh interesting I like that16:25
fungiit does mean more than one upload task, but hopefully the overhead of splitting it that way would be small16:26
fungihowever i'm not sure how to untangle it since a lot of that logic is in zuul-jobs16:27
clarkband for many jobs they don't log anymore than that16:27
clarkbI think the swift-upload-logs would haev to be run against two different sets of inputs16:27
clarkbso we may have to write those zuul specific log files to a new location and upload that then upload the normal logs afterwards from the existing location (this would be most backward compatible)16:28
fungior the upload-logs role could grow a priority list which it uploads only those matches in the first task and then excludes those from the list expanded in the second task16:28
fungithat might be unnecessarily complicated though16:28
clarkbfungi: looking at tripleo logs one thing that they have is fairly deep dir structures and each dir is a swift object16:30
clarkbseveral actually as you need the index too16:30
clarkbthey may see improvements flattening the log structure16:30
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html for example16:30
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/opt/git/opendev.org/openstack/skyline-console/docs/zh/develop/images/form/index.html16:30
clarkbthere are also a couple of places where they seem to upload the same files16:31
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/tmp/dnf-zuul-gcsew2fm/index.html16:32
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/usr/tmp/dnf-zuul-gcsew2fm/index.html16:32
clarkbthat is unnecessary duplication which can be trimmed16:32
clarkbI'm also not sure we need to copy what feels like all of /etc https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/etc/index.html16:33
fungiakahat|ruck: ^ see above for some actionable suggestions which may help16:33
clarkbmuch of what is in /etc is consistent job to job because it is either distro defaults or because we set it the same way via job configs or dib for all jobs16:34
clarkbyou could selectively log information if it becomes necessary rather than uploading things that no one ever looks at every job16:34
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/workspace/.quickstart/usr/local/share/ansible/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html thats a much deeper nesting example16:35
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/src/opendev.org/openstack/tripleo-quickstart-extras/roles/validate-tempest/files/tempestmail/tests/fixtures/index.html seems to be a duplicate with one of the /opt/git/ paths above too so more16:36
clarkbduplicates to clean up16:36
clarkbI suspect that things like https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/environments/index.html are known or can be generated locally too? Not sue we need to log that? I mean we don't log16:38
clarkball our ansible that we are about to run. We know that it matches what is in our repo. But I would start with things like deep nesting, duplicates, and excessive /etc content that never changes16:38
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/home/zuul/tripleo-deploy/undercloud/tripleo-heat-installer-templates/tools/index.html looks like another good trim16:49
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_614/842073/10/gate/tripleo-ci-centos-9-undercloud-upgrade/6142441/logs/undercloud/var/lib/neutron/.cache/python-entrypoints/index.html anything like that too16:50
akahat|ruckfungi, clarkb thank you for taking look in to this.17:52
akahat|ruckTripleO collect lot's of logs and there are some duplication also.17:52
akahat|ruckWe also not sure about the file count.. adding file count +1.17:53
fungiand that seems to be resulting in an increasing number of job failures, so needs to be reined in17:53
akahat|ruckand also i liked your idea about collect logs where we can add logs playbooks which collect specific files from the host.17:53
fungiwe can get file counts for the successful jobs, but we don't know what the file counts might be for the failing ones which are unable to upload all their logs before zuul gets bored of waiting and cuts them off17:54
akahat|ruck+1 for uploading console log, json and zuul info files at first and rest later17:54
fungiif you would like to help with improving the upload-logs roles in zuul-jobs that would be awesome, but at least please look into trimming the fat in tripleo's job log collection17:55
akahat|ruckfungi, yeah.. sure. I'll discuss this topic with my team and will check what specific files we can collect.17:57
clarkbkeep in mind the nested makes each dir level count as another "file"17:57
clarkbs/nested/nesting/17:57
clarkbso reducing nesting can also help17:57
akahat|ruckclarkb, noted.17:59
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging  https://review.opendev.org/c/openstack/diskimage-builder/+/84760018:51
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: DNM: Network Manager logging to Trace for Debugging  https://review.opendev.org/c/openstack/diskimage-builder/+/84760018:56
*** dviroel is now known as dviroel|out20:50
opendevreviewGage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments  https://review.opendev.org/c/openstack/project-config/+/84762121:45
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo  https://review.opendev.org/c/openstack/project-config/+/84741421:46
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo  https://review.opendev.org/c/openstack/project-config/+/84741421:47
opendevreviewGage Hugo proposed openstack/project-config master: End project gating for openstack-helm deployments  https://review.opendev.org/c/openstack/project-config/+/84762121:56
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo  https://review.opendev.org/c/openstack/project-config/+/84741421:56
opendevreviewGage Hugo proposed openstack/project-config master: Retire openstack-helm-deployments repo  https://review.opendev.org/c/openstack/project-config/+/84741421:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!