Friday, 2023-03-03

opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : submit-requirements for deprecated NoOp function  https://review.opendev.org/c/openstack/project-config/+/87580403:42
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : add submit requirements to NoBlock labels  https://review.opendev.org/c/openstack/project-config/+/87599303:42
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : handle key / values with multiple =  https://review.opendev.org/c/openstack/project-config/+/87599403:42
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Update Review-Priority to submit-requirements  https://review.opendev.org/c/openstack/project-config/+/87599503:42
opendevreviewIan Wienand proposed openstack/project-config master: gerrit/acl : Convert remaining AnyWithBlock to submit requirements  https://review.opendev.org/c/openstack/project-config/+/87599603:42
opendevreviewdaniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info  https://review.opendev.org/c/openstack/ci-log-processing/+/87626009:27
dpawlikdansmith: hey, let me know if it is fine for you: https://review.opendev.org/c/openstack/ci-log-processing/+/87626009:28
*** odyssey4me is now known as odyssey4me__10:58
*** odyssey4me__ is now known as odyssey4me10:58
*** jpena|off is now known as jpena11:00
*** odyssey4me is now known as odyssey4me__11:06
*** odyssey4me__ is now known as odyssey4me11:06
*** odyssey4me is now known as odyssey4me__12:02
opendevreviewJeremy Stanley proposed openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3""  https://review.opendev.org/c/openstack/project-config/+/87636514:23
ade_lee_fungi, clarkb gotta ask to hold a node yet again to figure out why the fips ubuntu tests are failing14:56
dansmithdpawlik: questions inline, but yeah, sounds like that's what we need, thanks a lot :)14:57
ade_lee_it looks like there is some failure to do iscsi things - specifically with chap algorithms14:57
ade_lee_https://zuul.opendev.org/t/openstack/build/44e7d0b4a565456893f1c096f6b9da61/logs14:58
ade_lee_fungi, clarkb ^^14:58
ade_lee_we should be setting the chap algorithms correctly, but maybe that doesn't work in the same way for ubuntu14:59
dpawlikdansmith: allright15:02
opendevreviewMerged openstack/project-config master: Revert "Revert "Temporarily stop booting nodes in inmotion iad3""  https://review.opendev.org/c/openstack/project-config/+/87636515:03
dpawlikdansmith: I applied change on logscraper. You should get now more details for TIME_OUT jobs15:04
dpawlikdansmith: to see how opensearch keeps  the field in the index, you can just click on "json" when you click on some domunent (Expanded document)15:05
dansmithdpawlik: right. but are there any other list fields?15:06
dpawlikyou mean if there are already some fileds that uses list?15:07
dansmithyeah I was just curious how that's going to look in the query interface15:07
dansmithobviously being able to see it in json is something.. I'm trying to get my search to refresh15:08
dpawlikhttps://paste.openstack.org/show/bcm3M0hsQgJNLyrKGCuq/15:08
dpawlikso there are few fields that contains list or dict15:08
dansmithah okay, tags for example15:08
dansmithcool, yeah, that looks good then15:08
dpawlikdansmith: try this one: https://opensearch.logs.openstack.org/_dashboards/app/visualize#/edit/21f18650-b9d6-11ed-a277-139f56dc2b08?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30m,to:now))&_a=(filters:!(),linked:!f,query:(language:kuery,query:''),uiState:(),vis:(aggs:!((enabled:!t,id:'1',params:(field:build_uuid.keyword),schema15:15
dpawlik:metric,type:cardinality),(enabled:!t,id:'3',params:(field:hosts_region.keyword,missingBucket:!f,missingBucketLabel:Missing,order:desc,orderBy:'1',otherBucket:!f,otherBucketLabel:Other,size:5),schema:segment,type:terms),(enabled:!t,id:'4',params:(filters:!((input:(language:kuery,query:'build_status:%22TIMED_OUT%22'),label:''))),schema:split,type:fi15:15
dpawliklters)),params:(addLegend:!t,addTimeMarker:!f,addTooltip:!t,categoryAxes:!((id:CategoryAxis-1,labels:(filter:!t,show:!t,truncate:100),position:bottom,scale:(type:linear),show:!t,style:(),title:(),type:category)),grid:(categoryLines:!f),labels:(show:!f),legendPosition:right,row:!f,seriesParams:!((data:(id:'1',label:'Unique%20count%20of%20build_uuid.15:15
dpawlikkeyword'),drawLinesBetweenPoints:!t,lineWidth:2,mode:stacked,show:!t,showCircles:!t,type:histogram,valueAxis:ValueAxis-1)),thresholdLine:(color:%23E7664C,show:!f,style:full,value:10,width:1),times:!(),type:histogram,valueAxes:!((id:ValueAxis-1,labels:(filter:!f,rotate:0,show:!t,truncate:100),name:LeftAxis-1,position:left,scale:(mode:normal,type:lin15:15
dpawlikear),show:!t,style:(),title:(text:'Unique%20count%20of%20build_uuid.keyword'),type:value))),title:TIME_OUT-builds-region,type:histogram))15:15
opendevreviewdaniel.pawlik proposed openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info  https://review.opendev.org/c/openstack/ci-log-processing/+/87626015:15
dansmithdpawlik: it's not merged yet so we're not actually using the new rules yet right?15:16
dpawlikdansmith: I still did not have enough time to automatize service deployment after some change is merged15:19
dpawlikone day when I have few min, I will finally finish release process and release logscraper v1.0.015:19
dpawlikdo automatization of service deployment, etc.15:20
dpawlikSo far, it is manual job.... 15:20
dansmithokay I'm not sure what you're saying.. so you're just manually applying the changes and this is already applied?15:20
dpawlikso the container is created, just changed the service container image. That's it. 15:21
dpawlikmerging change15:22
dpawlikI would be really happy, if there would be more people to handle that15:22
dansmithso, currently, every TIMED_OUT job I see is coming from rax-IAD15:29
dansmithpresumably this has to soak a bit to get a better view,15:29
dansmithbut I also wonder if the IAD hardware is all much older than others and our timeout problems come from longer jobs landing on those nodes15:30
dansmithfungi: any idea what the distribution is of nodes in regions/15:31
JayFdansmith: in #opendev, they just pushed some kind of change to remove RAX-iad from rotation, a mirror died or something like that? Not sure if you're tuned into that or not, but it might be related15:31
dansmithI wish this discussion didn't have to be so fragmented 15:31
dansmithbut mirror issues are probably not related to job timeouts15:32
dansmithalso looks like maybe that's not rax-IAD15:32
dpawlikdansmith: did you check visualization?15:33
dpawlikhttps://paste.openstack.org/show/bRXYvgTvs3SPrDlZYk1G/15:33
dpawlikas I see it is to early to say if it's rax or ovh15:34
fungidansmith: it's not assumed that all the hardware in a given provider region is even the same, we document that here: https://docs.opendev.org/opendev/infra-manual/latest/testing.html#known-differences-to-watch-out-for "CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region."15:34
dansmithfungi: oh I know15:35
fungiJayF: that was inmotion-iad3 not rax-iad which we disabled, totally different cloud15:35
JayFfungi: I see that now, I think I might have merged two things in my head, thank you for the correction15:35
dansmithfungi: I've just been trying to determine why we're suddenly hitting a ton of job timeouts, and if we slowly grew past the amount of things we can test on our slowest set of nodes, it might be an indicator if all the timeouts were on one set of hardware15:35
dansmithfungi: I'm just looking for clues15:36
fungiyeah, i don't actually know how the hardware in different providers compares15:36
dansmithfungi: yeah, I don't really expect we would be able to know that15:36
fungii don't think anyone's tried to do a survey, but because we can't even expect all hardware in a particular provider to be consistent, it would be a nontrivial exercise15:37
dansmithfungi: fwiw, what i was asking above was if you knew something like "80% of our quota is in rax-IAD"15:37
fungioh that. i think we have more effective quota in ovh than rax, but we have a dashboard with numbers, just a sec15:37
dansmithokay15:37
dansmithI think what dpawlik is trying to show is that all the timeouts we have recorded (so far) are spread between one ovh and one rax region15:38
dpawlikafter few days visualization "would say something more"15:40
dansmithyeah15:40
fungithese are the rackspace utilization charts: https://grafana.opendev.org/d/a8667d6647/nodepool-rackspace?orgId=115:41
fungiand these are ovh: https://grafana.opendev.org/d/2b4dba9e25/nodepool-ovh?orgId=115:41
fungiso yes it looks like we have more quota in rackspace than ovh after all15:42
dansmithack15:42
dansmithfungi: going back to your knowing things about the hardware in a region,15:43
dansmithif we take a single fat job and can show that, if it times out, it almost always does so in a given region, then we can probably make some assumption about the speed of those nodes (either raw, or "throughput" with noisy neighbors)15:44
fungii suppose, with the caveat that "speed" is a multi-faceted thing. you can at least extrapolate it to "slower at running the same kinds of jobs as the ones which time out"15:45
fungilots of job timeouts are second-order symptoms of something like memory exhaustion, so you could actually be measuring "how well does this provider's disk handle swap thrash"15:46
dansmithassuming a composite job mix, of course15:46
dansmithbut yeah15:46
fungiunfortunately the different resources and underlying hardware aren't usually adjustable in isolation from one another, so while we do have some larger-memory flavors we could try to run the same jobs on for comparison, they're also going to be in a different provider on different hardware (where the preferred memory-to-cpu ratio is chosen by the provider to more efficiently pack their15:50
fungiservers)15:50
fungiso even if it ran faster on nodes with more memory, we'd be hard pressed to say for sure that the additional memory is why it ran faster15:51
opendevreviewMerged openstack/ci-log-processing master: Add job with 'timed_out' status to fetch; add hosts_region info  https://review.opendev.org/c/openstack/ci-log-processing/+/87626015:51
dansmithin isolation sure,15:51
ade_lee_fungi, clarkb ?15:51
fungiade_lee_: yep, pulling up the build info so i can set an autohold for it15:51
ade_lee_fungi, thanks15:51
dansmithfungi: but if you run tens of thousands of jobs all well-distributed across the nodes, and you see a strong correlation of timeouts on one provider for one job, I think you can make the conclusion about those nodes being "slower" for that workload15:52
dansmithif you don't have a strong correlation then you can't of course15:52
fungizuul-client autohold --tenant=openstack --project=opendev.org/openstack/tempest --job=tempest-all-fips-focal --ref='refs/changes/97/873697/.*' --reason='ade_lee looking into fips iscsi chap errors'15:54
fungithat's set now15:54
fungidansmith: yes, of course15:54
fungithough not necessarily what to change in order to address the slowness15:55
dansmithno, not unless you know stuff about what makes that job special (which we do in some cases)15:55
fungi(for example, in some cases we're the reason the nodes seem "slow" thanks to being in an overcommit configuration that isn't tuned for our worloads)15:56
dansmithworkloads or warlords? :)15:56
fungis/worloads/workloads/15:56
fungiboth15:56
fungiwarring openstates15:57
dansmithmy theory is more that we continue to grow our list of tests (and probably also our server software is slower) and we're getting closer to the limit of what we can test in two hours15:57
dansmithso I'm just looking for clues that suggest that's the case, and if not, maybe suggest what else might be the problem15:57
fungialso not new. we had the same sort of discussion when devstack jobs started taking longer than 45 minutes ;)15:57
dansmithand I don't know what else to do other than look at the data along different axes until I see something that correlates15:58
fungiagreed15:58
fungior ask chatgpt. it can probably give you an explanation (not a correct one, but it will totally sounds plausible)15:58
dansmithI thought chatgpt has feelings now and we're not supposed to ask it hard questions that might cause it to need to seek therapy?15:59
dansmithor is that bing?15:59
dansmithmaybe chatgpt could be the therapist for bing...15:59
fungioh, right, the bing-chat ai was the one they had to "lobotomize" after it started threatening users16:02
dansmithyeah, I totally love that it took about two weeks for the "good AI" to get too creepy for human comfort16:18
ade_lee_fungi, thanks -- I'll kick off a recheck now16:19
*** jpena is now known as jpena|off17:21
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos  https://review.opendev.org/c/openstack/project-config/+/87605417:56
ade_lee_fungi, looks like the node already failed17:59
ade_lee_fungi, https://zuul.opendev.org/t/openstack/build/aa9d89ea073a40a6b84895a019707d9018:00
fungiade_lee_: what ssh key do you want authorized for it?18:00
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos  https://review.opendev.org/c/openstack/project-config/+/87605418:00
ade_lee_fungi, https://paste.openstack.org/show/bADwloaOalEwhkusIxjF/18:01
fungiade_lee_: ssh root@173.231.255.7718:02
ade_lee_fungi, thanks - in18:03
fungicool, let us know when you're done and we'll clean up the hold18:03
ade_lee_fungi, will do18:03
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant  https://review.opendev.org/c/openstack/project-config/+/87641418:09
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the main NebulOuS repos  https://review.opendev.org/c/openstack/project-config/+/87605418:11
opendevreviewRadosław Piliszek proposed openstack/project-config master: Add the NebulOuS tenant  https://review.opendev.org/c/openstack/project-config/+/87641418:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!