Monday, 2021-04-12

*** tosky has quit IRC00:00
ianwwheels aren't releasing due to "Could not lock the VLDB entry for the volume 536871142."00:54
ianwi feel like i already fixed that at some point ...00:54
ianwhttp://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-03-29.log.html#t2021-03-29T04:08:3700:57
ianwindeed00:57
ianwVLDB entries for all servers which are locked:00:59
ianwTotal entries: 000:59
*** brinzhang has joined #opendev01:02
*** iurygregory has quit IRC01:05
ianwok, this is a red herring01:21
ianwthe real problem is01:21
ianwhttps://zuul.openstack.org/build/6dd9f20e9b3a41d7a6e6dec8347b2a3d01:21
ianwwhich is openafs failing to install on centos7 which means it can't publish01:24
ianwfor a long time i've been meaning to reorganise these jobs to use the executor's afs client to copy the data ... but anyway01:25
ianwi feel like the last time this happened, it was because we hadn't updated centos nodes and the kernel had changed, and we couldn't get the headers01:26
ianwok, 785675,1 is stuck waiting for arm64 nodes01:27
ianwi'm not sure why it hasn't timed out01:27
ianwnl03 isn't responding for me and could explain this01:30
ianw... ignore that.  helps if you try nl01.openDEV.org (not stack)01:32
ianwkevinz: hrm, i think i see in scrollback you'd identified some bogus nodes right?01:35
kevinzianw: morning! There are 3 instances are deleted but still remaining metadata..01:36
kevinzI'm working on removing it from DB01:36
ianwok, cool01:36
ianwit almost looks to me like the launcher has somehow forgotten about the nodes being requested by zuul01:37
ianwit doesn't appear to be trying to satisfy any requests01:37
ianw2021-04-12 01:37:34,918 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: []01:37
ianwbut system-config-zuul-role-integration-bionic-arm64 has been queued for 54 hours01:38
kevinzianw: You means that linaro-us doesn't respond any requests from Zuul?01:39
kevinzI saw that just 6  instances  are currently running on the cluster01:39
ianwkevinz: no i don't think that's it.  it seems like nodepool has some how lost a bunch of requests; it is not trying to satisfy them01:40
ianwi think linaro is responding ok01:40
kevinzianw: well, OK,  what can I do to help? The first thing I think is to remove the existing "disappeared" vm instances from our cluster first01:41
ianwkevinz: yeah, i don't know, this is a weird one01:44
kevinzianw: OK,  I will fix this first to see if things will be better01:45
ianwi feel like the node requests are not in zookeeper, so nodepool will never try to satisfy them.  but zuul clearly thinks they are01:46
ianwsomething happened about 58 hr 21 min ago01:46
ianw0a526f11-b784-416b-bd89-c5de47a9ba4c | debian-buster-arm64-linaro-us-0023946093 | BUILD  |                                                                        | debian-buster-arm64-1618117653 | os.large |01:47
ianwkevinz: ^ can you see anything interesting relating to that01:48
ianw2021-04-11 08:22:04,016 INFO nodepool.NodeLauncher: [e: 788535e8d4bc49919afbc414a1fcaa45] [node_request: 300-0013647916] [node: 0023946093] Node is ready01:49
ianwit seems to say the node is ready, but it's still showing "BUILDING"?01:49
ianwbut then "2021-04-11 09:45:19,783 INFO nodepool.NodeDeleter: Deleting ZK node id=0023946093, state=deleting, external_id=f5ee1b0f-107d-4965-a7b5-2375be42a30"01:50
ianwthis is all from yesterday01:50
ianwok, it's not correct that the requests aren't in zookeeper01:53
ianwhere is a log of a request in zookeeper and the NL related logs01:55
ianwhttp://paste.openstack.org/show/804376/01:55
ianwthis failed at01:55
ianwlauncher-debug.log.2021-04-09_15:2021-04-09 15:28:34,633 ERROR nodepool.NodeLauncher: [node_request: 300-0013634813] [node: 0023929201] Launch failed for node centos-8-stream-arm64-linaro-us-002301:56
ianwafter 3 attempts01:56
ianwright, nodepool request-list shows this too02:01
ianwi'm restarting nl03 container, i'm not sure what else to do02:04
ianwkevinz: i think there is a problem02:06
ianwi'm seeing a very helpful (not) message of02:06
ianwopenstack.exceptions.SDKException: Error in creating the server (no further information available)02:06
ianw2021-04-12 02:06:19,346 ERROR nodepool.NodeLauncher: [node_request: 300-0013634833] [node: 0023948459] Detailed node error: No valid host was found. There are not enough hosts available.02:07
ianwkevinz: ^ it might actually be that02:07
ianwkevinz: yeah, things are just going into ERROR state02:08
ianwyou can probably see that, nodepool is going crazy trying to create the nodes again :)02:09
kevinzianw: Yes I saw, quite a lot of instances are comming.  Several instances are building and others are failed due to no valid host02:11
kevinzianw: I think we can stop some UT test since it is quite overloaded to the nodepool02:11
kevinzbtw, the 3 "disappeared" instances  have been removed already02:12
ianwyeah i would say this is trying to build too many nodes02:13
ianwwe've got max-servers: 40 ; i guess this is within limit02:15
ianwkevinz: is this thundering heard of starting instances killing the cloud?02:15
kevinzianw: yes, the limitation is 40.  I will try to find one more node to join the cluster to release the overload02:17
kevinzianw: Yes the cloud is receiving a lot of creating requests,  so it is slow now :-)02:18
ianwok, i can turn that down if it's gotten too high02:18
ianwi'm having a few authentication issues, but hopefully we'll have 15 nodes from OSU OSL coming online soon02:19
kevinzianw: That's fine actually,  I see some instances creation is finished02:20
kevinzcool,  you mean 15 nodes are 15 vms or bare metal machines?02:20
kevinzianw: it looks that the OSU OSL machines are newer and maybe better performance :-)02:21
ianw15 vms :)02:26
kevinzOK,  nice02:27
ianwi'm going to grab some lunch and hopefully things will start moving now02:27
kevinzianw: OK, np02:43
*** cloudnull8 has quit IRC03:02
*** cloudnull8 has joined #opendev03:02
ianwhrm, something is still up03:08
ianwwe've got like 6 active nodes and nothing trying to build, but the queue is huge03:08
ianwkevinz: it still seems to go straight into error node03:11
ianws/node/mode03:11
ianw151a8028-569b-4178-b09f-8c8411cf6aa5 for example, can you see what happened with that?03:12
kevinzianw: I'm adding one new compute node to this cluster, and it is under operation now.  This instance is happened to schedulered to this new node03:14
ianwkevinz: oh, ok np.  lmn when things are stable03:14
kevinzianw: I saw https://zuul.openstack.org/stream/52a25bc0333c40febdebf9319c994321?logfile=console.log is running03:27
ianwkevinz: if you check https://zuul.openstack.org/status there's lots of things waiting for nodes03:28
kevinzianw: yes I see,03:28
ianwi've turned the max servers down to 10 for a little while you're working on it03:29
ianwas you say, it does seem some nodes are building now03:29
kevinzianw: how long of zuul waiting  for a instance creation?03:29
ianwthough that said, a bunch are in error03:29
ianwe.g. 98b36e90-52ec-47fa-a413-5b246e1705af just errored03:30
ianwkevinz: several days :)  that's the problem ...03:30
kevinzOK,  will check03:30
ianwkevinz: here's a big list http://paste.openstack.org/show/804377/03:30
kevinzI mean is there a timeout time for waiting instance launch,  if timeout then retry03:30
ianwyeah, i would have expected all these to fail with timeouts, but they haven't.  i think that's perhaps a separate, but related issue03:31
kevinzianw: OK, ack03:31
ianwsomething about the way things are failing isn't making zuul/nodepool give up03:31
kevinzianw:  98b36e90-52ec-47fa-a413-5b246e1705af : No valid host was found03:33
kevinzI think there has some schulder issues maybe,  always make the cloud no valide host...03:34
kevinzWill fix the new host adding first anyway03:34
ianwok03:38
ianwi've found at least one issue, that leaked nodes are put in a DELETING state but with no other details, and this confuses the quota calculator03:45
*** brinzhang_ has joined #opendev04:00
*** brinzhang has quit IRC04:03
*** mkowalski_ has joined #opendev04:12
*** tristanC_ has joined #opendev04:15
*** jrosser has quit IRC04:20
*** tristanC has quit IRC04:20
*** mkowalski has quit IRC04:20
*** Alex_Gaynor has left #opendev04:21
kevinzianw: adding one more 44core machines to the cluster04:33
kevinzadding finished and I've tested the instance creation04:34
kevinzianw: yes I always saw that the DELETING state blocked..04:34
*** jrosser has joined #opendev04:34
ianwkevinz: ok, cool, quota back up to 40 nodes in linaro.  i think the xxxlarge instances though will keep the number of running nodes more limited (hitting memory quota)04:35
kevinzianw: Yes.  That's is another problem04:36
*** ysandeep|away is now known as ysandeep04:53
*** ykarel has joined #opendev04:56
*** marios has joined #opendev05:08
*** whoami-rajat_ has joined #opendev05:36
*** sboyron has joined #opendev05:52
*** ralonsoh has joined #opendev06:02
*** slaweq has joined #opendev06:09
*** eolivare has joined #opendev06:23
*** dmsimard has quit IRC06:48
openstackgerritMerged openstack/project-config master: Bump node version for publish-openstack-stackviz-element  https://review.opendev.org/c/openstack/project-config/+/78576806:49
*** dmsimard has joined #opendev06:51
*** amoralej|off is now known as amoralej06:52
openstackgerritMerged openstack/project-config master: nodepool elements: create suse boot rc directory  https://review.opendev.org/c/openstack/project-config/+/78100207:02
*** fressi has joined #opendev07:05
*** eolivare has quit IRC07:07
*** andrewbonney has joined #opendev07:08
*** eolivare has joined #opendev07:09
ianwfungi: ^ i think that wheel building is held up because openafs fails to install on centos7.  i think that's because our images have an out of date kernel, and the headers are not on the mirror any more.  and i think that's because it's stuck behind suse.  and that's what ^ fixes :)07:14
ianwjust a typical day in dependency land!07:14
*** dmsimard has quit IRC07:32
*** dmsimard has joined #opendev07:33
*** tosky has joined #opendev07:35
*** jpena|off is now known as jpena07:54
*** rpittau|afk is now known as rpittau08:04
*** ysandeep is now known as ysandeep|lunch08:11
*** gnuoy` has joined #opendev08:22
*** gnuoy has quit IRC08:26
*** brinzhang_ is now known as brinzhang08:55
*** dtantsur|afk is now known as dtantsur08:56
*** ysandeep|lunch is now known as ysandeep08:56
hrwmorning09:45
hrwI see that check-arm64 queue cleaned up09:45
*** whoami-rajat_ is now known as whoami-rajat10:33
*** brinzhang_ has joined #opendev11:05
*** brinzhang has quit IRC11:08
*** iurygregory has joined #opendev11:11
*** artom has joined #opendev11:19
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap  https://review.opendev.org/c/opendev/system-config/+/77144511:31
openstackgerritGuillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag  https://review.opendev.org/c/opendev/system-config/+/77179811:31
*** jpena is now known as jpena|lunch11:32
*** dhellmann_ has joined #opendev11:44
*** dhellmann has quit IRC11:45
*** dhellmann_ is now known as dhellmann11:45
fungiianw: it also sounds like you might have run into the same stuck node requests i've been trying to track down the cause of for a few weeks now11:52
fungiianw: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2021-04-03.log.html#t2021-04-03T16:39:4711:53
funginot sure if that sounds like some of what you saw too11:53
*** jpena|lunch is now known as jpena12:30
*** cloudnull8 is now known as cloudnull12:48
*** stephenfin has quit IRC12:49
*** amoralej is now known as amoralej|lunch12:52
*** stephenfin has joined #opendev13:08
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap  https://review.opendev.org/c/opendev/system-config/+/77144513:15
openstackgerritGuillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag  https://review.opendev.org/c/opendev/system-config/+/77179813:15
openstackgerritMerged openstack/diskimage-builder master: Add Debian Bullseye Zuul job  https://review.opendev.org/c/openstack/diskimage-builder/+/78379013:16
zigoHi there!13:20
zigoI was wondering, would there be a way to get, in gerrit, a direct link to a plain patch file?13:20
zigoI mean, no zip, tar.xz or base64...13:20
zigoIt'd be really helpful for me.13:20
hrwzigo: press 'DOWNLOAD' link13:22
hrwah. you were there already13:22
hrwzigo: curl patchlink|base64 --decode?13:22
zigohrw: Yeah, it has diff.base64, diff.zip, tgz, tar, tbz2, txz ...13:22
zigohrw: Yeah, I know, I can do that... :)13:23
zigoI'd prefer if I didn't have to.13:23
hrwzigo: I went that way in CI job13:24
hrwas it was easiest way to fetch patches without having gerrit account13:24
zigohrw: It's not about CI or automation, it's that I very often pick-up patches by hand, and that's always one more step to do ...13:25
*** amoralej|lunch is now known as amoralej13:25
hrwzigo: make an alias?13:26
*** fressi has left #opendev13:49
fungihttps://review.opendev.org/Documentation/rest-api-changes.html#get-patch explains the rest api call that download link represents13:57
fungii expect the reason for base64 encoding is that the diff could be of a binary file, and so trying to display that in a web browser would get weird13:58
*** sboyron has quit IRC14:06
*** sboyron has joined #opendev14:07
*** ykarel has quit IRC14:21
*** ykarel has joined #opendev14:24
clarkbanother approach could be to use git fetch14:44
clarkbgit fetch && git show FETCH_HEAD > foo.patch14:44
*** snapdeal has joined #opendev14:48
fungiyep, you can even fetch those refs from the opendev.org gitea server farm14:49
fungiunfortunately gitea doesn't have a way to call named refs in its webui that i've been able to figure out (something i sorely miss from cgit and gitweb)14:50
clarkbfungi re https://review.opendev.org/c/opendev/git-review/+/785723 I'll get that installed after breakfast then try and remember to use it once or twice to push some actual code14:50
fungiyou could use the gerrit-provided gitweb to do it, i think, but you'd need to be authenticated first because of the way its hooked up14:50
fungiclarkb: thanks!14:50
*** marios is now known as marios|call14:53
*** dpawlik has quit IRC14:58
*** marios|call is now known as marios15:01
*** ykarel is now known as ykarel|away15:19
clarkbI've received notice that git.airshipit.org and survey.openstack.org's ssl certs have 30 days of validity remaining. these are not LE certs. Do we want to bother renewing them?15:19
fungifor survey.openstack.org i expect we can just let it expire15:20
fungii was going to propose we take that out of service anyway15:21
fungifor git.airshipit.org it's probably a one-liner to add it to the other git redirect domains we already generate certs for15:21
fungijust need a corresponding cname for the acme stuff15:21
fungiclarkb: good news, the git.airshipit.org cert we're deploying is already generated with lets encrypt, so that can be ignored15:23
fungiSubject: CN = git.airshipit.org15:23
fungiIssuer: C = US, O = Let's Encrypt, CN = R315:23
clarkboh even better15:23
fungiNot After : May 18 05:31:57 2021 GMT15:23
clarkband ya I agree re survey15:23
clarkbI bet that got updated when the stuff serving files moved servers a while back15:24
fungiyup, was pretty sure we had done them all, which is why i double-checked15:25
clarkbfungi: do you have a sense for where the openstack release process is re final RCs? I'm starting to try and page the zk cluster rolling replacements back in and wonder if we need to be careful for their release still15:27
fungiclarkb: wednesday around 10:00 utc i think is when the final release versions will all be tagged15:30
clarkbcool in that case probably waiting for at least wednesday is fine15:30
clarkbI can find other items to occupy my time between now and then15:30
clarkbdo the gerrit 3.2.8 upgrade later this week too likely15:30
*** ykarel|away has quit IRC15:33
*** mlavalle has joined #opendev15:42
openstackgerritClark Boylan proposed opendev/system-config master: Add note about python -u to external id cleanup script  https://review.opendev.org/c/opendev/system-config/+/78590715:45
clarkbfungi: ^ that was pushed with `git review -v --no-thin` the -v helped me verify the command (and --no-thin was present as expected) and --no-thin appears to have functioned fine15:46
clarkbgiven that seems to work and the test didn't cause any problems I suspect we can land that15:47
fungivery cool! yeah, looking15:47
*** roman_g has joined #opendev15:47
*** amoralej is now known as amoralej|off15:49
clarkbLooking at the gerrit user account conflicts I see there are a small numbr of CI accounts that we can likely pretty safely untangle15:57
clarkbbasically remove the human identifying conflict from the CI accounts15:57
clarkband let the human account be the owner of that external id without conflict15:58
clarkbin some cases it is two different CI accounts conflicting with each other. In those cases I think we simply disable the one that least recently commented and clean it up15:58
clarkbbut I do need to review things because I think in some of these cases we don't actually want to retire any accounts as both are being used. We just want the CI system to ahve a CI system email addr and a human to have a human email addr without conflict16:00
fungialso i expect there are cases where multiple ci systems were created with the same e-mail address16:04
clarkbI'm also seeing a non zero set of accounts with ssh keys set but no username16:06
clarkbI think the only way that would really make sense is if those accounts had been merged previously?16:06
fungiyes, i think so. we often didn't remove old ssh keys from accounts we merged into other accounts16:08
clarkbya looking at more recently used timestamps and other attributes it seems that this is likely the case16:09
*** dtroyer has joined #opendev16:24
clarkbThere is one CI account that conflicts with human accounts for four other people. I suspect in a case like that we don't retire anything, but simply remove the conflicts from the CI account, but I need to look at the external ids for that account more closely16:31
clarkb3 of those conflicts are simple mailtos and can be cleaned up. The fourth conflict is between emails on openids between what may be a human account and the CI account16:35
clarkbthe human account hasn't been used since 2015 though, but the ci account has been used this year. I guess in that case we can "sacrifice" the human account?16:35
fungiyeah, i would16:40
fungiwe can always help them get a new account set up later if they come to us16:40
clarkbIt is interesting to see how different some of these accounts are from each other in terms of how they conflict16:47
clarkbI'm going through and trying to understand each one a little better16:47
*** hamalq has joined #opendev16:47
*** hamalq has quit IRC16:47
*** hamalq has joined #opendev16:48
fungiit seems like it provides a window into the history of our infrastructure and how users have interacted with it in the past16:51
*** dtroyer has quit IRC16:53
*** rpittau is now known as rpittau|afk16:59
*** ysandeep is now known as ysandeep|holiday17:01
*** marios is now known as marios|out17:03
*** dtantsur is now known as dtantsur|afk17:06
*** marios|out has quit IRC17:06
corvusinfra-root: the squiggly lines on cacti and grafana look good to me.  in particular, the memory line on cacti is not at all squiggly and is in fact horizontal.  i think we're probably good to cut a release of zuul (which will be a good checkpoint release we can roll back to if needed for the next bit of v5 work).  concurrences?  dissents?17:14
fungicorvus: i agree, the memory leak looks very much solved now. this would make for a good version to release17:15
*** sboyron has quit IRC17:21
*** sboyron has joined #opendev17:21
clarkbcorvus: sounds good to me17:23
clarkbfungi: can you look at review:~clarkb/gerrit_user_cleanups/next-cleanups-20210412.txt ? That is what I worked through based on the conversation above. If all looks well to you I'll be trying to run the retire step against the ones listed as retireable and then in a few days can do the external id cleanups17:24
*** jpena is now known as jpena|off17:48
*** ralonsoh has quit IRC17:52
*** eolivare has quit IRC18:00
openstackgerritClark Boylan proposed opendev/git-review master: Add option for disabling thin pushes  https://review.opendev.org/c/opendev/git-review/+/78572318:00
clarkbfungi: ^ manpage updated18:01
fungiclarkb: were you adding a release note too (see earlier review comment)?18:03
*** roman_g has quit IRC18:08
*** roman_g has joined #opendev18:08
*** roman_g has quit IRC18:09
*** roman_g has joined #opendev18:09
*** roman_g has quit IRC18:09
*** roman_g has joined #opendev18:10
*** roman_g has quit IRC18:10
*** roman_g has joined #opendev18:11
*** roman_g has joined #opendev18:12
*** roman_g has quit IRC18:12
*** roman_g has joined #opendev18:12
*** roman_g has quit IRC18:13
*** roman_g has joined #opendev18:13
*** roman_g has joined #opendev18:14
*** roman_g has quit IRC18:14
openstackgerritMerged opendev/system-config master: Add note about python -u to external id cleanup script  https://review.opendev.org/c/opendev/system-config/+/78590718:15
fungiclarkb: on the "skipping" comments in your new cleanup list, those apply to the line immediately following them?18:16
fungii guess you're not feeding the list directly to a script so it's fine not to comment them out18:16
fungialso the midokura account being dormant is a good call, pretty sure they're no longer involved, the midonet neutron driver has been retired due to lack of maintenance18:19
fungithe plan in the latter half of next-cleanups-20210412.txt looks good, spot checks of the various account classes reflect the states i would expect18:21
clarkbfungi: yup to the line below18:29
clarkbfungi: oh I missed the release note comment. I'll address that18:29
openstackgerritClark Boylan proposed opendev/git-review master: Add option for disabling thin pushes  https://review.opendev.org/c/opendev/git-review/+/78572318:38
clarkbfungi: how's that?18:38
clarkbfungi: and ya thats not the direct input to retire accounts. It takes a bit of massaging (I have to take the account ids and prefix them with refs/users/XY/ for account id ABXY18:40
*** sboyron has quit IRC18:41
*** andrewbonney has quit IRC18:47
clarkbalright I'm going to start retiring the 56 accounts in that list18:54
fungisounds good, thanks!18:54
clarkbthat is done now and logs are in the normal location on review19:25
clarkbI'm going to look at what is necessary to do the manual surgery that I proposed for the subset of CI accounst that we cannot just turn off now19:25
*** mailingsam has joined #opendev19:41
*** whoami-rajat has quit IRC19:55
clarkbI have sent email to the third party account contacts for the accounst that need manual surgery. I gave them a week to respond otherwise I would proceed with the corrective action I described in the email20:15
clarkbfungi was cc'd as well if questions come up and I am not around20:15
fungiyep, i'll kep an eye out for replies20:18
clarkbcontinues to feel like decent progress on the account conflict front20:19
fungivery much so20:21
*** slaweq has quit IRC20:38
*** slaweq has joined #opendev20:41
*** snapdeal has quit IRC20:46
*** hamalq has quit IRC21:11
*** hamalq has joined #opendev21:11
*** dmellado has quit IRC21:18
*** dmellado has joined #opendev21:21
*** jralbert has joined #opendev21:25
ianwclarkb/fungi: could you take a look at the ipv6 config @ https://review.opendev.org/c/opendev/system-config/+/785556 and make sure it's what we're thinking21:36
ianwfor review0221:36
ianwi would say linaro is not looing happy again :(21:38
ianwlooking even21:38
ianwkevinz: ^ we don't seem to have any active nodes21:38
fungiianw: not sure if you saw my comments earlier today, but were you possibly seeing stuck node requests?21:43
ianwfungi: yesterday, i would say yes.  there were node requests in zookeeper and nodepool did not appear to be trying to satisfy them21:44
*** sshnaidm|pto has quit IRC21:44
ianwright now, linaro has three requests21:47
ianw2021-04-12 21:44:51,362 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: ['300-0013664209', '300-0013664234', '300-0013664235']21:47
ianwand three servers in BUILD status21:47
ianwand no active servers21:48
fungiianw: 785556 looks fine other than the mac i think? where did you find that one?21:49
ianwohhhhh, i may have copy-pasted that wrong21:49
fungialso do we do something similar for the mirror server, or was that set by hand?21:49
ianwit seems the mirror might have been setup by hand21:50
fungigot it, so if this works w/ netplan and ansible, maybe we can copy the same mechanism at least21:50
ianwok, three linaro nodes have now gone active.  but the queue is large and we should have plenty of capacity for more21:51
fungiserver list reports only the active three as well21:51
*** slaweq has quit IRC21:53
fungiFile "/usr/local/lib/python3.7/site-packages/nodepool/driver/utils.py", line 280, in estimatedNodepoolQuotaUsed21:54
fungiif node.type[0] not in provider_pool.labels:21:54
fungiIndexError: list index out of range21:54
funginot sure what's causing that in the debug log21:54
ianwfungi: yeah, i debugged that yesterday -> https://review.opendev.org/c/zuul/nodepool/+/78582121:54
fungioh, immediately above it... ERROR nodepool.driver.openstack.OpenStackProvider: Couldn't consider invalid node21:54
ianwin short, openstack driver puts in a dummy entry for leaked nodes that doesn't have a "type"; the quota calculator however sees the dummy node and tries to look up it's config to see what flavor it is21:55
fungii wonder if the launcher thinks we're at quota in linaro-us even though we're nowhere near21:56
ianw2021-04-12 21:56:38,670 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: []21:56
ianwnow it sees no active requests, despite the queue being large21:57
ianwthis is more or less what i saw yesterday.  an examination of zk would probably show a lot of node requests.  also restarting the launcher will probably pick them up21:57
fungiluckily restarting the launcher isn't especially disruptive, at worst it discards some building nodes... but yeah i'm not sure why it sometimes doesn't signal that it's satisfied or rejected a node request after considering it21:58
fungiafter *locking* and considering it, that is to say21:59
corvuswhy do you think the queue is large?22:00
corvusgrafana say's it's 3, and nodepool request-list says 422:00
ianwcorvus: the check-arm64 queue has a lot of things waiting in the ui22:00
corvusbut not queued?22:01
corvuswhich tenant has a lot of items in check-arm64?22:02
corvusi see 0 in openstack22:02
ianwsigh, so do i after reloading the status page22:03
ianwit looks like auto-refresh stopped after some point when i left it22:03
ianwi agree sorry22:03
corvusmaybe the restart on friday22:03
fungiokay, so probably things are going okay in that provider for now22:03
corvusianw: no worries, a problem easily solved! :)22:04
openstackgerritJames E. Blair proposed opendev/system-config master: Add zuul keystore password  https://review.opendev.org/c/opendev/system-config/+/78598022:05
openstackgerritIan Wienand proposed opendev/system-config master: review02: pin ipv6 configuration  https://review.opendev.org/c/opendev/system-config/+/78555622:06
ianwon a related note, we've sorted out all the credentials for the OSU OSL arm64 resources.  i'll work on incorporating today22:07
ianwkevinz: ^ please ignore my ping too :)  it was -ESHOULDGETCUPOFTEABEFORECHECKINGTHINGS22:09
ianwit looks like the wheel volumes got released, which i am assuming was cleared up by fresh centos images22:18
fungiseemed that way22:34
ianwbtw i liked the suggestion of replacing openstack/openstack-planet with the OPML file in https://review.opendev.org/c/opendev/system-config/+/784191 so will do that22:35
fungiyeah, not a bad idea to at least keep the last state of that list22:36
ianwif no objections i'll make planet.openstack.org point to static and set that up just to redirect to the gitea page, where we can have a readme and the opml file22:36
fungiyep, similar to how openstack handles docs url redirects for retired deliverables22:38
openstackgerritMerged opendev/system-config master: Fix up openafs-client job matching  https://review.opendev.org/c/opendev/system-config/+/77835322:43
clarkbianw: yes that netplan config lgtm, fungi already approved it but I put a +2 as well. When we did the mirror I did a reboot to ensure the old state was cleared out too iirc22:44
ianwclarkb: yep, will do22:44
ianwjust changing locations here, back in about 20 min22:44
clarkbas a side note we can safely modify the cloud init file becuase we should be removing that package after first boot22:52
*** gothicserpent has quit IRC22:54
*** artom has quit IRC22:54
*** artom has joined #opendev22:54
*** gothicserpent has joined #opendev22:57
clarkbianw: I said I would double check with you before dropping the tarballs ord replica meeting agenda item. Should that be dropped now or would you like to keep it up?23:05
ianwclarkb: i think drop; we know it's taking a long time but i guess it doesn't really matter23:08
ianwit's being "vos release"23:08
clarkbk23:08
ianwi had some terse notes here on using "vos dump" | ssh cat that were working, for a recovery case23:09
ianwoh and ajaeger confirmed that docs-old can go.  i'll work on removing that too23:12
clarkbnice23:12
clarkbianw: not sure if you saw but I'm starting to run into where communicating to users is going to be necessary re gerrit accounts. No responses yet, though the first couple of sets are in china I think so I probably won't hear back until tmorrow morning at the earliest23:13
clarkbbut I did find another batch of ~56 that could be cleaned up (they had ssh keys but not user names and often invalid openids or no openids which implied to me they are likely accounts that had been previously merged23:14
clarkband the agenda has been sent out23:14
ianwexcellent, thanks for keeping on the user cleanup!23:15
clarkbtomorrow I'm going to start putting some ptg stuff together as I suspect the later half of the week will be consumed with zk stuff23:16
clarkbif anyone has ideas or wants to get started on that before me feel free :)23:17
openstackgerritMerged opendev/system-config master: haproxy: write to container log files  https://review.opendev.org/c/opendev/system-config/+/78312023:17
openstackgerritMerged opendev/system-config master: Add OSUOSL cloud  https://review.opendev.org/c/opendev/system-config/+/78581323:18
openstackgerritMerged opendev/system-config master: review02: pin ipv6 configuration  https://review.opendev.org/c/opendev/system-config/+/78555623:18
clarkbianw: if you have a moment https://review.opendev.org/c/opendev/git-review/+/785723 is what should be a straightforward git-review update to allow for --no-thin pushes (to workaround an annoying but infrequent pack file disagreement between jgit and c git)23:18
clarkbwill be easier for us to have people do git review --no-thin than git push --no-thin gerrit HEAD:/refs/for/master23:19
ianwok, it sounds like a need a trip to the git man page :)23:20
ianwif i'm not back in 30 minutes, send a search party23:20
*** tosky has quit IRC23:24
clarkbha, there is a link to an old lp bug in the commit message too which may help23:26
ianwthe man page doesn't, because no-thin isn't documented23:27
clarkbianw: grep for thin in git push23:27
clarkbit is formatted as --[no-]thin23:27
ianwahhh23:28
*** artom has quit IRC23:29
ianwlgtm to i guess :)  i'm not sure how a normal person would figure this out ... is it something repeatable with certain trees?23:32
clarkbianw: it has been repeatable for certain users until they do something like rebase or do the --no-thin manually. And yes I don't think people will necessarily figure it out themselves, but we can point them to the flag if it comes up23:33
clarkbsince instructing someone to do git review --no-thin is easier than detailing how to push to gerrit without git review23:34
ianwsure.  i guess the rebase bit might be the key there.  it sort of makes sense i guess if we've repacked everything as we do and the client hasn't updated in a long time ... sort of.  i mean the idea should be that any object is reachable all the time you'd think23:35
clarkbyup, I think the issue is when you do a thin transaction each side takes the list of refs they know about and the list of refs they want one side to know about and then build an assumption about what objects need to be present in the tree23:36
clarkbbut it seems that sometimes jgit and c git disagree23:36
clarkband that is how the problem bubbles up23:36
clarkbalso it seems that protocol v2 vs v1 doesn't change the behavior23:37
clarkbwe managed to get the last person to hit this to check for us23:37
ianwit looks like arm64 isn't happy in some way23:38
ianwhttps://review.opendev.org/c/opendev/system-config/+/785675 all reported retry_limit23:38
ianwFailed to fetch https://mirror.regionone.linaro-us.opendev.org/ubuntu-ports/dists/bionic/universe/binary-arm64/Packages  403  Forbidden23:39
clarkblooks like mirror problems, ya23:39
ianwindeed, 403 is a odd one23:39
clarkbI'm having a hard time getting anything to load23:40
clarkbthe mirror root even23:40
ianwls: cannot open directory '/afs/openstack.org/': Connection timed out23:40
clarkbah that would do it23:40
ianwping: afs01.dfw.openstack.org: Temporary failure in name resolution23:41
ianwso ... yeah23:41
ianwit doesn't appear to have an ipv4 address23:42
clarkbdhcp should be serving those, did our lease run out and we just never got a new one?23:45
clarkbor is the other side of the NAT fine and our 1:1 to the fip broke?23:45
ianwlike "ip addr" doesn't show a ipv423:46
ianwi'm going to reboot it before i dig too much further23:52
ianwkevinz: ^ it still doesn't have an ipv4.  my current theory is that dhcp is not responding?23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!