Monday, 2021-04-12

*** tosky has quit IRC00:00
ianwwheels aren't releasing due to "Could not lock the VLDB entry for the volume 536871142."00:54
ianwi feel like i already fixed that at some point ...00:54
ianwVLDB entries for all servers which are locked:00:59
ianwTotal entries: 000:59
*** brinzhang has joined #opendev01:02
*** iurygregory has quit IRC01:05
ianwok, this is a red herring01:21
ianwthe real problem is01:21
ianwwhich is openafs failing to install on centos7 which means it can't publish01:24
ianwfor a long time i've been meaning to reorganise these jobs to use the executor's afs client to copy the data ... but anyway01:25
ianwi feel like the last time this happened, it was because we hadn't updated centos nodes and the kernel had changed, and we couldn't get the headers01:26
ianwok, 785675,1 is stuck waiting for arm64 nodes01:27
ianwi'm not sure why it hasn't timed out01:27
ianwnl03 isn't responding for me and could explain this01:30
ianw... ignore that.  helps if you try (not stack)01:32
ianwkevinz: hrm, i think i see in scrollback you'd identified some bogus nodes right?01:35
kevinzianw: morning! There are 3 instances are deleted but still remaining metadata..01:36
kevinzI'm working on removing it from DB01:36
ianwok, cool01:36
ianwit almost looks to me like the launcher has somehow forgotten about the nodes being requested by zuul01:37
ianwit doesn't appear to be trying to satisfy any requests01:37
ianw2021-04-12 01:37:34,918 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: []01:37
ianwbut system-config-zuul-role-integration-bionic-arm64 has been queued for 54 hours01:38
kevinzianw: You means that linaro-us doesn't respond any requests from Zuul?01:39
kevinzI saw that just 6  instances  are currently running on the cluster01:39
ianwkevinz: no i don't think that's it.  it seems like nodepool has some how lost a bunch of requests; it is not trying to satisfy them01:40
ianwi think linaro is responding ok01:40
kevinzianw: well, OK,  what can I do to help? The first thing I think is to remove the existing "disappeared" vm instances from our cluster first01:41
ianwkevinz: yeah, i don't know, this is a weird one01:44
kevinzianw: OK,  I will fix this first to see if things will be better01:45
ianwi feel like the node requests are not in zookeeper, so nodepool will never try to satisfy them.  but zuul clearly thinks they are01:46
ianwsomething happened about 58 hr 21 min ago01:46
ianw0a526f11-b784-416b-bd89-c5de47a9ba4c | debian-buster-arm64-linaro-us-0023946093 | BUILD  |                                                                        | debian-buster-arm64-1618117653 | os.large |01:47
ianwkevinz: ^ can you see anything interesting relating to that01:48
ianw2021-04-11 08:22:04,016 INFO nodepool.NodeLauncher: [e: 788535e8d4bc49919afbc414a1fcaa45] [node_request: 300-0013647916] [node: 0023946093] Node is ready01:49
ianwit seems to say the node is ready, but it's still showing "BUILDING"?01:49
ianwbut then "2021-04-11 09:45:19,783 INFO nodepool.NodeDeleter: Deleting ZK node id=0023946093, state=deleting, external_id=f5ee1b0f-107d-4965-a7b5-2375be42a30"01:50
ianwthis is all from yesterday01:50
ianwok, it's not correct that the requests aren't in zookeeper01:53
ianwhere is a log of a request in zookeeper and the NL related logs01:55
ianwthis failed at01:55
ianwlauncher-debug.log.2021-04-09_15:2021-04-09 15:28:34,633 ERROR nodepool.NodeLauncher: [node_request: 300-0013634813] [node: 0023929201] Launch failed for node centos-8-stream-arm64-linaro-us-002301:56
ianwafter 3 attempts01:56
ianwright, nodepool request-list shows this too02:01
ianwi'm restarting nl03 container, i'm not sure what else to do02:04
ianwkevinz: i think there is a problem02:06
ianwi'm seeing a very helpful (not) message of02:06
ianwopenstack.exceptions.SDKException: Error in creating the server (no further information available)02:06
ianw2021-04-12 02:06:19,346 ERROR nodepool.NodeLauncher: [node_request: 300-0013634833] [node: 0023948459] Detailed node error: No valid host was found. There are not enough hosts available.02:07
ianwkevinz: ^ it might actually be that02:07
ianwkevinz: yeah, things are just going into ERROR state02:08
ianwyou can probably see that, nodepool is going crazy trying to create the nodes again :)02:09
kevinzianw: Yes I saw, quite a lot of instances are comming.  Several instances are building and others are failed due to no valid host02:11
kevinzianw: I think we can stop some UT test since it is quite overloaded to the nodepool02:11
kevinzbtw, the 3 "disappeared" instances  have been removed already02:12
ianwyeah i would say this is trying to build too many nodes02:13
ianwwe've got max-servers: 40 ; i guess this is within limit02:15
ianwkevinz: is this thundering heard of starting instances killing the cloud?02:15
kevinzianw: yes, the limitation is 40.  I will try to find one more node to join the cluster to release the overload02:17
kevinzianw: Yes the cloud is receiving a lot of creating requests,  so it is slow now :-)02:18
ianwok, i can turn that down if it's gotten too high02:18
ianwi'm having a few authentication issues, but hopefully we'll have 15 nodes from OSU OSL coming online soon02:19
kevinzianw: That's fine actually,  I see some instances creation is finished02:20
kevinzcool,  you mean 15 nodes are 15 vms or bare metal machines?02:20
kevinzianw: it looks that the OSU OSL machines are newer and maybe better performance :-)02:21
ianw15 vms :)02:26
kevinzOK,  nice02:27
ianwi'm going to grab some lunch and hopefully things will start moving now02:27
kevinzianw: OK, np02:43
*** cloudnull8 has quit IRC03:02
*** cloudnull8 has joined #opendev03:02
ianwhrm, something is still up03:08
ianwwe've got like 6 active nodes and nothing trying to build, but the queue is huge03:08
ianwkevinz: it still seems to go straight into error node03:11
ianw151a8028-569b-4178-b09f-8c8411cf6aa5 for example, can you see what happened with that?03:12
kevinzianw: I'm adding one new compute node to this cluster, and it is under operation now.  This instance is happened to schedulered to this new node03:14
ianwkevinz: oh, ok np.  lmn when things are stable03:14
kevinzianw: I saw is running03:27
ianwkevinz: if you check there's lots of things waiting for nodes03:28
kevinzianw: yes I see,03:28
ianwi've turned the max servers down to 10 for a little while you're working on it03:29
ianwas you say, it does seem some nodes are building now03:29
kevinzianw: how long of zuul waiting  for a instance creation?03:29
ianwthough that said, a bunch are in error03:29
ianwe.g. 98b36e90-52ec-47fa-a413-5b246e1705af just errored03:30
ianwkevinz: several days :)  that's the problem ...03:30
kevinzOK,  will check03:30
ianwkevinz: here's a big list
kevinzI mean is there a timeout time for waiting instance launch,  if timeout then retry03:30
ianwyeah, i would have expected all these to fail with timeouts, but they haven't.  i think that's perhaps a separate, but related issue03:31
kevinzianw: OK, ack03:31
ianwsomething about the way things are failing isn't making zuul/nodepool give up03:31
kevinzianw:  98b36e90-52ec-47fa-a413-5b246e1705af : No valid host was found03:33
kevinzI think there has some schulder issues maybe,  always make the cloud no valide host...03:34
kevinzWill fix the new host adding first anyway03:34
ianwi've found at least one issue, that leaked nodes are put in a DELETING state but with no other details, and this confuses the quota calculator03:45
*** brinzhang_ has joined #opendev04:00
*** brinzhang has quit IRC04:03
*** mkowalski_ has joined #opendev04:12
*** tristanC_ has joined #opendev04:15
*** jrosser has quit IRC04:20
*** tristanC has quit IRC04:20
*** mkowalski has quit IRC04:20
*** Alex_Gaynor has left #opendev04:21
kevinzianw: adding one more 44core machines to the cluster04:33
kevinzadding finished and I've tested the instance creation04:34
kevinzianw: yes I always saw that the DELETING state blocked..04:34
*** jrosser has joined #opendev04:34
ianwkevinz: ok, cool, quota back up to 40 nodes in linaro.  i think the xxxlarge instances though will keep the number of running nodes more limited (hitting memory quota)04:35
kevinzianw: Yes.  That's is another problem04:36
*** ysandeep|away is now known as ysandeep04:53
*** ykarel has joined #opendev04:56
*** marios has joined #opendev05:08
*** whoami-rajat_ has joined #opendev05:36
*** sboyron has joined #opendev05:52
*** ralonsoh has joined #opendev06:02
*** slaweq has joined #opendev06:09
*** eolivare has joined #opendev06:23
*** dmsimard has quit IRC06:48
openstackgerritMerged openstack/project-config master: Bump node version for publish-openstack-stackviz-element
*** dmsimard has joined #opendev06:51
*** amoralej|off is now known as amoralej06:52
openstackgerritMerged openstack/project-config master: nodepool elements: create suse boot rc directory
*** fressi has joined #opendev07:05
*** eolivare has quit IRC07:07
*** andrewbonney has joined #opendev07:08
*** eolivare has joined #opendev07:09
ianwfungi: ^ i think that wheel building is held up because openafs fails to install on centos7.  i think that's because our images have an out of date kernel, and the headers are not on the mirror any more.  and i think that's because it's stuck behind suse.  and that's what ^ fixes :)07:14
ianwjust a typical day in dependency land!07:14
*** dmsimard has quit IRC07:32
*** dmsimard has joined #opendev07:33
*** tosky has joined #opendev07:35
*** jpena|off is now known as jpena07:54
*** rpittau|afk is now known as rpittau08:04
*** ysandeep is now known as ysandeep|lunch08:11
*** gnuoy` has joined #opendev08:22
*** gnuoy has quit IRC08:26
*** brinzhang_ is now known as brinzhang08:55
*** dtantsur|afk is now known as dtantsur08:56
*** ysandeep|lunch is now known as ysandeep08:56
hrwI see that check-arm64 queue cleaned up09:45
*** whoami-rajat_ is now known as whoami-rajat10:33
*** brinzhang_ has joined #opendev11:05
*** brinzhang has quit IRC11:08
*** iurygregory has joined #opendev11:11
*** artom has joined #opendev11:19
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap
openstackgerritGuillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag
*** jpena is now known as jpena|lunch11:32
*** dhellmann_ has joined #opendev11:44
*** dhellmann has quit IRC11:45
*** dhellmann_ is now known as dhellmann11:45
fungiianw: it also sounds like you might have run into the same stuck node requests i've been trying to track down the cause of for a few weeks now11:52
funginot sure if that sounds like some of what you saw too11:53
*** jpena|lunch is now known as jpena12:30
*** cloudnull8 is now known as cloudnull12:48
*** stephenfin has quit IRC12:49
*** amoralej is now known as amoralej|lunch12:52
*** stephenfin has joined #opendev13:08
openstackgerritGuillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap
openstackgerritGuillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag
openstackgerritMerged openstack/diskimage-builder master: Add Debian Bullseye Zuul job
zigoHi there!13:20
zigoI was wondering, would there be a way to get, in gerrit, a direct link to a plain patch file?13:20
zigoI mean, no zip, tar.xz or base64...13:20
zigoIt'd be really helpful for me.13:20
hrwzigo: press 'DOWNLOAD' link13:22
hrwah. you were there already13:22
hrwzigo: curl patchlink|base64 --decode?13:22
zigohrw: Yeah, it has diff.base64,, tgz, tar, tbz2, txz ...13:22
zigohrw: Yeah, I know, I can do that... :)13:23
zigoI'd prefer if I didn't have to.13:23
hrwzigo: I went that way in CI job13:24
hrwas it was easiest way to fetch patches without having gerrit account13:24
zigohrw: It's not about CI or automation, it's that I very often pick-up patches by hand, and that's always one more step to do ...13:25
*** amoralej|lunch is now known as amoralej13:25
hrwzigo: make an alias?13:26
*** fressi has left #opendev13:49
fungi explains the rest api call that download link represents13:57
fungii expect the reason for base64 encoding is that the diff could be of a binary file, and so trying to display that in a web browser would get weird13:58
*** sboyron has quit IRC14:06
*** sboyron has joined #opendev14:07
*** ykarel has quit IRC14:21
*** ykarel has joined #opendev14:24
clarkbanother approach could be to use git fetch14:44
clarkbgit fetch && git show FETCH_HEAD > foo.patch14:44
*** snapdeal has joined #opendev14:48
fungiyep, you can even fetch those refs from the gitea server farm14:49
fungiunfortunately gitea doesn't have a way to call named refs in its webui that i've been able to figure out (something i sorely miss from cgit and gitweb)14:50
clarkbfungi re I'll get that installed after breakfast then try and remember to use it once or twice to push some actual code14:50
fungiyou could use the gerrit-provided gitweb to do it, i think, but you'd need to be authenticated first because of the way its hooked up14:50
fungiclarkb: thanks!14:50
*** marios is now known as marios|call14:53
*** dpawlik has quit IRC14:58
*** marios|call is now known as marios15:01
*** ykarel is now known as ykarel|away15:19
clarkbI've received notice that and's ssl certs have 30 days of validity remaining. these are not LE certs. Do we want to bother renewing them?15:19
fungifor i expect we can just let it expire15:20
fungii was going to propose we take that out of service anyway15:21
fungifor it's probably a one-liner to add it to the other git redirect domains we already generate certs for15:21
fungijust need a corresponding cname for the acme stuff15:21
fungiclarkb: good news, the cert we're deploying is already generated with lets encrypt, so that can be ignored15:23
fungiSubject: CN = git.airshipit.org15:23
fungiIssuer: C = US, O = Let's Encrypt, CN = R315:23
clarkboh even better15:23
fungiNot After : May 18 05:31:57 2021 GMT15:23
clarkband ya I agree re survey15:23
clarkbI bet that got updated when the stuff serving files moved servers a while back15:24
fungiyup, was pretty sure we had done them all, which is why i double-checked15:25
clarkbfungi: do you have a sense for where the openstack release process is re final RCs? I'm starting to try and page the zk cluster rolling replacements back in and wonder if we need to be careful for their release still15:27
fungiclarkb: wednesday around 10:00 utc i think is when the final release versions will all be tagged15:30
clarkbcool in that case probably waiting for at least wednesday is fine15:30
clarkbI can find other items to occupy my time between now and then15:30
clarkbdo the gerrit 3.2.8 upgrade later this week too likely15:30
*** ykarel|away has quit IRC15:33
*** mlavalle has joined #opendev15:42
openstackgerritClark Boylan proposed opendev/system-config master: Add note about python -u to external id cleanup script
clarkbfungi: ^ that was pushed with `git review -v --no-thin` the -v helped me verify the command (and --no-thin was present as expected) and --no-thin appears to have functioned fine15:46
clarkbgiven that seems to work and the test didn't cause any problems I suspect we can land that15:47
fungivery cool! yeah, looking15:47
*** roman_g has joined #opendev15:47
*** amoralej is now known as amoralej|off15:49
clarkbLooking at the gerrit user account conflicts I see there are a small numbr of CI accounts that we can likely pretty safely untangle15:57
clarkbbasically remove the human identifying conflict from the CI accounts15:57
clarkband let the human account be the owner of that external id without conflict15:58
clarkbin some cases it is two different CI accounts conflicting with each other. In those cases I think we simply disable the one that least recently commented and clean it up15:58
clarkbbut I do need to review things because I think in some of these cases we don't actually want to retire any accounts as both are being used. We just want the CI system to ahve a CI system email addr and a human to have a human email addr without conflict16:00
fungialso i expect there are cases where multiple ci systems were created with the same e-mail address16:04
clarkbI'm also seeing a non zero set of accounts with ssh keys set but no username16:06
clarkbI think the only way that would really make sense is if those accounts had been merged previously?16:06
fungiyes, i think so. we often didn't remove old ssh keys from accounts we merged into other accounts16:08
clarkbya looking at more recently used timestamps and other attributes it seems that this is likely the case16:09
*** dtroyer has joined #opendev16:24
clarkbThere is one CI account that conflicts with human accounts for four other people. I suspect in a case like that we don't retire anything, but simply remove the conflicts from the CI account, but I need to look at the external ids for that account more closely16:31
clarkb3 of those conflicts are simple mailtos and can be cleaned up. The fourth conflict is between emails on openids between what may be a human account and the CI account16:35
clarkbthe human account hasn't been used since 2015 though, but the ci account has been used this year. I guess in that case we can "sacrifice" the human account?16:35
fungiyeah, i would16:40
fungiwe can always help them get a new account set up later if they come to us16:40
clarkbIt is interesting to see how different some of these accounts are from each other in terms of how they conflict16:47
clarkbI'm going through and trying to understand each one a little better16:47
*** hamalq has joined #opendev16:47
*** hamalq has quit IRC16:47
*** hamalq has joined #opendev16:48
fungiit seems like it provides a window into the history of our infrastructure and how users have interacted with it in the past16:51
*** dtroyer has quit IRC16:53
*** rpittau is now known as rpittau|afk16:59
*** ysandeep is now known as ysandeep|holiday17:01
*** marios is now known as marios|out17:03
*** dtantsur is now known as dtantsur|afk17:06
*** marios|out has quit IRC17:06
corvusinfra-root: the squiggly lines on cacti and grafana look good to me.  in particular, the memory line on cacti is not at all squiggly and is in fact horizontal.  i think we're probably good to cut a release of zuul (which will be a good checkpoint release we can roll back to if needed for the next bit of v5 work).  concurrences?  dissents?17:14
fungicorvus: i agree, the memory leak looks very much solved now. this would make for a good version to release17:15
*** sboyron has quit IRC17:21
*** sboyron has joined #opendev17:21
clarkbcorvus: sounds good to me17:23
clarkbfungi: can you look at review:~clarkb/gerrit_user_cleanups/next-cleanups-20210412.txt ? That is what I worked through based on the conversation above. If all looks well to you I'll be trying to run the retire step against the ones listed as retireable and then in a few days can do the external id cleanups17:24
*** jpena is now known as jpena|off17:48
*** ralonsoh has quit IRC17:52
*** eolivare has quit IRC18:00
openstackgerritClark Boylan proposed opendev/git-review master: Add option for disabling thin pushes
clarkbfungi: ^ manpage updated18:01
fungiclarkb: were you adding a release note too (see earlier review comment)?18:03
*** roman_g has quit IRC18:08
*** roman_g has joined #opendev18:08
*** roman_g has quit IRC18:09
*** roman_g has joined #opendev18:09
*** roman_g has quit IRC18:09
*** roman_g has joined #opendev18:10
*** roman_g has quit IRC18:10
*** roman_g has joined #opendev18:11
*** roman_g has joined #opendev18:12
*** roman_g has quit IRC18:12
*** roman_g has joined #opendev18:12
*** roman_g has quit IRC18:13
*** roman_g has joined #opendev18:13
*** roman_g has joined #opendev18:14
*** roman_g has quit IRC18:14
openstackgerritMerged opendev/system-config master: Add note about python -u to external id cleanup script
fungiclarkb: on the "skipping" comments in your new cleanup list, those apply to the line immediately following them?18:16
fungii guess you're not feeding the list directly to a script so it's fine not to comment them out18:16
fungialso the midokura account being dormant is a good call, pretty sure they're no longer involved, the midonet neutron driver has been retired due to lack of maintenance18:19
fungithe plan in the latter half of next-cleanups-20210412.txt looks good, spot checks of the various account classes reflect the states i would expect18:21
clarkbfungi: yup to the line below18:29
clarkbfungi: oh I missed the release note comment. I'll address that18:29
openstackgerritClark Boylan proposed opendev/git-review master: Add option for disabling thin pushes
clarkbfungi: how's that?18:38
clarkbfungi: and ya thats not the direct input to retire accounts. It takes a bit of massaging (I have to take the account ids and prefix them with refs/users/XY/ for account id ABXY18:40
*** sboyron has quit IRC18:41
*** andrewbonney has quit IRC18:47
clarkbalright I'm going to start retiring the 56 accounts in that list18:54
fungisounds good, thanks!18:54
clarkbthat is done now and logs are in the normal location on review19:25
clarkbI'm going to look at what is necessary to do the manual surgery that I proposed for the subset of CI accounst that we cannot just turn off now19:25
*** mailingsam has joined #opendev19:41
*** whoami-rajat has quit IRC19:55

Generated by 2.17.2 by Marius Gedminas - find it at!