Wednesday, 2020-08-05

*** ryohayakawa has joined #opendev00:09
corvusclarkb: sn5 sucessfully hopped :)00:45
ianwi had it up, then there were comments it was scrubbed, so i closed it, and then it went00:55
*** xiaolin has joined #opendev01:23
*** hashar has joined #opendev01:29
*** mtreinish has joined #opendev02:39
*** hashar has quit IRC03:50
*** tkajinam has quit IRC03:51
*** tkajinam has joined #opendev03:52
*** fressi has joined #opendev04:27
donnydits looks like something is actual busted... so maybe my upgrade didn't go as smoothly as I thought...04:33
openstackgerritIan Wienand proposed opendev/system-config master: launch-node: get sshfp entries from the host
ianwfungi/clarkb: ^ one to try if launching a new node04:39
ianwdonnyd: thanks ... if it was easy everyone would do it :)04:40
donnydI want to say that its possible we have an issue with the number of images being uploaded - it appears its trying to upload 322 images at the moment04:41
donnydi am going to purge those that are stuck in queued04:44
donnydjust some rough maths - that would be about 3.2 TB worth of images... it may take some time to get caught up04:47
*** sgw2 has quit IRC04:49
*** raukadah is now known as chkumar|rover04:54
ianwumm, yes that is not correct :)04:54
ianw2020-08-03 16:16:45 UTC deleted corrupt znode /nodepool/images/fedora-31/builds/0000011944 to unblock image cleanup threads04:55
ianwthat might be related?04:56
ianwdonnyd: afaics nodepool doesn't think it's uploading to OE at the moment.  so might be collateral damage from prior issues04:58
openstackgerritOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml
donnydianw: it looks like the images are uploaded now06:25
donnydthere are still a couple saving.. but it does appear to "work"06:25
donnydand I was able to delete the instance that was stuck in error06:27
donnydI think what happened is I reduced the available memory for my DB nodes while I was shuffling things around because they weren't really in use - so i forgot to set them back to normal06:28
donnydand when I looked at the db nodes two of them were all but locked up.. took several minutes to even open an ssh session... so I am thinking they are sorted and we can maybe give it another swing06:29
donnydi have run a few tests to ensure the focal image that was uploaded does in fact boot and start06:39
*** ryohayakawa has quit IRC07:02
*** tosky has joined #opendev07:38
openstackgerritMerged openstack/project-config master: Normalize projects.yaml
*** DSpider has joined #opendev07:43
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:03
*** dtantsur|afk is now known as dtantsur08:04
*** tosky has quit IRC08:26
*** tosky has joined #opendev08:27
ianwdonnyd: ok, so it seems to boot, but i can't log in08:41
ianw[   29.577671] cloud-init[2011]: ci-info: no authorized SSH keys fingerprints found for user ubuntu.08:41
ianwi feel like it's not getting keys; meta-data problem?08:41
ianwb4622148-ea50-46a9-85a5-24f3c08d565d i've left in this state08:41
ianw(key is in /tmp from launch attempt on
*** sshnaidm|afk is now known as sshnaidm08:56
*** bolg has joined #opendev09:59
*** tkajinam has quit IRC10:15
*** hashar has joined #opendev11:08
openstackgerritAurelien Lourot proposed openstack/project-config master: Mirror keystone-kerberos and ceph-iscsi charms to GitHub
donnydianw: I rebooted the node and it seems to have gotten meta-data12:06
*** hashar has quit IRC12:13
donnydoh i see the error ianw12:48
donnydthe metadata service is in fact also busted12:48
donnydits fixed now12:50
donnydI left a hosts entry in the edge machine that was trying to contact the internal metadata service on ipv6 and it was not very happy about this12:51
*** priteau has joined #opendev13:09
*** iurygregory has quit IRC13:12
*** sgw1 has joined #opendev13:15
*** iurygregory has joined #opendev13:18
openstackgerritMerged openstack/project-config master: Move non-voting neutron tempest jobs to separate graph
openstackgerritAurelien Lourot proposed openstack/project-config master: Mirror keystone-kerberos and ceph-iscsi charms to GitHub
openstackgerritOleksandr Kozachenko proposed openstack/project-config master: Add openstack/barbican in required project list of vexxhost
*** sshnaidm is now known as sshnaidm|afk14:24
*** redrobot has joined #opendev14:32
*** mlavalle has joined #opendev14:37
*** hashar has joined #opendev15:01
openstackgerritMerged openstack/project-config master: Add openstack/barbican in required project list of vexxhost
*** chkumar|rover is now known as raukadah15:26
*** tosky has quit IRC15:31
*** shtepanie has joined #opendev16:00
*** tosky has joined #opendev16:29
openstackgerritSean McGinnis proposed openstack/project-config master: Gerritbot: only comment on stable:follows-policy repos
*** dtantsur is now known as dtantsur|afk16:39
*** sshnaidm|afk is now known as sshnaidm17:19
*** priteau has quit IRC17:38
*** fressi has quit IRC17:42
*** shtepanie has quit IRC18:40
clarkbfungi: when pbr isn't distracting you I'd love your thoughts on and what testing looks like19:12
clarkbthats the gerritbot change19:12
clarkbI'll sync up with ianw on his comments later today and try to get a new ps up that is mergeable19:14
*** hashar has quit IRC19:32
*** tosky has quit IRC19:39
*** DSpider has quit IRC20:08
clarkbfungi also following up on the sshfp stuff for review are we all set there? the zone update landed and cname was updated?20:09
clarkbthe nodepool upload workers update has applied so we should upload more quickly now20:10
* clarkb is catching up on yesterdays todo list20:10
fungioh, i haven't done the update yet, will get to that in just a sec20:13
openstackgerritOleksandr Kozachenko proposed openstack/project-config master: Add openstack/barbican-tempest-plugin to vexxhost
Open10K8SHI team20:40
Open10K8SPlease check this Needed-By:
*** smcginni1 has joined #opendev20:47
*** smcginnis has quit IRC20:50
*** smcginni1 is now known as smcginnis20:50
openstackgerritMerged openstack/project-config master: Add openstack/barbican-tempest-plugin to vexxhost
openstackgerritTristan Cacqueray proposed zuul/zuul-jobs master: shake-build: add shake_target variable
*** tosky has joined #opendev21:25
*** markmcclain has quit IRC21:53
ianwclarkb: my thought was that it has to listen to the event stream right?  so it has to connect to the remote host, so review-test needs a key installed?22:15
clarkbianw: it does, but it may be sufficient to test that the docker container starts?22:16
clarkband then just let it error?22:16
ianwyeah, that was the chicken egg i was thinking ... how do you get the authorized key on the remote side?  i can't see a way22:16
ianwit is probably enough that the container starts and we see a log file or something22:17
clarkbright, my question about key material was more do we want a real key there so that paramiko loads it successfully but not necessarily add it to the server so that it can then connect successfully22:19
clarkbI'm not sure where the useful line is22:19
ianwahh, ok; sorry yes i was thinking total end-to-end22:19
ianwi wouldn't mind; if it's easier to detect the daemon starting with a valid but useless key ("can't log in" v throwing an exception because the key isn't parsed, say) i'd go with the fake key :)22:21
clarkbcool I'll generate a fake key then22:22
*** qchris has quit IRC22:22
clarkbthough I've been distracted by pbr things22:24
ianwdonnyd: thanks, let me give it another go :)22:28
ianw(i'm assuming fungi/clarkb didn't as yet?)22:28
fungii have not, no22:29
fungistill somewhat consumed by personal post-storm activities/cleanup22:29
clarkbI didn't. Sorry totally sniped by the pbr thing. Hoping to fix it so that the same issue doens't pop up every month22:29
ianwi will have to read backscroll on pbr ... do i want to? ;)22:29
fungimost discussion was in #openstack-infra with some in #openstack-oslo22:30
clarkbianw: tl;dr is people keep trying to drop python2 support in pbr22:30
clarkbbut since pbr is a setup_requires we can't really do that without breaking stable branches that still run on python222:30
fungitl;dr is that pbr's testing was neglected for far too long and siuffered excessive bitrot22:30
ianwfungi: bummer ... all ok?  i have nothing but time.  we're on a 6 week lockdown where you're supposed to go out within a 5km radius for 1 hour a day, and a total curfew from 8pm-5am22:30
clarkband ya its really just the pbr testing that bit rotted and needs fixing. pbr itself is fine when running22:30
fungiianw: yeah, we didn't get hit that hard, but now have to undo all our storm prep (until the next one)22:31
fungilots of stuff in the living space which normally goes downstairs in the entry or garage, things moved to the non-windward side of the house need to be moved back, and so on22:32
clarkbfungi: have you ocnsidering converting your home into a boat?22:33
clarkbthen with a good anchor you can float through the storms?22:33
fungiit's halfway there, but too leaky22:33
fungiit would sink22:33
fungimuch of our disposable income goes toward making it increasingly seaworthy22:34
*** qchris has joined #opendev22:35
fungier, i meant stuff we moved from the leeward side of the house to the windward side has to be moved back, those always trip me up22:35
* fungi is clearly not a career sailor22:35
fungiit's always confused me that "windward" is the side away from the wind22:36
fungier, no, i had it right the first time22:38
fungii guess the confusion is that if you're moving windward you're moving with the wind but the windward side is what faces the wind22:39
clarkbya ist direction related22:40
fungionce we finish sealing the hull and weigh anchor i'm sure i'll figure it out22:42
donnydyea we should be good this time around.. but I am sure i have said that before22:48
ianw... ok ... trying hte mirror bringup22:58
* donnyd crosses fingers his cloud isn't still busted22:59
ianwit's up, logged in, ipv4 & ipv6 working ... looking good!22:59
* donnyd jumps in excitement that he didn't ask ianw to keep working on a busted cloud23:00
*** tkajinam has joined #opendev23:00
mnaserinfra-root: we just upgraded nova in the montreal datacenter and it _looks_ like nodepool is spinning through vms very quickly (almost as if it's deciding "this vm is not ok, scrap it" rate of speed)23:00
mnaseri'm seeing them disappear before i can even check the console23:01
donnydi ran some tests earlier launching 30 instances at a time and making sure they came up properly - nothing was failing.. but it normally doesn't fail on the local network23:01
mnasernow it also looks like they're not pingable so i wonder if this is some glean thing23:01
mnasercause we can boot our images (which use cloud-init) just fine23:02
donnydis there a particular OS or is it all of them mnaser23:02
ianwmnaser: umm ... config-drive related maybe?  no glean changes afaik23:02
mnaserdonnyd: i havent looked if its a specific os23:02
ianwlet me see if i can pull anything from logs23:02
donnydwhen we had glean issues on OE it was just a single OS that was angry23:03
donnydI want to say is was centos, but I don't remember THB23:04
donnydthat is not good at all23:04
mnaserthat'll do it23:04
donnydis that all images or just one type? anyway for you to check?23:05
mnaseri'm seeing other vms going up just fine23:05
mnaserit's pretty hard because i have like23:05
mnaser2 minutes before nodepool kills it23:05
clarkbwe have had corrupted images before23:06
clarkbthere is no hash verification anywhere23:06
clarkbis it possible that upgrading nova happened during an upload and that truncated things in a bad way?23:06
clarkbmnaser: you can boot the image instead to debug rather than rely on nodepool?23:06
clarkbif you'd prefer I can do that too23:07
donnyddo you use ceph for the glance backend mnaser ?23:07
mnaserclarkb: that might be a case, and it might be easier for you to do it than me.  i'd have to give myself access to the tenant and all that23:07
clarkbok I'll spin one up. this is mtl right?23:08
clarkbya-cmq-1 or whatever23:08
ianwi'm just looking at the logs on nl03 now to see what i can ... glean23:08
clarkbmnaser: do you know what image the boots are failing on?23:09
clarkbname or uuid is good for me23:09
mnaserclarkb: i think majority seem to be `ubuntu-bionic-1596656540`23:09
*** mlavalle has quit IRC23:09
ianwi see 149 failed launch attempts there23:10
mnaseryeah, the thing is i can launch vms normally just fine (and using our images)23:10
donnydmnaser: I think I have that image - let me fire one off and see what it does on my end23:10
mnaserianw: are you able to tell when that image was uploaded? ^23:11
clarkbmnaser: clarkb-test1 has started it has uuid 834d1d5e-f28e-4136-a1fe-89761643db7a23:11
clarkbmnaser: | updated_at       | 2020-08-05T21:17:03Z for that image23:12
*** markmcclain has joined #opendev23:12
clarkbabout 2 hours ago23:12
ianwmnaser: ok, i agree, all of the bad launches are on bionic23:12
donnydnope, I only have ubuntu-bionic-159665654123:12
ianwactually no ... sorry23:12
clarkbthat image is "only" 13GB or so23:13
clarkbI'm trying to figure out what the source image size is23:13
mnaserthe image on disk is23:13
fungithat's some amazing compression you've got23:13
mnaserlet me check what the size is in ceph23:13
ianw# grep -f bad-launch.txt launcher-debug.log | grep 'from image' | grep vexxhost-ca-ymq-1 | awk '{print $20}' | sort | uniq -c23:13
ianw     24 centos-723:13
ianw    155 centos-823:13
ianw    324 ubuntu-bionic23:13
ianw     15 ubuntu-focal23:13
ianwthat maybe just reflects the distribution of our testing23:14
donnydmnaser: share your compression secrets with the world... you may save someone a billion or two23:14
*** tosky has quit IRC23:14
ianws/vexxhost/pied piper/23:14
fungistore the extra bits in a quantum pocket universe23:14
clarkb17921081344 is the correct size according to nodepool's fs23:14
clarkbso ya I think thats a corrupted upload23:14
corvusoh i'm around if there's an urgent thing; was deep into k8s stuff in another window23:15
mnaser`rbd -p images info 05fbc2ad-8aa6-4cd6-bd2b-eee942034cf9` => size 13 GiB in 1615 objects23:15
clarkbthe fix is to delete the upload in vexxhost and have it reupload23:15
clarkbcorvus: I think its not that urgent23:15
corvusk; ping me if you need me23:15
clarkbcorvus: or at least we're close to figuring it out / fixing it23:15
mnaserso i guess it _maybe_ not related to the ugprade23:15
clarkbmnaser: unless the upgrade somehow truncated the image?23:15
fungiunless the upgrade somehow interrupted the upload23:15
clarkbI think we should delete the upload and have it reupload and go from there23:16
mnaserglance has been on ussuri since a few days ago23:16
fungiin a way that nodepool didn't recognize as a failure23:16
clarkbfungi: well glane doesn't check hashes or anything23:16
fungiclarkb: i concur23:16
mnaserthis was only nova, and i think ianw confirmed more than 1 failure23:16
clarkbfungi: its really hard for nodepool to know if thta fails :(23:16
donnydthat is pretty interesting - i have 6.23gb image size for ubuntu-bionic-159665654123:16
fungiand yes, glance's checksums are really just cosmetic23:16
clarkbdonnyd: yours is likely qcow2?23:16
clarkbianw: oh does your list show failures outside of bionic too?23:16
donnydoh so 13G is the raw size for the cephs23:16
fungiwe upload raw to vexxhost because boot from volume23:17
clarkbdonnyd: its actually 17GB23:17
ianwclarkb: hrmmm, hang on, that might be *all* launch attempts.  let me fiddle the grep ...23:17
clarkbdonnyd: which is why I suspect that is the problem23:17
fungiand also if memory serves, because the on-the-fly conversion to raw via glance tasks was... tasking vexxhost's infrastructure23:17
mnaseryes, i remember that, very well23:17
mnaseri was going to say we technically could go back to qcow223:18
mnaserbecause those new systems have local storage included + bfv if needed23:18
donnydI was literally just about to ask why not have it convert on the other end23:18
ianwno i think it's right23:18
donnydbut I guess that has already been done before23:18
ianwcat launcher-debug.log | grep  'Launch attempt '  | awk '{print $8}' | sed 's/]//' | sort | uniq  > bad-launch.txt ... gets me a list of failed launch attempt id's23:18
clarkbmnaser: clarkb-test1 does not ping and console log show shows no console log so I expect it has failed similarly23:18
clarkbianw: in that case we've either corrupted all uploads or its something else23:18
mnaserclarkb: yeah, even opening novnc console shows a "no hard disk found"23:18
ianwthen the above command pulls on the vexxhots-ca-ymq-1 ones23:19
fungiclarkb: likely due to most of its bytes missing23:19
donnydthose pesky bits and their need to be included23:19
mnaserthe nova build log really doesn't seem unhappy at all, unless the image download just broke out early23:19
donnyddo you have the newest version of the bionic image mnaser ?23:20
ianw(btw the OE mirror updated, but getting the sshfp records automatically still remains annoying)23:20
mnaserdonnyd: the opendev bionic image? i'm not sure23:20
mnaserbut using our images work just fine and other vms in the cloud are launching just fine23:20
mnaseranyways, we've *confirmed* that the ubuntu image is not clean23:21
mnaserperhaps the others are not either, so maybe we can start with that23:21
mnaseroh you know what, let me check something23:21
fungias clarkb said, if we delete the uploads via nodepool's rpc client, then it should reupload (presumed good) copies23:21
ianwi have windows open so lmn an i can delete images via nodepool23:22
fungiyeah, maybe once mnaser double-checks whatever it is he's double-checking23:22
donnydfwiw I had nodepool try to upload 322 images last night23:23
donnydso maybe there is something there23:23
clarkbdonnyd: it will try repeatedly until it has success23:23
clarkbdonnyd: I expect it thought each one failed23:23
donnydclarkb: these were all different names23:23
clarkbdonnyd: yes the name is based on when it uploads to you, not the source name23:24
donnydso it seems like it may have tried to upload every image it has ever built23:24
mnaserok so the base image is 13G for that disk23:25
clarkbwe build one image with a name based on a serial counter: ubuntu-bionic-0000001.qcow2. Then we upload that to each provider as ubuntu-bionic-$epochtime based on when the upload starts23:25
donnydthere is a great need for a sarcasm font23:25
clarkbdonnyd: heh. Anyway I think it tried to upload eg 0000001.qcow2 322 times because the first 321 failed23:25
mnaserianw: please delete it and lets se what happens23:25
clarkbdonnyd: we should look into it but I expect they ar eseparate issues23:25
mnaserit certainly seems the wrong size23:25
mnaserwe know _that_ is broken23:26
ianw| 0000114333 | 0000000001 | vexxhost-ca-ymq-1   | ubuntu-bionic        | ubuntu-bionic-1596656540        | 05fbc2ad-8aa6-4cd6-bd2b-eee942034cf9 | ready     | 00:02:09:09  |23:26
mnaserare you saying there is a vm that went up?23:27
ianwno sorry that's the image i will delete23:27
clarkbianw: yes that image looks correct to me23:27
mnaseroh, yes, fair :)23:27
mnaseryes, that is the image23:27
clarkbdonnyd:  0000114332 | 0000000118 | openedge-us-east <- that says image 0000114332 was uploaded 118 times to open edge23:27
clarkbdonnyd: the 118th succeeded so it stopped23:27
donnydAh I see23:27
donnydprobably from my DB being busted last night then23:28
clarkbdonnyd: what we might want to try is a backoff23:28
clarkbdonnyd: I think right now it retries immediately23:28
clarkbbut we could do exponential backoff until some reasonable period (one hour?)23:28
clarkbcorvus: ^ that may interest you specifically nodepool's insistnence on reuploading immediately23:29
ianwok, vexxhost bionic image deleting23:29
donnydclarkb: I am happy with the pummelings nodepool issues TBH - I want to know if something doesn't work right and being kind to a cloud only give one false hope23:30
clarkbdonnyd: ok :) I think it is good feedback that if the first 10 fail maybe the 11th will too and we can slow down23:30
donnydits probably not a bad idea23:31
ianwok, vexxhost is showing two images uploading, a xenial one and a buster one23:32
donnydianw: hopefully the mirror is going a little better this time around23:32
ianw| 0000118348 | 0000000001 | vexxhost-ca-ymq-1   | ubuntu-xenial        | None                            | None                                 | uploading | 00:04:11:34  |23:32
ianwdonnyd: yep, it's up :)  i just need to add it to dns/inventory now23:32
ianw| 0000145461 | 0000000001 | vexxhost-ca-ymq-1   | debian-buster        | None                            | None                                 | uploading | 00:00:48:15  |23:32
ianwmnaser: do you see those in flight? ^23:33
mnaserunfortunately with no id it's kinda hard to identify anything23:33
ianwhrm, ok23:34
ianwfirst thing we should be able to see if the old bionic image is booting i guess23:34
donnydianw: I am going to pop out and do that thing were I sit on my couch and stare at the TV. Is there anything else I can do for now before I make my way to the living room23:35
ianwdonnyd: no, thank you! :)  i'll get it into our systems and then we should be gtg23:35
mnaserho boy23:36
mnaser`2020-08-05 23:28:58.500 11 WARNING glance.api.v2.images [req-84af7fef-cba3-48bb-bc7d-80026f95908d de81838458254d87a9ef66cc89e22308 86bbbcfa8ad043109d2d7af530225c72 - default default] After upload to backend, deletion of staged image data has failed because it cannot be found at /tmp/staging//05fbc2ad-8aa6-4cd6-bd2b-eee942034cf9`23:36
mnaseris this glance staging image uploads23:38
mnaserare we by any chance trying to import images23:39
clarkbwe're doing whatever the code that was shade does in openstacksdk23:40
clarkbI think its the v2 post23:40
clarkbopenstack.connection.Connection.create_image() is what we call23:41
clarkbI always get lost in sdk23:42
openstackgerritIan Wienand proposed opendev/system-config master: Add OE mirror to inventory
*** yoctozepto3 has joined #opendev23:44
mnaserok found a potential issue with image upload si guess23:44
mnaser"OSError: timeout during read(8388608) on wsgi.input"23:44
*** yoctozepto has quit IRC23:45
*** yoctozepto3 is now known as yoctozepto23:45
clarkbmnaser: the condition in sdk seems to be if filename or data then self._upload_image23:48
clarkbI think that implies to me that it really shouldn't be using import23:48
clarkb(because our images are local files so it will do the direct upload)23:48
mnaserclarkb: i think it is direct upload, but it seems the glance excpetion handler always calls unstage23:48
mnaserwhich tries to rm the staged file, even if its not being imported23:48
mnaserso that's more of a fallout23:48
clarkbgotcha and potentially a glance bug?23:48
mnasermaybe -- haven't grasped the whole thing, but i do see some wsgi save timeouts23:49
mnaserFailed to upload image data due to internal error: OSError: timeout during read(8388608) on wsgi.input23:49
openstackgerritIan Wienand proposed opendev/ master: Add replacement OE mirror
clarkbwow we have 4 different ssh host key types?23:50
mnaserthe upload time does correspond to the same failure time though23:51
mnaserso likely that23:51
ianwclarkb: interesting actually, that's with ssh-keygen -r on the host, the others i generated via -D ... i wonder if it's different external v internal23:57
ianwthe others are missing "2"23:58
clarkbcould that be old dsa keys that remote queries just ignore?23:58

Generated by 2.17.2 by Marius Gedminas - find it at!