Monday, 2021-11-01

*** ykarel is now known as ykarel|lunch09:47
*** sshnaidm is now known as sshnaidm|afk10:09
*** rlandy is now known as rlandy|ruck10:33
*** ykarel|lunch is now known as ykarel11:23
*** dviroel|out is now known as dviroel|rover11:35
*** sshnaidm|afk is now known as sshnaidm11:39
stephenfinWe're seeing regular failures on the nodepool-build-image-siblings job in openstacksdk. Looks like opensuse have done something funky with their mirrors and there's a checksum failure occurring as a result? https://zuul.opendev.org/t/openstack/build/a21d7820d59c4416a5c6171491927192/log/job-output.txt#2409011:40
stephenfinAh, never mind, looks like this was fixed with cce7dbc669cea619766b4bbe7e310f75b9461d2d11:46
stephenfinsorry, Ica62392ebf4a665a04cd65458dda9e0a7545ccc811:46
fungistephenfin: so it's solved? just making sure i don't need to dig into that13:34
*** ykarel is now known as ykarel|away13:41
clarkbya that change should fix it. We don't know what was up with that OBS repo but realized we didn't need to use it anymore now that we were updated to bullseye15:51
*** Guest3739 is now known as diablo_rojo_phone16:35
*** diablo_rojo_phone is now known as Guest462516:36
*** Guest4625 is now known as diablo_rojo_phone16:37
afaranhaHi timburke, fungi for the patch on swift that you guys are reviewing https://review.opendev.org/c/openstack/swift/+/796057 , I was trying to identify where tests were failing, I was able to track to this line https://opendev.org/openstack/swift/src/branch/master/swift/obj/diskfile.py#L265 on the "xattr.setxattr". Do you folks know what could be the issue with the set xattr? 17:46
clarkbthat setxattr is what runs out of space?17:52
fungiafaranha: without digging into write_metadata() calls, it's hard to guess what the side of those pickled structures would be17:52
fungiclarkb: yeah, at least from the job log it's evident that xattr calls are involved17:52
fungis/side/size/17:53
afaranhafungi, one thing is that if you check the test https://opendev.org/openstack/swift/src/branch/master/test/functional/s3api/test_object.py#L361 when I try the header with only "!" it fails17:53
clarkbya ok so possible that you're hitting a size limit on the xattrs and not actual fs storage (which would explain why the fs percentage used is minimal)17:53
afaranhaI think some characters are making this go south17:53
fungiwhat's curious is that it only seems to be happening when the server is rebooted into fips compliant mode17:54
afaranhaI tried creating a file and testing on it what the write_metadata does, as fd I put a file name I created with "touch" and the metadata I copied from the tests, I put some prints and copied it17:54
afaranhaI can try making some changed to this file "test", but I don't know what i coudl do17:55
afaranha"OSError: [Errno 28] No space left on device: b'test'"17:55
afaranhaclarkb, for my test I created with plenty of space, but the issue was the file itself17:55
clarkbright xattrs for files are separate from the nomral disk space storage17:56
afaranhawhat restriction does the xattr has?17:56
clarkbI want to say the reason swift uses xfs is related to the size of them in the first place (other fses don't allow as much data to be set)17:56
clarkbhttps://man7.org/linux/man-pages/man7/xattr.7.html17:57
clarkb"In the JFS, XFS, and Reiserfs filesystem implementations, the limit on bytes used in an EA value is the ceiling imposed by the VFS."17:58
clarkbattribute names are 255 bytes, attribute values limited to 64kb17:58
clarkbIs it possible that fips is changing the VFS limitations?17:59
fungior possible fips-compliant algorithms are changing the length of the metadata itself?17:59
clarkbthat also seems possible18:00
clarkbyou can probably try to write 64kb -1 bytes, then 64 kb byes, then 64kb +1 bytes and see if it works18:00
fungilike is something somewhere switching from an md5 hash to sha2-256?18:00
clarkbthen binary search for a limit if none of those work18:00
clarkbthen if 64kb is the actual limit you need to determine why you are writing more than 64kb which could be a different hash output as fungi suggests18:01
fungibut yeah, it could also be that the xfs driver changes the internal representation of the xattr blobs, effectively reducing how long they can be18:02
timburkefwiw, that xattr_size kwarg is never used (outside of unit tests, presumably), so every setxattr() call during func tests will be sending at most 64k18:04
timburkethe overall length of metastr is most likely in the 100s of kb; for vanila swift it's usually well below the 64k chunking that's going on, but having encryption enabled will significantly increase the metadata sizes18:04
timburkei did get around to bringing up a fips-enabled VM locally and verifying that 64k xattrs were still allowed, but i don't think i'd put *that* many of them on any single file18:05
fungiso should the test(s) triggering that condition be adjusted to write less data into it?18:06
timburkeno -- the test is within expectations for production data, so if something about the combination of xfs and fips mode means we can't write it all, it's an indication that swift cannot be run under fips mode in production yet18:08
fungigot it18:09
timburkeswift definitely prefers xfs in part due to the larger allowed xattr sizes -- iirc (some versions of?) ext3/4 would have limits down closer to 4k, and performance would suffer because of the relatively large xattrs we want to write18:10
clarkb"The list of attribute names that can be returned is also limited to 64 kB (see BUGS in listxattr(2))." I wouldn't expect this to be the issue if you are getting errors on writes. But if it was a problem with incomplete data coming back that mgith be related18:10
clarkbtimburke: ext* limits you to one block18:10
clarkbso depends on the fs block size18:10
afaranhaclarkb, fungi I'm not sure it's something limited to the bytes on the metadata, I tried just setting "!" (from the test) on the metadata, and it doesn't work18:11
clarkband that limit applies to all the metadata for a file. So ya fairly limiting18:11
clarkbafaranha: right but is that the actual data written?18:11
fungiyeah, the error seen in the job log is basically enospc, it's indistinguishable from a full filesystem except that it crops up when calling xattr18:11
clarkbafaranha: fungi is suggesting that encryption or similar could be tripping it18:11
afaranhaI can check the metadata value before the calling of setxattr, do you suspect this is being changed on this method call or before?18:13
fungii don't have any precise suspicions, merely vague guesses as to what it could possibly be18:14
afaranhawait a minute18:14
fungimost of what i know about xfs's extended attributes came from the discussion here in the past few minutes ;)18:14
fungisame for swift's use of them18:15
afaranhaI just run 2 times, first time it didn't work, last time worked fine18:15
afaranhalet me try again18:15
afaranhafor reference, I'm running using this command: tox -e func-encryption-py3 test.functional.s3api.test_object.TestS3ApiObject.test_put_object_weird_metadata18:15
stephenfinfungi: yup, all good (the failures on the nodepool-build-image-siblings job in openstacksdk)18:15
afaranhaand test was modified to this: https://paste.openstack.org/show/810316/18:16
afaranhaI run again, and it got stuck as before18:16
afaranhaMETADATA: {'X-Timestamp': '1635790594.21114', 'Content-Type': 'binary/octet-stream', 'Content-Length': '10', 'ETag': '7d721f6bd24977788449b41a0b7ac912', 'X-Object-Sysmeta-S3Api-Acl': '{"Owner":"test:tester","Grant":[{"Permission":"FULL_CONTROL","Grantee":"test:tester"}]}', 'X-Object-Transient-Sysmeta-Crypto-Meta-!': 'aQ==; swift_meta=%7B%22cipher%22%3A+%22AES_CTR_256%22%2C+%22iv%22%3A+%22w6REfMVgZRLLWFdVG86U2w%3D%3D%22%7D', 'X-Object-Transient-18:16
afaranhaSysmeta-Crypto-Meta': '%7B%22cipher%22%3A+%22AES_CTR_256%22%2C+%22key_id%22%3A+%7B%22path%22%3A+%22%2FAUTH_test%2Fbucket%2Fobject%22%2C+%22v%22%3A+%222%22%7D%7D'}18:16
clarkbI think the next debugging step is find out what exactly is being written to the metadata and do it out of band by hand18:20
clarkband try to reproduce it18:20
afaranhaafter running the: metastr = pickle.dumps(_encode_metadata(metadata), PICKLE_PROTOCOL); metastr value is https://paste.openstack.org/show/810318/18:20
afaranhaack18:20
timburkepretty short, <2k18:21
timburkewas that value one that failed, or succeeded?18:21
clarkbalso I'm not sure anything here is special to opendev or infra. This is likely to be whatever platform you are on (centos 7/8?) + fips + swift related18:22
afaranhait failed18:22
afaranhacentos8 + fips; But I tried creating and running a Centos8 VM without fips, and the tests passed, then I enabled fips and the tests passed again18:23
afaranhawe were only able to reproduce the issue so far, on the CI server that fungi reserved for this investigation18:23
clarkbafaranha: right we know the tests pass generally without fips enabled. When you did it locally did you use a file based block device for the filesystem?18:24
clarkbI suppose it is possibly related to that implementation detail somehow as well18:24
afaranhalocally I just run the tests using tox, I don't know yet how it does it18:24
fungii think the relevant tests get skipped if that file isn't created and mounted?18:25
clarkbI think this may actually be it18:27
clarkbtools/test-setup.sh runs to create the xfs filesystem18:27
afaranhais there a way to force it to be run?18:27
afaranhaor can I just run tools/test-setup.sh and then tox?18:27
fungiby... running it18:27
clarkbthe CI jobs then set TMPDIR: /home/zuul/xfstmp. But I'm not sure that tox passenvs' TMPDIR by default18:27
fungiour ci jobs install any packages bindep indicates should be installed, run any tools/test-setup.sh script which is present in the repo, and then call tox with the specified env18:28
clarkbits possible that it is failing beacuse we're using ext4 and that is limited to 204818:28
clarkb(or whatever our block size is)18:28
clarkbhttps://tox.wiki/en/latest/config.html#conf-passenv I think that may be it18:29
fungiafaranha: note that tools/test-setup.sh needs to be run with root permissions, but then tox should be run as your testing account rather than root18:29
clarkband fips is pushing over the limit18:29
clarkbit works locally beacuse you run it on a centos8 with an xfs filesystem and the special file doesn't need to exist18:29
clarkbwhen run on our CI system on ext4 you need to create the filesystem and use it but I think the tests may not be using it18:29
afaranhaack18:29
fungioh, right, if centos uses xfs for its rootfs then those tests won't be skipped18:29
clarkbtimburke: ^ fyi this may be a more general problem18:30
clarkbfungi: centos default is xfs but we ext4 everything by default with dib (for sanity)18:30
fungiright, that's what i meant18:30
fungias to why the tests weren't being skipped on the manually set up test server, it likely used xfs for its rootfs18:31
clarkbalso TMPDIR is used more broadly isn't it? I think you should maybe use a different variabel to set that path?18:31
clarkbfungi: well it seems the tests aren't being skipped on our CI machines either18:31
clarkb(otherwise why do they fail?)18:31
fungiclarkb: so why would passenv be posing a problem only for the fips runs and not the rest of the time?18:31
clarkbfungi: because fips is using bigger encryption data18:32
fungii didn't mean to imply the tests were being skipped in ci18:32
clarkbfungi: I'm saying that i don't think the xfs block file setup in upstream CI is being used18:32
clarkbfungi: because we are not properly passing the env var through. Unless swift just goes for it18:32
clarkbthat implies we are not skipping any tests based on the underlying fs18:33
timburkefwiw, the encryption job should be using the same algos whether fips is enabled or now -- it shouldn't impact the metadata size18:33
fungiand somehow in non-fips mode that ext4 fs has enough room to pass the tests anyway?18:33
clarkbfungi: right18:33
fungicould ext4's behavior be possibly changing under fips mode?18:33
clarkbanother possibility: swift is using TMPDIR to set this implying it relies on whatever is in /tmp (by default) or the override18:33
clarkbmaybe when you boot in fips mode /tmp is a different fs type18:34
clarkbfrom ext4 -> tmpfs or vice versa type of thing18:34
afaranhait just got complicated to me now D=18:34
clarkblet me get some links to explain my suspicion18:34
timburkefwiw, from https://tox.wiki/en/latest/config.html#conf-setenv -- "Some variables are always passed through to ensure the basic functionality of standard library functions or tooling like pip ... Others (e.g. UNIX, macOS): TMPDIR"18:35
clarkbhttps://opendev.org/openstack/swift/src/branch/master/tools/test-setup.sh#L13 this is where test-setup.sh mounts the xfs filesystem at $HOME/xfstmp18:35
clarkbtimburke: ah ok. tox does the right thing then?18:36
timburkeshould18:36
clarkbnext theory: is the reboot for fips happening before or after we mount that filesystem?18:36
clarkbit isn't being put in fstab so if we reboot after the mount then you'll lose the mount and be on ext418:36
clarkbis there a link to the failing job log?18:37
afaranhaclarkb, https://zuul.opendev.org/t/openstack/build/8fefe2da3d754c9484f2cdd2090eb484/logs18:38
clarkbyup we run test-setup.sh first then reboot so teh mount is lost18:39
fungiaha, so need to reorder the roles being included there18:39
fungigood find!18:39
clarkbhttps://zuul.opendev.org/t/openstack/build/8fefe2da3d754c9484f2cdd2090eb484/console shows this. unittests/pre.yaml runs test-setup.sh and enable-fips.yaml happens later18:39
clarkbfungi: or we should add that fs to fstab (that might be a bit heavy handed if it runs on say your laptop)18:40
fungithis would probably have sorted itself out soon anyway, since at/after the ptg we talked through making the fips setup role run much earlier, because of similar issues with stateless multinode setup18:40
clarkband I don't think the tests are skipped on ext4. They are run and fail. Would be curious to know if they fail without fips too (they probably do)18:41
fungilike if the setup fips role was just replaced by a simple reboot18:41
clarkbafaranha: on your test node you can mount the xfs filesystem and set TMPDIR to that path and run tests to make sure it works18:41
afaranhaI'm trying to follow, but let me ask something, if it reboots after mounting the xfs block, that means the test will be skipped?18:42
afaranhaso we shouldn't see any issue with fips?18:42
fungiapparently no, the test gets run anyway, just tries to write metadata into ext4 instead of xfs18:43
afaranhabut why if I to write "b" as the metadata on the test it works, but "!" doesn't?18:43
clarkbluck of encryption size?18:44
afaranhaand it worked once with "!"18:44
clarkbif you are close to the limit a few bytes either way encrypting things with non determinstic length outputs could do it18:44
afaranhaby lucky, you mean, the encryption resulted in a small metadata?18:44
clarkbyes or large when it fails18:44
afaranhaokay18:44
afaranhalet me try the setenv on the tox then18:45
fungiregardless, the fips setup will have to happen earlier in the job, because the reboot could clear away any number of stateless things done as part of the job setup18:45
fungijust like we saw with the multinode jobs losing their network configs18:45
clarkbright, mostly just thought confirming this was the issue really quickly would be good then we can delete the held node and fix this by reordering steps18:45
clarkbside note to the fips stuff: I don't think it is an issue yet but we should be wary of created two entirely identical but for fips sets of CI testing for projects18:48
clarkbWe can probably get away with asserting if it works under fips then it will work without fips and drop half the jobs18:48
clarkbbut I'd need to think that through a bit more18:48
clarkbor do targetted testing (and focus on functional testing?) of fips18:49
afaranha[testenv:func-encryption-py3]18:49
afaranha[...]18:49
afaranhapassenv = TMPDIR=/home/zuul/xfstmp18:49
afaranhalike this right?18:49
clarkbafaranha: no, timburke  pointed out that TMPDIR is automatically passed through by default. You need to mount the filesystem to /home/zuul/xfstmp then when you run tox you need to do it like: TMPDIR=/home/zuul/xfstmp tox -e py36 -- something18:50
clarkbafaranha: https://opendev.org/openstack/swift/src/branch/master/tools/test-setup.sh#L13 that is the mount command that test-setup.sh uses18:51
afaranhaokay, passed 3 times, let me try again18:53
afaranha:O18:53
afaranhaI think it's fixed18:53
clarkbcool so ya you need to reorder the steps the job takes then it shouldwork18:56
afaranhathank you all o/18:57
afaranhaI'll leave now (EMEA timezone) hopefully we cna send a patch tomorrow to have the tests for swift working :D18:58
*** dviroel|rover is now known as dviroel|out21:57
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly for some security updates, but should return to service momentarily22:10

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!