Monday, 2020-06-15

fungiahh, yeah, so it's taking a long time even when rsync hasn't run00:00
ianwhttps://lists.openafs.org/pipermail/openafs-info/2019-September/042865.html includes a link to afs audit logs from an rsync run00:00
fungiahh, right, this is ringing a bell now00:02
fungibut https://review.opendev.org/681367 didn't actually solve it?00:02
ianwfungi: it seems not, the graph is showing every release takes ~8 hours00:05
ianwthis is what led to the path of doing the release via ssh and -localauth00:07
ianwfungi: perhaps we should leave mirror-update off for a bit and investigate again?00:09
fungiyeah, not a bad idea00:10
ianwfor a start, when fedora gets in sync, we could turn on file auditing and run a "vos release" with a zero-delta and see what happens00:10
openstackgerritMerged openstack/project-config master: Drop pip-and-virtualenv from images  https://review.opendev.org/73442800:29
openstackgerritMerged openstack/project-config master: Use https apt mirrors for image builds  https://review.opendev.org/73536200:30
auristorianw fungi: afs vice partitions should be noatime but that won't alter the contents of the incremental dumps.00:48
auristorA third fileserver should be added so that there is always a redundant clone in case of a failure of afs01.dfw00:49
*** Meiyan has joined #opendev01:01
*** xiaolin has joined #opendev01:04
ianwauristor: it's probably a bit of a moot point though when it's basically in a constant state of "vos release" (i.e. the next one starts immediately after the previous one finishes)01:06
*** xiaolin has quit IRC01:11
auristornot really.   The point is that while afs01 is updating afs02, there is no valid copy on afs02.   the only consistent copy is on afs01.   which is at 100% network capacity so sending fetches there from clients only makes things slower.   If afs03 existed, then either afs02 or afs03 would be online with a self consistent copy while the release was taking place.01:22
ianwwe have afs01.ord too, i'm not sure if it's deliberately or it's just an accident of history01:26
auristoris anything replicated to it?  mirror.centos for example is not01:27
clarkbianw: ord was used until we hit the window sizing issues01:28
clarkbthe idea was to be offsite for resiliency but that meant copies took forever01:28
ianwhrm, i don't remember that but ok; that probably explains the odd mix of replications we have01:29
auristordocs, docs.dev, mirror, project, project.airship, root.afs, root.cell, and user.corvus01:29
auristorif throughput to ord is a problem, then I suggest standing up a afs03.dfw.01:30
auristorI really wish we could figure out some way that auristorfs could be used to host this cell01:30
ianwwe have updating to bionic and 1.8 as a increasingly insistent todo01:31
auristoropenafs 1.8 will help a bit with rx issues but it isn't going to fix most of the underlying issues01:32
ianwauristor: it's still true we shouldn't mix 1.6 and 1.8 servers?  i think that's the assumption we've been working under01:34
auristorabsolutely not01:34
ianwauristor: sorry, we absolutely should not mix them, or it's ok to? :)01:36
auristorthere are no data format or wire protocol changes between 1.6 and 1.8.  mix and match to your hearts content.01:44
auristorwhat command line options are passed to rsync?01:52
ianwauristor: rsync -rltDiz01:53
ianwhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L10101:53
ianwlooks like everything has released now02:04
auristorvos status afs02.dfw reports no transactions02:05
ianwi can try a release on fedora now and see what happens02:06
ianwsince the update server is shutdown, nothing has written to it02:06
ianwif we want i can restart with audit logging02:08
auristorI don't think there is any interesting audit logging for the release.   its the rsync that is interesting from my perspective.02:08
fungiyep, confirmed, the fedora and opensuse volume releases did finally complete some time in the last few minutes02:11
auristorAs we discussed many months ago, the vos release is going to send all directories and any files that changed from five minutes before the last release time.   The last release time was 2s after the last update time.02:11
auristors/five minutes/fifteen minutes/02:15
ianwauristor: yeah, that's why we put in the sleep https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L15202:21
ianwistr we did try that experiment, running multiple releases02:23
auristorianw: instead of performing a "vos release" that will require network bandwidth and taking the afs02.dfw volume offline, could you execute02:23
auristor  vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 15:04"02:23
ianwVolume: 53687100702:24
ianwdump_size: 30672504164602:24
auristorand remove the -time switch and parameter02:25
auristorThat is effectively the entire volume02:26
ianwVolume: 53687100702:26
ianwdump_size: 30684082258202:26
ianwecho $(( 115780936 / 8 / 1024 / 1024))02:27
ianw1302:27
ianw~13 gb difference ?02:27
auristorwhy dividing by 8?02:28
ianwoh it's bytes02:28
auristor110MB difference which is nothing02:29
auristorif you specify the time as "2020-06-14" what do you get?02:29
ianwVolume: 53687100702:30
ianwdump_size: 1561318702:30
auristorthe times listed by vos examine are local times.  So I'm giving you EDT.   Use vos examine mirror.fedora from the machine the vos size command is being executed on and use that time02:31
auristorLast Update time02:31
ianwall, all the hosts run in UTC02:31
auristorvos doesn't02:32
ianwi'm doing this on afs0102:32
auristorI'm not on afs01.  So my Last Update Sat Jun 13 15:04:11 202002:32
ianwLast Update Sat Jun 13 19:04:11 202002:33
auristorProvide that time to vos size02:33
ianwianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 19:04:11"02:34
ianwVolume: 53687100702:34
ianwdump_size: 1561326602:34
auristor14MB which will be the size of the directories02:34
auristorsubtract 15m from that time and what do you get?02:35
ianw$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 18:45"02:35
ianwVolume: 53687100702:35
ianwdump_size: 1561326602:35
auristorthe problem isn't the incremental dump02:36
auristorrsync the content from mirror.fedora.readonly to mirror.fedora.    That should be "no change"    Then perform the "vos size with -time "2020-06-13 18:45"" again02:38
ianwumm, ok, i want to be very careful i don't destroy things with an errant command :)02:40
auristoryou can copy mirror.fedora to a new volume02:40
auristorvos copy -id mirror.fedora -fromserver 104.130.138.161 -frompart a -toname test.fedora -toserver 104.130.138.161 -topart a02:43
auristorthen mount test.fedora so you can rsync to it02:43
ianwok, i just have a dry-run going anyway to see what it thinks about things02:44
ianwrsync -avz  --dry-run /afs/openstack.org/mirror/fedora/ /afs/.openstack.org/mirror/fedora/ reports nothing to do02:45
auristorthose aren't the rsync options you indicated earlier02:46
auristorof -rltDiz the most interesting is -t02:47
ianwhttps://static.opendev.org/mirror/logs/rsync-mirrors/fedora.log02:50
ianwdoes have verbose logging on that should show if rsync touches anything02:50
ianwthat's the itemize changes (-i) which will show why it updated files02:51
auristorthe behavior I observed was that rsync didn't update the data but it set the last update time on files it didn't modify02:52
ianwthe vos copy i guess will take a while02:55
auristorsadly its performed via rx over loopback02:55
ianwi can strace the rsync to see exactly what it touches02:56
auristorthe fileserver audit log would tell as well02:56
ianwright, i'm pretty sure that's what i got @ http://people.redhat.com/~iwienand/fedora-mirror-11-09-2019.tar.gz02:57
auristorI wonder if this is the problem with the openafs client02:58
auristor    ip->i_mtime.tv_sec = vp->va_mtime.tv_sec;02:58
auristor    /* Set the mtime nanoseconds to the sysname generation number.02:58
auristor     * This convinces NFS clients that all directories have changed02:58
auristor     * any time the sysname list changes.02:58
auristor     */02:58
auristor    ip->i_mtime.tv_nsec = afs_sysnamegen;02:58
auristorin other words, the nsec component of the mtime reported by the openafs client is not going to match the nsec time that rsync obtains from the source02:59
auristorif the data hasn't changed, rsync won't rewrite it. but with -t it will try to fix the mtime03:00
auristorIn the FileAuditLog you are looking for AFS_SRX_StStat events03:01
ianwi feel like that would show in itemized-changes03:02
auristorAFS_SRX_StStat events for a FID without a AFS_SRX_StData event03:02
ianwi think maybe if i bring mirror-update back online, and get in there fast and take the update lock, then i should be able to run the exact rsyncs under strace03:04
ianwthat seems the lowest impact way to get data right now03:04
auristorok03:06
ianwok, i've commented out the cron run and will update the script and run manually03:11
ianwit's running in a screen on mirror-update03:16
ianwlogging to ~ianw/rsync-run03:17
ianwlstat("Modular/x86_64/os/Packages/p/perl-Time-Piece-1.31-415.module_2570+32b47dc0.x86_64.rpm", {st_mode=S_IFREG|0644, st_size=43780, ...}) = 003:17
ianwutimensat(AT_FDCWD, "Modular/x86_64/os/Packages/p/perl-Time-Piece-1.31-415.module_2570+32b47dc0.x86_64.rpm", [UTIME_NOW, {tv_sec=1544180283, tv_nsec=202155000} /* 2018-12-07T10:58:03.202155000+0000 */], AT_SYMLINK_NOFOLLOW) = 003:17
ianwis basically it03:17
ianwthis isn't a zero delta, it's bringing in a bunch of stuff from upstream03:21
ianwok, it's into that "+ sleep 1200" period03:22
ianwianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871006 -dump03:42
ianwVolume: 53687100603:42
ianwdump_size: 30681606228503:42
ianwianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871006 -dump -time "2020-06-15 03:00"03:42
ianwVolume: 53687100603:42
ianwdump_size: 30670028134903:42
ianwi don't know if that is right, but that's 110mb difference from before and now03:42
auristor-time "2020-06-13 18:45"03:44
ianw$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 18:45"03:46
ianwVolume: 53687100703:46
ianwdump_size: 1561326603:46
auristoryou want the incremental dump of the RW03:47
ianwwell the release has started03:49
ianwi've put mirror-update in emergency so the cron job doesn't come back03:52
*** ykarel|away is now known as ykarel03:55
auristorI'm done for the night.03:59
ianwauristor: thanks, i think if we do some manual tracing of zero-delta updates we can get some more info to go off04:01
AJaegerianw: what kind of cleanup is needed after https://review.opendev.org/735301?05:47
openstackgerritMerged openstack/project-config master: Add github sync job for tricircle  https://review.opendev.org/73541705:54
AJaegerianw: I see you left the plain ones in - ok, so no need for cleanup *yet*.05:55
ianwAJaeger: yeah, i'll get rid of everything after it's settled06:00
ianwdelete the zuul-jobs testing, then the nodes can go06:01
AJaegerok06:07
openstackgerritFelix Edel proposed zuul/zuul-jobs master: Return upload_results in upload-logs-swift role  https://review.opendev.org/73356406:19
openstackgerritFelix Edel proposed zuul/zuul-jobs master: Return upload_results in test-upload-logs-swift role  https://review.opendev.org/73550306:19
*** ysandeep is now known as ysandeep|afk06:31
*** priteau has joined #opendev06:34
AJaegerinfra-root, I just saw "Could not connect to mirror.mtl01.inap.opendev.org:443 (198.72.125.6), connection timed " ;(06:50
AJaegerLog: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_91d/735494/5/check/tempest-full-py3/91dbf66/job-output.txt06:50
AJaegerhappens in https://b7511727a7deb59d79f6-083f3205a01a368b196dd0a8486413e5.ssl.cf2.rackcdn.com/735494/5/check/neutron-tempest-linuxbridge/cfb66a5/job-output.txt as well06:51
ianwAJaeger: hrm, it's up and i can talk to it06:51
AJaegerianw: I cannot from here06:52
ianwyeah, apache not talking but the host is06:52
ianwit's been up 200+ days, i'm going to reboot it06:53
ianwthere's nothing in dmesg for over a month06:53
AJaegerthanks06:53
ianwok responding now06:55
ianw#status log rebooted mirror.mtl01.inap.opendev.org due to unresponsive apache processes06:56
openstackstatusianw: finished logging06:56
*** ykarel is now known as ykarel|afk06:56
ianwfungi/auristor: i think the nanosecond comment is honing in on the problems; there's constant calls to utimensat() on no-op rsyncs06:57
ianwmirror-update.opendev.org:~ianw/rsync-run/rsync.3911 is an example06:58
*** ykarel|afk is now known as ykarel07:00
*** iurygregory has joined #opendev07:11
*** tosky has joined #opendev07:28
*** DSpider has joined #opendev07:40
-openstackstatus- NOTICE: uWSGI made a new release that breaks devstack, please refrain from rechecking until a devstack fix is merged.07:41
*** rpittau|afk is now known as rpittau08:00
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
*** ykarel is now known as ykarel|lunch08:04
ianwfungi/auristor: i think that's the smoking gun -- http://paste.openstack.org/show/794754/ -- that just uses utimensat to update the mtime.  it's always "1".  i have to think about the implications08:09
ianwis it as easy as dropping "-t"?08:10
frickler#status log force-merged https://review.opendev.org/735517 and https://review.opendev.org/577955 to unblock devstack and all its consumers after a new uwsgi release08:15
openstackstatusfrickler: finished logging08:15
hrwmorning08:34
*** ysandeep|afk is now known as ysandeep08:41
*** ykarel|lunch is now known as ykarel08:49
*** priteau has quit IRC09:11
*** priteau has joined #opendev09:21
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540209:25
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540209:35
hrw4 days of weekend were great. but had to end.09:46
hrwhttp://mirror.regionone.linaro-us.opendev.org/ feels weird. does not list anything anymore (did in past). something changed?09:49
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540209:50
ykarellooks like centos mirrors are gone again https://mirror.ca-ymq-1.vexxhost.opendev.org/ or it was not fixed for the provider09:54
ykareljust seen in a job https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_552/727200/73/check/tripleo-ci-centos-8-containers-multinode/5524ccf/job-output.txt09:54
AJaegerinfra-root, any idea? Looks good on https://mirror.mtl01.inap.opendev.org/centos/09:56
hrwhm. looks like mirrors are weird state or sth.09:58
hrwlinaro-us one feels empty09:58
priteauIs Zuul a bit slow today? It took 6 minutes between W+1 and starting gate jobs on https://review.opendev.org/#/c/734040/10:01
*** rpittau is now known as rpittau|bbl10:03
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540210:03
*** Meiyan has quit IRC10:05
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540210:13
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540210:28
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540210:35
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540210:49
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:02
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:12
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:18
*** hashar has joined #opendev11:20
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:28
*** ykarel is now known as ykarel|afk11:30
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:35
*** nautics889 has joined #opendev11:45
*** nautics889 has quit IRC11:55
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540211:58
*** rpittau|bbl is now known as rpittau12:04
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540212:07
*** ysandeep is now known as ysandeep|afk12:07
ianwhrw: i dunno ... ls on /afs/openstack.org times out12:07
ianwthere's a lot of messages in there about dropped connections12:08
AJaegerargh ;(12:11
ianwit's annoying and i just rebooted it ... another afs issue to investigate longer term :/12:12
AJaegerthanks, ianw12:19
hrwianw: thanks12:35
*** ysandeep|afk is now known as ysandeep12:45
*** ykarel|afk is now known as ykarel12:46
auristorianw: I'm just returning to my desk. From my reading of the rsync repository the nanosec comparison is a fairly recent addition and -t sends the timestamp to the remote for time optimization.  If -t is not set, then the timestamp comparison optimization is ignored and comparison of the data contents is used exclusively.  In the case, of rsync and /afs the timestamp comparison doesn't work anyway so I think leaving it off is the12:52
auristor.12:52
*** priteau has quit IRC12:54
*** priteau has joined #opendev12:55
fungiianw: should i check all the mirror frontends to make sure there's not more of them hung, or have you already?13:10
hrwhttps://review.opendev.org/#/c/730331 got refreshed so Kolla now uses wheel cache first and then pypi mirror as a fallback.13:24
mordredhrw: cool13:26
*** hashar has quit IRC13:33
fungiinfra-root: seems there are some jobs failing on afs writes from the zuul executors. i'm going through and checking them one by one, so far i've shutdown the zuul-executor service on ze0113:39
hrwchecking build time difference now13:39
fungier, sorry, on ze0413:39
fungiokay, ze04 seems to have been the only one which couldn't ls /afs/.openstack.org/docs/13:40
corvusfungi: i'm around - need anything?13:41
fungisimilar to the mirrors ianw was looking at, `ls /afs/.openstack.org/` on ze04 is empty13:41
fungicorvus: sanity checks maybe13:41
fungistill just cleaning up from the afs01.dfw outage late saturday utc13:42
auristorfs checkservers -all13:42
openstackgerritAlbin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image  https://review.opendev.org/73540213:42
auristorfs checkvolumes13:42
fungiauristor: sadly, those give me "All servers are running." and "All volumeID/name mappings checked." but `ls /afs/.openstack.org/` is still coming back empty13:43
fungi(on this particular client that is)13:43
fungiinterestingly dmesg there doesn't report any "lost contact" log entries from around or after the outage13:45
*** ysandeep is now known as ysandeep|afk13:46
fungii have a feeling if i restarted afsd and possibly also did an rmmod/modprobe of the openafs lkm, this would go back to normal13:47
fungirebooting the other clients which exhibited similar issues with the ro replicas seemed to solve it, but unfortunately doesn't tell us much about what the actual problem was13:48
openstackgerritDavid Moreau Simard proposed openstack/project-config master: Create a new project for recordsansible/ara-collection  https://review.opendev.org/73543913:49
fungithough this particular client is one out of a redundant cluster of a dozen servers, so we can more easily keep it like this for a bit to poke around13:49
fungiinterestingly it sees the read-only tree under /afs/openstack.org/ just not the read/write tree under /afs/.openstack.org/13:50
openstackgerritDrew Walters proposed openstack/project-config master: Add missing project to Airship doc job  https://review.opendev.org/73487413:52
corvusfungi: i don't have any other ideas13:54
corvusfungi: i agree that a client restart may be in order13:55
fungibeing down one out of twelve executors for a bit is likely fine, so i'm happy leaving it like this in case there are other ideas of things we want to check first13:57
corvusfungi: i ran 'fs flush /afs/.openstack.org' and things have improved14:00
fungicorvus: oh, indeed, that seems to now be returning expected content14:00
fungiso was it possible it cached an empty state for the cell root?14:00
corvusthat's what it looks like14:01
corvusauristor: ^ fyi14:01
openstackgerritJeremy Stanley proposed opendev/system-config master: Forward user-committee ML to openstack-discuss  https://review.opendev.org/73367314:04
corvusdocs volume under that looks fine14:05
fungi`ls /afs/.openstack.org/mirror/` on ze04 is taking several minutes to complete so far14:05
hrw0:05:21.497262 versus 0:23:49.306074 is nice improvement14:07
fungihrw: is that the speedup from using prebuilt wheels?14:08
hrwfungi: yes14:08
fungisignificant!14:08
hrwwe have two images which suck time. waiting for second one14:08
corvusfungi: well, that might call for a reboot :/14:09
fungicorvus: yeah, it's still blocking...14:10
fungii mean, technically the executor shouldn't need to write to /afs/.openstack.org/mirror/ at the moment (though when we get the wheel builder jobs reworked it will)14:11
fungii'm just more worried it's indicative of deeper problems14:11
*** priteau has quit IRC14:12
corvusfungi: agreed.  at this point, i'd suggest we restart the client or reboot (reboot since it's more thorough and no less disruptive)14:14
fungiit just now returned14:16
fungiafter spitting out "ls: cannot access '/afs/.openstack.org/mirror/fedora': Resource temporarily unavailable"14:16
AJaegeralso: I presented something when I visited Amundi in February. Do you need anything else?14:16
AJaegerfungi, https://mirror.ord.rax.opendev.org/centos/7/os/x86_64/Packages/virt-what-1.18-4.el7.x86_64.rpm is failing to download14:17
AJaegergives a forbidden ;(14:17
AJaeger(ignore my first pasto :(14:17
fungifungi@mirror01:~$ ls /afs/openstack.org/mirror/centos/14:18
fungils: cannot access '/afs/openstack.org/mirror/centos/': Connection timed out14:18
corvus'fs checkservers' is unhappy here14:19
fungicheckservers on mirror01.ord.rax.opendev.org is taking a while14:20
corvusThese servers unavailable due to network or server problems:  mirror01.ord.rax.opendev.org.14:20
corvusslighly counterintuitive message :/14:20
fungithat looks like the 127.0.1.1 problem showing up again14:20
corvusiiuc, that was a volume which had its vldb entry set to 127.0.1.1 ?14:21
corvusthose were all fixed, right?14:21
fungithat's what i thought14:21
corvusdmesg says:  [Jun13 20:45] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)14:22
corvusand no "back up" message14:22
fungithat's when afs01.dfw hung, yeah14:22
corvusmaybe we should go with a reboot here too?14:22
corvusor afsd restart14:23
fungii can give that a shot first14:23
corvusk14:23
hrwshould I use http://mirror.regionone.linaro-us.opendev.org:8080/wheel/debian-10-aarch64/ or http://mirror.regionone.linaro-us.opendev.org/wheel/debian-10-aarch64/ on CI?14:24
hrw:8080 gives 403 ;(14:24
openstackgerritMonty Taylor proposed opendev/zone-opendev.org master: Add review-test  https://review.opendev.org/73560014:25
fungihrw: the wheel cache is served over 80 and 443, 8080 is a proxy14:26
hrwfungi: thanks. was not sure14:26
hrwupdated patch14:27
fungicorvus: i ended up rebooting it because afsd wouldn't stop14:27
corvusi suspected as much :)14:28
fungi#status log rebooted mirror01.ord.rax.opendev.org to clear hung openafs client state14:28
openstackstatusfungi: finished logging14:28
openstackgerritMonty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on  https://review.opendev.org/73560214:28
hrwfungi: need to check does requirements-tox-py3x-check-uc* jobs in openstack/requirements use cache too14:30
mordredcorvus, fungi: those two patches ^^ should help me finish standing up review-test so that I can rsync / mysqldump the existing prod content over. I made a private hostvars file for it with what I think is the bare minimum of secrets (we don't need a bunch of the prod ones for this) - and I moved group_vars/review.yaml to host_vars/review01.openstack.org.yaml14:31
corvusmordred: i guess we want to keep review-dev around for testing without production-copy data, which is why this is a new server and not repurposing that?14:32
fungii was about to hard reboot mirror01.ord.rax.opendev.org via api, but oob console just showed it finally giving up waiting on [something] to terminate14:32
AJaegerconfig-core, please review https://review.opendev.org/#/c/734874/ - the starlingx team needs this to prepare for the election14:33
mordredcorvus: yeah - although I think we could also consider merging the two ideas at some point - now that we don't replicate to github, I think we could move gtest to the production gerrit and then have a review-dev like the one I'm setting up for review-test that gets a periodic data rsync from review14:34
mordredbut I didn't want to block upgrade testing on getting that done14:34
corvus++14:34
fungimirror01.ord.rax.opendev.org is back online now and i can `ls /afs/openstack.org/mirror/centos/` successfully14:35
hrwlooks like I will have a change which touch all jobs14:35
openstackgerritMarcin Juszkiewicz proposed zuul/zuul-jobs master: pip.conf: use wheel cache first and fallback to pypi mirror  https://review.opendev.org/73560614:40
hrwcan config-core take a look at ^^?14:40
hrwI hope that commit message is clear enough14:40
fungihrw: it's not clear to me why that's necessary. pypi doesn't try things in sequence, it pulls all the indices and then decides what to download14:42
fungiextra-index-url isn't a "fallback" it's just yet another index it incorporates14:42
fungiotherwise our wheel cache wouldn't work for any architecture14:43
hrwah. so maybe I messed it with :8080 used for cache at same time14:44
hrwdropped14:45
fungiyeah, our "pypi_mirror" is a caching proxy, our "wheel_mirror" is served directly by apache14:45
hrwthanks fungi14:45
openstackgerritMerged openstack/project-config master: Add missing project to Airship doc job  https://review.opendev.org/73487414:46
hrwINFO:kolla.common.utils.openstack-base:  Downloading http://mirror.regionone.linaro-us.opendev.org/wheel/debian-10-aarch64/setproctitle/setproctitle-1.1.10-cp37-cp37m-linux_aarch64.whl (37 kB)14:47
*** sgw has quit IRC14:47
hrwyes ;D14:47
mnaserhi friends -- appreciate reviews on https://review.opendev.org/#/c/735478/14:48
fungihrw: yeah, if you want to see the details, the mirror servers' use this vhost configuration: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j214:48
*** sgw has joined #opendev14:50
openstackgerritJames E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support  https://review.opendev.org/72030214:50
fungihrw: according to that config, we actually make the pypi proxy available over 80/443/8080/4443 (because the proxy statements for it are included in both the basemirror and proxymirror macros)14:53
fungithough we likely set the mirror vars to use 8080/4443 because the 80/443 proxies are just backward compatibility from when we used to host our own mirror of pypi (before it got far too large)14:54
hrwhttps://7a40b7c4a1adb2feec0f-f29e759a440a8c469e5909803b48c54b.ssl.cf1.rackcdn.com/735599/1/check-arm64/requirements-tox-py38-check-uc-aarch64/3038d7d/tox/py38-check-uc-1.log works lovely ;D14:55
fungiexcellent14:55
clarkbfungi: corvus I don't think centos was one with 127.0.1.1. The centos wheel mirror for arm64 was. The centos wheel mirror for x86 was not but it was accidentally cleaned up and recreated14:55
auristorfungi: sorry, I had to step away.   I wonder if the location server list for the cell became corrupted.14:55
*** ysandeep|afk is now known as ysandeep14:56
*** ykarel is now known as ykarel|away14:56
fungiclarkb: yes. entirely possible checkservers was still trying to find 127.0.1.1 though even though we deleted and recreated those volumes14:56
mnaserthanks corvus and AJaeger :D14:56
clarkbfungi: but that volume was never part of the 127.0.1.1 problem?14:56
hrwfungi: is RETRY_LIMIT on https://zuul.openstack.org/builds?job_name=requirements-tox-py38-check-uc-aarch64 means 'we need more hosts'?14:56
clarkbor do you think that could have affected other volumes somehow?14:57
fungiclarkb: right, that volume wasn't, i was just speculating on why the checkservers command was reporting the local hostname for the client as unavailable14:57
clarkbgot it14:57
funginot necessarily related to the volume access issue14:57
fungiauristor: how do i query the location server list?14:57
* fungi checks docs14:57
AJaegerhrw: RETRY_LIMIT normally means: pre-playbook failed, was retried and Zuul gave up after three tries14:58
fungiahh, the vldb14:58
openstackgerritMerged opendev/zone-opendev.org master: Add review-test  https://review.opendev.org/73560014:59
hrwAJaeger: thx14:59
fungithe sites listed for mirror.fedora look correct (rw and ro on afs01.dfw.openstack.org, ro on afs02.dfw.openstack.org)14:59
auristorafs clients do not forget fileservers addresses once they've been told about them.   only a restart will clear the known fileserver list14:59
auristorthere is no 127.0.1.1 fileserver entry in the VLDB at this time15:00
fungiauristor: got it, that likely explains the checkservers error hanging around15:00
auristorfs checkvolumes should discard the known volume to fileserver address bindings.15:00
auristorif /afs/.openstack.org/ is not accessible that sounds like a bug in the dynamic root logic.   fs flush /afs or fs flush /afs/.openstack.org might clear it.15:03
auristorI don't remember if "fs lsm /afs/.openstack.org" works for OpenAFS on dynamic root entries.15:03
fungiauristor: yes, `fs flush /afs/.openstack.org` did clear it according to corvus15:03
openstackgerritMerged openstack/project-config master: Add vexxhost/atmosphere  https://review.opendev.org/73547815:05
*** mlavalle has joined #opendev15:05
*** sgw1 has joined #opendev15:06
auristorThat sounds like corruption of the dynamic root entry15:08
openstackgerritMerged zuul/zuul-jobs master: Add namespace in the collect-k8s-logs role  https://review.opendev.org/73131915:08
openstackgerritMonty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test  https://review.opendev.org/73561015:10
mordredcorvus: ^^ does that seem like a sane sync playbook?15:11
mordredcorvus: my thinking is that if we shut down gerrit, sync the git repos, the indexes and the caches, apply the most recent mysqldump - we should be in a pretty equivilent state, yeah?15:16
mordredso we can then do a test migration, see how it goes, then just do a state sync15:16
mordredand do it again15:17
mordred(I was originally thinking about using cloud snapshots - but I think that's too complicated honestly - because rebooting into a snapshot does stuff with ephemeral, so we'd need to invent some automation around launch-node tasks that would need to be re-done - and I think rsync will do it)15:17
corvusmordred: things may be a little out of sync in terms of the mysqldump being behind the current prod git repo state.  do you think that would be a problem?  i think it would be really important to have the 2 in sync for the notedb migration, but maybe just going to 2.16 it's not as important?15:25
corvusmordred: if we do think it's important, we could shut down prod gerrit briefly, take a mysql dump, and do a final incremental rsync.  outage should only be a few minutes?15:26
clarkbcorvus: mordred: maybe as a first step having an in sync point in time we can restore is sufficient?15:26
clarkbthen once we'd decided if upgrading in sequence with online upgrades or doing one major upgrade is better we can refine that specific option with more up to date data?15:26
corvusclarkb: sorry, i'm not following -- i'm wondering whether we need to have the mysqldb and the git repos in sync on review-test, or if having a db that's slightly older than the git repos is okay15:27
clarkbcorvus: ya I was more addressing the automation around launch node. eg we don't need a full proper sync each time we launch a new review-test. We only need one that we can copy and restore15:28
clarkbassuming that we decide a full sync is necessary15:28
corvusclarkb: oh yeah, i think mordred intends to keep review-test persistent; i think that playbook is an ad-hoc playbook15:29
clarkbah15:29
corvusi think mordred's approach is probably okay, but we're going to have change refs for changes that aren't in the db, so pushing up new changes is almost certainly a bad idea.  but just to test/time re-indexing, etc, it's probably sufficient.15:32
mordredcorvus: yeah - that. I think we could also create a point-in-time snapshot like you suggest15:43
mordredcorvus: perhaps once we're happy with an upgrade procedure we can create a consisent snapshot to test and upgrade that as a more final test before we go - so that we can test pushing up changes and stuff15:44
corvusmordred: sounds good15:53
corvusan outage for a PIT should be fairly short.15:53
mordredyah15:54
openstackgerritJames E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support  https://review.opendev.org/72030216:01
openstackgerritMonty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test  https://review.opendev.org/73561016:04
mordredcorvus: ^^ I lost to the whitespace gods16:05
*** sshnaidm_ is now known as sshnaidm16:05
fungithe only way to win is not to play16:06
*** ysandeep is now known as ysandeep|away16:13
*** rpittau is now known as rpittau|afk16:17
mnaserhi friends16:27
mnaseris there any chance that zuul logging is borked because of some ooms?16:27
fungii can look16:28
mnaserhttps://zuul.opendev.org/t/vexxhost/status <= my jobs here when clicking go straight to END OF STREAM16:28
mnaser(could also be something else, but i can't tell really)16:28
fungimost recent oom on any executor was 2020-03-29 on ze0216:31
fungiwell, on any running executor (there was one from april on ze04 but it's currently down for evaluation)16:32
clarkbthere is a period of time between the node being assigned and the job actualyl starting on the remote node where there is no stream content16:32
fungiwe restarted our executor services on 2020-05-26 so i don't think any log streamers have been sacrificed in an oom event since then16:33
*** diablo_rojo has joined #opendev16:34
clarkbmnaser: fungi both jobs seem to have content now16:35
clarkbI think the period between node assignment and job starting enough to have a streamer running is likely the cause here16:36
*** diablo_rojo has quit IRC16:39
clarkbcorvus: did you see my question on https://review.opendev.org/#/c/730929/6 ?16:44
clarkbalso I'm double checking that we merged all the changes from friday's renaming and it appears we have. If you've got any still open please let me/us know16:44
openstackgerritMonty Taylor proposed opendev/system-config master: Don't install puppet modules when we don't need them  https://review.opendev.org/73564216:46
mordredclarkb: ^^ I just noticed that when looking at a test run that timed out - we're installing all of the pupept modules from git in every job even when those jobs don't run puppet16:46
mordred(t16:47
mordredit's only taking 2 minutes - but still, that's 2 completely wasted minutes in most of our jobs)16:47
corvusclarkb: ah yeah, looks like a rebase snafu16:56
openstackgerritJames E. Blair proposed opendev/system-config master: Fake zuul_connections for gate  https://review.opendev.org/73092916:57
*** diablo_rojo has joined #opendev16:59
mordredcorvus: stop using backend hostname should be safe to land yes?17:00
mordred(I mean, it looks that way, just checking to make sure)17:01
corvusmordred: yeah, i think it'll all good up to WIP zookeepe17:01
mordredcool17:01
clarkbhttps://review.opendev.org/#/c/734711/ is an easy puppet code deletion if anyone has a quick moment17:01
clarkband https://review.opendev.org/#/c/734647/ will update a number of docker images, but helps make our python3 auditing cleaner17:02
mordredclarkb: done on both17:02
mordredclarkb: did we just switch out nodes to ones without virtualenv pre-installed?17:03
clarkbgitea's 1.12.0 milestone is down to a single issue without an open PR. The other issue has an open PR that passes testing and needs review17:04
clarkbmordred: we did17:04
mordredbecause I just got a failure on system-config-legacy-logstash-filters: https://zuul.opendev.org/t/openstack/build/876b22c1c06649ea8aaea5f0733a793717:04
mordredAWESOME17:04
clarkbmordred: ianw did that during australia monday17:04
mordredI'll get up a fix17:04
clarkbthanks17:04
openstackgerritMonty Taylor proposed opendev/system-config master: Use python3 -m venv instead of virtualenv  https://review.opendev.org/73564317:06
mordredinfra-root: ^^ fix gate break17:06
mordredclarkb: I17:07
mordred'I'm excited we're close to 1.1217:07
clarkbmordred: hrm for the venv fix I think that may still nto work on xenial beacuse xenial's pip isn't able to handle our wheel mirror config? I could be wrong about that (testing should tell us)17:07
clarkbif it does fail due to the wheel mirror being present we can just add the ensure-virtualenv role to the job17:07
fungimordred: shouldn't that use -m ?17:09
fungiat least testing locally, `python3 -v venv foo` doesn't seem to create a venv17:10
fungi"python3: can't open file 'venv': [Errno 2] No such file or directory"17:10
clarkbfungi: yes, the commit message got it right17:11
fungiindeed, seems so17:12
clarkboh neat looks like the other issue assocaited with 1.12 is maybe not a bug17:13
clarkbI wonder if this means we could have a 1.12.0 release this week17:13
fungithat would be exciting17:13
openstackgerritMonty Taylor proposed opendev/system-config master: Use python3 -m venv instead of virtualenv  https://review.opendev.org/73564317:13
mordredfungi, clarkb: yup. I can't type :)17:14
fungino worries, me neither17:14
fungihalf the time i'm lucky i can even read17:14
mordredfungi: I think it's unreasonable to expect a single person to be able to both read AND write17:16
fungisometimes i can append, does that count?17:17
corvusas i read this conversation, the word 'truncate' comes to mind17:18
mordredcorvus: that sounds like truculence to me17:28
openstackgerritMonty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on  https://review.opendev.org/73560217:30
openstackgerritMonty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test  https://review.opendev.org/73561017:30
corvusmordred: is that when a big-rig driver .... nevermind17:31
mordredcorvus: yes17:34
dmsimardmordred: would love a refresh of your +2 on https://review.opendev.org/#/c/735439/ <317:42
hrwhttps://marcin.juszkiewicz.com.pl/2020/06/15/opendev-ci-speed-up-for-aarch64/17:44
mordreddmsimard: done17:46
dmsimard\o/ thanks17:46
AJaegerhrw: thanks, nice numbers on speed improvement!17:53
mordredhrw: nice!17:53
hrw2020-05/#opendev:22 14:47 < hrw> I should probably find it 2-3 years ago ;D17:56
hrw;D17:56
*** hashar has joined #opendev17:57
fungiexcellent article17:58
openstackgerritMerged openstack/project-config master: Create a new project for recordsansible/ara-collection  https://review.opendev.org/73543918:00
hrwthx18:00
hrwshould have some links in it but I care less about seo than before ;D18:01
clarkbit is always great to see how changes we've made help18:01
mordredclarkb, fungi: *wat* - https://zuul.opendev.org/t/openstack/build/aee168e0d87c4dbf9337c4bee692104b18:18
mordreddoes python3 -m venv not produce a venv with a working pip in it?18:18
corvus\o/  zuul with zk tls started!  https://zuul.opendev.org/t/openstack/build/dbff561b77214db19a05d9711a09634a/log/zuul01.openstack.org/debug.log18:19
clarkbmordred: ya I think that was what I was trying to describe earlier18:19
clarkbmordred: you may need to use ensure-virtualenv on xenial to work around python sillyness on ubuntu18:19
mordredclarkb: ok. I'm going to do that18:19
fungii'm dubious that's the cause, but doing some local testing now18:20
clarkbthe problem I remember had to do with it using old pip18:20
clarkbI would've expected a pip in the virtualenv though18:20
clarkbpossible that site pacakges changes the behavior there18:20
clarkband if you don't have a python3 pip installed in the system you get no pip in the venv?18:21
openstackgerritJames E. Blair proposed opendev/system-config master: Add Zookeeper TLS support  https://review.opendev.org/72030218:21
fungiyeah, i don't get the behavior there. on debian/ubuntu with distro packaged python3, either you have python3-venv installed which depends on a wheel bundle including pip, or you get an error about ensurepip failing18:21
fungii thought maybe there was a chance --system-site-packages changed that behavior, but it doesn't seem to for me18:22
openstackgerritMonty Taylor proposed opendev/system-config master: Use ensure-virtualenv in legacy puppet jobs  https://review.opendev.org/73564318:22
fungithere is a --without-pip option to the venv module18:22
fungimaybe somehow it's defaulting on18:22
fungimore testing18:23
* mordred isn't going to lose a lot of sleep on it - these jobs need to diaf anyway18:23
fungiat least in debian/sid it's installing pip into the venv for me even using distro-packaged python3-venv18:24
mordredfungi: maybe it's clarkb's thing - if - you don't have python3-pip installed do you wind up with no pip?18:25
fungii did not install python3-pip18:26
fungiand did not have it installed18:26
mordredyeah. I agree - I just did that locally too18:26
fungipython3-venv pulls in python3.8-venv and python-pip-whl, the latter has wheel bundles for stuff including pip18:26
mordredand I happily have pip in the venv18:26
openstackgerritGhanshyam Mann proposed openstack/project-config master: Retire Tricircle projects: finish infra todo  https://review.opendev.org/72890218:27
mordredfungi: *WEIRD*18:27
fungithis failed on xenial though18:28
fungiso maybe it's older behavior?18:28
clarkbfungi: I was just going to ask python3.8 isn't on xenial18:28
clarkbfungi: yes that is my hunch18:28
clarkbianw discovered xenial to be weird18:28
fungiyeah, i was testing on debian/sid since it's what i have locally18:28
fungii thought this was how the python3-venv package had worked for a while, but perhaps not so long as xenial18:29
fungithough it looks the same from a deps standpoint18:30
fungipython3-venv on xenial depends on python3.5-venv which depends on python-pip-whl18:30
fungiand it in turn only depends on ca-certificates, no python3-pip or python3.8-pip or anything of the sort18:31
* mordred just tried it in a xenial container18:31
mordredand it worked just fine18:31
fungiand python-pip-whl only installs .whl files under /usr/share/python-wheels/ nothing directly importable or executable18:31
clarkbmordred: ya I just did that too18:31
fungihttps://packages.ubuntu.com/xenial/all/python-pip-whl/filelist18:31
fungiso that build failure is *very* puzzling18:32
mordredI'm almost interested in holding a node18:32
fungicould the python3 -m venv call have failed but not returned an error somehow?18:32
mordredmaybe?18:33
clarkbfungi: that seems plausible18:33
openstackgerritMonty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on  https://review.opendev.org/73560218:33
openstackgerritMonty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test  https://review.opendev.org/73561018:33
clarkbif venv wasn't installed we'd get an error (just tested this on xenial container)18:34
clarkbso venv needs to be there in some capacity to have it be silent like that18:34
clarkbare we only serving the arm64 wheels on the arm64 mirror?18:40
clarkbI guess that kinda makes sense18:40
clarkbbut with things like zuuls cross arch docker image builds we may want to put the contents for all the arches in all the mirrors18:40
mordredclarkb: that's a good point18:43
mordredalthough won't that require some logistical reworking?18:43
clarkbmordred: I don't think so since everything is path scoped by arch already18:44
clarkbmordred: I think it may just be a matter of having the correct symlinks on disk and apache config?18:44
mordrednod18:44
mnaserhas anyone looked at system-config-legacy-logstash-filters or not yet? :>18:57
mnaseri can try myself at fixing it if there's no one at it18:58
clarkbmnaser: mordred is18:58
clarkbmnaser: https://review.opendev.org/735643 that change18:58
mnaserok, cool -- /me can help if needed18:59
mordredmnaser: it _should_ be fixed by that19:02
mordredmnaser: and one day I'll get around to killing that job19:02
mnaser\o/19:03
fungimordred: i also noticed over the weekend pbr's unit and devstack/tempest jobs are busted too, though i haven't had time to dig into that yet19:04
clarkbfungi: the python2 failure is using the stestr constraint for version 3.0.1 which is python3 only19:07
fungiyeah, i'm unsurprised there19:07
clarkband python3 failed on some virtualenv thing which may be related to new images? though the timestamp is such that I don't think so19:07
clarkbAttributeError: module 'virtualenv' has no attribute 'create_environment'19:08
clarkbpossible that is related to virtualenv 3 updates?19:08
fungioh, maybe19:13
mordredclarkb: feel like a +A on https://review.opendev.org/#/c/735602 ?19:15
clarkbI'll take alook after lunch19:15
openstackgerritMonty Taylor proposed opendev/system-config master: Don't install puppet modules when we don't need them  https://review.opendev.org/73564219:35
openstackgerritMonty Taylor proposed opendev/system-config master: Install pip3 on codesearch  https://review.opendev.org/73566819:35
mordredclarkb, fungi: more fallout from new nodes ^^19:35
openstackgerritMerged opendev/system-config master: Use ensure-virtualenv in legacy puppet jobs  https://review.opendev.org/73564319:42
openstackgerritMonty Taylor proposed opendev/system-config master: Add bit more info on disabling ansible runs  https://review.opendev.org/73524619:42
mordredfungi: ^^ I rebased that on the logstash filters fix and added reference to disable-ansible script19:42
openstackgerritMonty Taylor proposed opendev/system-config master: Switch prep-apply.sh to use python3  https://review.opendev.org/72954319:43
clarkbcentos 8.2 has released.Anothet thing to keep an eye on of/when failures happen19:46
clarkbmordred: is review-test a full size node?20:04
clarkbalso looking at it we don't use the review group for group vars. We use hostvars and you've trimmed the hostvars down for review-test. Is that sufficient to ensure that things like gerritbot and launchpad syncing won't try to run in both places at once?20:06
clarkb(we want to rpevent that and want to make sure we've considered it and I think the split host vars does that?)20:06
corvusclarkb, mordred, fungi: https://review.opendev.org/720302 zk tls is ready -- do we want to think about doing that on friday?20:07
clarkbcorvus: I'll be around and able to help20:08
corvusi'll add an item to the mtg agenda20:08
fungiyeah, i can do friday, no problem20:09
openstackgerritMerged opendev/system-config master: Add tool to export Rackspace DNS domains to bind format  https://review.opendev.org/72873920:10
mordredclarkb: yes20:13
mordredclarkb: as is the rax db I made20:13
clarkbmordred: cool so the 48g heap size won't cause problems then. What about the other thing?20:14
mordredclarkb: well - before I did private hostvar surgery on bridge, we actually used group_vars  for review for settings20:14
mordredclarkb: but - I believe with the secrets being in host-specific files we will not be putting any secrets on review-test that would allow those services to operate20:15
clarkbcool20:15
clarkbthat was my read of it too, just double checking20:15
clarkbwhat about email20:15
mordred(I'm pretty sure this first ansible run won't even finish because it'll be missing some required secrets)20:15
clarkbare we concerend about gerrit sending people email?20:15
mordredhrm. that's a good question20:15
mordredit should really only send mail on patchset upload right?20:15
clarkbya I think upload and merge20:16
clarkbas long as we avoid updating random changes we're probably fine20:16
mordredlike - as long as we're not pushing changes to or merging changes there is _SHOULD_ be fine?20:16
mordredyeah20:16
clarkbI've +2'd the change though zuul is unhappy with it20:16
mordred\o/20:16
clarkbpossibly due to the host vars20:16
mordredlet's see what's broken this time20:16
*** hashar has quit IRC20:17
mordredData could not be sent to remote host "198.72.124.215". Make sure this host can be reached over ssh: ssh: connect to host 198.72.124.215 port 22: No route to host20:17
mordredclarkb: it seems to have been unhappy trying to talk to fake review-dev20:17
clarkbah so maybe just a recheck?20:18
mordredclarkb: yeah - I'll try that20:18
mordredclarkb: oh - also - if you have a sec ...20:18
mordredclarkb: check the most two recent commits in private hostvars and make sure I didn't derp?20:18
clarkbmordred: it looks right to me. You renamed group_vars/review.yaml to host_vars/review01.openstack.org.yaml and added host_vars/review-test.opendev.org.yaml with minimal content20:20
mordredclarkb: \o/20:27
clarkbI did git log -2 -p fwiw20:28
clarkbcorvus: I had the zuul tls changes under CD topic. Would you like me to drop ti there and discuss it as a separate item or collapse under that heading?20:40
clarkbI'm getting ready to send the agenda out and want to make sure its got the proper attention20:40
corvusclarkb: your choice20:40
corvussorry i missed it was already there20:41
clarkbno worries20:41
clarkbyou added mroe info :)20:41
openstackgerritMerged opendev/system-config master: Add bit more info on disabling ansible runs  https://review.opendev.org/73524620:41
openstackgerritMerged opendev/system-config master: Switch prep-apply.sh to use python3  https://review.opendev.org/72954320:41
openstackgerritMerged opendev/system-config master: Install pip3 on codesearch  https://review.opendev.org/73566820:44
*** rchurch has quit IRC20:44
*** rchurch has joined #opendev20:45
mordredcorvus, fungi: if either of you have a sec: https://review.opendev.org/#/c/735642/ is easy20:46
*** mlavalle has quit IRC20:47
mordredinfra-root: the ensure-virtualenv patch landed and jobs work again - I have rechecked the system-config patches that had failed due to that20:48
clarkbmordred: thanks20:48
clarkb(I had a couple get caught by it)20:48
mordredyeah20:49
mordredit was actually quite the carnage - there are 7 changes in recheck right now20:49
clarkbI'm around but will need to transition to dadops in about half an hour to run kids' remote class thing20:49
clarkb(as general heads up)20:49
mordredI'm around and unlikely to go anywhere for a bit as I am beset on all sides by a pile of sleeping kittens20:50
clarkbmordred: just noticed you update https://review.opendev.org/735246 thanks for catching that20:57
fungimordred: i've approved it, in case you survive burial by kitten20:57
mordredclarkb: sure nuff20:58
mordredfungi: \o/20:58
*** mlavalle has joined #opendev20:59
clarkbinfra-root I've discovered https://review.opendev.org/#/c/686237/ in my spring cleaning and wonder if I should either abandon that because we dno't want the behavior or push a new patchset to do that with ansible now that we use ansible to deploy zuul?21:07
mordredclarkb: well - maybe we should just land teh "use docker for executors" patch21:08
mordredclarkb: https://review.opendev.org/#/c/733967/21:08
clarkbmordred: ++ I'll abandon my change21:09
clarkbmordred: also I think that old change not being merge conflicted implies we can clean up some puppet things?21:10
clarkbI'll look into that while dealing with kids' school stuff21:10
mordredclarkb: ++21:10
openstackgerritMerged opendev/system-config master: Forward user-committee ML to openstack-discuss  https://review.opendev.org/73367321:14
openstackgerritMerged opendev/system-config master: Change launch scripts to python3 shebangs  https://review.opendev.org/73434521:14
corvusi have a zuul enqueue-ref command that is hung; something fishy may be going on21:16
corvusum.  the gearman certificate has expired.21:22
corvusha, it's the ca cert that expired21:24
corvusthe client/server certs have 10-year lives21:24
corvusthe ca only 321:24
corvusistr we lost our ca infrastructure somewhere along the line21:24
corvusbut i can use zk-ca.sh to make new certs easily21:25
corvushowever, we'll need a full system restart to use them21:25
mordredcorvus: yes - I believe we decided we didn't need to replace the ca infrastructure with LE and zk-ca21:25
corvusmordred: well, we decided that after it was removed, but yes :)21:26
mordredcorvus: yeah21:26
mordredit wasn't like an _active_ choice ;)21:26
corvusmordred, clarkb, fungi: perhaps we should go ahead and merge https://review.opendev.org/720302 and do the zk tls and gearman rekey all at once?21:27
clarkbthe existing connections are fine?21:27
corvusclarkb: yeah, as long as they aren't interrupted21:27
corvuswe probably won't be brining any of those offline ze's back online till then though21:27
mordredcorvus: wcpgw?21:28
clarkbcorvus: I'm not opposed to bunlding those changes since they are related (they share a CA right?)21:28
clarkbbut I'm not really able to help at this very moment21:29
corvusclarkb: the thing they *most* share in common is they need a full restart21:29
clarkbcorvus: got it21:29
mordredcorvus: so - I21:29
mordredI'm game21:29
openstackgerritMerged opendev/system-config master: Don't install puppet modules when we don't need them  https://review.opendev.org/73564221:29
openstackgerritMerged opendev/system-config master: uwsgi-base: drop packages.txt  https://review.opendev.org/73547321:29
mordredcorvus: while we're full system restarting - should we land the z-e docker change?21:29
mordred(that might be a bit much though - and we really can do that executor at a time to make sure)21:30
corvusmordred: let's not -- we can restart executors one-at-a-time and reduce risk there21:30
mordredcorvus: oh headdesk. there's another puppet failure in the stack. looking21:31
corvusoy i just saw that21:31
mordredcorvus: I think it's unrelated21:31
mordredhttps://zuul.opendev.org/t/openstack/build/42cb6c4d6188486eae6dd2b7a05a6b5c/log/applytest/puppetapplytest07.final.out.FAILED21:31
mordredis the failure21:31
corvusunfortunately, that means we'll have a full re-run cycle for that21:31
mordredyeah21:31
mordredwe could cheat21:31
corvusmordred: and?21:31
corvuswe could force-merge all 3 changes21:32
mordredyeah21:32
corvuswe can't enqueue-to-gate though because the zuul cli is out of commission21:32
mordredyea. given that I don't think we want to spend _hours_ in the current situation21:32
mordredand we do have green runs of the jobs that are actually relevant21:33
corvusbut these do all have good check results, so seems like good risk/reward.21:33
mordredyah21:33
corvusclarkb: are you reviewing https://review.opendev.org/720302 ?21:33
mordredclarkb: ?21:33
clarkbI can review but not help with the change landing itself /me looks21:33
fungii've resurfaced from making/consuming evening sustenance... catching up but can definitely help with a zuul restart for new certs21:34
mordredcorvus: the zk patch isn't going to fix the gearman cert though21:34
clarkbI'm reviewing the change21:34
mordredcorvus: but that's just a zk-ca and updating private hostvars, right?21:34
corvusmordred: correct21:34
corvusi can do that now so that it gets incorporated into the next run21:35
fungialso the executor daemon for ze04 is still stopped, we haven't rebooted that server yet. i didn't know if ianw might want to take a look, but we should either avoid restarting the executor on it or reboot the server21:35
mordredcool. so - yeah - I think the sequence would be land all the patches, shut down zuul, update hostvars, run service-zuul and service-zk and then re-start yes?21:35
mordredI suppose we can update the hostvars before doing the shutdown21:36
mordredin fact, you could probably go ahead and update the hostvars21:36
corvusi'd like to update hostvars; merge patches, wait for playbook completion, then restart21:36
mordredyes21:36
mordredI was just about to write that same thing21:36
mordredI think ti's correct - I blame the clowder of kittens for making me take a while to reach that conclusion21:37
fungi720302 is safe to merge, it just won't take effect until restarts, yeah?21:38
corvusfungi: probably? :)21:38
fungibut yeah, i agree with hostvars first21:38
corvusif it breaks, that's our signal to restart anyway :)21:39
fungino argument there ;)21:39
corvusinstalling new certs now21:39
fungi$CAROOT looks like a hipster vegetable21:39
ianwfungi: ze04 having the same afs issues?21:39
ianwas linaro last night i mean21:39
ianws/night/your local time/ :)21:40
fungiianw: yes, i left it stopped since it was something we could safely leave broken to evaluate21:40
fungithe rax-ord mirror also needed a reboot21:40
openstackgerritMerged opendev/system-config master: Cleanup old puppet management of release-volumes.py  https://review.opendev.org/73471121:40
fungifor similar reasons21:40
mordredcorvus: oh - fwiw - executor on ze01 is stopped because I was doing the afs+docker testing - but I think it's fine to restart when we do the restart21:40
mordredI do not think we need it to remain stopped21:40
fungiianw: fs flush got the cell root browseable again, but trying to look at some subpaths of the tree timed out read operations21:41
ianwfungi: yeah, i did poke around on the linaro mirror and didn't see anything other than a lot of disconnection/connection logs21:41
fungiianw: the difference with ze04 is it was having trouble getting to the rw tree rather than the ro tree21:41
fungialso dmesg on rax-dfw mirror showed the loss of connectivity with afs01.dfw but never logged it coming back into service21:42
fungier, rax-ord mirror i mean21:42
clarkbcorvus: couple of questions on https://review.opendev.org/#/c/720302/17 but lgtm otherwise21:42
fungiianw: anyway, i spotted ze04 because a bunch of publication jobs were failing rsync calls21:42
ianwfungi: yeah, not sure i have anything else i know to look at21:43
fungiianw: in that case i guess we can just make sure to reboot ze04 when we're restarting the rest of the executors21:44
corvusclarkb: replied21:44
fungioh, and i did try restarting afsd on the rax-ord mirror, but it got stuck stopping21:44
fungiand was unkillable21:44
clarkbcorvus: rgr +221:45
ianwfungi: so i'm just getting reset, not sure if you saw the scrollback about the utimensat() calls and openafs not updating the ns for files and "-t"21:45
corvus#status log re-keyed gearman tls certs (they expired)21:45
openstackstatuscorvus: finished logging21:45
ianwfungi: yeah, i've never had any luck with anything but rebooting21:45
fungiianw: yep, i followed that. mismatch in timestamp expectations between openafs and rsync sounds plausible. did you try dropping -t?21:46
clarkbianw: I tried to understand what that meant for us, do we update our rsync flags?21:46
ianwfungi: i plan to do a manual run under strace without the "-t" to rsync and see what happens21:46
corvusmordred, fungi: i will force-merge now21:46
fungiianw: cool, i notice it's the cronjob has been running all day21:46
ianwthe fedora cron job should be commented out ... i hope at least21:47
fungiianw: also should we go ahead and turn mirror-update.openstack.org back on too before the reprepro mirrors fall too far behind?21:47
openstackgerritMerged opendev/system-config master: Stop using backend hostname in zuul testinfra tests  https://review.opendev.org/73340921:47
openstackgerritMerged opendev/system-config master: Fake zuul_connections for gate  https://review.opendev.org/73092921:47
ianwfungi: oh yeah, i think so, i saw your initial work to migrate that which is great too21:47
fungiit's way incomplete, i need to find time to make progress on it21:48
openstackgerritMerged opendev/system-config master: Add Zookeeper TLS support  https://review.opendev.org/72030221:48
mordredcorvus: woot!21:48
fungiianw: oh, and somebody said a new centos 8.x release dropped today, so... probably a lot of rsync going on for that too21:48
clarkbfungi: ya 8.2 (was me21:48
fungithanks clarkb! today has been a complete blur21:48
fungii may declare wednesday as sunday and make myself scarce ;)21:49
mordredfungi: wednesday fednesday right?21:49
* fungi boots mirror-update.ostack.o back up21:49
fungimordred: something like that, yep21:49
corvusmordred, fungi: looks like there's a deploy backlog; i'm going to afk for 30m21:50
fungicorvus: cool, i'll be around for a while still when you get back21:51
mordredcorvus: kk21:53
openstackgerritJeremy Stanley proposed opendev/system-config master: Cleanup from ZK TLS transition  https://review.opendev.org/73574021:55
*** sgw has quit IRC21:59
*** sgw has joined #opendev22:01
fungijust the firewall rules for now, but we can dogpile other cleanup into that if anyone knows of more22:02
fungiwe still need to update the ports in the nodepool confs in project-config, right?22:05
fungior is there a separate change already up to do that@22:05
fungi?22:05
clarkbfungi: I think that is in the change that merged, it does a load file and then edit then write back out again22:05
fungi#status log started mirror-update01.openstack.org22:05
openstackstatusfungi: finished logging22:05
fungiclarkb: oh! so directly modifies the configs at runtime, okay22:06
clarkbfungi: yes I think so22:06
fungiclarkb: i'm not finding anywhere in 720302 which modifies the nodepool configs, i may be overlooking something22:08
clarkbfungi: https://review.opendev.org/#/c/720302/17/playbooks/roles/nodepool-base/tasks/main.yaml line 6922:09
fungioh, in 0playbooks/roles/nodepool-base/tasks/main.yaml there's a task to "Overwrite zookeeper-servers" and another from a previous change to "Write nodepool config"22:10
clarkband https://review.opendev.org/#/c/720302/17/playbooks/roles/nodepool-base/library/make_nodepool_zk_hosts.py22:10
fungiso i guess we don't directly write out the nodepool configs from the project-config repo22:10
mnaserhi all.  Could we add me (or tc members) to https://review.opendev.org/#/admin/groups/441,members to help with the retirement of tricircle ?22:10
mordredmnaser: done22:11
mnasermordred: thanks!22:12
fungiclarkb: aha, i see we've actually had that implemented since april via https://review.opendev.org/72070922:12
fungii wonder if we should either update the configs in project-config for the new ports to avoid confusion, or better yet remove the zookeeper servers from it entirely and substitute a comment saying that we inject them with ansible now22:13
fungihaving zk connection details in those configs when we're not actually relying on them is just begging someone to make updates in the wrong place down the road22:14
clarkbfungi: ya that seems reasonable. Or maybe go back to consuming it from project-config once the dust settles on this22:14
fungiclarkb: well, the change went in originally to support production-like test environments for our integration testing jobs22:15
fungiso i expect we'd want to keep the capability22:15
clarkbfungi: with /etc/hosts being written now we may be able to do that without editing the configs?22:15
clarkbthough we probably only do a single zk server I guess22:16
clarkb(rather than 3)22:16
mnasermordred: i'm sorry, do you mind adding me to https://review.opendev.org/#/admin/groups/1706,members too?22:25
mnaserseems like client core != project core22:25
mnaser(or any other infra admin around)22:26
ianwmnaser: done22:26
ianwfungi: ok, i've re-run manually our fedora rsync without the "-t"22:26
mnaserthank you ianw22:27
ianwit's got all the lstats, but none of the utimensat() calls22:28
fungiianw: that's promising... any noticeable difference in rsync runtime (like is it significantly slower without -t?)22:29
ianw2020-06-15 22:22:06  | Running rsync for releases/31..22:31
ianw2020-06-15 22:22:30  | Running rsync for updates/30...22:31
mnaserlast request in helping retire, trivial change: https://review.opendev.org/#/c/728902/422:31
ianwlike 24 seconds22:31
ianw2020-06-08 06:43:34  | Running rsync for updates/30...22:32
clarkbany concern landing mnaser's chnage ^? specifically the zuul config update at https://review.opendev.org/#/c/728902/4/zuul/main.yaml while we fix zuul things?22:32
corvusback22:32
ianw2020-06-08 06:44:11  | Running rsync for updates/31...22:33
mnaseryeah we may hold off on that then, it's not _that_ urgent but worth deferring for later if things are goin on22:33
ianwfungi: so yeah, in the noise22:34
fungiianw: in that case, sounds like we should just drop -t from our rsync calls22:35
*** ysandeep|away is now known as ysandeep22:43
corvusmordred, fungi: it might be worth a bit of analysis to find out why https://review.opendev.org/733673 is running all the jobs22:49
corvusit's been running for 1.5 hours, and is maybe 1/3 through the list22:49
corvusthen there are 4 changes after it, then finally the 3 changes we need for zuul :/22:50
mordredcorvus: because it touches inventory22:51
corvusmordred: do we need to adjust that matcher after the recent reorg?22:51
corvusor is that intentional?22:51
mordredcorvus: yes - but I think we still have work to do there22:51
corvusbecause any job can reference the inventory hostvars of any group....22:52
mordredyeah. I think we're defaulting to safe currently22:52
corvusif we're adding inventory/ to everything because of that ^ then i think maybe we'd be better off just running one job that does everything, because everything is going to touch inventory22:52
mordredcorvus: I actually thought we had some smaller matchers already22:53
corvusbut i think we can maybe narrow that down22:53
corvuslike, service-zuul should be able to say "inventory/zuul" + "inventory/zookeeper" or whatever22:53
mordredyes - that22:53
ianwfungi: i think dropping -t means that it doesn't detect non-size changing updates?22:54
mordredcorvus: yeah - I think we have better matchers on the CI jobs ... but haven't done the same for the prod jobs22:54
mordredcorvus: s/I think//22:55
mordredcorvus: it's totally all inventory/ in the prod jobs22:55
corvusmordred: ok, well the good news is that none of the other changes before the zuul changes touch inventory (though one will probably run puppet-else);  our final change does touch inventory though22:55
mordredcorvus: so - I think we go through and match up the file matchers we use for CI jobs with the prod versions22:55
corvusmordred: ++22:55
corvusthis is just a swag, we might be looking at another 5 hours before the zuul change is deployed22:56
corvusi will not be in a position to help then22:56
corvusmordred: perhaps we should disable something and come back tomorrow?22:56
mordredcorvus: well ... we could touch disable-ansible - that'll block all future runs22:57
mordredso as soon as the current job is done there will be no more ansible running22:57
mordredand we could just do a git piull and then run the relevant playbooks22:57
corvusand then rely on -hourly to catch up whatever else was in the queue?22:58
mordredyeah22:58
corvussounds like a plan22:58
mordredthe currently queued jobs will time out after an hour iirc22:58
mordredwell - each one will block for an hour22:58
mordredo - but then we'll restart zuul - so they will go away22:58
clarkbyou're going to restart zuul22:58
mordredyeah22:58
clarkbya22:58
mordredso yeah - I think that'll totally work22:59
mordredwant me to run disable-ansible now?22:59
corvuswhere's our docs for disabling ansible?22:59
corvushttps://docs.openstack.org/infra/system-config/sysadmin.html#disable-enable-puppet22:59
corvusthat's all i've found :/22:59
mordredclarkb just updated them22:59
mordredbut I thin kthe patch is one of teh ones landing22:59
clarkbthey are in bridge's page23:00
mordredcorvus: https://review.opendev.org/#/c/735246/23:00
clarkbI added them to our sysadmins page as that is where we have preexisting stuff23:00
mordredyeah - so you would find them there eventually23:00
corvuswhere's bridge's page?23:00
mordredcorvus: oh - you're looking at openstack docs23:01
mordredcorvus: https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#disable-enable-ansible23:01
corvuswe moved that without a redirect or delete?23:01
mordredit certainly seems that way, yes. we should fix that23:02
mordredanyway - clarkb's change still hasn't published there - so the proper instructions are still missing23:02
mordredbut they reduce to "run the disable-ansible script"23:02
corvusi wonder how we could add a redirect?23:02
fungii think we can add one to the htaccess file in openstack-manuals23:03
fungilooking23:03
corvus#status log disabled ansible on bridge due to 5+ hour backlog with potentially breaking change at end23:03
openstackstatuscorvus: finished logging23:03
*** mlavalle has quit IRC23:04
mordredcorvus: cool23:04
fungicorvus: looks like we did it for infra-manual thusly: https://opendev.org/openstack/openstack-manuals/src/branch/master/www/.htaccess#L263-L26623:04
*** tosky has quit IRC23:04
fungithere's also a corresponding ci test for that redirect23:04
fungii'll propose a similar one for system-config now23:04
corvusfungi: thanks!23:04
mordredfungi: ++23:05
fungias soon as i finish cloning that massive repo23:05
fungiso... slow...23:06
fungiand cloned23:13
fungiwow23:13
openstackgerritJames E. Blair proposed opendev/system-config master: Make disable-ansible fancier  https://review.opendev.org/73574523:15
corvusfungi, mordred: ^ that's the result of a mental simulation i just performed about possible outcomes from leaving DISABLE-ANSIBLE in place overnight.23:15
mordredcorvus: yes.23:19
fungi23:27 <openstackgerrit> Jeremy Stanley proposed openstack/openstack-manuals master: Redirect infra/system-config to docs.opendev.org  https://review.opendev.org/73574723:28
fungithere was some sitemap cleanup to do at the same time23:28
corvusfungi: thanks!23:35
openstackgerritMerged opendev/system-config master: Be explicit about using python3 in docker images  https://review.opendev.org/73464723:37
*** DSpider has quit IRC23:38
clarkbare we restarting services ir leaving ansible disabled then picking it up tomorrow?23:52
openstackgerritMerged opendev/system-config master: Make disable-ansible fancier  https://review.opendev.org/73574523:54

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!