Monday, 2020-06-15

fungi	ahh, yeah, so it's taking a long time even when rsync hasn't run	00:00
ianw	https://lists.openafs.org/pipermail/openafs-info/2019-September/042865.html includes a link to afs audit logs from an rsync run	00:00
fungi	ahh, right, this is ringing a bell now	00:02
fungi	but https://review.opendev.org/681367 didn't actually solve it?	00:02
ianw	fungi: it seems not, the graph is showing every release takes ~8 hours	00:05
ianw	this is what led to the path of doing the release via ssh and -localauth	00:07
ianw	fungi: perhaps we should leave mirror-update off for a bit and investigate again?	00:09
fungi	yeah, not a bad idea	00:10
ianw	for a start, when fedora gets in sync, we could turn on file auditing and run a "vos release" with a zero-delta and see what happens	00:10
openstackgerrit	Merged openstack/project-config master: Drop pip-and-virtualenv from images https://review.opendev.org/734428	00:29
openstackgerrit	Merged openstack/project-config master: Use https apt mirrors for image builds https://review.opendev.org/735362	00:30
auristor	ianw fungi: afs vice partitions should be noatime but that won't alter the contents of the incremental dumps.	00:48
auristor	A third fileserver should be added so that there is always a redundant clone in case of a failure of afs01.dfw	00:49
*** Meiyan has joined #opendev		01:01
*** xiaolin has joined #opendev		01:04
ianw	auristor: it's probably a bit of a moot point though when it's basically in a constant state of "vos release" (i.e. the next one starts immediately after the previous one finishes)	01:06
*** xiaolin has quit IRC		01:11
auristor	not really. The point is that while afs01 is updating afs02, there is no valid copy on afs02. the only consistent copy is on afs01. which is at 100% network capacity so sending fetches there from clients only makes things slower. If afs03 existed, then either afs02 or afs03 would be online with a self consistent copy while the release was taking place.	01:22
ianw	we have afs01.ord too, i'm not sure if it's deliberately or it's just an accident of history	01:26
auristor	is anything replicated to it? mirror.centos for example is not	01:27
clarkb	ianw: ord was used until we hit the window sizing issues	01:28
clarkb	the idea was to be offsite for resiliency but that meant copies took forever	01:28
ianw	hrm, i don't remember that but ok; that probably explains the odd mix of replications we have	01:29
auristor	docs, docs.dev, mirror, project, project.airship, root.afs, root.cell, and user.corvus	01:29
auristor	if throughput to ord is a problem, then I suggest standing up a afs03.dfw.	01:30
auristor	I really wish we could figure out some way that auristorfs could be used to host this cell	01:30
ianw	we have updating to bionic and 1.8 as a increasingly insistent todo	01:31
auristor	openafs 1.8 will help a bit with rx issues but it isn't going to fix most of the underlying issues	01:32
ianw	auristor: it's still true we shouldn't mix 1.6 and 1.8 servers? i think that's the assumption we've been working under	01:34
auristor	absolutely not	01:34
ianw	auristor: sorry, we absolutely should not mix them, or it's ok to? :)	01:36
auristor	there are no data format or wire protocol changes between 1.6 and 1.8. mix and match to your hearts content.	01:44
auristor	what command line options are passed to rsync?	01:52
ianw	auristor: rsync -rltDiz	01:53
ianw	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L101	01:53
ianw	looks like everything has released now	02:04
auristor	vos status afs02.dfw reports no transactions	02:05
ianw	i can try a release on fedora now and see what happens	02:06
ianw	since the update server is shutdown, nothing has written to it	02:06
ianw	if we want i can restart with audit logging	02:08
auristor	I don't think there is any interesting audit logging for the release. its the rsync that is interesting from my perspective.	02:08
fungi	yep, confirmed, the fedora and opensuse volume releases did finally complete some time in the last few minutes	02:11
auristor	As we discussed many months ago, the vos release is going to send all directories and any files that changed from five minutes before the last release time. The last release time was 2s after the last update time.	02:11
auristor	s/five minutes/fifteen minutes/	02:15
ianw	auristor: yeah, that's why we put in the sleep https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L152	02:21
ianw	istr we did try that experiment, running multiple releases	02:23
auristor	ianw: instead of performing a "vos release" that will require network bandwidth and taking the afs02.dfw volume offline, could you execute	02:23
auristor	vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 15:04"	02:23
ianw	Volume: 536871007	02:24
ianw	dump_size: 306725041646	02:24
auristor	and remove the -time switch and parameter	02:25
auristor	That is effectively the entire volume	02:26
ianw	Volume: 536871007	02:26
ianw	dump_size: 306840822582	02:26
ianw	echo $(( 115780936 / 8 / 1024 / 1024))	02:27
ianw	13	02:27
ianw	~13 gb difference ?	02:27
auristor	why dividing by 8?	02:28
ianw	oh it's bytes	02:28
auristor	110MB difference which is nothing	02:29
auristor	if you specify the time as "2020-06-14" what do you get?	02:29
ianw	Volume: 536871007	02:30
ianw	dump_size: 15613187	02:30
auristor	the times listed by vos examine are local times. So I'm giving you EDT. Use vos examine mirror.fedora from the machine the vos size command is being executed on and use that time	02:31
auristor	Last Update time	02:31
ianw	all, all the hosts run in UTC	02:31
auristor	vos doesn't	02:32
ianw	i'm doing this on afs01	02:32
auristor	I'm not on afs01. So my Last Update Sat Jun 13 15:04:11 2020	02:32
ianw	Last Update Sat Jun 13 19:04:11 2020	02:33
auristor	Provide that time to vos size	02:33
ianw	ianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 19:04:11"	02:34
ianw	Volume: 536871007	02:34
ianw	dump_size: 15613266	02:34
auristor	14MB which will be the size of the directories	02:34
auristor	subtract 15m from that time and what do you get?	02:35
ianw	$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 18:45"	02:35
ianw	Volume: 536871007	02:35
ianw	dump_size: 15613266	02:35
auristor	the problem isn't the incremental dump	02:36
auristor	rsync the content from mirror.fedora.readonly to mirror.fedora. That should be "no change" Then perform the "vos size with -time "2020-06-13 18:45"" again	02:38
ianw	umm, ok, i want to be very careful i don't destroy things with an errant command :)	02:40
auristor	you can copy mirror.fedora to a new volume	02:40
auristor	vos copy -id mirror.fedora -fromserver 104.130.138.161 -frompart a -toname test.fedora -toserver 104.130.138.161 -topart a	02:43
auristor	then mount test.fedora so you can rsync to it	02:43
ianw	ok, i just have a dry-run going anyway to see what it thinks about things	02:44
ianw	rsync -avz --dry-run /afs/openstack.org/mirror/fedora/ /afs/.openstack.org/mirror/fedora/ reports nothing to do	02:45
auristor	those aren't the rsync options you indicated earlier	02:46
auristor	of -rltDiz the most interesting is -t	02:47
ianw	https://static.opendev.org/mirror/logs/rsync-mirrors/fedora.log	02:50
ianw	does have verbose logging on that should show if rsync touches anything	02:50
ianw	that's the itemize changes (-i) which will show why it updated files	02:51
auristor	the behavior I observed was that rsync didn't update the data but it set the last update time on files it didn't modify	02:52
ianw	the vos copy i guess will take a while	02:55
auristor	sadly its performed via rx over loopback	02:55
ianw	i can strace the rsync to see exactly what it touches	02:56
auristor	the fileserver audit log would tell as well	02:56
ianw	right, i'm pretty sure that's what i got @ http://people.redhat.com/~iwienand/fedora-mirror-11-09-2019.tar.gz	02:57
auristor	I wonder if this is the problem with the openafs client	02:58
auristor	ip->i_mtime.tv_sec = vp->va_mtime.tv_sec;	02:58
auristor	/* Set the mtime nanoseconds to the sysname generation number.	02:58
auristor	* This convinces NFS clients that all directories have changed	02:58
auristor	* any time the sysname list changes.	02:58
auristor	*/	02:58
auristor	ip->i_mtime.tv_nsec = afs_sysnamegen;	02:58
auristor	in other words, the nsec component of the mtime reported by the openafs client is not going to match the nsec time that rsync obtains from the source	02:59
auristor	if the data hasn't changed, rsync won't rewrite it. but with -t it will try to fix the mtime	03:00
auristor	In the FileAuditLog you are looking for AFS_SRX_StStat events	03:01
ianw	i feel like that would show in itemized-changes	03:02
auristor	AFS_SRX_StStat events for a FID without a AFS_SRX_StData event	03:02
ianw	i think maybe if i bring mirror-update back online, and get in there fast and take the update lock, then i should be able to run the exact rsyncs under strace	03:04
ianw	that seems the lowest impact way to get data right now	03:04
auristor	ok	03:06
ianw	ok, i've commented out the cron run and will update the script and run manually	03:11
ianw	it's running in a screen on mirror-update	03:16
ianw	logging to ~ianw/rsync-run	03:17
ianw	lstat("Modular/x86_64/os/Packages/p/perl-Time-Piece-1.31-415.module_2570+32b47dc0.x86_64.rpm", {st_mode=S_IFREG\|0644, st_size=43780, ...}) = 0	03:17
ianw	utimensat(AT_FDCWD, "Modular/x86_64/os/Packages/p/perl-Time-Piece-1.31-415.module_2570+32b47dc0.x86_64.rpm", [UTIME_NOW, {tv_sec=1544180283, tv_nsec=202155000} /* 2018-12-07T10:58:03.202155000+0000 */], AT_SYMLINK_NOFOLLOW) = 0	03:17
ianw	is basically it	03:17
ianw	this isn't a zero delta, it's bringing in a bunch of stuff from upstream	03:21
ianw	ok, it's into that "+ sleep 1200" period	03:22
ianw	ianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871006 -dump	03:42
ianw	Volume: 536871006	03:42
ianw	dump_size: 306816062285	03:42
ianw	ianw@afs01:~$ vos size -server afs01.dfw.openstack.org -part a -id 536871006 -dump -time "2020-06-15 03:00"	03:42
ianw	Volume: 536871006	03:42
ianw	dump_size: 306700281349	03:42
ianw	i don't know if that is right, but that's 110mb difference from before and now	03:42
auristor	-time "2020-06-13 18:45"	03:44
ianw	$ vos size -server afs01.dfw.openstack.org -part a -id 536871007 -dump -time "2020-06-13 18:45"	03:46
ianw	Volume: 536871007	03:46
ianw	dump_size: 15613266	03:46
auristor	you want the incremental dump of the RW	03:47
ianw	well the release has started	03:49
ianw	i've put mirror-update in emergency so the cron job doesn't come back	03:52
*** ykarel\|away is now known as ykarel		03:55
auristor	I'm done for the night.	03:59
ianw	auristor: thanks, i think if we do some manual tracing of zero-delta updates we can get some more info to go off	04:01
AJaeger	ianw: what kind of cleanup is needed after https://review.opendev.org/735301?	05:47
openstackgerrit	Merged openstack/project-config master: Add github sync job for tricircle https://review.opendev.org/735417	05:54
AJaeger	ianw: I see you left the plain ones in - ok, so no need for cleanup yet.	05:55
ianw	AJaeger: yeah, i'll get rid of everything after it's settled	06:00
ianw	delete the zuul-jobs testing, then the nodes can go	06:01
AJaeger	ok	06:07
openstackgerrit	Felix Edel proposed zuul/zuul-jobs master: Return upload_results in upload-logs-swift role https://review.opendev.org/733564	06:19
openstackgerrit	Felix Edel proposed zuul/zuul-jobs master: Return upload_results in test-upload-logs-swift role https://review.opendev.org/735503	06:19
*** ysandeep is now known as ysandeep\|afk		06:31
*** priteau has joined #opendev		06:34
AJaeger	infra-root, I just saw "Could not connect to mirror.mtl01.inap.opendev.org:443 (198.72.125.6), connection timed " ;(	06:50
AJaeger	Log: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_91d/735494/5/check/tempest-full-py3/91dbf66/job-output.txt	06:50
AJaeger	happens in https://b7511727a7deb59d79f6-083f3205a01a368b196dd0a8486413e5.ssl.cf2.rackcdn.com/735494/5/check/neutron-tempest-linuxbridge/cfb66a5/job-output.txt as well	06:51
ianw	AJaeger: hrm, it's up and i can talk to it	06:51
AJaeger	ianw: I cannot from here	06:52
ianw	yeah, apache not talking but the host is	06:52
ianw	it's been up 200+ days, i'm going to reboot it	06:53
ianw	there's nothing in dmesg for over a month	06:53
AJaeger	thanks	06:53
ianw	ok responding now	06:55
ianw	#status log rebooted mirror.mtl01.inap.opendev.org due to unresponsive apache processes	06:56
openstackstatus	ianw: finished logging	06:56
*** ykarel is now known as ykarel\|afk		06:56
ianw	fungi/auristor: i think the nanosecond comment is honing in on the problems; there's constant calls to utimensat() on no-op rsyncs	06:57
ianw	mirror-update.opendev.org:~ianw/rsync-run/rsync.3911 is an example	06:58
*** ykarel\|afk is now known as ykarel		07:00
*** iurygregory has joined #opendev		07:11
*** tosky has joined #opendev		07:28
*** DSpider has joined #opendev		07:40
-openstackstatus- NOTICE: uWSGI made a new release that breaks devstack, please refrain from rechecking until a devstack fix is merged.		07:41
*** rpittau\|afk is now known as rpittau		08:00
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
*** ykarel is now known as ykarel\|lunch		08:04
ianw	fungi/auristor: i think that's the smoking gun -- http://paste.openstack.org/show/794754/ -- that just uses utimensat to update the mtime. it's always "1". i have to think about the implications	08:09
ianw	is it as easy as dropping "-t"?	08:10
frickler	#status log force-merged https://review.opendev.org/735517 and https://review.opendev.org/577955 to unblock devstack and all its consumers after a new uwsgi release	08:15
openstackstatus	frickler: finished logging	08:15
hrw	morning	08:34
*** ysandeep\|afk is now known as ysandeep		08:41
*** ykarel\|lunch is now known as ykarel		08:49
*** priteau has quit IRC		09:11
*** priteau has joined #opendev		09:21
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	09:25
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	09:35
hrw	4 days of weekend were great. but had to end.	09:46
hrw	http://mirror.regionone.linaro-us.opendev.org/ feels weird. does not list anything anymore (did in past). something changed?	09:49
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	09:50
ykarel	looks like centos mirrors are gone again https://mirror.ca-ymq-1.vexxhost.opendev.org/ or it was not fixed for the provider	09:54
ykarel	just seen in a job https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_552/727200/73/check/tripleo-ci-centos-8-containers-multinode/5524ccf/job-output.txt	09:54
AJaeger	infra-root, any idea? Looks good on https://mirror.mtl01.inap.opendev.org/centos/	09:56
hrw	hm. looks like mirrors are weird state or sth.	09:58
hrw	linaro-us one feels empty	09:58
priteau	Is Zuul a bit slow today? It took 6 minutes between W+1 and starting gate jobs on https://review.opendev.org/#/c/734040/	10:01
*** rpittau is now known as rpittau\|bbl		10:03
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	10:03
*** Meiyan has quit IRC		10:05
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	10:13
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	10:28
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	10:35
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	10:49
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:02
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:12
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:18
*** hashar has joined #opendev		11:20
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:28
*** ykarel is now known as ykarel\|afk		11:30
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:35
*** nautics889 has joined #opendev		11:45
*** nautics889 has quit IRC		11:55
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	11:58
*** rpittau\|bbl is now known as rpittau		12:04
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	12:07
*** ysandeep is now known as ysandeep\|afk		12:07
ianw	hrw: i dunno ... ls on /afs/openstack.org times out	12:07
ianw	there's a lot of messages in there about dropped connections	12:08
AJaeger	argh ;(	12:11
ianw	it's annoying and i just rebooted it ... another afs issue to investigate longer term :/	12:12
AJaeger	thanks, ianw	12:19
hrw	ianw: thanks	12:35
*** ysandeep\|afk is now known as ysandeep		12:45
*** ykarel\|afk is now known as ykarel		12:46
auristor	ianw: I'm just returning to my desk. From my reading of the rsync repository the nanosec comparison is a fairly recent addition and -t sends the timestamp to the remote for time optimization. If -t is not set, then the timestamp comparison optimization is ignored and comparison of the data contents is used exclusively. In the case, of rsync and /afs the timestamp comparison doesn't work anyway so I think leaving it off is the	12:52
auristor	.	12:52
*** priteau has quit IRC		12:54
*** priteau has joined #opendev		12:55
fungi	ianw: should i check all the mirror frontends to make sure there's not more of them hung, or have you already?	13:10
hrw	https://review.opendev.org/#/c/730331 got refreshed so Kolla now uses wheel cache first and then pypi mirror as a fallback.	13:24
mordred	hrw: cool	13:26
*** hashar has quit IRC		13:33
fungi	infra-root: seems there are some jobs failing on afs writes from the zuul executors. i'm going through and checking them one by one, so far i've shutdown the zuul-executor service on ze01	13:39
hrw	checking build time difference now	13:39
fungi	er, sorry, on ze04	13:39
fungi	okay, ze04 seems to have been the only one which couldn't ls /afs/.openstack.org/docs/	13:40
corvus	fungi: i'm around - need anything?	13:41
fungi	similar to the mirrors ianw was looking at, `ls /afs/.openstack.org/` on ze04 is empty	13:41
fungi	corvus: sanity checks maybe	13:41
fungi	still just cleaning up from the afs01.dfw outage late saturday utc	13:42
auristor	fs checkservers -all	13:42
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add tests for upload-docker-image https://review.opendev.org/735402	13:42
auristor	fs checkvolumes	13:42
fungi	auristor: sadly, those give me "All servers are running." and "All volumeID/name mappings checked." but `ls /afs/.openstack.org/` is still coming back empty	13:43
fungi	(on this particular client that is)	13:43
fungi	interestingly dmesg there doesn't report any "lost contact" log entries from around or after the outage	13:45
*** ysandeep is now known as ysandeep\|afk		13:46
fungi	i have a feeling if i restarted afsd and possibly also did an rmmod/modprobe of the openafs lkm, this would go back to normal	13:47
fungi	rebooting the other clients which exhibited similar issues with the ro replicas seemed to solve it, but unfortunately doesn't tell us much about what the actual problem was	13:48
openstackgerrit	David Moreau Simard proposed openstack/project-config master: Create a new project for recordsansible/ara-collection https://review.opendev.org/735439	13:49
fungi	though this particular client is one out of a redundant cluster of a dozen servers, so we can more easily keep it like this for a bit to poke around	13:49
fungi	interestingly it sees the read-only tree under /afs/openstack.org/ just not the read/write tree under /afs/.openstack.org/	13:50
openstackgerrit	Drew Walters proposed openstack/project-config master: Add missing project to Airship doc job https://review.opendev.org/734874	13:52
corvus	fungi: i don't have any other ideas	13:54
corvus	fungi: i agree that a client restart may be in order	13:55
fungi	being down one out of twelve executors for a bit is likely fine, so i'm happy leaving it like this in case there are other ideas of things we want to check first	13:57
corvus	fungi: i ran 'fs flush /afs/.openstack.org' and things have improved	14:00
fungi	corvus: oh, indeed, that seems to now be returning expected content	14:00
fungi	so was it possible it cached an empty state for the cell root?	14:00
corvus	that's what it looks like	14:01
corvus	auristor: ^ fyi	14:01
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Forward user-committee ML to openstack-discuss https://review.opendev.org/733673	14:04
corvus	docs volume under that looks fine	14:05
fungi	`ls /afs/.openstack.org/mirror/` on ze04 is taking several minutes to complete so far	14:05
hrw	0:05:21.497262 versus 0:23:49.306074 is nice improvement	14:07
fungi	hrw: is that the speedup from using prebuilt wheels?	14:08
hrw	fungi: yes	14:08
fungi	significant!	14:08
hrw	we have two images which suck time. waiting for second one	14:08
corvus	fungi: well, that might call for a reboot :/	14:09
fungi	corvus: yeah, it's still blocking...	14:10
fungi	i mean, technically the executor shouldn't need to write to /afs/.openstack.org/mirror/ at the moment (though when we get the wheel builder jobs reworked it will)	14:11
fungi	i'm just more worried it's indicative of deeper problems	14:11
*** priteau has quit IRC		14:12
corvus	fungi: agreed. at this point, i'd suggest we restart the client or reboot (reboot since it's more thorough and no less disruptive)	14:14
fungi	it just now returned	14:16
fungi	after spitting out "ls: cannot access '/afs/.openstack.org/mirror/fedora': Resource temporarily unavailable"	14:16
AJaeger	also: I presented something when I visited Amundi in February. Do you need anything else?	14:16
AJaeger	fungi, https://mirror.ord.rax.opendev.org/centos/7/os/x86_64/Packages/virt-what-1.18-4.el7.x86_64.rpm is failing to download	14:17
AJaeger	gives a forbidden ;(	14:17
AJaeger	(ignore my first pasto :(	14:17
fungi	fungi@mirror01:~$ ls /afs/openstack.org/mirror/centos/	14:18
fungi	ls: cannot access '/afs/openstack.org/mirror/centos/': Connection timed out	14:18
corvus	'fs checkservers' is unhappy here	14:19
fungi	checkservers on mirror01.ord.rax.opendev.org is taking a while	14:20
corvus	These servers unavailable due to network or server problems: mirror01.ord.rax.opendev.org.	14:20
corvus	slighly counterintuitive message :/	14:20
fungi	that looks like the 127.0.1.1 problem showing up again	14:20
corvus	iiuc, that was a volume which had its vldb entry set to 127.0.1.1 ?	14:21
corvus	those were all fixed, right?	14:21
fungi	that's what i thought	14:21
corvus	dmesg says: [Jun13 20:45] afs: Lost contact with file server 104.130.138.161 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server)	14:22
corvus	and no "back up" message	14:22
fungi	that's when afs01.dfw hung, yeah	14:22
corvus	maybe we should go with a reboot here too?	14:22
corvus	or afsd restart	14:23
fungi	i can give that a shot first	14:23
corvus	k	14:23
hrw	should I use http://mirror.regionone.linaro-us.opendev.org:8080/wheel/debian-10-aarch64/ or http://mirror.regionone.linaro-us.opendev.org/wheel/debian-10-aarch64/ on CI?	14:24
hrw	:8080 gives 403 ;(	14:24
openstackgerrit	Monty Taylor proposed opendev/zone-opendev.org master: Add review-test https://review.opendev.org/735600	14:25
fungi	hrw: the wheel cache is served over 80 and 443, 8080 is a proxy	14:26
hrw	fungi: thanks. was not sure	14:26
hrw	updated patch	14:27
fungi	corvus: i ended up rebooting it because afsd wouldn't stop	14:27
corvus	i suspected as much :)	14:28
fungi	#status log rebooted mirror01.ord.rax.opendev.org to clear hung openafs client state	14:28
openstackstatus	fungi: finished logging	14:28
openstackgerrit	Monty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on https://review.opendev.org/735602	14:28
hrw	fungi: need to check does requirements-tox-py3x-check-uc* jobs in openstack/requirements use cache too	14:30
mordred	corvus, fungi: those two patches ^^ should help me finish standing up review-test so that I can rsync / mysqldump the existing prod content over. I made a private hostvars file for it with what I think is the bare minimum of secrets (we don't need a bunch of the prod ones for this) - and I moved group_vars/review.yaml to host_vars/review01.openstack.org.yaml	14:31
corvus	mordred: i guess we want to keep review-dev around for testing without production-copy data, which is why this is a new server and not repurposing that?	14:32
fungi	i was about to hard reboot mirror01.ord.rax.opendev.org via api, but oob console just showed it finally giving up waiting on [something] to terminate	14:32
AJaeger	config-core, please review https://review.opendev.org/#/c/734874/ - the starlingx team needs this to prepare for the election	14:33
mordred	corvus: yeah - although I think we could also consider merging the two ideas at some point - now that we don't replicate to github, I think we could move gtest to the production gerrit and then have a review-dev like the one I'm setting up for review-test that gets a periodic data rsync from review	14:34
mordred	but I didn't want to block upgrade testing on getting that done	14:34
corvus	++	14:34
fungi	mirror01.ord.rax.opendev.org is back online now and i can `ls /afs/openstack.org/mirror/centos/` successfully	14:35
hrw	looks like I will have a change which touch all jobs	14:35
openstackgerrit	Marcin Juszkiewicz proposed zuul/zuul-jobs master: pip.conf: use wheel cache first and fallback to pypi mirror https://review.opendev.org/735606	14:40
hrw	can config-core take a look at ^^?	14:40
hrw	I hope that commit message is clear enough	14:40
fungi	hrw: it's not clear to me why that's necessary. pypi doesn't try things in sequence, it pulls all the indices and then decides what to download	14:42
fungi	extra-index-url isn't a "fallback" it's just yet another index it incorporates	14:42
fungi	otherwise our wheel cache wouldn't work for any architecture	14:43
hrw	ah. so maybe I messed it with :8080 used for cache at same time	14:44
hrw	dropped	14:45
fungi	yeah, our "pypi_mirror" is a caching proxy, our "wheel_mirror" is served directly by apache	14:45
hrw	thanks fungi	14:45
openstackgerrit	Merged openstack/project-config master: Add missing project to Airship doc job https://review.opendev.org/734874	14:46
hrw	INFO:kolla.common.utils.openstack-base: Downloading http://mirror.regionone.linaro-us.opendev.org/wheel/debian-10-aarch64/setproctitle/setproctitle-1.1.10-cp37-cp37m-linux_aarch64.whl (37 kB)	14:47
*** sgw has quit IRC		14:47
hrw	yes ;D	14:47
mnaser	hi friends -- appreciate reviews on https://review.opendev.org/#/c/735478/	14:48
fungi	hrw: yeah, if you want to see the details, the mirror servers' use this vhost configuration: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j2	14:48
*** sgw has joined #opendev		14:50
openstackgerrit	James E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support https://review.opendev.org/720302	14:50
fungi	hrw: according to that config, we actually make the pypi proxy available over 80/443/8080/4443 (because the proxy statements for it are included in both the basemirror and proxymirror macros)	14:53
fungi	though we likely set the mirror vars to use 8080/4443 because the 80/443 proxies are just backward compatibility from when we used to host our own mirror of pypi (before it got far too large)	14:54
hrw	https://7a40b7c4a1adb2feec0f-f29e759a440a8c469e5909803b48c54b.ssl.cf1.rackcdn.com/735599/1/check-arm64/requirements-tox-py38-check-uc-aarch64/3038d7d/tox/py38-check-uc-1.log works lovely ;D	14:55
fungi	excellent	14:55
clarkb	fungi: corvus I don't think centos was one with 127.0.1.1. The centos wheel mirror for arm64 was. The centos wheel mirror for x86 was not but it was accidentally cleaned up and recreated	14:55
auristor	fungi: sorry, I had to step away. I wonder if the location server list for the cell became corrupted.	14:55
*** ysandeep\|afk is now known as ysandeep		14:56
*** ykarel is now known as ykarel\|away		14:56
fungi	clarkb: yes. entirely possible checkservers was still trying to find 127.0.1.1 though even though we deleted and recreated those volumes	14:56
mnaser	thanks corvus and AJaeger :D	14:56
clarkb	fungi: but that volume was never part of the 127.0.1.1 problem?	14:56
hrw	fungi: is RETRY_LIMIT on https://zuul.openstack.org/builds?job_name=requirements-tox-py38-check-uc-aarch64 means 'we need more hosts'?	14:56
clarkb	or do you think that could have affected other volumes somehow?	14:57
fungi	clarkb: right, that volume wasn't, i was just speculating on why the checkservers command was reporting the local hostname for the client as unavailable	14:57
clarkb	got it	14:57
fungi	not necessarily related to the volume access issue	14:57
fungi	auristor: how do i query the location server list?	14:57
* fungi checks docs		14:57
AJaeger	hrw: RETRY_LIMIT normally means: pre-playbook failed, was retried and Zuul gave up after three tries	14:58
fungi	ahh, the vldb	14:58
openstackgerrit	Merged opendev/zone-opendev.org master: Add review-test https://review.opendev.org/735600	14:59
hrw	AJaeger: thx	14:59
fungi	the sites listed for mirror.fedora look correct (rw and ro on afs01.dfw.openstack.org, ro on afs02.dfw.openstack.org)	14:59
auristor	afs clients do not forget fileservers addresses once they've been told about them. only a restart will clear the known fileserver list	14:59
auristor	there is no 127.0.1.1 fileserver entry in the VLDB at this time	15:00
fungi	auristor: got it, that likely explains the checkservers error hanging around	15:00
auristor	fs checkvolumes should discard the known volume to fileserver address bindings.	15:00
auristor	if /afs/.openstack.org/ is not accessible that sounds like a bug in the dynamic root logic. fs flush /afs or fs flush /afs/.openstack.org might clear it.	15:03
auristor	I don't remember if "fs lsm /afs/.openstack.org" works for OpenAFS on dynamic root entries.	15:03
fungi	auristor: yes, `fs flush /afs/.openstack.org` did clear it according to corvus	15:03
openstackgerrit	Merged openstack/project-config master: Add vexxhost/atmosphere https://review.opendev.org/735478	15:05
*** mlavalle has joined #opendev		15:05
*** sgw1 has joined #opendev		15:06
auristor	That sounds like corruption of the dynamic root entry	15:08
openstackgerrit	Merged zuul/zuul-jobs master: Add namespace in the collect-k8s-logs role https://review.opendev.org/731319	15:08
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test https://review.opendev.org/735610	15:10
mordred	corvus: ^^ does that seem like a sane sync playbook?	15:11
mordred	corvus: my thinking is that if we shut down gerrit, sync the git repos, the indexes and the caches, apply the most recent mysqldump - we should be in a pretty equivilent state, yeah?	15:16
mordred	so we can then do a test migration, see how it goes, then just do a state sync	15:16
mordred	and do it again	15:17
mordred	(I was originally thinking about using cloud snapshots - but I think that's too complicated honestly - because rebooting into a snapshot does stuff with ephemeral, so we'd need to invent some automation around launch-node tasks that would need to be re-done - and I think rsync will do it)	15:17
corvus	mordred: things may be a little out of sync in terms of the mysqldump being behind the current prod git repo state. do you think that would be a problem? i think it would be really important to have the 2 in sync for the notedb migration, but maybe just going to 2.16 it's not as important?	15:25
corvus	mordred: if we do think it's important, we could shut down prod gerrit briefly, take a mysql dump, and do a final incremental rsync. outage should only be a few minutes?	15:26
clarkb	corvus: mordred: maybe as a first step having an in sync point in time we can restore is sufficient?	15:26
clarkb	then once we'd decided if upgrading in sequence with online upgrades or doing one major upgrade is better we can refine that specific option with more up to date data?	15:26
corvus	clarkb: sorry, i'm not following -- i'm wondering whether we need to have the mysqldb and the git repos in sync on review-test, or if having a db that's slightly older than the git repos is okay	15:27
clarkb	corvus: ya I was more addressing the automation around launch node. eg we don't need a full proper sync each time we launch a new review-test. We only need one that we can copy and restore	15:28
clarkb	assuming that we decide a full sync is necessary	15:28
corvus	clarkb: oh yeah, i think mordred intends to keep review-test persistent; i think that playbook is an ad-hoc playbook	15:29
clarkb	ah	15:29
corvus	i think mordred's approach is probably okay, but we're going to have change refs for changes that aren't in the db, so pushing up new changes is almost certainly a bad idea. but just to test/time re-indexing, etc, it's probably sufficient.	15:32
mordred	corvus: yeah - that. I think we could also create a point-in-time snapshot like you suggest	15:43
mordred	corvus: perhaps once we're happy with an upgrade procedure we can create a consisent snapshot to test and upgrade that as a more final test before we go - so that we can test pushing up changes and stuff	15:44
corvus	mordred: sounds good	15:53
corvus	an outage for a PIT should be fairly short.	15:53
mordred	yah	15:54
openstackgerrit	James E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support https://review.opendev.org/720302	16:01
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test https://review.opendev.org/735610	16:04
mordred	corvus: ^^ I lost to the whitespace gods	16:05
*** sshnaidm_ is now known as sshnaidm		16:05
fungi	the only way to win is not to play	16:06
*** ysandeep is now known as ysandeep\|away		16:13
*** rpittau is now known as rpittau\|afk		16:17
mnaser	hi friends	16:27
mnaser	is there any chance that zuul logging is borked because of some ooms?	16:27
fungi	i can look	16:28
mnaser	https://zuul.opendev.org/t/vexxhost/status <= my jobs here when clicking go straight to END OF STREAM	16:28
mnaser	(could also be something else, but i can't tell really)	16:28
fungi	most recent oom on any executor was 2020-03-29 on ze02	16:31
fungi	well, on any running executor (there was one from april on ze04 but it's currently down for evaluation)	16:32
clarkb	there is a period of time between the node being assigned and the job actualyl starting on the remote node where there is no stream content	16:32
fungi	we restarted our executor services on 2020-05-26 so i don't think any log streamers have been sacrificed in an oom event since then	16:33
*** diablo_rojo has joined #opendev		16:34
clarkb	mnaser: fungi both jobs seem to have content now	16:35
clarkb	I think the period between node assignment and job starting enough to have a streamer running is likely the cause here	16:36
*** diablo_rojo has quit IRC		16:39
clarkb	corvus: did you see my question on https://review.opendev.org/#/c/730929/6 ?	16:44
clarkb	also I'm double checking that we merged all the changes from friday's renaming and it appears we have. If you've got any still open please let me/us know	16:44
openstackgerrit	Monty Taylor proposed opendev/system-config master: Don't install puppet modules when we don't need them https://review.opendev.org/735642	16:46
mordred	clarkb: ^^ I just noticed that when looking at a test run that timed out - we're installing all of the pupept modules from git in every job even when those jobs don't run puppet	16:46
mordred	(t	16:47
mordred	it's only taking 2 minutes - but still, that's 2 completely wasted minutes in most of our jobs)	16:47
corvus	clarkb: ah yeah, looks like a rebase snafu	16:56
openstackgerrit	James E. Blair proposed opendev/system-config master: Fake zuul_connections for gate https://review.opendev.org/730929	16:57
*** diablo_rojo has joined #opendev		16:59
mordred	corvus: stop using backend hostname should be safe to land yes?	17:00
mordred	(I mean, it looks that way, just checking to make sure)	17:01
corvus	mordred: yeah, i think it'll all good up to WIP zookeepe	17:01
mordred	cool	17:01
clarkb	https://review.opendev.org/#/c/734711/ is an easy puppet code deletion if anyone has a quick moment	17:01
clarkb	and https://review.opendev.org/#/c/734647/ will update a number of docker images, but helps make our python3 auditing cleaner	17:02
mordred	clarkb: done on both	17:02
mordred	clarkb: did we just switch out nodes to ones without virtualenv pre-installed?	17:03
clarkb	gitea's 1.12.0 milestone is down to a single issue without an open PR. The other issue has an open PR that passes testing and needs review	17:04
clarkb	mordred: we did	17:04
mordred	because I just got a failure on system-config-legacy-logstash-filters: https://zuul.opendev.org/t/openstack/build/876b22c1c06649ea8aaea5f0733a7937	17:04
mordred	AWESOME	17:04
clarkb	mordred: ianw did that during australia monday	17:04
mordred	I'll get up a fix	17:04
clarkb	thanks	17:04
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use python3 -m venv instead of virtualenv https://review.opendev.org/735643	17:06
mordred	infra-root: ^^ fix gate break	17:06
mordred	clarkb: I	17:07
mordred	'I'm excited we're close to 1.12	17:07
clarkb	mordred: hrm for the venv fix I think that may still nto work on xenial beacuse xenial's pip isn't able to handle our wheel mirror config? I could be wrong about that (testing should tell us)	17:07
clarkb	if it does fail due to the wheel mirror being present we can just add the ensure-virtualenv role to the job	17:07
fungi	mordred: shouldn't that use -m ?	17:09
fungi	at least testing locally, `python3 -v venv foo` doesn't seem to create a venv	17:10
fungi	"python3: can't open file 'venv': [Errno 2] No such file or directory"	17:10
clarkb	fungi: yes, the commit message got it right	17:11
fungi	indeed, seems so	17:12
clarkb	oh neat looks like the other issue assocaited with 1.12 is maybe not a bug	17:13
clarkb	I wonder if this means we could have a 1.12.0 release this week	17:13
fungi	that would be exciting	17:13
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use python3 -m venv instead of virtualenv https://review.opendev.org/735643	17:13
mordred	fungi, clarkb: yup. I can't type :)	17:14
fungi	no worries, me neither	17:14
fungi	half the time i'm lucky i can even read	17:14
mordred	fungi: I think it's unreasonable to expect a single person to be able to both read AND write	17:16
fungi	sometimes i can append, does that count?	17:17
corvus	as i read this conversation, the word 'truncate' comes to mind	17:18
mordred	corvus: that sounds like truculence to me	17:28
openstackgerrit	Monty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on https://review.opendev.org/735602	17:30
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test https://review.opendev.org/735610	17:30
corvus	mordred: is that when a big-rig driver .... nevermind	17:31
mordred	corvus: yes	17:34
dmsimard	mordred: would love a refresh of your +2 on https://review.opendev.org/#/c/735439/ <3	17:42
hrw	https://marcin.juszkiewicz.com.pl/2020/06/15/opendev-ci-speed-up-for-aarch64/	17:44
mordred	dmsimard: done	17:46
dmsimard	\o/ thanks	17:46
AJaeger	hrw: thanks, nice numbers on speed improvement!	17:53
mordred	hrw: nice!	17:53
hrw	2020-05/#opendev:22 14:47 < hrw> I should probably find it 2-3 years ago ;D	17:56
hrw	;D	17:56
*** hashar has joined #opendev		17:57
fungi	excellent article	17:58
openstackgerrit	Merged openstack/project-config master: Create a new project for recordsansible/ara-collection https://review.opendev.org/735439	18:00
hrw	thx	18:00
hrw	should have some links in it but I care less about seo than before ;D	18:01
clarkb	it is always great to see how changes we've made help	18:01
mordred	clarkb, fungi: wat - https://zuul.opendev.org/t/openstack/build/aee168e0d87c4dbf9337c4bee692104b	18:18
mordred	does python3 -m venv not produce a venv with a working pip in it?	18:18
corvus	\o/ zuul with zk tls started! https://zuul.opendev.org/t/openstack/build/dbff561b77214db19a05d9711a09634a/log/zuul01.openstack.org/debug.log	18:19
clarkb	mordred: ya I think that was what I was trying to describe earlier	18:19
clarkb	mordred: you may need to use ensure-virtualenv on xenial to work around python sillyness on ubuntu	18:19
mordred	clarkb: ok. I'm going to do that	18:19
fungi	i'm dubious that's the cause, but doing some local testing now	18:20
clarkb	the problem I remember had to do with it using old pip	18:20
clarkb	I would've expected a pip in the virtualenv though	18:20
clarkb	possible that site pacakges changes the behavior there	18:20
clarkb	and if you don't have a python3 pip installed in the system you get no pip in the venv?	18:21
openstackgerrit	James E. Blair proposed opendev/system-config master: Add Zookeeper TLS support https://review.opendev.org/720302	18:21
fungi	yeah, i don't get the behavior there. on debian/ubuntu with distro packaged python3, either you have python3-venv installed which depends on a wheel bundle including pip, or you get an error about ensurepip failing	18:21
fungi	i thought maybe there was a chance --system-site-packages changed that behavior, but it doesn't seem to for me	18:22
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use ensure-virtualenv in legacy puppet jobs https://review.opendev.org/735643	18:22
fungi	there is a --without-pip option to the venv module	18:22
fungi	maybe somehow it's defaulting on	18:22
fungi	more testing	18:23
* mordred isn't going to lose a lot of sleep on it - these jobs need to diaf anyway		18:23
fungi	at least in debian/sid it's installing pip into the venv for me even using distro-packaged python3-venv	18:24
mordred	fungi: maybe it's clarkb's thing - if - you don't have python3-pip installed do you wind up with no pip?	18:25
fungi	i did not install python3-pip	18:26
fungi	and did not have it installed	18:26
mordred	yeah. I agree - I just did that locally too	18:26
fungi	python3-venv pulls in python3.8-venv and python-pip-whl, the latter has wheel bundles for stuff including pip	18:26
mordred	and I happily have pip in the venv	18:26
openstackgerrit	Ghanshyam Mann proposed openstack/project-config master: Retire Tricircle projects: finish infra todo https://review.opendev.org/728902	18:27
mordred	fungi: WEIRD	18:27
fungi	this failed on xenial though	18:28
fungi	so maybe it's older behavior?	18:28
clarkb	fungi: I was just going to ask python3.8 isn't on xenial	18:28
clarkb	fungi: yes that is my hunch	18:28
clarkb	ianw discovered xenial to be weird	18:28
fungi	yeah, i was testing on debian/sid since it's what i have locally	18:28
fungi	i thought this was how the python3-venv package had worked for a while, but perhaps not so long as xenial	18:29
fungi	though it looks the same from a deps standpoint	18:30
fungi	python3-venv on xenial depends on python3.5-venv which depends on python-pip-whl	18:30
fungi	and it in turn only depends on ca-certificates, no python3-pip or python3.8-pip or anything of the sort	18:31
* mordred just tried it in a xenial container		18:31
mordred	and it worked just fine	18:31
fungi	and python-pip-whl only installs .whl files under /usr/share/python-wheels/ nothing directly importable or executable	18:31
clarkb	mordred: ya I just did that too	18:31
fungi	https://packages.ubuntu.com/xenial/all/python-pip-whl/filelist	18:31
fungi	so that build failure is very puzzling	18:32
mordred	I'm almost interested in holding a node	18:32
fungi	could the python3 -m venv call have failed but not returned an error somehow?	18:32
mordred	maybe?	18:33
clarkb	fungi: that seems plausible	18:33
openstackgerrit	Monty Taylor proposed opendev/system-config master: Make a review-test that we run ansible on https://review.opendev.org/735602	18:33
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add playbook for syncing state from review to review-test https://review.opendev.org/735610	18:33
clarkb	if venv wasn't installed we'd get an error (just tested this on xenial container)	18:34
clarkb	so venv needs to be there in some capacity to have it be silent like that	18:34
clarkb	are we only serving the arm64 wheels on the arm64 mirror?	18:40
clarkb	I guess that kinda makes sense	18:40
clarkb	but with things like zuuls cross arch docker image builds we may want to put the contents for all the arches in all the mirrors	18:40
mordred	clarkb: that's a good point	18:43
mordred	although won't that require some logistical reworking?	18:43
clarkb	mordred: I don't think so since everything is path scoped by arch already	18:44
clarkb	mordred: I think it may just be a matter of having the correct symlinks on disk and apache config?	18:44
mordred	nod	18:44
mnaser	has anyone looked at system-config-legacy-logstash-filters or not yet? :>	18:57
mnaser	i can try myself at fixing it if there's no one at it	18:58
clarkb	mnaser: mordred is	18:58
clarkb	mnaser: https://review.opendev.org/735643 that change	18:58
mnaser	ok, cool -- /me can help if needed	18:59
mordred	mnaser: it _should_ be fixed by that	19:02
mordred	mnaser: and one day I'll get around to killing that job	19:02
mnaser	\o/	19:03
fungi	mordred: i also noticed over the weekend pbr's unit and devstack/tempest jobs are busted too, though i haven't had time to dig into that yet	19:04
clarkb	fungi: the python2 failure is using the stestr constraint for version 3.0.1 which is python3 only	19:07
fungi	yeah, i'm unsurprised there	19:07
clarkb	and python3 failed on some virtualenv thing which may be related to new images? though the timestamp is such that I don't think so	19:07
clarkb	AttributeError: module 'virtualenv' has no attribute 'create_environment'	19:08
clarkb	possible that is related to virtualenv 3 updates?	19:08
fungi	oh, maybe	19:13
mordred	clarkb: feel like a +A on https://review.opendev.org/#/c/735602 ?	19:15
clarkb	I'll take alook after lunch	19:15
openstackgerrit	Monty Taylor proposed opendev/system-config master: Don't install puppet modules when we don't need them https://review.opendev.org/735642	19:35
openstackgerrit	Monty Taylor proposed opendev/system-config master: Install pip3 on codesearch https://review.opendev.org/735668	19:35
mordred	clarkb, fungi: more fallout from new nodes ^^	19:35
openstackgerrit	Merged opendev/system-config master: Use ensure-virtualenv in legacy puppet jobs https://review.opendev.org/735643	19:42
openstackgerrit	Monty Taylor proposed opendev/system-config master: Add bit more info on disabling ansible runs https://review.opendev.org/735246	19:42
mordred	fungi: ^^ I rebased that on the logstash filters fix and added reference to disable-ansible script	19:42
openstackgerrit	Monty Taylor proposed opendev/system-config master: Switch prep-apply.sh to use python3 https://review.opendev.org/729543	19:43
clarkb	centos 8.2 has released.Anothet thing to keep an eye on of/when failures happen	19:46
clarkb	mordred: is review-test a full size node?	20:04
clarkb	also looking at it we don't use the review group for group vars. We use hostvars and you've trimmed the hostvars down for review-test. Is that sufficient to ensure that things like gerritbot and launchpad syncing won't try to run in both places at once?	20:06
clarkb	(we want to rpevent that and want to make sure we've considered it and I think the split host vars does that?)	20:06
corvus	clarkb, mordred, fungi: https://review.opendev.org/720302 zk tls is ready -- do we want to think about doing that on friday?	20:07
clarkb	corvus: I'll be around and able to help	20:08
corvus	i'll add an item to the mtg agenda	20:08
fungi	yeah, i can do friday, no problem	20:09
openstackgerrit	Merged opendev/system-config master: Add tool to export Rackspace DNS domains to bind format https://review.opendev.org/728739	20:10
mordred	clarkb: yes	20:13
mordred	clarkb: as is the rax db I made	20:13
clarkb	mordred: cool so the 48g heap size won't cause problems then. What about the other thing?	20:14
mordred	clarkb: well - before I did private hostvar surgery on bridge, we actually used group_vars for review for settings	20:14
mordred	clarkb: but - I believe with the secrets being in host-specific files we will not be putting any secrets on review-test that would allow those services to operate	20:15
clarkb	cool	20:15
clarkb	that was my read of it too, just double checking	20:15
clarkb	what about email	20:15
mordred	(I'm pretty sure this first ansible run won't even finish because it'll be missing some required secrets)	20:15
clarkb	are we concerend about gerrit sending people email?	20:15
mordred	hrm. that's a good question	20:15
mordred	it should really only send mail on patchset upload right?	20:15
clarkb	ya I think upload and merge	20:16
clarkb	as long as we avoid updating random changes we're probably fine	20:16
mordred	like - as long as we're not pushing changes to or merging changes there is _SHOULD_ be fine?	20:16
mordred	yeah	20:16
clarkb	I've +2'd the change though zuul is unhappy with it	20:16
mordred	\o/	20:16
clarkb	possibly due to the host vars	20:16
mordred	let's see what's broken this time	20:16
*** hashar has quit IRC		20:17
mordred	Data could not be sent to remote host "198.72.124.215". Make sure this host can be reached over ssh: ssh: connect to host 198.72.124.215 port 22: No route to host	20:17
mordred	clarkb: it seems to have been unhappy trying to talk to fake review-dev	20:17
clarkb	ah so maybe just a recheck?	20:18
mordred	clarkb: yeah - I'll try that	20:18
mordred	clarkb: oh - also - if you have a sec ...	20:18
mordred	clarkb: check the most two recent commits in private hostvars and make sure I didn't derp?	20:18
clarkb	mordred: it looks right to me. You renamed group_vars/review.yaml to host_vars/review01.openstack.org.yaml and added host_vars/review-test.opendev.org.yaml with minimal content	20:20
mordred	clarkb: \o/	20:27
clarkb	I did git log -2 -p fwiw	20:28
clarkb	corvus: I had the zuul tls changes under CD topic. Would you like me to drop ti there and discuss it as a separate item or collapse under that heading?	20:40
clarkb	I'm getting ready to send the agenda out and want to make sure its got the proper attention	20:40
corvus	clarkb: your choice	20:40
corvus	sorry i missed it was already there	20:41
clarkb	no worries	20:41
clarkb	you added mroe info :)	20:41
openstackgerrit	Merged opendev/system-config master: Add bit more info on disabling ansible runs https://review.opendev.org/735246	20:41
openstackgerrit	Merged opendev/system-config master: Switch prep-apply.sh to use python3 https://review.opendev.org/729543	20:41
openstackgerrit	Merged opendev/system-config master: Install pip3 on codesearch https://review.opendev.org/735668	20:44
*** rchurch has quit IRC		20:44
*** rchurch has joined #opendev		20:45
mordred	corvus, fungi: if either of you have a sec: https://review.opendev.org/#/c/735642/ is easy	20:46
*** mlavalle has quit IRC		20:47
mordred	infra-root: the ensure-virtualenv patch landed and jobs work again - I have rechecked the system-config patches that had failed due to that	20:48
clarkb	mordred: thanks	20:48
clarkb	(I had a couple get caught by it)	20:48
mordred	yeah	20:49
mordred	it was actually quite the carnage - there are 7 changes in recheck right now	20:49
clarkb	I'm around but will need to transition to dadops in about half an hour to run kids' remote class thing	20:49
clarkb	(as general heads up)	20:49
mordred	I'm around and unlikely to go anywhere for a bit as I am beset on all sides by a pile of sleeping kittens	20:50
clarkb	mordred: just noticed you update https://review.opendev.org/735246 thanks for catching that	20:57
fungi	mordred: i've approved it, in case you survive burial by kitten	20:57
mordred	clarkb: sure nuff	20:58
mordred	fungi: \o/	20:58
*** mlavalle has joined #opendev		20:59
clarkb	infra-root I've discovered https://review.opendev.org/#/c/686237/ in my spring cleaning and wonder if I should either abandon that because we dno't want the behavior or push a new patchset to do that with ansible now that we use ansible to deploy zuul?	21:07
mordred	clarkb: well - maybe we should just land teh "use docker for executors" patch	21:08
mordred	clarkb: https://review.opendev.org/#/c/733967/	21:08
clarkb	mordred: ++ I'll abandon my change	21:09
clarkb	mordred: also I think that old change not being merge conflicted implies we can clean up some puppet things?	21:10
clarkb	I'll look into that while dealing with kids' school stuff	21:10
mordred	clarkb: ++	21:10
openstackgerrit	Merged opendev/system-config master: Forward user-committee ML to openstack-discuss https://review.opendev.org/733673	21:14
openstackgerrit	Merged opendev/system-config master: Change launch scripts to python3 shebangs https://review.opendev.org/734345	21:14
corvus	i have a zuul enqueue-ref command that is hung; something fishy may be going on	21:16
corvus	um. the gearman certificate has expired.	21:22
corvus	ha, it's the ca cert that expired	21:24
corvus	the client/server certs have 10-year lives	21:24
corvus	the ca only 3	21:24
corvus	istr we lost our ca infrastructure somewhere along the line	21:24
corvus	but i can use zk-ca.sh to make new certs easily	21:25
corvus	however, we'll need a full system restart to use them	21:25
mordred	corvus: yes - I believe we decided we didn't need to replace the ca infrastructure with LE and zk-ca	21:25
corvus	mordred: well, we decided that after it was removed, but yes :)	21:26
mordred	corvus: yeah	21:26
mordred	it wasn't like an _active_ choice ;)	21:26
corvus	mordred, clarkb, fungi: perhaps we should go ahead and merge https://review.opendev.org/720302 and do the zk tls and gearman rekey all at once?	21:27
clarkb	the existing connections are fine?	21:27
corvus	clarkb: yeah, as long as they aren't interrupted	21:27
corvus	we probably won't be brining any of those offline ze's back online till then though	21:27
mordred	corvus: wcpgw?	21:28
clarkb	corvus: I'm not opposed to bunlding those changes since they are related (they share a CA right?)	21:28
clarkb	but I'm not really able to help at this very moment	21:29
corvus	clarkb: the thing they most share in common is they need a full restart	21:29
clarkb	corvus: got it	21:29
mordred	corvus: so - I	21:29
mordred	I'm game	21:29
openstackgerrit	Merged opendev/system-config master: Don't install puppet modules when we don't need them https://review.opendev.org/735642	21:29
openstackgerrit	Merged opendev/system-config master: uwsgi-base: drop packages.txt https://review.opendev.org/735473	21:29
mordred	corvus: while we're full system restarting - should we land the z-e docker change?	21:29
mordred	(that might be a bit much though - and we really can do that executor at a time to make sure)	21:30
corvus	mordred: let's not -- we can restart executors one-at-a-time and reduce risk there	21:30
mordred	corvus: oh headdesk. there's another puppet failure in the stack. looking	21:31
corvus	oy i just saw that	21:31
mordred	corvus: I think it's unrelated	21:31
mordred	https://zuul.opendev.org/t/openstack/build/42cb6c4d6188486eae6dd2b7a05a6b5c/log/applytest/puppetapplytest07.final.out.FAILED	21:31
mordred	is the failure	21:31
corvus	unfortunately, that means we'll have a full re-run cycle for that	21:31
mordred	yeah	21:31
mordred	we could cheat	21:31
corvus	mordred: and?	21:31
corvus	we could force-merge all 3 changes	21:32
mordred	yeah	21:32
corvus	we can't enqueue-to-gate though because the zuul cli is out of commission	21:32
mordred	yea. given that I don't think we want to spend _hours_ in the current situation	21:32
mordred	and we do have green runs of the jobs that are actually relevant	21:33
corvus	but these do all have good check results, so seems like good risk/reward.	21:33
mordred	yah	21:33
corvus	clarkb: are you reviewing https://review.opendev.org/720302 ?	21:33
mordred	clarkb: ?	21:33
clarkb	I can review but not help with the change landing itself /me looks	21:33
fungi	i've resurfaced from making/consuming evening sustenance... catching up but can definitely help with a zuul restart for new certs	21:34
mordred	corvus: the zk patch isn't going to fix the gearman cert though	21:34
clarkb	I'm reviewing the change	21:34
mordred	corvus: but that's just a zk-ca and updating private hostvars, right?	21:34
corvus	mordred: correct	21:34
corvus	i can do that now so that it gets incorporated into the next run	21:35
fungi	also the executor daemon for ze04 is still stopped, we haven't rebooted that server yet. i didn't know if ianw might want to take a look, but we should either avoid restarting the executor on it or reboot the server	21:35
mordred	cool. so - yeah - I think the sequence would be land all the patches, shut down zuul, update hostvars, run service-zuul and service-zk and then re-start yes?	21:35
mordred	I suppose we can update the hostvars before doing the shutdown	21:36
mordred	in fact, you could probably go ahead and update the hostvars	21:36
corvus	i'd like to update hostvars; merge patches, wait for playbook completion, then restart	21:36
mordred	yes	21:36
mordred	I was just about to write that same thing	21:36
mordred	I think ti's correct - I blame the clowder of kittens for making me take a while to reach that conclusion	21:37
fungi	720302 is safe to merge, it just won't take effect until restarts, yeah?	21:38
corvus	fungi: probably? :)	21:38
fungi	but yeah, i agree with hostvars first	21:38
corvus	if it breaks, that's our signal to restart anyway :)	21:39
fungi	no argument there ;)	21:39
corvus	installing new certs now	21:39
fungi	$CAROOT looks like a hipster vegetable	21:39
ianw	fungi: ze04 having the same afs issues?	21:39
ianw	as linaro last night i mean	21:39
ianw	s/night/your local time/ :)	21:40
fungi	ianw: yes, i left it stopped since it was something we could safely leave broken to evaluate	21:40
fungi	the rax-ord mirror also needed a reboot	21:40
openstackgerrit	Merged opendev/system-config master: Cleanup old puppet management of release-volumes.py https://review.opendev.org/734711	21:40
fungi	for similar reasons	21:40
mordred	corvus: oh - fwiw - executor on ze01 is stopped because I was doing the afs+docker testing - but I think it's fine to restart when we do the restart	21:40
mordred	I do not think we need it to remain stopped	21:40
fungi	ianw: fs flush got the cell root browseable again, but trying to look at some subpaths of the tree timed out read operations	21:41
ianw	fungi: yeah, i did poke around on the linaro mirror and didn't see anything other than a lot of disconnection/connection logs	21:41
fungi	ianw: the difference with ze04 is it was having trouble getting to the rw tree rather than the ro tree	21:41
fungi	also dmesg on rax-dfw mirror showed the loss of connectivity with afs01.dfw but never logged it coming back into service	21:42
fungi	er, rax-ord mirror i mean	21:42
clarkb	corvus: couple of questions on https://review.opendev.org/#/c/720302/17 but lgtm otherwise	21:42
fungi	ianw: anyway, i spotted ze04 because a bunch of publication jobs were failing rsync calls	21:42
ianw	fungi: yeah, not sure i have anything else i know to look at	21:43
fungi	ianw: in that case i guess we can just make sure to reboot ze04 when we're restarting the rest of the executors	21:44
corvus	clarkb: replied	21:44
fungi	oh, and i did try restarting afsd on the rax-ord mirror, but it got stuck stopping	21:44
fungi	and was unkillable	21:44
clarkb	corvus: rgr +2	21:45
ianw	fungi: so i'm just getting reset, not sure if you saw the scrollback about the utimensat() calls and openafs not updating the ns for files and "-t"	21:45
corvus	#status log re-keyed gearman tls certs (they expired)	21:45
openstackstatus	corvus: finished logging	21:45
ianw	fungi: yeah, i've never had any luck with anything but rebooting	21:45
fungi	ianw: yep, i followed that. mismatch in timestamp expectations between openafs and rsync sounds plausible. did you try dropping -t?	21:46
clarkb	ianw: I tried to understand what that meant for us, do we update our rsync flags?	21:46
ianw	fungi: i plan to do a manual run under strace without the "-t" to rsync and see what happens	21:46
corvus	mordred, fungi: i will force-merge now	21:46
fungi	ianw: cool, i notice it's the cronjob has been running all day	21:46
ianw	the fedora cron job should be commented out ... i hope at least	21:47
fungi	ianw: also should we go ahead and turn mirror-update.openstack.org back on too before the reprepro mirrors fall too far behind?	21:47
openstackgerrit	Merged opendev/system-config master: Stop using backend hostname in zuul testinfra tests https://review.opendev.org/733409	21:47
openstackgerrit	Merged opendev/system-config master: Fake zuul_connections for gate https://review.opendev.org/730929	21:47
ianw	fungi: oh yeah, i think so, i saw your initial work to migrate that which is great too	21:47
fungi	it's way incomplete, i need to find time to make progress on it	21:48
openstackgerrit	Merged opendev/system-config master: Add Zookeeper TLS support https://review.opendev.org/720302	21:48
mordred	corvus: woot!	21:48
fungi	ianw: oh, and somebody said a new centos 8.x release dropped today, so... probably a lot of rsync going on for that too	21:48
clarkb	fungi: ya 8.2 (was me	21:48
fungi	thanks clarkb! today has been a complete blur	21:48
fungi	i may declare wednesday as sunday and make myself scarce ;)	21:49
mordred	fungi: wednesday fednesday right?	21:49
* fungi boots mirror-update.ostack.o back up		21:49
fungi	mordred: something like that, yep	21:49
corvus	mordred, fungi: looks like there's a deploy backlog; i'm going to afk for 30m	21:50
fungi	corvus: cool, i'll be around for a while still when you get back	21:51
mordred	corvus: kk	21:53
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Cleanup from ZK TLS transition https://review.opendev.org/735740	21:55
*** sgw has quit IRC		21:59
*** sgw has joined #opendev		22:01
fungi	just the firewall rules for now, but we can dogpile other cleanup into that if anyone knows of more	22:02
fungi	we still need to update the ports in the nodepool confs in project-config, right?	22:05
fungi	or is there a separate change already up to do that@	22:05
fungi	?	22:05
clarkb	fungi: I think that is in the change that merged, it does a load file and then edit then write back out again	22:05
fungi	#status log started mirror-update01.openstack.org	22:05
openstackstatus	fungi: finished logging	22:05
fungi	clarkb: oh! so directly modifies the configs at runtime, okay	22:06
clarkb	fungi: yes I think so	22:06
fungi	clarkb: i'm not finding anywhere in 720302 which modifies the nodepool configs, i may be overlooking something	22:08
clarkb	fungi: https://review.opendev.org/#/c/720302/17/playbooks/roles/nodepool-base/tasks/main.yaml line 69	22:09
fungi	oh, in 0playbooks/roles/nodepool-base/tasks/main.yaml there's a task to "Overwrite zookeeper-servers" and another from a previous change to "Write nodepool config"	22:10
clarkb	and https://review.opendev.org/#/c/720302/17/playbooks/roles/nodepool-base/library/make_nodepool_zk_hosts.py	22:10
fungi	so i guess we don't directly write out the nodepool configs from the project-config repo	22:10
mnaser	hi all. Could we add me (or tc members) to https://review.opendev.org/#/admin/groups/441,members to help with the retirement of tricircle ?	22:10
mordred	mnaser: done	22:11
mnaser	mordred: thanks!	22:12
fungi	clarkb: aha, i see we've actually had that implemented since april via https://review.opendev.org/720709	22:12
fungi	i wonder if we should either update the configs in project-config for the new ports to avoid confusion, or better yet remove the zookeeper servers from it entirely and substitute a comment saying that we inject them with ansible now	22:13
fungi	having zk connection details in those configs when we're not actually relying on them is just begging someone to make updates in the wrong place down the road	22:14
clarkb	fungi: ya that seems reasonable. Or maybe go back to consuming it from project-config once the dust settles on this	22:14
fungi	clarkb: well, the change went in originally to support production-like test environments for our integration testing jobs	22:15
fungi	so i expect we'd want to keep the capability	22:15
clarkb	fungi: with /etc/hosts being written now we may be able to do that without editing the configs?	22:15
clarkb	though we probably only do a single zk server I guess	22:16
clarkb	(rather than 3)	22:16
mnaser	mordred: i'm sorry, do you mind adding me to https://review.opendev.org/#/admin/groups/1706,members too?	22:25
mnaser	seems like client core != project core	22:25
mnaser	(or any other infra admin around)	22:26
ianw	mnaser: done	22:26
ianw	fungi: ok, i've re-run manually our fedora rsync without the "-t"	22:26
mnaser	thank you ianw	22:27
ianw	it's got all the lstats, but none of the utimensat() calls	22:28
fungi	ianw: that's promising... any noticeable difference in rsync runtime (like is it significantly slower without -t?)	22:29
ianw	2020-06-15 22:22:06 \| Running rsync for releases/31..	22:31
ianw	2020-06-15 22:22:30 \| Running rsync for updates/30...	22:31
mnaser	last request in helping retire, trivial change: https://review.opendev.org/#/c/728902/4	22:31
ianw	like 24 seconds	22:31
ianw	2020-06-08 06:43:34 \| Running rsync for updates/30...	22:32
clarkb	any concern landing mnaser's chnage ^? specifically the zuul config update at https://review.opendev.org/#/c/728902/4/zuul/main.yaml while we fix zuul things?	22:32
corvus	back	22:32
ianw	2020-06-08 06:44:11 \| Running rsync for updates/31...	22:33
mnaser	yeah we may hold off on that then, it's not _that_ urgent but worth deferring for later if things are goin on	22:33
ianw	fungi: so yeah, in the noise	22:34
fungi	ianw: in that case, sounds like we should just drop -t from our rsync calls	22:35
*** ysandeep\|away is now known as ysandeep		22:43
corvus	mordred, fungi: it might be worth a bit of analysis to find out why https://review.opendev.org/733673 is running all the jobs	22:49
corvus	it's been running for 1.5 hours, and is maybe 1/3 through the list	22:49
corvus	then there are 4 changes after it, then finally the 3 changes we need for zuul :/	22:50
mordred	corvus: because it touches inventory	22:51
corvus	mordred: do we need to adjust that matcher after the recent reorg?	22:51
corvus	or is that intentional?	22:51
mordred	corvus: yes - but I think we still have work to do there	22:51
corvus	because any job can reference the inventory hostvars of any group....	22:52
mordred	yeah. I think we're defaulting to safe currently	22:52
corvus	if we're adding inventory/ to everything because of that ^ then i think maybe we'd be better off just running one job that does everything, because everything is going to touch inventory	22:52
mordred	corvus: I actually thought we had some smaller matchers already	22:53
corvus	but i think we can maybe narrow that down	22:53
corvus	like, service-zuul should be able to say "inventory/zuul" + "inventory/zookeeper" or whatever	22:53
mordred	yes - that	22:53
ianw	fungi: i think dropping -t means that it doesn't detect non-size changing updates?	22:54
mordred	corvus: yeah - I think we have better matchers on the CI jobs ... but haven't done the same for the prod jobs	22:54
mordred	corvus: s/I think//	22:55
mordred	corvus: it's totally all inventory/ in the prod jobs	22:55
corvus	mordred: ok, well the good news is that none of the other changes before the zuul changes touch inventory (though one will probably run puppet-else); our final change does touch inventory though	22:55
mordred	corvus: so - I think we go through and match up the file matchers we use for CI jobs with the prod versions	22:55
corvus	mordred: ++	22:55
corvus	this is just a swag, we might be looking at another 5 hours before the zuul change is deployed	22:56
corvus	i will not be in a position to help then	22:56
corvus	mordred: perhaps we should disable something and come back tomorrow?	22:56
mordred	corvus: well ... we could touch disable-ansible - that'll block all future runs	22:57
mordred	so as soon as the current job is done there will be no more ansible running	22:57
mordred	and we could just do a git piull and then run the relevant playbooks	22:57
corvus	and then rely on -hourly to catch up whatever else was in the queue?	22:58
mordred	yeah	22:58
corvus	sounds like a plan	22:58
mordred	the currently queued jobs will time out after an hour iirc	22:58
mordred	well - each one will block for an hour	22:58
mordred	o - but then we'll restart zuul - so they will go away	22:58
clarkb	you're going to restart zuul	22:58
mordred	yeah	22:58
clarkb	ya	22:58
mordred	so yeah - I think that'll totally work	22:59
mordred	want me to run disable-ansible now?	22:59
corvus	where's our docs for disabling ansible?	22:59
corvus	https://docs.openstack.org/infra/system-config/sysadmin.html#disable-enable-puppet	22:59
corvus	that's all i've found :/	22:59
mordred	clarkb just updated them	22:59
mordred	but I thin kthe patch is one of teh ones landing	22:59
clarkb	they are in bridge's page	23:00
mordred	corvus: https://review.opendev.org/#/c/735246/	23:00
clarkb	I added them to our sysadmins page as that is where we have preexisting stuff	23:00
mordred	yeah - so you would find them there eventually	23:00
corvus	where's bridge's page?	23:00
mordred	corvus: oh - you're looking at openstack docs	23:01
mordred	corvus: https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#disable-enable-ansible	23:01
corvus	we moved that without a redirect or delete?	23:01
mordred	it certainly seems that way, yes. we should fix that	23:02
mordred	anyway - clarkb's change still hasn't published there - so the proper instructions are still missing	23:02
mordred	but they reduce to "run the disable-ansible script"	23:02
corvus	i wonder how we could add a redirect?	23:02
fungi	i think we can add one to the htaccess file in openstack-manuals	23:03
fungi	looking	23:03
corvus	#status log disabled ansible on bridge due to 5+ hour backlog with potentially breaking change at end	23:03
openstackstatus	corvus: finished logging	23:03
*** mlavalle has quit IRC		23:04
mordred	corvus: cool	23:04
fungi	corvus: looks like we did it for infra-manual thusly: https://opendev.org/openstack/openstack-manuals/src/branch/master/www/.htaccess#L263-L266	23:04
*** tosky has quit IRC		23:04
fungi	there's also a corresponding ci test for that redirect	23:04
fungi	i'll propose a similar one for system-config now	23:04
corvus	fungi: thanks!	23:04
mordred	fungi: ++	23:05
fungi	as soon as i finish cloning that massive repo	23:05
fungi	so... slow...	23:06
fungi	and cloned	23:13
fungi	wow	23:13
openstackgerrit	James E. Blair proposed opendev/system-config master: Make disable-ansible fancier https://review.opendev.org/735745	23:15
corvus	fungi, mordred: ^ that's the result of a mental simulation i just performed about possible outcomes from leaving DISABLE-ANSIBLE in place overnight.	23:15
mordred	corvus: yes.	23:19
fungi	23:27 <openstackgerrit> Jeremy Stanley proposed openstack/openstack-manuals master: Redirect infra/system-config to docs.opendev.org https://review.opendev.org/735747	23:28
fungi	there was some sitemap cleanup to do at the same time	23:28
corvus	fungi: thanks!	23:35
openstackgerrit	Merged opendev/system-config master: Be explicit about using python3 in docker images https://review.opendev.org/734647	23:37
*** DSpider has quit IRC		23:38
clarkb	are we restarting services ir leaving ansible disabled then picking it up tomorrow?	23:52
openstackgerrit	Merged opendev/system-config master: Make disable-ansible fancier https://review.opendev.org/735745	23:54

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!