Tuesday, 2021-08-17

fungi	oh, good point, the packaging doesn't know to remove the per-site initscripts	00:08
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back https://review.opendev.org/c/openstack/diskimage-builder/+/804000	05:36
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec https://review.opendev.org/c/openstack/diskimage-builder/+/804002	05:36
opendevreview	Steve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules https://review.opendev.org/c/openstack/diskimage-builder/+/804816	05:36
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9 https://review.opendev.org/c/openstack/diskimage-builder/+/804817	05:37
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/804818	05:37
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF https://review.opendev.org/c/openstack/diskimage-builder/+/804819	05:37
*** ysandeep\|away is now known as ysandeep		06:32
*** jpena\|off is now known as jpena		07:32
*** rpittau\|afk is now known as rpittau		07:54
*** ykarel is now known as ykarel\|lunch		09:27
*** diablo_rojo is now known as Guest4602		10:34
*** ykarel\|lunch is now known as ykarel		10:38
*** dviroel\|out is now known as dviroel\|ruck		11:26
*** jpena is now known as jpena\|lunch		11:39
*** jpena\|lunch is now known as jpena		12:42
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897	12:46
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897	12:53
*** ykarel is now known as ykarel\|away		14:41
*** jpena is now known as jpena\|off		15:26
*** jpena\|off is now known as jpena		15:27
*** jpena is now known as jpena\|off		15:48
*** ysandeep is now known as ysandeep\|dinner		16:02
clarkb	fungi: can you think of other testing we might want to do on that server? should we manually run newlist maybe and see if your mail server receives it? (I expect mine wont)	16:05
clarkb	s/that server/the test lists.kc.io server/	16:05
fungi	we'll need to turn mailman and exim on, but sure can give that a shot. should be safe	16:05
fungi	i'm heating up some lunch right now but can test stuff shortly	16:06
clarkb	no rush, just trying to think of additional checks we can safely do	16:06
fungi	still need to dig up logs for the suspected zuul configuration cache bug reported in #openstack-infra today as well	16:06
clarkb	fungi: did you also check with /etc/hosts override that the archives are visible?	16:07
fungi	yes, seemed fine	16:07
clarkb	oh yes I meant to read up on that more closely now that meetings are done	16:07
*** marios is now known as marios\|out		16:08
*** ysandeep\|dinner is now known as ysandeep\|out		16:21
*** rpittau is now known as rpittau\|afk		16:31
clarkb	corvus: fungi: I think there may be another issue with running parallel CD jobs on bridge and that is the system-config repo start?	16:48
clarkb	s/start/state/ every job is going to try and update the repo	16:48
clarkb	In theory they will all be trying to update to the same thing but I could see git lock failures?	16:49
fungi	maybe we could do a lbyl on it?	16:49
fungi	though i suppose there's still an initial race	16:49
clarkb	lbyl?	16:50
fungi	look before you leap	16:50
fungi	if two builds start at the same time and both see system-config is behind and want to update it	16:50
clarkb	ah yup.	16:50
fungi	then they would still potentially collide	16:50
clarkb	I'm not sure what a good appraoch would be yet. Was just thinking through other points of conflict and realized this is likely one	16:50
fungi	the lbyl idea was check whether system-config is up to date, and then only try to update it if not	16:50
fungi	we could also add a wait lock around updates to that checkout, to deal with the race	16:51
clarkb	another approach would be to have the locking job at the start of the buildset update git and then pause	16:51
clarkb	then the subsequent jobs do not touch git	16:51
fungi	quickly getting to be complicated coordination though	16:51
fungi	yeah, maybe if all the other builds need to depend on the base playbook anyway, then we just update there and not any other time. though it makes testing those playbooks outside the deploy scope more complex since we always need a prerequisite system-config update before running	16:52
clarkb	fungi: we already have it split out of the service playbooks	16:53
clarkb	but it runs in every job currently	16:53
clarkb	in good news we did cut out almost 20 minutes per hourly deploy by moving the cloud launcher	16:54
clarkb	are we actively using zuul preview? I'm wondering if that is another potential optimisation point	16:55
fungi	what's the context? i mean, zuul-preview is used by jobs	17:03
clarkb	fungi: we run its playbook hourly along with nodepool, zuul, and the docker registry jobs	17:04
clarkb	but I thought we had to turn it off	17:04
clarkb	I'm wondering if there is any value to running the job currently as a result	17:04
fungi	the server is running and responding	17:07
fungi	http://zuul-preview.opendev.org/	17:07
clarkb	yes I think the server is up but we stopped returning it in zuul artifacts	17:07
fungi	oh, got it	17:07
clarkb	and I thought the apache might have been stopped?	17:07
fungi	nah, apache is running and listening on 80/443 (though we only allow 80 through iptables)	17:08
clarkb	I don't recall details but I want to say it shouldn't be	17:08
fungi	but yeah, codesearch says this is the only master branch use of it currently: https://opendev.org/inaugust/inaugust.com/src/branch/master/.zuul.yaml#L19	17:10
mordred	yeah - I use it :)	17:24
mordred	oh - but even that is wrong	17:25
mordred	because success-url is not the right way to do that anymore	17:25
clarkb	yes	17:27
fungi	i found where it was originally disabled	17:29
fungi	https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2019-08-08.log.html#t2019-08-08T15:02:28	17:29
fungi	the suggestion was there were signs of use as an open proxy in its logs	17:29
clarkb	I think faeda1ab850f11da0ed7df4fac985ff9e96454b3 may have fixed that in zuul-preview	17:31
clarkb	now the question goes back to "do we need hourly deploys of this service if nothing is (correctly) using it right now?"	17:32
fungi	looks like https://review.opendev.org/717870 was the expected solution	17:32
fungi	configuration for it doesn't need to update that frequently, i expect	17:32
fungi	maybe the hourly was to catch new zuul-preview container images more quickly, but that also seems like daily would be plenty	17:33
clarkb	fungi: yup hourly jobs in most of these cases are to update container images in a reasonable maount of time	17:33
clarkb	the cloud launcher and the puppet jobs are the exceptions to that rule.	17:33
fungi	also, as some reassurance the open proxy situation was resolved, i've analyzed the past month worth of logs and don't see any successful requests through it	17:34
fungi	(also proof that it's not really being used at all though)	17:34
clarkb	fungi: checking on prod lists.kc.io ua status still shows it is enrolled	17:44
clarkb	I suspect they may check things like IP addrs and uuids and such	17:44
clarkb	whcih is good because it means things are happy on our end :)	17:44
clarkb	Looking at remote-puppet-else more closely I think we can probably improve the files list on that one a bit and run it in deploy and daily only	17:52
clarkb	We're really not expecting puppet side effects unless we update the puppet directly. The reason for this is puppet things largely only deal with pacakges and not containers	17:52
clarkb	Then we should be able to trim the hourly stuff to be nodepool and zuul which are things that tend to be under active development and have external updates we don't have good triggers for	17:53
fungi	also the puppeted things are, on the whole, not getting a lot of updates these days	17:54
clarkb	ya I think the reason it is in there is we may update the various puppet moduels then want those updates to hit production	17:56
clarkb	but reality is we don't make a lot of updates to the external puppet modules anymore	17:56
clarkb	and if we had a one off we can always wait for the daily or manually run the playbook	17:56
clarkb	the thing that sets zuul apart is it mgiht get many updates every day all week	17:56
clarkb	Pure brainstorm: we could also drop the hourly pipeline entirely and if we want things quicker than daily either land a change (could be a noop change) to force jobs to run or manually run playbooks	17:59
fungi	corvus: clarkb: so on the stale project pipeline config for the openstack/glance_store stable/victoria branch, i can't find evidence that zuul saw the change-merged event for that (i see plenty of other change-merged events logged, but none for 804606,2 when it merged). i'm going to check the gerrit log for any errors around that time	18:27
clarkb	fungi: you might also look for ssh connection errors in the zuul scheduler log since those events are ingested via the ssh event stream	18:28
fungi	thanks, good idea	18:29
fungi	corvus: clarkb: so given the timing, i think this exception could be related to the missed change-merged event, but there's not enough context in the log to be certain (it occurs roughly 19 seconds after the submit action at 17:08:26,569)	18:47
fungi	http://paste.openstack.org/show/808156/	18:47
clarkb	interesting. maybe we failed to process the merge event	18:49
clarkb	and that is the exception related to that?	18:50
clarkb	we only get the ssh event stream event over ssh then everything else is http	18:50
fungi	i can't find any smoking gun in the gerrit error log either	18:51
fungi	anyway, i'm inclined to say this doesn't point to a problem with config caching, but seems to be related to a lost stream event	18:51
fungi	worth noting, there are 17 "Exception moving Gerrit event" lines i yesterday's scheduler log	18:52
clarkb	though the behavior indicates the cache is affected	18:52
clarkb	because pushing a new change without the history of the old change didn't run the job that was removed	18:53
fungi	and 16 so far in today's log	18:53
clarkb	it does seem like the cache should try and be resilient to these issues if possible?	18:53
fungi	turns out it wasn't "pushing a new change without the history of the old change" so much as "pushing a change which edits the zuul config"	18:54
clarkb	ah	18:54
fungi	which makes sense, in that case zuul creates a speculative layout based on the change	18:54
clarkb	because that forces a caceh refresh	18:54
clarkb	yup	18:54
fungi	but it doesn't actually refresh the cache, seems like, because changes which don't alter the zuul config remain brokenly running the removed job	18:55
clarkb	ya it will only apply to the change's speculative state until it merges	18:55
fungi	so seems more like the cache is avoided when a speculative layout is created	18:55
clarkb	I think it still caches it	18:55
clarkb	for the next time that chagne runs jobs	18:55
fungi	er, or it's caching separately right	18:55
fungi	but not updating the cache of the branch state config	18:56
corvus	i'm not really here yet -- but if we missed an event, it's not a cache issue -- that's why i suggested that's the place to look	18:56
corvus	it's just that zuul is running with an outdated config; this would have happened in zuul v3 too	18:57
clarkb	corvus: I don't know that we missed the event as much as threw and exception processing the event based on what fungi pasted	18:57
clarkb	functioanlly it is like missing an event	18:57
corvus	okay, i'll dig in more later	18:57
corvus	well, okay, i seem to be toggling back and forth on understanding what you're saying...	18:58
fungi	as i said, i don't think we have sufficient context in the log to know that the exception there was related to the lost change-merged event, it's 19 seconds after the submit, which makes me suspect it might not be	18:58
corvus	question #1 for debugging this is: "did the zuul scheduler's main loop process a change-merged event and therefore correctly determine that it should update its config?"	18:59
fungi	but that in conjunction with me not finding any log of seeing the change-merged event for it makes it enough of a possibility i'm not ruling it out	18:59
clarkb	2021-08-16 17:08:45,780 ERROR zuul.GerritEventConnector: Exception moving Gerrit event: <- that is what happened in fungi's paste	18:59
corvus	if the answer to that is "yes" then we look at the new zk config cache stuff. if the answer is "no" (for whatever reason -- TCP error, exception moving event, etc) then we don't look at the config cache, we look at the event processing.	19:00
corvus	anyway, sorry, i haven't caught up yet, and i'm still working on ansible stuff... i was just trying to help avoid dead ends. like i said, i can pitch in after lunch.	19:00
fungi	yes, i think it's event processing. the config cache states are simply explaining the behavior we saw, i don't see any reason to suspect they're part of the problem	19:01
corvus	merging another config change to that file or performing a full reconfig should fix the immediate issue	19:02
corvus	(in case it's becoming an operational issue)	19:02
fungi	it's not, i don't think, but yes i already assumed those were the workarounds. thanks for confirming!	19:03
yoctozepto	fungi: as a side note to matrix discussion: I will be advocating for more flexibility - otherwise we will have a hard time migrating ever ;/	19:19
fungi	yes, i think it's good to acknowledge that people will communicate however is convenient for them to do so, and it's better that we not pretend they're active and available somewhere they aren't	19:20
yoctozepto	++	19:20
yoctozepto	though is mainlaind China happy with matrix? it might be seen as overly encrypted, no?	19:21
fungi	element/matrix.org homeservers are apparently not generally accessible from behind the great firewall, no	19:22
yoctozepto	meh	19:22
fungi	however there was mention of the existence of a matrix wechat bridge, someone still needs to look into viability	19:22
fungi	i think that wouldn't work for getting to the zuul matrix channel from wechat, but might make some wechat channels available via matrix	19:23
clarkb	also the beijing linux user group has matrix instructions	19:24
clarkb	we might be able to reach out to groups like that to get an informed opinion on usability	19:24
yoctozepto	++ on that	19:24
yoctozepto	would help openstack tc as well	19:24
yoctozepto	aaand is the zuul matrix-native channel on matrix.org or elsewhere?	19:25
clarkb	its on a small opendev homeserver that we don't intend on using for accounts just rooms	19:25
yoctozepto	ack	19:26
opendevreview	Ian Wienand proposed opendev/system-config master: borg-backup: randomise time on a per-server basis https://review.opendev.org/c/opendev/system-config/+/804916	19:26
fungi	yoctozepto: also we're not running the homeserver ourselves, the foundation is paying element to host one for us	19:26
yoctozepto	so, there could be a chance that this one can be reach via the great firewall	19:26
fungi	but we could host it on our own servers later if we decide that's warranted	19:26
yoctozepto	ack	19:26
yoctozepto	thanks for the insights	19:26
fungi	i think with element hosting the homeserver for us, it's unlikely to work directly from mainland china	19:27
yoctozepto	duh	19:27
ianw	sorry, i got distracted on the UA redirect	20:01
fungi	ianw: oh, don't apologize, i was meaning to work on it too. i've commented on the change with the ip address of the held server, feel free to fiddle with the vhost config on it if you like	20:02
corvus	fungi, clarkb: looking at the traceback, it does seem likely that the event was not processed due to the connection error. i think a retry loop in queryChange should help (and would apply equally to http and ssh query methods).	20:04
opendevreview	Ian Wienand proposed opendev/system-config master: [wip] redirect pastebinit https://review.opendev.org/c/opendev/system-config/+/804918	20:23
marc-vorwerk	/msg NickServ REGISTER <password> <e-mail>	20:38
*** dviroel\|ruck is now known as dviroel\|ruck\|out		21:15
corvus	infra-root: i'd like to restart zuul now to pick up some bugfixes	21:56
fungi	corvus: fine by me, status looks calm and i don't see any openstack release changes in-flight	21:58
fungi	i'll give #openstack-release a heads up	21:58
corvus	re-enqueing now	22:13
fungi	thanks!	22:13
corvus	complete	22:20
corvus	#status log restarted all of zuul on commit 6eb84eb4bd475e09498f1a32a49e92b814218942	22:20
opendevstatus	corvus: finished logging	22:20
opendevreview	Paul Belanger proposed opendev/system-config master: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922	22:40
pabelanger	o/	22:46
pabelanger	I pushed up a change to add mitogen support to ansible deploys	22:47
pabelanger	should make things a little faster	22:47
pabelanger	will check back soon to debug the change failure if any	22:47
fungi	thanks pabelanger!	22:48
clarkb	thanks. It will be interseting to see how that performs compared to the default	22:51
opendevreview	Clark Boylan proposed opendev/system-config master: Run infra-prod-service-zuul-preview daily instaed of hourly https://review.opendev.org/c/opendev/system-config/+/804925	22:58
opendevreview	Clark Boylan proposed opendev/system-config master: Run remote-puppet-else daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804926	22:58
opendevreview	Clark Boylan proposed opendev/system-config master: Stop requiring puppet things for afs, eavesdrop, and nodepool https://review.opendev.org/c/opendev/system-config/+/804927	22:58
clarkb	infra-root ^ more attempts at cleaning up the current deploy job situation	22:59
pabelanger	clarkb: ah, you are on ansible-core 2.12. Sadly, mitogen doesn't support it yet	23:10
pabelanger	I think they only support 2.10	23:10
pabelanger	I'm still on 2.9	23:10
clarkb	pabelanger: ok I was worried about that. You can set the ansible back to 2.9 or 2.10 in the change too just for comparison	23:10
clarkb	I don't know that we will downgrade in production but having the data would still be useful I think	23:10
fungi	is that actually what we've got in production?	23:11
clarkb	fungi: ya we upgraded a couple months ago iirc	23:11
pabelanger	k, let me do it for the test job	23:11
fungi	indeed, --version says ansible [core 2.11.1]	23:12
fungi	so >2.10 anyway	23:12
opendevreview	Paul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922	23:13
pabelanger	how come https://zuul.opendev.org/t/openstack/status/change/804922,2 is in the openstack tenant and not opendev tenant?	23:14
fungi	long story	23:15
pabelanger	guessing legacy reasons?	23:15
fungi	some	23:15
fungi	also our main zuul config is in openstack/project-config still	23:16
corvus	Short version: No one has moved everything that needs to be moved	23:17
clarkb	it is getting easier and easier to move as we remove less dependencies on a bunch of repos	23:19
clarkb	the ansible work is consolidating a lot of stuff into sysetm-config which amkes this simpler	23:19
opendevreview	Paul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922	23:30
clarkb	ianw: left a note on https://review.opendev.org/c/opendev/system-config/+/804918 if that didn't need a new patchset after testing Id' probably leave it as is but since a new patchset is necessary anyway...	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!