Tuesday, 2021-08-17

fungioh, good point, the packaging doesn't know to remove the per-site initscripts00:08
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back  https://review.opendev.org/c/openstack/diskimage-builder/+/80400005:36
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec  https://review.opendev.org/c/openstack/diskimage-builder/+/80400205:36
opendevreviewSteve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules  https://review.opendev.org/c/openstack/diskimage-builder/+/80481605:36
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9  https://review.opendev.org/c/openstack/diskimage-builder/+/80481705:37
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader  https://review.opendev.org/c/openstack/diskimage-builder/+/80481805:37
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF  https://review.opendev.org/c/openstack/diskimage-builder/+/80481905:37
*** ysandeep|away is now known as ysandeep06:32
*** jpena|off is now known as jpena07:32
*** rpittau|afk is now known as rpittau07:54
*** ykarel is now known as ykarel|lunch09:27
*** diablo_rojo is now known as Guest460210:34
*** ykarel|lunch is now known as ykarel10:38
*** dviroel|out is now known as dviroel|ruck11:26
*** jpena is now known as jpena|lunch11:39
*** jpena|lunch is now known as jpena12:42
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389712:46
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch  https://review.opendev.org/c/opendev/elastic-recheck/+/80389712:53
*** ykarel is now known as ykarel|away14:41
*** jpena is now known as jpena|off15:26
*** jpena|off is now known as jpena15:27
*** jpena is now known as jpena|off15:48
*** ysandeep is now known as ysandeep|dinner16:02
clarkbfungi: can you think of other testing we might want to do on that server? should we manually run newlist maybe and see if your mail server receives it? (I expect mine wont)16:05
clarkbs/that server/the test lists.kc.io server/16:05
fungiwe'll need to turn mailman and exim on, but sure can give that a shot. should be safe16:05
fungii'm heating up some lunch right now but can test stuff shortly16:06
clarkbno rush, just trying to think of additional checks we can safely do16:06
fungistill need to dig up logs for the suspected zuul configuration cache bug reported in #openstack-infra today as well16:06
clarkbfungi: did you also check with /etc/hosts override that the archives are visible?16:07
fungiyes, seemed fine16:07
clarkboh yes I meant to read up on that more closely now that meetings are done16:07
*** marios is now known as marios|out16:08
*** ysandeep|dinner is now known as ysandeep|out16:21
*** rpittau is now known as rpittau|afk16:31
clarkbcorvus: fungi: I think there may be another issue with running parallel CD jobs on bridge and that is the system-config repo start?16:48
clarkbs/start/state/ every job is going to try and update the repo16:48
clarkbIn theory they will all be trying to update to the same thing but I could see git lock failures?16:49
fungimaybe we could do a lbyl on it?16:49
fungithough i suppose there's still an initial race16:49
clarkblbyl?16:50
fungilook before you leap16:50
fungiif two builds start at the same time and both see system-config is behind and want to update it16:50
clarkbah yup.16:50
fungithen they would still potentially collide16:50
clarkbI'm not sure what a good appraoch would be yet. Was just thinking through other points of conflict and realized this is likely one16:50
fungithe lbyl idea was check whether system-config is up to date, and then only try to update it if not16:50
fungiwe could also add a wait lock around updates to that checkout, to deal with the race16:51
clarkbanother approach would be to have the locking job at the start of the buildset update git and then pause16:51
clarkbthen the subsequent jobs do not touch git16:51
fungiquickly getting to be complicated coordination though16:51
fungiyeah, maybe if all the other builds need to depend on the base playbook anyway, then we just update there and not any other time. though it makes testing those playbooks outside the deploy scope more complex since we always need a prerequisite system-config update before running16:52
clarkbfungi: we already have it split out of the service playbooks16:53
clarkbbut it runs in every job currently16:53
clarkbin good news we did cut out almost 20 minutes per hourly deploy by moving the cloud launcher16:54
clarkbare we actively using zuul preview? I'm wondering if that is another potential optimisation point16:55
fungiwhat's the context? i mean, zuul-preview is used by jobs17:03
clarkbfungi: we run its playbook hourly along with nodepool, zuul, and the docker registry jobs17:04
clarkbbut I thought we had to turn it off17:04
clarkbI'm wondering if there is any value to running the job currently as a result17:04
fungithe server is running and responding17:07
fungihttp://zuul-preview.opendev.org/17:07
clarkbyes I think the server is up but we stopped returning it in zuul artifacts17:07
fungioh, got it17:07
clarkband I thought the apache might have been stopped?17:07
funginah, apache is running and listening on 80/443 (though we only allow 80 through iptables)17:08
clarkbI don't recall details but I want to say it shouldn't be17:08
fungibut yeah, codesearch says this is the only master branch use of it currently: https://opendev.org/inaugust/inaugust.com/src/branch/master/.zuul.yaml#L1917:10
mordredyeah - I use it :) 17:24
mordredoh - but even that is wrong17:25
mordredbecause success-url is not the right way to do that anymore17:25
clarkbyes17:27
fungii found where it was originally disabled17:29
fungihttps://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2019-08-08.log.html#t2019-08-08T15:02:2817:29
fungithe suggestion was there were signs of use as an open proxy in its logs17:29
clarkbI think faeda1ab850f11da0ed7df4fac985ff9e96454b3 may have fixed that in zuul-preview17:31
clarkbnow the question goes back to "do we need hourly deploys of this service if nothing is (correctly) using it right now?"17:32
fungilooks like https://review.opendev.org/717870 was the expected solution17:32
fungiconfiguration for it doesn't need to update that frequently, i expect17:32
fungimaybe the hourly was to catch new zuul-preview container images more quickly, but that also seems like daily would be plenty17:33
clarkbfungi: yup hourly jobs in most of these cases are to update container images in a reasonable maount of time17:33
clarkbthe cloud launcher and the puppet jobs are the exceptions to that rule.17:33
fungialso, as some reassurance the open proxy situation was resolved, i've analyzed the past month worth of logs and don't see any successful requests through it17:34
fungi(also proof that it's not really being used at all though)17:34
clarkbfungi: checking on prod lists.kc.io ua status still shows it is enrolled17:44
clarkbI suspect they may check things like IP addrs and uuids and such17:44
clarkbwhcih is good because it means things are happy on our end :)17:44
clarkbLooking at remote-puppet-else more closely I think we can probably improve the files list on that one a bit and run it in deploy and daily only17:52
clarkbWe're really not expecting puppet side effects unless we update the puppet directly. The reason for this is puppet things largely only deal with pacakges and not containers17:52
clarkbThen we should be able to trim the hourly stuff to be nodepool and zuul which are things that tend to be under active development and have external updates we don't have good triggers for17:53
fungialso the puppeted things are, on the whole, not getting a lot of updates these days17:54
clarkbya I think the reason it is in there is we may update the various puppet moduels then want those updates to hit production17:56
clarkbbut reality is we don't make a lot of updates to the external puppet modules anymore17:56
clarkband if we had a one off we can always wait for the daily or manually run the playbook17:56
clarkbthe thing that sets zuul apart is it mgiht get many updates every day all week17:56
clarkbPure brainstorm: we could also drop the hourly pipeline entirely and if we want things quicker than daily either land a change (could be a noop change) to force jobs to run or manually run playbooks17:59
fungicorvus: clarkb: so on the stale project pipeline config for the openstack/glance_store stable/victoria branch, i can't find evidence that zuul saw the change-merged event for that (i see plenty of other change-merged events logged, but none for 804606,2 when it merged). i'm going to check the gerrit log for any errors around that time18:27
clarkbfungi: you might also look for ssh connection errors in the zuul scheduler log since those events are ingested via the ssh event stream18:28
fungithanks, good idea18:29
fungicorvus: clarkb: so given the timing, i think this exception could be related to the missed change-merged event, but there's not enough context in the log to be certain (it occurs roughly 19 seconds after the submit action at 17:08:26,569)18:47
fungihttp://paste.openstack.org/show/808156/18:47
clarkbinteresting. maybe we failed to process the merge event18:49
clarkband that is the exception related to that?18:50
clarkbwe only get the ssh event stream event over ssh then everything else is http18:50
fungii can't find any smoking gun in the gerrit error log either18:51
fungianyway, i'm inclined to say this doesn't point to a problem with config caching, but seems to be related to a lost stream event18:51
fungiworth noting, there are 17 "Exception moving Gerrit event" lines i yesterday's scheduler log18:52
clarkbthough the behavior indicates the cache is affected18:52
clarkbbecause pushing a new change without the history of the old change didn't run the job that was removed18:53
fungiand 16 so far in today's log18:53
clarkbit does seem like the cache should try and be resilient to these issues if possible?18:53
fungiturns out it wasn't "pushing a new change without the history of the old change" so much as "pushing a change which edits the zuul config"18:54
clarkbah18:54
fungiwhich makes sense, in that case zuul creates a speculative layout based on the change18:54
clarkbbecause that forces a caceh refresh18:54
clarkbyup18:54
fungibut it doesn't actually refresh the cache, seems like, because changes which don't alter the zuul config remain brokenly running the removed job18:55
clarkbya it will only apply to the change's speculative state until it merges18:55
fungiso seems more like the cache is avoided when a speculative layout is created18:55
clarkbI think it still caches it18:55
clarkbfor the next time that chagne runs jobs18:55
fungier, or it's caching separately right18:55
fungibut not updating the cache of the branch state config18:56
corvusi'm not really here yet -- but if we missed an event, it's not a cache issue -- that's why i suggested that's the place to look18:56
corvusit's just that zuul is running with an outdated config; this would have happened in zuul v3 too18:57
clarkbcorvus: I don't know that we missed the event as much as threw and exception processing the event based on what fungi pasted18:57
clarkbfunctioanlly it is like missing an event18:57
corvusokay, i'll dig in more later18:57
corvuswell, okay, i seem to be toggling back and forth on understanding what you're saying...18:58
fungias i said, i don't think we have sufficient context in the log to know that the exception there was related to the lost change-merged event, it's 19 seconds after the submit, which makes me suspect it might not be18:58
corvusquestion #1 for debugging this is: "did the zuul scheduler's main loop process a change-merged event and therefore correctly determine that it should update its config?"18:59
fungibut that in conjunction with me not finding any log of seeing the change-merged event for it makes it enough of a possibility i'm not ruling it out18:59
clarkb2021-08-16 17:08:45,780 ERROR zuul.GerritEventConnector: Exception moving Gerrit event: <- that is what happened in fungi's paste18:59
corvusif the answer to that is "yes" then we look at the new zk config cache stuff.  if the answer is "no" (for whatever reason -- TCP error, exception moving event, etc) then we don't look at the config cache, we look at the event processing.19:00
corvusanyway, sorry, i haven't caught up yet, and i'm still working on ansible stuff... i was just trying to help avoid dead ends.  like i said, i can pitch in after lunch.19:00
fungiyes, i think it's event processing. the config cache states are simply explaining the behavior we saw, i don't see any reason to suspect they're part of the problem19:01
corvusmerging another config change to that file or performing a full reconfig should fix the immediate issue19:02
corvus(in case it's becoming an operational issue)19:02
fungiit's not, i don't think, but yes i already assumed those were the workarounds. thanks for confirming!19:03
yoctozeptofungi: as a side note to matrix discussion: I will be advocating for more flexibility - otherwise we will have a hard time migrating ever ;/19:19
fungiyes, i think it's good to acknowledge that people will communicate however is convenient for them to do so, and it's better that we not pretend they're active and available somewhere they aren't19:20
yoctozepto++19:20
yoctozeptothough is mainlaind China happy with matrix? it might be seen as overly encrypted, no?19:21
fungielement/matrix.org homeservers are apparently not generally accessible from behind the great firewall, no19:22
yoctozeptomeh19:22
fungihowever there was mention of the existence of a matrix wechat bridge, someone still needs to look into viability19:22
fungii think that wouldn't work for getting to the zuul matrix channel from wechat, but might make some wechat channels available via matrix19:23
clarkbalso the beijing linux user group has matrix instructions19:24
clarkbwe might be able to reach out to groups like that to get an informed opinion on usability19:24
yoctozepto++ on that19:24
yoctozeptowould help openstack tc as well19:24
yoctozeptoaaand is the zuul matrix-native channel on matrix.org or elsewhere?19:25
clarkbits on a small opendev homeserver that we don't intend on using for accounts just rooms19:25
yoctozeptoack19:26
opendevreviewIan Wienand proposed opendev/system-config master: borg-backup: randomise time on a per-server basis  https://review.opendev.org/c/opendev/system-config/+/80491619:26
fungiyoctozepto: also we're not running the homeserver ourselves, the foundation is paying element to host one for us19:26
yoctozeptoso, there could be a chance that this one can be reach via the great firewall19:26
fungibut we could host it on our own servers later if we decide that's warranted19:26
yoctozeptoack19:26
yoctozeptothanks for the insights19:26
fungii think with element hosting the homeserver for us, it's unlikely to work directly from mainland china19:27
yoctozeptoduh19:27
ianwsorry, i got distracted on the UA redirect20:01
fungiianw: oh, don't apologize, i was meaning to work on it too. i've commented on the change with the ip address of the held server, feel free to fiddle with the vhost config on it if you like20:02
corvusfungi, clarkb: looking at the traceback, it does seem likely that the event was not processed due to the connection error.  i think a retry loop in queryChange should help (and would apply equally to http and ssh query methods).20:04
opendevreviewIan Wienand proposed opendev/system-config master: [wip] redirect pastebinit  https://review.opendev.org/c/opendev/system-config/+/80491820:23
marc-vorwerk /msg NickServ REGISTER <password> <e-mail>20:38
*** dviroel|ruck is now known as dviroel|ruck|out21:15
corvusinfra-root: i'd like to restart zuul now to pick up some bugfixes21:56
fungicorvus: fine by me, status looks calm and i don't see any openstack release changes in-flight21:58
fungii'll give #openstack-release a heads up21:58
corvusre-enqueing now22:13
fungithanks!22:13
corvuscomplete22:20
corvus#status log restarted all of zuul on commit 6eb84eb4bd475e09498f1a32a49e92b81421894222:20
opendevstatuscorvus: finished logging22:20
opendevreviewPaul Belanger proposed opendev/system-config master: Add mitogen support to ansible  https://review.opendev.org/c/opendev/system-config/+/80492222:40
pabelangero/22:46
pabelangerI pushed up a change to add mitogen support to ansible deploys22:47
pabelangershould make things a little faster22:47
pabelangerwill check back soon to debug the change failure if any22:47
fungithanks pabelanger!22:48
clarkbthanks. It will be interseting to see how that performs compared to the default22:51
opendevreviewClark Boylan proposed opendev/system-config master: Run infra-prod-service-zuul-preview daily instaed of hourly  https://review.opendev.org/c/opendev/system-config/+/80492522:58
opendevreviewClark Boylan proposed opendev/system-config master: Run remote-puppet-else daily instead of hourly  https://review.opendev.org/c/opendev/system-config/+/80492622:58
opendevreviewClark Boylan proposed opendev/system-config master: Stop requiring puppet things for afs, eavesdrop, and nodepool  https://review.opendev.org/c/opendev/system-config/+/80492722:58
clarkbinfra-root ^ more attempts at cleaning up the current deploy job situation22:59
pabelangerclarkb: ah, you are on ansible-core 2.12. Sadly, mitogen doesn't support it yet23:10
pabelangerI think they only support 2.1023:10
pabelangerI'm still on 2.923:10
clarkbpabelanger: ok I was worried about that. You can set the ansible back to 2.9 or 2.10 in the change too just for comparison23:10
clarkbI don't know that we will downgrade in production but having the data would still be useful I think23:10
fungiis that actually what we've got in production?23:11
clarkbfungi: ya we upgraded a couple months ago iirc23:11
pabelangerk, let me do it for the test job23:11
fungiindeed, --version says ansible [core 2.11.1]23:12
fungiso >2.10 anyway23:12
opendevreviewPaul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible  https://review.opendev.org/c/opendev/system-config/+/80492223:13
pabelangerhow come https://zuul.opendev.org/t/openstack/status/change/804922,2 is in the openstack tenant and not opendev tenant?23:14
fungilong story23:15
pabelangerguessing legacy reasons?23:15
fungisome23:15
fungialso our main zuul config is in openstack/project-config still23:16
corvusShort version: No one has moved everything that needs to be moved23:17
clarkbit is getting easier and easier to move as we remove less dependencies on a bunch of repos23:19
clarkbthe ansible work is consolidating a lot of stuff into sysetm-config which amkes this simpler23:19
opendevreviewPaul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible  https://review.opendev.org/c/opendev/system-config/+/80492223:30
clarkbianw: left a note on https://review.opendev.org/c/opendev/system-config/+/804918 if that didn't need a new patchset after testing Id' probably leave it as is but since a new patchset is necessary anyway...23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!