Wednesday, 2022-12-14

clarkbinfra-root ok draft is largely completed at https://etherpad.opendev.org/p/V-YLkq0iEJyhBi4hHsFU for an OpenDev update if you have a moment to read it over00:01
clarkbwant to make sure I didn't forget anything important etc00:01
clarkboh ze01 restarted earlier than I thought. I was looking at the wrong job. It restarted about 40 minutes ago00:06
corvuseach of the executors is only running a few builds right now...00:13
corvuswhat do folks think about me running a manual pause command on a handful of them to speed up the rolling restart?00:13
clarkbcorvus: give me a sec to double check that won't interfere with the playbook but that sounds great to me00:13
corvusi think when the playbook runs it, it should be a noop, but yeah, that would be the main concern i think :)00:13
ianwlgtm, the message00:14
corvusclarkb: message lgtm too00:15
clarkbcorvus: reading zuul-executor's graceful tasks I think we do the right thing00:15
clarkbcorvus: we already handle the case where the container has exited previously. So the only other concern would be if pausing or gracefulling an already paused/graceful executor is a problem and I don't think it is00:15
corvusoh i was thinking of just running zuul-executor pause00:16
clarkbthank you for looking at that message, I'm going to send it out shortly00:16
clarkbcorvus: oh ya pause won't exit liek graceful will. But I think both are fine atually00:16
corvusthen letting the playbook run graceful (which will do another pause first, which will noop, then basically immediately exit since no jobs run)00:16
clarkbso ya I think that is fine00:16
corvusokay.  i agree you're probably right about graceful being safe to run too.  somehow i feel like i want to do pause instead, just to try to keep it a light touch.00:17
clarkbwfm00:17
corvushow many should i do?  we have the periodic jobs coming up soon.. so maybe 6?00:21
clarkbya half is probably a good count00:21
corvusok... actually, i'll do 5 so that's a total of 6 that are paused00:22
clarkbcorvus: the swift arm job that is queued on a node is something I keep meaning to look at too. If you are worried that will impact the restart we can dequeue it manually00:22
clarkbbut I think it won't cause a problem because it hasn't gotten far neough to need an executor yet00:23
clarkb(it it sitll waiting on a node request)00:23
corvusclarkb: yeah shouldn't affect the restart00:23
corvusokay ze02 is paused by the playbook, and ze03-ze07 were manually paused by me00:23
corvusbasically, every executor is either running 4 or 5 jobs right now00:24
Clark[m]corvus looked like pausing worked. It is onto 08 now01:46
opendevreviewYoshi Kadokawa proposed openstack/project-config master: Add Cinder Huawei charm  https://review.opendev.org/c/openstack/project-config/+/86758903:10
opendevreviewIan Wienand proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets  https://review.opendev.org/c/openstack/diskimage-builder/+/86757903:17
opendevreviewIan Wienand proposed openstack/diskimage-builder master: tox jobs: pin to correct nodesets  https://review.opendev.org/c/openstack/diskimage-builder/+/86757904:18
*** yadnesh|away is now known as yadnesh04:27
*** Tengu_ is now known as Tengu07:01
*** ysandeep is now known as ysandeep|lunch08:30
*** jpena|off is now known as jpena08:38
*** ysandeep|lunch is now known as ysandeep10:05
obondarevHi folks, it seems my organisation public IP address was banned by https://review.opendev.org/ - can someone please help to remove the ban?10:31
*** yadnesh is now known as yadnesh|afk11:04
*** dviroel|out is now known as dviroel|rover11:12
*** rlandy|out is now known as rlandy11:12
*** yadnesh|afk is now known as yadnesh11:42
*** ysandeep is now known as ysandeep|brb11:51
*** dasm|off is now known as dasm12:18
opendevreviewMariusz Karpiarz proposed openstack/project-config master: Add the "api-ref-jobs" template to CloudKitty  https://review.opendev.org/c/openstack/project-config/+/86765112:31
*** ysandeep|brb is now known as ysandeep12:31
fungiobondarev: we generally only block ip addresses if they're repeatedly trying to open connections and failing to authenticate, so whatever's at that ip address seems to have broken authentication configured. have you corrected it?12:37
fungior at least turned off whatever is repeatedly trying to connect from that address?12:38
obondarev@fungi hmm, that's an uplink IP address used by many employees12:39
obondarevNAT address I mean12:40
obondarevthat is 176.74.218.10612:42
fungiobondarev: you'll need to check your router's sessions to see what's failing to connect. we believe it to probably be https://wiki.openstack.org/wiki/ThirdPartySystems/Seagate_CI12:47
fungiif you can confirm you've turned that off we'll allow connections from the ip address again12:48
fungiianw attempted to contact the person listed as responsible for that system to notify them of the problem12:49
obondarevfungi: ok, I'll check with IT guys and get back here, thank you very much!12:49
fungiobondarev: you're welcome12:49
obondarevfungi: and where is the person responsible for that system listed?12:52
fungiobondarev: there's an e-mail address on the "Contact Information" row of the table on that https://wiki.openstack.org/wiki/ThirdPartySystems/Seagate_CI page13:28
obondarevah, right, sorry13:29
fungiobondarev: oh, though that was for a different ip address. it's possible ianw blocked more than one address that day, let me double-check13:30
obondarevfungi: yeah, that tristero.net should not be related to Mirantis13:31
fungiobondarev: yep, my mistake. ianw blocked three addresses for ssh key negotiation failures, that one we couldn't find any contact info for other than general eudc.cloud administrator addresses13:34
fungithat address was the source of hundreds of "Unable to negotiate key" errors in gerrit's error log13:35
obondarevfungi: can I provide a contact email for that one so it could be unblocked?13:35
fungii can remove the block temporarily, but please try to figure out what is failing to authenticate13:35
obondarevfungi: cool, thanks!13:35
fungiobondarev: i've deleted that rule, please try connecting again13:39
obondarevfungi: great, thanks, I'll talk to IT 13:40
fungiyou're welcome13:45
*** ysandeep is now known as ysandeep|dinner14:30
*** dviroel|rover is now known as dviroel|rover|lunch15:52
clarkbThe Zuul restart appears to ahve completed successfully. We are running nodepool and zuul on python3.11 now16:07
*** dviroel|rover|lunch is now known as dviroel|rover16:38
opendevreviewJeremy Stanley proposed openstack/project-config master: Add the "api-ref-jobs" template to CloudKitty  https://review.opendev.org/c/openstack/project-config/+/86765116:41
*** marios is now known as marios|out16:47
corvusclarkb: the executors are split, and they do have a behavior difference: https://review.opendev.org/86490317:20
corvusi think that behavior difference is unlikely to be hit by opendev in production, but it's something to be aware of17:20
clarkbya we do run a cleanup but it is small enough I'm not too worried17:21
corvusif we see any problems related to that, we'll just want to check the executor version first17:22
clarkbI guess if you wanted to have a clearer signal on whether or not that is working as intended we could restart the executors on the older version. But ya I'm not worried about opendev itself17:23
clarkbthe restart in a couple of days should resynchronize things17:23
corvusyeah, i think it's good enough for now as long as we know the caveat17:23
*** yadnesh is now known as yadnesh|away17:27
*** jpena is now known as jpena|off17:35
fungiclarkb: checking the held mm3 node i used for our final import testing, i see the same thing there as in production so i don't think i accidentally changed anything by looking17:54
fungihyperkitty properly filters the mailing list names but postorius isn't17:54
fungiand both the admin interface and django webui are consistent with what we see in production (two mail hosts but sharing a common lists.opendev.org mail host)17:55
fungiand hyperkitty is showing the lists.opendev.org name in the corner of the lists.zuul-ci.org site17:55
fungiso i guess the next thing to do is try adding/associating separate mailman web hosts (what django calls "sites") with the existing mail hosts17:56
fungiwe seem to have duplicate copies of web-settings.py in docker/mailman/web/mailman-web and playbooks/roles/mailman3/files18:02
fungiboth set SITE_ID = 1 which i guess is a default the new list domains all inherit?18:02
Clark[m]fungi: the file in playbooks overrides a number of settings18:12
Clark[m]They should not be identical18:13
fungidiff says they're the same18:13
*** ysandeep|dinner is now known as ysandeep18:13
funginevermind!18:13
fungii was idiotically running `git diff foo bar` rather than `diff foo bar`18:14
*** ysandeep is now known as ysandeep|out18:14
fungiyes they're definitely different18:14
Clark[m]The one in the docker file is kept in sync with upstream to avoid forking too hard. Then we overlay some edits via our deployment 18:14
Clark[m]The one actually used is the one in playbooks18:14
fungiso anyway, i was able to get help output from the manage.py script in the container but it isn't immediately obvious that it supports adding sites/web hosts18:15
fungithere is a django_extensions set_default_site subcommand but that's the only obvious match18:16
fungiahh, maybe i want postorius's mmclient subcommand for manage.py18:19
fungiwhoa... apparently SITE_ID = 0 is magic sauce: https://docs.mailman3.org/en/latest/faq.html#the-domain-name-displayed-in-hyperkitty-shows-example-com-or-something-else18:27
fungi"setting SITE_ID = 0 in Django’s settings will cause HyperKitty to display the DISPLAY NAME for the domain whose DOMAIN NAME matches the accessing domain. However, do not set SITE_ID = 0 in a new installation without any existing Sites as this will cause an issue in applying migrations. Only set SITE_ID = 0 after there are domains defined in the Django admin Sites view."18:27
fungiguess i'll give that a try on the held node and see what happens18:28
Clark[m]Interesting 18:34
fungidoesn't seem to solve it as far as i can tell18:41
clarkbfungi: is there a way to add a second site via the django admin webpage? I wonder if that if what we should try next?18:43
clarkbor maybe as admin we can change the web_host value?18:44
clarkbactually I think we should look at that first because iirc we found something saying that web_host needs to be unique for the vhosting and currently they are not?18:44
clarkbI think that aws in the email that corvus  found18:44
fungiyes, the problem is that the "web host" seems to be synonymous with the django "site" and currently we only have one (lists.opendev.org)18:47
clarkbI see18:47
clarkbso maybe SITE_ID=0 needs to be done in conjunction with multiple sites?18:47
fungithe sites are numbered, and the SITE_ID in settings.py seems to refer to the django site (therefore web host)18:47
fungiand yeah, just setting SITE_ID=0 and restarting hasn't solved the display problem, but i'm trying adding sites and associating mail hosts (mailman 3 "mail domains" in the django admin ui) with them18:49
clarkbI wonder if we can just insert a record into the db18:49
clarkbhttps://docs.djangoproject.com/en/4.1/ref/contrib/sites/ seems to be the underlying implementation bits?18:49
fungioh, my bad. i didn't actually change it18:49
fungii edited the file on disk with the container down, not merely stopped18:50
clarkboh ya if you down it then when you up it you get it back in a clean state18:50
fungione better. i was editing the copy of that file inside the container file tree, but we bindmount our own into the running container instead18:55
fungiso i was making all of this far more complicated than it needed to be18:55
fungiso good news and bad news...19:02
fungithe good news is that if i create a lists.zuul-ci.org site and associate the lists.zuul-ci.org mail host with it, then with SITE_ID=0 in settings.py it seems to show the correct domain name on the hyperkitty pages now19:04
fungithe bad news is that it seems to break the ability to access the interface by any domains not configured for the site but which may resolve to it (for example "localhost" via my ssh socket)19:04
clarkbI don't think you need to access it as localhost19:05
clarkbyou just need to originate from localhost but can hit the external interface19:05
clarkbdoes it filter the lists properly too?19:05
fungiwell, i do at least need to override my dns resolution to go to the forwarded port on my local machine19:06
fungifor accessing the django admin interface19:06
fungijust referring to 127.0.0.1 or localhost won't set the right host header in my browser request19:06
clarkbya would need to use SOCKS or similar for what I describe I guess19:06
fungipoint is, with SITE_ID=0 any access to the mailman web content except by a domain name it knows about will now show an error page19:08
fungibut that's probably acceptable19:08
clarkbwell it did that before too (there is a domains list with a filter but localhots was explicitly added)19:09
clarkbThe change here really only applies to localhost access I think. And ya that seems fine. We can either drop the admin local requirement or document using socks or something19:09
clarkbfungi: does it filter the lists properly too? And what is hte process of creating a site like?19:10
fungiand no, this doesn't solve the list filtering in postorius, just the site name displayed by hyperkitty19:10
clarkb(I'm wondering if we can automate this somehow especiallysince we can't flip the 1 to a 0 unti lafter we deploy...)19:10
clarkbfungi: I guess you hvae to associate individual lists with the site too?19:11
clarkbvia the web_host setting howeverthat is expressed19:11
fungithe workaround hinted at in that one ml discussion for the "not till after deploy" problem seems to be to create a dummy site you're not going to use19:11
clarkbfungi: we might be able to have the intial pass create everything then update the file and restart containers and not change the content of that file once there is a site present19:12
clarkbits probbaly doable, just really clunky19:12
clarkbprobably want to look at the fixes in aggregate though once we've sorted out all the steps we want to take19:15
clarkbfungi: now that you created the site is there a way to associate it as the web host to the individual mailing lists? I'm just wondering if that is what we need to filter them on the pages19:27
funginot that i'm able to find so far19:30
fungithey're filtered on the hyperkitty pages just not the postorius pages19:30
clarkbhttps://gitlab.com/mailman/hyperkitty/-/blob/master/hyperkitty/views/index.py#L60-79 is how hyperkitty does it19:31
clarkbfungi: note in prod it isn't filter on either of them19:31
clarkbhttps://lists.opendev.org/archives/ shows both sets of lists19:31
clarkbis it possible that our archives/ vs hyperkitty/ url is to blame somehow? I swear that thi swas working in prod though19:32
clarkbfungi: ok lists.zuul-ci.org is filtering with hyperkitty in prod but not lists.opendev.org19:32
clarkband then postorius for both doesn't filter19:33
fungioh, that may be due to the sites on prod19:33
fungifor the hyperkitty not filtering on the opendev site but doing it on the zuul site19:33
clarkbah because we've only got the one19:34
fungiyeah, if i look on the held node, the lists.opendev.org/archives page now lists all the imported mailing lists except the zuul ones19:34
fungipresumably because i associated them with the lists.zuul-ci.org site19:34
clarkbya I'm looking at the hyperkitty filter andI think that might be right19:35
fungior, rather, because i associated the lists.zuul-ci.org mail host with the lists.zuul-ci.org web host (django site) i added19:35
fungiso all the lists which used that mail host are now being associated with the newly added site19:35
fungirather than the default19:35
clarkbhrm though actually I think the filtering should work as is19:36
clarkbbecause it is looking at the requests value and the mail domains19:36
clarkbnothing seems to look at the django content19:36
clarkboh maybe it is this line https://gitlab.com/mailman/hyperkitty/-/blob/master/hyperkitty/views/index.py#L6919:36
fungithose values are kept in a django registry of some sort, the django admin page simply gives a way to edit them in one place i think?19:37
clarkbthat looks up the maildomain then checks site.domain so maybe that converts it to the django site19:37
fungithey may also be sharing that data through the db, i'm not sure19:37
clarkbI think it is all db19:37
clarkbthe django admin stuff makes use of what django knows about its db models that sites use to let you introspect the db19:38
clarkbhttps://gitlab.com/mailman/postorius/-/blob/master/src/postorius/views/list.py#L1038-1063 for postorius I think. Though I'm having a hard time parsing that19:40
clarkb"if there is only one mail_host for this web_host" but we have many mail_hosts19:40
clarkbI suspect https://gitlab.com/mailman/postorius/-/blob/master/src/postorius/views/list.py#L1046 is iterating over all the domains that django knows about?19:43
clarkbbut if that were the case your addition of the site to match the mail_host for zuul-ci.org would've corrected this for postorius I think.19:44
fungiclarkb: oh, actually yes it does look like it did fix filtering for postorius19:45
fungi104.130.140.226 is the held server if you want to point your /etc/hosts there19:45
clarkbfungi: yup trying now19:46
fungihttps://lists.zuul-ci.org/mailman3/lists/ shows just the zuul lists19:46
*** rlandy is now known as rlandy|dr_appt19:46
fungiif you try to look at lists.opendev.org there you'll see all the imported lists for the other domains that aren't the zuul one19:46
fungiso this does seem to cover all the bases, question is how best to automate19:47
clarkbya so this isn't quite working19:47
fungioh?19:47
clarkbwell no we don't want all the lists shown at lists.opendev.org19:47
clarkbwe should only see opendev lists19:47
fungiright, we would if all the other lists were associated with other domains than the lists.opendev.org one19:48
clarkbfungi: well the zuul lists still show up there too19:48
clarkband those do have a separate domain19:48
clarkbdo we need to add those sites on the test node to see if they are all properly associated if that fixes it?19:49
fungichecking again. i have to keep flipping back and forth between the public address and my ssh tunnel19:49
clarkbI agree lists.zuul-ci.org looks correct now19:49
fungiyeah, okay so maybe that's due to it being treated as the "default site"19:51
fungiwonder if we need a separate default site which isn't any of the actual sites19:51
clarkbI'm trying to reconcile that with what I see in the code an dnot seeing it yet19:51
clarkbit seems to only take the first entry if the list length is 119:52
clarkbhttps://gitlab.com/mailman/postorius/-/blob/master/src/postorius/views/list.py#L1049-1050 the filtration there would have to return the lists.zuul-ci.org lists when web_host is lists.opendev.org which makes no sense to me19:54
clarkbthe mail_host for lists.zuul-ci.org lists is lists.zuul-ci.org not lists.opendev.org so how do we get that back?19:54
fungiwhat do you mean get it back?19:56
clarkbMailDomain.objects.get() is doing a data lookup filtering by domain mail_host values. Then if the site.domain for the results matches web_host which is the url we hit we add the mail_host to the list of mail hosts19:58
clarkbwhen I hit lists.opendev.org web_host should be lists.opendev.org so how does the == .site.domain matching lists.zuul-ci.org?19:58
fungii associated the lists.zuul-ci.org mail host with the lists.zuul-ci.org site (web host) that i created19:59
clarkbright and that makes lists.zuul-ci.org return the reduced list. But how does that allow lists.opendev.org to continue to return lists.zuul-ci.org lists?19:59
clarkbI suppose it could be cached?19:59
fungioh, possibly. i can try restarting the containers20:00
clarkbit looks like hyperkitty is filtering the lists.zuul-ci.org results out of the index it produces20:00
fungidoing that now, though it takes a few minutes to start back up so we should be mindful of that when doing it in production20:00
clarkbpostorius isn't so ya maybe it is just caching20:00
clarkbthe filtering code between the two codebases is very similar20:00
clarkbI guess we can look for differences between them if the behavior continues to differ20:01
clarkbbut ya I think this fixes like 95% of the problem :)20:01
fungithe containers are restarted but the sites will return error pages for a few minutes until everything's running20:01
clarkbI think it does a lot of migration checks on startup20:01
fungiyeah, and the log complains about some not yet applied too, we might want to check production for the same when we restart it20:02
clarkbthey still show up in postorius after the restart20:02
fungi"Your models in app(s): 'django_mailman3', 'hyperkitty', 'postorius' have changes that are not yet reflected in a migration, and so won't be applied. Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them."20:03
clarkbthats weird we've only ever run the one version?20:03
fungithe log also says this: "Setting lists.opendev.org as the default domain ..."20:03
fungii wonder if being "default" conveys extra behaviors in postorius?20:03
clarkbmaybe?20:03
clarkbthis might be worth an email to the mm3 list?20:04
clarkbwe can describe what we've done how it has fixed hyperkitty but not postorius and see if they come back and say "its the default domain" or something20:04
fungithough truth be told, i don't mind lists.opendev.org showing a cross-domain view of all the mailing lists we're hosting20:04
clarkbya I'm not sure it is the end of the world. But I think it would be nice to understand at least20:05
clarkbreally my main concern is that the behavior seemed to change randomly20:05
fungiand also, if it is due to the default domain and we deem the behavior untidy, we could add another domain to serve solely as the "default" for the full list of lists20:06
clarkbhttp://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_089/866632/1/gate/system-config-run-lists3/089ad87/bridge99.opendev.org/screenshots/ those are the screenshots for the site-owner change and they all show filtering20:07
clarkbwithout setting site id to 0 or adding content to django20:07
clarkbso something is changing in the running state of the system that trips this behavior. Understanding how/why is probably the most important thing if we decide to keep the lists.opendev.org listing as a global overview20:07
clarkbLunch now. Back in a bit20:08
Clark[m]Maybe we hold a new node to cross compare?20:12
Clark[m]And see if doing the admin thing trips it?20:12
fungiprobably a good idea. the only thing i can think of is that it seems like there were no mail hosts defined on the server until i logged in as admin and went to the domain management page. like it decided that was the time to populate them all20:13
fungispecifically, visiting https://lists.opendev.org/mailman3/domains/20:14
fungito test that theory, i deleted all 7 mail domains and the zuul site entry in the django admin interface, downed the containers, set SITE_ID back to 1 in settings.py and upped the containers again. seems to be back to the original behavior from before20:19
funginow i've visited https://lists.opendev.org/mailman3/domains/ as admin and then checked the django webui and all 7 mail domains have reappeared20:21
fungiso i think that's the trigger20:21
Clark[m]Ok cool so we understand the trigger now we can apply the workaround. Probably still worth an email to their list to ask why postorius differs in behavior?20:23
Clark[m]I half wonder if this is a bug too since logging in as admin shouldn't change behavior 20:23
Clark[m]I mean that's a bit crazy to me :)20:23
fungiyes, that's why i didn't want to believe what i thought i had witnessed originally20:24
fungilike, there's no way that just logging in and looking at the domains list writes new configuration to the database, right?20:24
clarkbfungi: the only other thought I've got before sneding upstream messages is maybe we add sites for all of the domains in case the issue is having lists without domains that match sites21:02
clarkbbut I don't expect that to help after looking at the code21:02
fungiyeah, i've been trying to draft a post at the bottom of https://etherpad.opendev.org/p/mm3migration but am struggling with how to word it due to the overloaded and mismatched terminology between postorius and django21:06
clarkbI'm not sure Im' much help after taking a look. I think you grok the differences there better than any of us right now :)21:10
*** tosky_ is now known as tosky21:12
fungiokay, i feel like i've got it "summarized" but it still seems like information overload21:18
clarkbfungi: one last note and ya I think this is a bit convoluted but I suspect the mailman devs will be able to parse through it better than we can21:20
fungiokay, i think i've addressed your last suggestion21:25
*** dasm is now known as dasm|off21:26
clarkbya that looks good21:27
clarkbI think worst case they ask us for more clarification21:27
*** dviroel|rover is now known as dviroel|out21:31
clarkbholiday party planning details sent out21:31
fungiseems my first post to mailman-users is moderated. i guess merely subscribing isn't sufficient21:42
clarkbya that happened to me too, they get through it quickly iirc21:42
*** rlandy|dr_appt is now known as rlandy21:51
*** rlandy is now known as rlandy|bbl22:29
clarkbIn the gerrit mailing list I've found evidence that custom submit requirements do end up in the change list summary view23:45
clarkbI'm pretty confident our supposed workaround will work23:45
clarkbThe 3.6 to 3.7 upgrade is not easily downgradeable beacuse they convert label copyValue settings to copyConditions23:50
clarkbcopyCondition appears to be supported by 3.6. I suspect that the very first thing we should be doing is converting to copyCondition on all our labels?23:51
clarkbAnd maybe at the same time sort out what the submit requirements changes are that zuul quickstart ran into and do those conversions too23:52
clarkbprobably the first step is building 3.7 images and updated the upgrade job. But we shouldn't do that until we're confident we won't revert to 3.5. At this point this seems unlikely I'll poke at that tomorrow maybe23:54
clarkbBut then ya I suspect we need to do widespread updates to labels/submit-requirements to accomodate 3.7 early. I worry that the required offline reindexing and schema updates might be very slow...23:55
clarkbtesting that without a copy of the installations is hard, but we can probably benchmark it with artifical data and see if we think it represents an issue that deserves further testing23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!