Monday, 2021-08-16

*** gibi_pto is now known as gibi06:08
*** iurygregory_ is now known as iurygregory06:31
*** jpena|off is now known as jpena07:42
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespace  https://review.opendev.org/c/openstack/project-config/+/80466909:39
opendevreviewchzhang8 proposed openstack/project-config master: bring tricircle under x namespace  https://review.opendev.org/c/openstack/project-config/+/80466910:01
*** sshnaidm|pto is now known as sshnaidm10:30
*** sshnaidm is now known as sshnaidm|pto10:31
*** jpena is now known as jpena|lunch11:16
*** dviroel|out is now known as dviroel|ruck11:26
*** diablo_rojo is now known as Guest449111:39
*** jpena|lunch is now known as jpena12:16
clarkbyoctozepto: fwiw I cannot reproduce the behavior clicking on the cherry picks link when logged in on firefox15:34
clarkbI wonder if it has to do with being the owner for the change15:34
*** ysandeep is now known as ysandeep|away15:40
*** diablo_rojo__ is now known as diablo_rojo15:43
yoctozeptoclarkb: ack, no problem; it is the first I have such an issue15:48
*** jpena is now known as jpena|off15:58
*** marios is now known as marios|out16:01
clarkbour meeting agenda is surprisingly empty after taking a first pass at updating it this morning. I guess good news there is it means we've just put a bunch of work behind us :)16:28
clarkblet me know if I'm missing anything obvious that should be on there though.16:28
opendevreviewKendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot  https://review.opendev.org/c/opendev/system-config/+/80319016:49
clarkbfungi: re the lists.kc.io snapshot I'll try to boot that after lunch seems to be the most likely schedulign for that. Then upgrade it all the way through to focal taking notes?16:53
clarkbfungi: ^ are there any gotchas or things you think we should keep an eye out for on that?16:53
clarkbone thing is the esm registration on that snapshot I guess16:54
clarkbMaybe step zero is to disable that?16:54
clarkbthough we're under our quota for that so having a test server boot up with it isn't the end of the world I guess16:54
clarkbthough maybe it is safest to disable it to prevent the other server getting unregistered16:55
* clarkb is not really sure how taht gets accounted16:55
fungiwill disabling it disable the production server too? i guess we can check16:57
fungiwondering if it loads a unique key onto the machine on registering16:57
clarkbya I have no idea17:00
clarkband ya I guess we can check it and resolve manually if necessary17:01
clarkbfungi: do you think it would be safer to leave it as is on the new instance or disable esm on the new instance?17:01
fungii would disable, and then re-register the production server if necessary17:02
clarkbok17:05
opendevreviewKendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/80479017:08
clarkbdiablo_rojo: ^ note on that one. I need to page mroe of the plan there in to be sure of my comment on that but wanted to point out the issue either way17:11
opendevreviewKendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479117:16
*** timburke__ is now known as timburke17:16
opendevreviewKendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/80479017:18
opendevreviewKendall Nelson proposed opendev/zone-opendev.org master: Add CNAME for ptgbot.opendev.org  https://review.opendev.org/c/opendev/zone-opendev.org/+/80479017:19
diablo_rojoclarkb, makes sense. Hopefully I fixed it correctly. 17:22
diablo_rojoI also figure the letsencrypt cert had to be setup first? and that this should be dependent on that? 17:22
diablo_rojoBut I can remove that if its wrong17:22
clarkbin testing we use the staging LE servers and I'm not sure they properly verify against DNS or not17:24
diablo_rojoOkay so your guess is only marginally better than mine lol. 17:27
clarkbdiablo_rojo: what I'm not sure about looking at thsse changes is where the apache config is. I think you may need a "run the ptg site" change somewhere?17:31
clarkbya I think https://review.opendev.org/c/opendev/system-config/+/780942 was that but then the puppet got removed.17:32
clarkbdiablo_rojo: that means your letsencrypt change likely needs to also configure the apache config as well17:32
opendevreviewKendall Nelson proposed opendev/system-config master: Setup Letsencrypt for ptgbot site  https://review.opendev.org/c/opendev/system-config/+/80479117:47
corvusfungi, clarkb: i think the issue with the semaphore is that we didn't choose one CD strategy, we chose two, and they are not working well together17:48
clarkbcorvus: I'm not sure I completely agree with that. Having periodic catch ups seems like a reasonable safety net even if you want to do direct deploys.17:49
*** dviroel|ruck is now known as dviroel|out17:49
clarkbYes their approaches are different, but I don't think that users should be forced into only one or the other17:49
corvusif we can really deploy things when changes merge, that should be the primary strategy, and the periodic should be a backup.  and it should run less often and be quicker so it doesn't interfere with the first.17:49
clarkbcorvus: fwiw I believe the reason we haev an hourly deploy in addition to a daily is that some services like zuul and nodepool get image updates we awnt to apply more quickly than daily17:49
clarkbwe could address that by building our own zuul images when neceessary similar to how we do for other services. But I think that the zuul produced images work well and redoing that effort seems wrong17:51
corvusclarkb: well, in practice, we always manually pull zuul images when restarting anyway because that can't be relied upon.17:51
corvusclarkb: i also think our semaphore is too coarse17:52
corvuswe should be able to run all those jobs at the same time17:52
clarkbThe other big thing in hourly is remote-puppet-else but I think we can configure taht job to run whenever puppet related files change in system-config rather than blindly doing an hourly update17:52
clarkbIt would also make a huge difference if ansible addressed their performance regressions around forking tasks. It is unfortunately quite slow now :( but mordred says upstream isn't interested in reverting or changing that behavior (I can appreciate that it is likeyl complicated and changes there could produce worse unexpected side effects)17:53
corvusyep17:53
corvushere's my thinking: zuul should provide tools to help people CD, but our case is not a good one to model -- we have conflicting requirements that just plain cannot be satisfied.  we should resolve that before we try to ask for more complexity from zuul.17:55
clarkbI worry that opendev's situation is more common than that assertion expects though and we're likely to produce similar problems for other CD users17:56
corvusit's possible, but we know that our playbooks are not designed for this.  we haven't even finished implementing the system we originally designed.17:56
opendevreviewClark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly  https://review.opendev.org/c/opendev/system-config/+/80479517:57
clarkbI think ^ is an easy change we can make.17:57
clarkbIt won't solve everything but that job isn't fast and we run it hourly when we almost never actually need the updates encoded in it (and it runs in the deploy pipeline when we do need it)17:57
corvusthe fact that the periodic run takes >1hr is just not a good starting point -- honestly, if we're okay with that, we should drop deploy anyway and go back to "it will be deployed within 0-2 hours".17:59
corvusif we want to make immediate deployment primary, then we need to get the reconciliation path out of the way17:59
clarkbI don't think we're ok with it, but to fix it we either need to stop using ansible, run jobs in parallel, or run fewer jobs17:59
clarkbrun jobs in parallel was the original expectation iirc18:00
corvusyes, i agree.  i think that's the starting point though.18:00
clarkbyup definitely improving hourly throughput would make a huge difference18:00
clarkb804795 above should make a good starting dent in that18:00
corvusi'm not sure why all the other jobs can't run after base?  is it because we don't want them to run at the same time as any deployment pipeline job, and the semaphore doesn't let us express that?18:01
clarkbcorvus: I think there is some ordering implied as well. Like nodepool should update before zuul (and the registry as well?)18:02
corvusclarkb: maybe it's just a matter of making a new parent job for the periodic pipeline, have that one hold the semaphore, run base, then run everything else.  then also have the deployment pipeline jobs individually hold the semaphore so they exclude each other and the entire periodic pipeline (which is now faster)?18:02
clarkbEavesdrop should be able to run whenever I expect18:03
corvuswe should still be able to accomodate that18:03
clarkbbut ya I don't think it is as easy as letting everything run in parallel there is some implied ordering in services. Gitea before gerrit (for replication), nodepool before zuul for image changes, and so on18:04
clarkbI think puppet runs last because in the past we had a bunch of stuff doing puppet that wanted to be in the ordering. But now we may be able to run puppet whenever18:04
clarkbIf order doesn't matter for storyboard (I suspect it doesn't) then we can run pupept in any order based on my read of the site.pp18:05
corvusactually, the deploy pipeline should probably hold the lock for the entire buildset too18:05
corvusclarkb: basically like this: https://paste.opendev.org/show/808121/18:06
clarkband lock-holding-job is paused?18:06
corvusclarkb: yes18:06
corvuszero-node job18:06
clarkbya I think at the very least that will help us express the dependencies properly which will allow us to optimize further18:07
clarkbbasically that might not end up being faster but it will help us understand better to then make things faster18:07
corvusyeah, it should theoretically be faster  ;)18:07
corvuswe should be able to use the same job tree in periodic and deploy18:07
clarkbcorvus: when you say periodic do you mean daily or hourly or both?18:08
clarkb(we have two periodic pipelines currently and they have different jobs, see https://review.opendev.org/c/opendev/system-config/+/804795 for an example)18:08
corvushrm, i wasn't aware of the difference18:09
clarkbbasically hourly is there for things we want to update quickly because we may not have a good trigger for them18:10
corvusi wonder why eavesdrop is in there?18:10
clarkblike zuul and nodepool image updates18:10
clarkbcorvus: eavesdrop also consumes images from other repos (gerritbot for example)18:10
corvusclarkb: but it's pinned18:10
corvusit won't update without a corresponding system-config update18:11
clarkbcorvus: matrix gerritbot is but not irc gerritbot iirc18:11
corvuswhere does the gerritbot image come from?18:11
clarkbcorvus: from the gerritbot repo18:11
clarkbhttps://opendev.org/opendev/gerritbot/src/branch/master/Dockerfile18:12
corvuswould we be sad if it took 24h to update?18:12
corvusanyway, parallelism should help there18:12
clarkbIf we are trying to fix a bug we can always manually pull18:12
corvusi'm surprised the hourly takes >1 hour with those jobs18:13
clarkbcorvus: ~20 minutes is the cloud launcher which is why I have proposed moving it. But also ansible is really really slow :/18:13
clarkbA big part of the cloud launcher slowness is processing all of that native ansible task stuff to interact with the clouds18:13
clarkbit would probably take a minute or two if written as a python script18:14
corvusclarkb: looking at a recent buildset, if we parallelize that (after merging your cloud launcher move), we would have 4m for bridge + 8m for zuul (the longest playbook)18:14
corvusso we should be able to get a n hourly run down to 12m with this approach -- without doing any deeper optimization18:15
clarkbcorvus: but nodepool and zuul registry and zuul would need to run serially? I agree the puppet and the eavesdrop jobs can run in parallel18:15
corvusclarkb: i'm not sure they do?18:15
clarkbcorvus: its possible they don't. I thought that order was intentional for the zuul and nodepool services though. TO ensure that labels show up in the right order or similar18:16
clarkbbut I guess that is all happening in zookeeper now and can be lazy?18:16
corvusclarkb: (and incidentally, the cumulative runtime of all the current hourly jobs is 55m - so that assumption we've been working from is correct)18:16
corvusclarkb: if we're talking about adding a nodepool label, i don't think we expect them to be immediately available anyway (image build/upload time, etc)18:17
corvusclarkb: i think typically nodepool-provides-label and zuul-uses-label would be different changes anyway18:17
clarkbAnother tricky thing is that we use the same jobs in deploy and the hourly and periodic pipelines so we can't just convert the hourly pipeline without converting everything?18:17
clarkbthough maybe we can use a variant of the job to override semaphores in pipelines so we can do a bit at a time18:18
corvusclarkb: by convert do you mean change the semaphore usage?  i think the right thing to do is to apply this to all 3 pipelines.18:18
clarkbcorvus: yes. That is the right thing to do but I'm concerned that the scope of it is quite large and we can't minimize risk for out of order problems if we do all three at once18:19
corvusall 3 should have a lock-holding-job to make the semaphore apply to the whole buildset so interruptions don't happen.  then, if it's okay to parallelize in one, it should be okay to parallelize in all.18:19
clarkbcorvus: yes, except that periodic and deploy run many many more jobs than hourly. Which means we would have to sort out all of those ordering concerns at the same time (much more risk)18:19
clarkbmaking the hourly deploy parallel is much smaller in scope as far as determining what the order graph is18:20
clarkbBut maybe we start by trying to do all 3 together and if it gets unwieldy then we can attempt something smaller in scope18:21
fungithe eavesdrop deploy job was also handling meeting time changes at one point, right? but very recently that's been switched to use a more typical static site publication job?18:21
corvusclarkb: my understanding is that the original intent was that each of these jobs (aside from base and bridge) should be independent, so i hope that there aren't many instances of us assuming the opposite.  but you could stage this by using 2 semaphores.  one as the new lock-holding-job for the buildset -- all 3 pipelines need to use it.  then a second semaphore to make the jobs within a pipeline mutually exclusive.  that will keep things slow18:22
corvus(like today) until the second one is removed.18:22
clarkbfungi: yes, and I guess those meeting times were listed in a different repo too so would rely on hourly updates18:22
fungier, i guess it wasn't the eavesdrop deploy job before that, it was puppet18:22
clarkbcorvus: I'm pretty sure we still have ordering between the jobs. I dno't know that we sufficiently untangled that yet18:23
clarkbthings like gitea-lb running before gitea18:24
clarkb(maybe that should be one job?)18:24
fungibut it would be good to at least identify and codify specific cases like that using dependencies18:24
fungior making it one job18:24
clarkbnameserver before letsencrypt (though we don't encode that order today)18:25
clarkbbecause we need the nameserver job to create the zones before letsencrypt attempts to add records to them18:25
fungiis there a specific letsencrypt job though?18:25
clarkbfungi: yes18:25
corvusclarkb: yeah, gitea sounds like maybe that should be one playbook18:25
fungiahh, okay, i'm likely thinking of the individual handlers in cert management of various services18:26
clarkbletsencrypt before all the webservers18:26
clarkbThey definitel exist, and once we've bootstrapped things sufficiently the order tends to matter less18:26
fungithough also the nameserver/letsencrypt ordering is primarily a zone bootstrapping problem right? if we're not adding a new domain it doesn't matter?18:26
clarkbbut if you bootstrap from scratch the other is important and not encoded beyond the order of the jobs and running serially18:27
fungioh, as far as making sure cname records are deployed18:27
fungiokay, i've got it18:27
fungiso any time we're bootstrapping a new service really18:27
clarkbfungi: no I'm talking about the server bootstrapping in this case not the CNAME addition (though taht order also matters)18:27
clarkbbasically if we went fully parallel we couldn't safely deploy a new name server and attempt to update LE certs18:28
clarkb99% of that time that doesn't matter, until it does18:28
clarkbsimilarly with the various webservers that all need LE certs. We rely on the le job running early before they happen to properly bootstrap new webservers18:29
clarkbCurrently that is only encoded in the pipeline def order with the assumption things will run serially18:29
clarkbAll of this is fixable, but it isn't as simple as making it parallel18:29
fungiwe wouldn't add the nameserver into production (list it through the domain registrars) until it was serving the right records though, yeah?18:29
fungithat's a manual step18:29
fungiso letsencrypt shouldn't try to use it for lookups anyway18:30
corvusthat was an unfortunate oversight; the deploy pipeline is supposed to have explicit dependencies (after all, it does a lot of file matchers)18:30
corvusseveral jobs have an explicit dep on letsencrypt18:30
corvus(codesearch, etherpad, grafana)18:30
clarkbfungi: the issue isn't on the resolution side it is on the ansible side. We run the playbooks on both but it will fail to add records to a zone which doesn't exist if you get the order wrong18:30
clarkbcorvus: ah yup looks like some do have the right annotations but not all18:31
corvus(wow codesearch is listed in the pipeline twice :/ )18:31
clarkbfungi: basically the ansible will fail then we won't get new certs for anything. Not the dns lookup will fail because we never get to that point18:31
fungiclarkb: we make the zone exist by adding records to it though, right?18:31
fungiwe're just installing zonefiles into the fs i thought, not using an api like dynamic dns updates or something?18:32
corvusif it's a case we want to handle, letsencrypt depending on nameserver is reasonable (though right now, nameserver runs after letsencrypt)18:33
corvussorry i have to run18:34
fungilooks like we inject lines into existing files like /var/lib/bind/zones/acme.opendev.org/zone.db so maybe the problem is that we're not writing out the whole file (because we want to avoid invalidating other entries in it which might still be in use)18:34
clarkbfungi: sort of. the acme stuff is dynamic but I'm not sure what triggers it yet looking deeper18:34
clarkbfungi: but https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/master-nameserver/tasks/main.yaml is part of service-nameserver and that installs bind which ensures there are dirs to write to18:35
fungiif that's the actual place it's breaking down, we could just make sure to always create the file before writing to it18:35
clarkbfungi: but then you don't get file perms right beacuse bind isnt even installed yet18:35
clarkband then https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/letsencrypt-install-txt-record/tasks/main.yaml runs in the letsencrypt job18:36
clarkbthere is an implicit dependency between them today that the service-nameserver playbook and job run before we try to set up any LE certs18:36
clarkbit hasn't been a problem because we havne't tried to replace any nameservers recently18:37
clarkband if we are careful when replacing the server we don't need to encode the dependency, but it does exist18:37
fungiso in theory we could just include the nameserver setup role there before trying to create files18:37
fungiwhich ideally should be a quick no-op if the server is already configured18:37
clarkbexcept ansible is slow so not quite, but yes that would be an option18:38
fungi"quick" in ansible terms then ;)18:38
clarkbthat becomes similar to the gitea-lb vs gitea job though18:38
clarkbwe would basically collapse into a single job to encode the dependency and delete the other job18:38
clarkbhappy to do that, but we still need to make changes like that before parallelizing things is safe18:39
clarkbI'm beginning to think we have a couple of first tasks we need to do here. We can do some trivial updates like my cloud launcher change to speed things up immediately. But we should also do a large scale graphing exercise of dependencies in a human readable format if possible.18:40
clarkbThen once we've got that graph we can either encode deps as zuul job deps or collapse jobs together as it makes sense18:40
clarkbthen we can switch to being parallel18:40
fungii concur18:41
opendevreviewClark Boylan proposed opendev/system-config master: Run the cloud launcher daily instead of hourly  https://review.opendev.org/c/opendev/system-config/+/80479518:43
opendevreviewClark Boylan proposed opendev/system-config master: Remove extra service-codesearch job in deploy  https://review.opendev.org/c/opendev/system-config/+/80479618:43
clarkbThere are a couple of easy cleanups based on some of what we discovered during this discussion18:43
clarkbI'll add this as a meeting agenda item as I think it deserves discussion around my next steps above and then actually tackling them18:43
fungithanks!18:46
clarkbok the wiki should have my condensation of all that ^ in it now. Feel free to add additional notes or edits18:49
clarkbI'm going to grab lunch now. But plan to work on the kata lists test server next18:51
clarkbfungi: re disabling esm I'll have to do that after the snapshot boots unless we want to disable it on prod, snapshot again, then reenable on prod. I suspect the safest thing is after boot since the daily package updates won't run for a while anyway18:52
fungiyeah, i assumed after boot18:53
fungiat one point we also had problems with too many ansible runs at the same time causing oom events on bridge.o.o, right?18:57
fungithinking back to the parallel deploy jobs thing18:57
clarkbfungi: yes, but that is tied to connectivity errors piling up ansible processes18:57
clarkbI agree that is a risk though. We might need a semaphore with say only 5 jobs allowed to run at once18:57
clarkbto temper that18:58
fungiand i guess if we did run into that... yeah18:58
fungiexactly what i was thinking, since semaphores can be limited to more than 118:58
clarkbthe risk there is we would want the semaphores to all live within one pipeline I think18:58
clarkbexclusve to a pipeline but >1 job in a pipeline18:58
corvusIf we have a builder semaphore one secondary semaphore is ok19:00
corvusThe other pipes can't get it19:00
corvus* If we have a buildset semaphore one secondary semaphore is ok19:00
fungiyeah, coupling those sounds reasonable19:01
corvusOne semaphore for pipeline exclusion (max 1). One semaphore for job exclusion (max N).19:02
clarkbaha19:02
opendevreviewMerged opendev/system-config master: Run the cloud launcher daily instead of hourly  https://review.opendev.org/c/opendev/system-config/+/80479519:22
clarkbfungi: can you check my comment on line 195 of https://etherpad.opendev.org/p/listserv-inplace-upgrade-testing-2021 before I boot the tes instance? I assume there is some way to do what I describe there but I'm not sure I know what it is19:51
clarkbfungi: also do you know what you called the snapshot? I'm not able to list the snapshot currently but am trying to figure that out19:59
clarkb--os-volume-api 1 volume list using that works to get the volumes20:01
clarkbbut not snapshots20:01
clarkbaha because it is isn't a volume snapshot it is a server image20:03
clarkbI need image list. Found it20:03
clarkbfungi: ok I think I'm ready to boot off the snapshotted image if you can take a look at line 195 on the etherpad above20:05
fungiclarkb: yep, sorry, was out taming the yard, looking at the pad now20:18
fungiand sorry i wasn't clear earlier, it's definitely a server image not a volume snapshot (the latter would assume cinder and bfv i think?)20:18
fungiclarkb: so... package maintainer scripts are supposed to obey disabled state for services20:22
fungiit's maybe possible they'll break, but i think the restarts are supposed to be a successful no-op when services are disabled, per debian policy (and these packages are generally inherited from debian anyway)20:23
clarkbfungi: testing with the zuul test node servers indicated this wasn't the case. I suspect because with an upgrade it is different?20:24
clarkbfungi: the services were definitely running after the upgrade20:24
clarkbfungi: the statement ther eabout it starting serices is based on previous experimental data20:24
fungimmm, how were those services disabled?20:25
fungisystemctl disable or some other way?20:25
clarkbya using systemctl disable20:25
*** artom_ is now known as artom20:26
clarkblooking at my notes it seems mailman started after the first upgrade but maybe not exim? then maybe exim started after the second one. I wish I had better specifics about that now20:27
clarkbfungi: after we boot the snapshot can we check for anything spooled upand clear that out if it exists?20:27
clarkbif so I can boot the snapshot now and we can take a look and see if this is even a concern20:27
fungiyeah, worth double-checking them20:29
clarkbok should I boot the server now then?20:29
fungican always clear out the dirs under /var/spool/exim20:29
fungiyeah, i'm on as long a break from yardwork as we need, go ahead and i'll check it20:30
clarkbok proceeding20:30
fungilmk the ip address it gives you20:30
fungiand then i'll start a root screen session on it20:31
clarkbfungi: did you want to run through the upgrades with me too? Not sure how interested in that bit you are. My plan was to record the steps like I did with the zuul test servers on that same etherpad20:31
fungii can, sure20:32
fungiserver is responding to ping but ssh doesn't work yet20:37
clarkbya20:37
fungithough the server has a static ip configuration in /etc/network/interfaces... i wonder if that's getting properly reset20:39
clarkboh it does?20:40
clarkbhrm I want to say we've run into this before and had to do a rescue instance? ugh this gets really annoying really fast20:41
clarkbya I suspect that may be the issue20:41
clarkbbecause we uninstall cloud init20:41
clarkb:(20:41
clarkbmaybe we should stop doing that20:42
fungiwell, to be fair, we almost never do server images20:42
fungi(or in-place upgrades)20:42
clarkbya but when we do it is because the other options are bad20:42
clarkbI'm not really sure how the whole rescue instance thing works. Is that something I can do via osc?20:43
fungii've only ever tried it through the dashboard20:43
fungibut in essence it boots a replacement vm from some standard image and then attaches the server's volume as another block device20:44
clarkband from that we can mount and edit /etc/network/interfaces20:44
clarkblooks like there is an openstack server rescue command20:44
clarkbI'll try that20:45
clarkbhrm I didn't set a key name on the original boot so I'm not sure there will be keys set on the rescue vm20:46
clarkbwhy isn't that part of what the rescue api takes20:46
clarkbfungi: I suspect that I may need to unrescue then delete my test instance20:47
clarkbthen start over and set key name appropriately20:47
clarkbyup I get an ssh connection now but it wants to fallback to password20:47
clarkbI'll unrescue, delete the instance, boot again with a key set then rescue again20:47
funginormally when you do a rescue through the dashboard it tells you the root password20:48
clarkbhuh it did not do that here20:48
clarkbthough it looks like I can set a rescue password20:48
clarkbI guess I can try that before deleting and starting over20:48
clarkbyup that seems to have worked. The good news is the rescue image has an /etc/network/interfaces that I can refer too as well20:55
fungipita but at least it's a way forward to get the thing usable. i suppose you could chroot into it and disable esm too if you wanted20:56
clarkbbut now I'm confused because it seems the /etc/network/interfaces on the other device is the same as the one on this device20:58
clarkbthey are clearly different devices if I look at /etc/hostname and /mnt/etc/hostname or /etc/hosts etc20:59
fungiso maybe on boot the rax-agent or whatever it is does config file injection then?20:59
fungiin which case there could be other reasons for sshd not accepting connections, i guess20:59
fungimaybe it was taking a long time gathering enough entropy to generate new host keys?20:59
clarkbI see a .bak file with the old content I expected. I half suspect that it just didn't restart networking and that if I unrescue at this point maybe it will work?21:00
clarkbfrom 20:34 today21:00
clarkbI'll try that I guess. I haven't changed anything via the rescue so if that works then hax21:00
fungioh yeah maybe21:00
fungidoes it have host keys?21:00
clarkboh I've already dropped out I didn't check21:01
clarkbI guess thats the next thing to check if it continues to fail21:01
fungino worries, something to check if it happens again, yeah21:01
fungiit's pinging again21:03
clarkbbut still no ssh. I'm finding this very confusing. Also doesn't debian's ssh init generate ssh host keys?21:04
clarkbwhat if sshd is not starting at all for some reason?21:04
fungiyes21:05
fungiwhich needs entropy, which is hard to come by on a vm21:05
fungiand looks like haveged is not installed on it21:05
clarkbah I see what you mean earlier it could just be waiting on that?21:05
fungialso quite likely no compatible host entropy kernel module21:05
fungiyeah21:05
clarkbdo you think it is worth waiting to see if this chagnes or should I rescue it again?21:06
fungiit likely says on the console if it's generating keys21:06
clarkbya but for the console i have to login to the dashboard :P I was hoping to avoid that, but maybe I should21:06
clarkb*shouldn't21:06
fungiopenstack console url show <uuid>21:07
fungiit seems to want the root password for maintenance21:08
clarkbdoes that work with rax? I know just dumping the consiole doesn't. I'm logged in21:08
fungimaybe fsck of the rootfs failed?21:08
fungiyeah, console url show works with rac21:08
fungirax21:08
clarkbtil21:09
fungii wouldn't be surprised if there are fs inconsistencies since it was imaged while mounted and likely writing21:09
clarkbI agree it wants a root login. Should I delete this instance and try again? Then check the console of the new instance? maybe the snapshot isn't so happy?21:09
clarkbif it fails again we do over?21:09
fungii would rescue boot and fsck the block device while unmounted21:09
clarkbok I'll try that21:09
fungii suspect imaging any running server could result in this situation21:10
fungii've had more luck imaging servers while they're shutdown21:10
fungimore frightening though if rackspace did file injection on a filesystem which wasn't clean21:11
fungithe power of cloud21:11
clarkbfungi: just `fsck /dev/xvdb` ?21:13
fungisure, though you might have to hit 'y' a lot21:13
fungiyou could add -y21:13
fungiit's unlikely you'd tell it to do anything other than try to repair anyway21:13
clarkbmy manpage doesn't show -y as a valid option21:14
fungii may be confusing with bsd ;)21:14
clarkbactually it probably wants xvdb1 since thati s the partition with an fs on it21:14
fungimy fsck manpage has -y documented21:14
clarkbhuh mine does not21:14
clarkbmany I need to look at fsck.ext321:15
fungithe fsck manpage on lists.k.i also has it21:15
toskymaybe it's specific to the fsck.foo you use21:15
clarkbya has to be a specific fsck21:15
clarkbjust fsck doesn't document it21:15
fungiit is specific to the fsck.foo, but the general fsck manpage also says that about it21:15
clarkbneat mine doesn't on suse nor does the debian rescue image I'm on21:16
clarkbI'm running that fsck now21:16
fungi"-y r  some filesystem-specific checkers, the -y option will cause the fs-specific fsck to  always  attempt  to  fix..."21:16
clarkbthat was quick it is done21:16
fungidid it say it repaired anything?21:16
clarkbit updated free inode and block counts21:16
toskyboth Debian 11 (ok, testing) and Fedora 34 don't document it (same manpage apparently, last update February 2009, from util-linux)21:16
clarkbcloudimg-rootfs: clean, 303705/5120000 files, 1247631/10485499 blocks21:16
clarkbdoesn't seem to have done much21:17
clarkbI'll mount it now and check for ssh host keys21:17
clarkbit has its preexisting keys from the snapshot21:17
clarkbfungi: ^ anything else you want me to try before giving it another reboot?21:18
fungitosky: interestingly, the ubuntu 16.04 fsck manpage i quoted from above claims to be from "util-linux February 2009"21:18
fungimaybe they patched it21:18
fungiclarkb: nah, i strongly suspect it was fsck failing at boot which caused the behavior we saw21:18
fungigive it another try now21:18
clarkbok21:18
fungitosky: i agree my newer debian machines don't document -y in the general fsck manpage but do in the manpage for, e.g., fsck.ext2 and so on21:20
toskyweird21:21
clarkbthe console shows the boot splash. I wish bootsplashes went away on server images21:21
clarkbI suspect this may fail as it is taking a long time again21:22
fungitosky: i guess they decided to axe a number of entries in the general manpage which just said "this normally does x but behavior depends on the backend chosen"21:22
clarkbyup it is in emergency mode again21:22
clarkbbut the boot splash prevented us from seeing why21:22
fungidid it say why?21:22
clarkbno because boot splash21:22
fungiugh21:22
clarkbI guess if I rescue it again there might be something in the kernel or syslog?21:23
fungimaybe, and worst case we can disable the bootsplash in the grub.config or whatever21:23
clarkbfungi: ya but doesn't grub require complicate reconfiguration when you do that now?21:23
clarkbI wonder if that will work using debian tools against the ubuntu image21:23
funginot if you're editing the config in /boot21:23
clarkbah ok let me rescue it again21:24
fungithere's a fancy run-parts tree in /etc which you can use to build a grub.cfg file, but that's all it really is. you can edit the built config directly as long as you don't care that rerunning the config builder will undo your changes in it21:25
clarkbor that if you get it wrong grub may fail? which isn't much worse than what we're doing now21:25
fungishould just be able to edit /boot/grub/grub.cfg and take out the "splash quiet" from the kernel command line21:26
fungiand yes, splashscreens on virtual machines (or servers in general) are beyond silly21:28
clarkbdoing that now21:28
clarkbok that is done. going to quickly check kern.log and syslog type logs21:29
clarkbthose don't have anything from today in them implying we aren't getting that far21:30
fungiright, if they don't get far enough to mount the rootfs, that's what i'd expect21:30
clarkbfungi: I think it might possibly be the swap device in fstab21:31
clarkbI'm going to comment those out21:31
fungioh, quite likely21:31
fungiin fact, yes, i bet we set swap to a partition on the ephemeral disk which on the new server isn't partitioned/formatted21:32
fungigood call21:32
clarkbyup21:32
clarkbit is unrescuing now21:32
fungiof course, if it weren't for the splashscreen, we'd have known that an hour ago ;)21:33
clarkbindeed21:33
clarkbalright it is up finally21:34
fungiand i can ssh in21:34
fungii have a root screen session established on it21:34
fungiexim4 and mailman did not start on boot, that's good21:35
clarkbyup I didn't expect them to start booting from the snapshot, just after the upgrade(s)21:35
clarkbfungi: looks like exim may have some stuff spooled21:35
fungiagreed, just getting a baseline so we know later21:35
fungiokay, cleared out the exim4 spooldirs21:37
fungithat way if it does start, it shouldn't try to deliver any duplicate messages21:37
clarkbthanks. you just rm'd the dir contents for input etc?21:38
clarkbI'm getting my notes together on the etherpad then will proceed with upgrade things21:38
fungii did, yes21:39
fungi/var/spool/exim4/*/* basically21:39
clarkbalright the next step on my etherpad is to unenroll from esm. I'll do that21:40
fungiand then we want to check esm status on the production server21:40
clarkbyup I just did that nad it says it is still enrolled. I'll make a note to check it again tomorrow in case it takes time for that accounting to ahppen21:41
clarkbas expected upgrading is a noop21:41
clarkbmy notes say I should reboot, but I think I can skip that because no package updates occured21:42
fungiyes21:42
fungithat's fine21:42
fungieffectively we did *just* "reboot" it anyway21:42
clarkbyup exactly.21:42
clarkbfungi: the next step is actually doing an upgrade. I'm going to take a short break here as I need some water. But will get back to it and start a root screen and update my notes on the etherpad as I go through answering questions if you want to follow along21:43
fungiare you using the root screen session i started (and if not, do you want to?)21:43
clarkboh not yet sorry I didn't realize there was one but you said you would start one21:43
clarkbI'll use that one going forward21:43
fungiand yeah, i'll do a quick round with the leaf blower, brb21:43
clarkbI'll start the beginning of this isn't too interesting21:48
clarkbfungi: you ready for me to accept the new packages. This is where it gets fun and you have to sort out keeping old files or accepting new ones. THough I have a bit of a cheatsheet from earlier testing21:53
* clarkb goes for it. we can always redo testing again if necessary21:55
fungii'm back, sorry21:56
clarkbfungi: if you look at the etherpad we are at line 21621:56
fungiand yeah, the choices are less interesting than the results21:56
clarkbfungi: so this step is one that wasn't curses before it was just a prompt but also one I had questions about21:58
clarkbcurrently we only support a subset of lanagues in our lists, but I figure the safest thing is to select all of them?21:59
clarkbfungi: do you think it is worthwhile to only do the subset?21:59
fungiyeah, just take the default there. it won't really chew up that much additional space nor time generating locales for things21:59
clarkbwell the default was none iirc21:59
funginot even english? interesting21:59
clarkband ya it is relatively tiny and not that much time so I figured lets just pick all of them21:59
clarkbI'll proceed with all selected22:00
fungithat may come in important for miltilingual support in mailman anyway22:00
clarkbI'm going to select no for saving the iptables rules because I added the 1022 rule that we want to go away later22:01
fungioh, what was rule 1022 for?22:01
fungii see, temp ssh server22:02
clarkbyup22:02
fungias long as you're okay with that vanishing on reboot22:02
clarkbfungi: it was going to anyway as the upgrade doesn't restart that sshd aiui22:04
fungiand yeah, the logind.conf change looks non-impacting for us22:05
fungiis the current plan just to upgrade from xenial to bionic, or are we going all the way through to focal?22:08
clarkbI think we can go all the way to focal22:08
clarkbalso I just added a question to the etherpad. Do we need to uninstall puppet first or can we do that after.22:08
clarkbWe'll test if uninstalling it after works on this run I guess22:08
fungiif we're not puppeting anything on those servers any longer, i'd uninstall puppet first before doing any upgrading22:09
clarkbthat works too22:09
clarkband yes we are not puppeting anymore with the move to ansibel for lists22:09
fungifewer unknowns that way22:09
clarkbwfm22:09
fungiotherwise there's every chance the puppet packages will conflict with distro packages somewhere22:10
fungilike by insisting on a capped version of some dependency, or something22:10
clarkbfungi: the current pop up is unexpected as we should have that list shouldn't we?22:10
fungiyes, we should probably investigate it22:10
clarkbthere is a /var/lib/mailman/lists/mailman list22:11
clarkbI wonder if this is just dpkg being extra verbose22:11
clarkbfungi: I'm going to select 'OK' to continue if you think that is good considering /var/lib/mailman/lists/mailman exists22:12
fungiyeah, i'm not sure why it's not finding that, but maybe we have a path getting overridden somewhere22:13
clarkbdid you want to check anything else before I hit ok?22:13
fungino, go ahead and okay it22:13
fungiwe technically don't rely on the mailman list anyway, as we disable password reminders22:14
fungiwe'll want to keep the modified templates i think? because we install them with ansible (though we may need to upgrade the ones in system-config once we're done upgrading everything)22:15
clarkbyup I think this is a keep22:15
fungithe maintainer versions will be added to the directory with a .dpkg-new extension on them for reference anyway22:16
fungiso we can compare later after the upgrade is done22:16
clarkbok I guess just keep all of these for now then22:17
fungiespecially if they're files in our ansible, since for the production upgrade ansible will just replace them anway22:17
fungilooks like the new ntp.conf is ~equivalent to the one we were installing with puppet. do we no longer do that?22:20
clarkbfungi: ya we don't use puppet for that anymore and no puppet on this server. This is an install package manager version iirc22:20
fungisounds good22:21
clarkbya that is what I have on my notes from using the zuul test nodes. I'm going to select 'Y' here22:21
fungiahh, okay, that looked like the daily ntp cronjob in your notes22:21
fungiwe do manage the sshd config with ansible though, right?22:22
clarkbyes we do22:23
clarkbthis is a keep ours22:23
fungiahh, yeah your notes say yes22:23
clarkbfungi: the etehrpad has the ntoes ya22:24
fungibefore the reboot step, i want to open a second window in screen and check the exim4/mailman service states22:24
clarkbok22:25
clarkbfungi: you are clear to check things.22:27
fungiopening a new screen window now, if you're ready22:27
clarkbyup I'm ready22:28
fungiit did indeed start mailman but not exim22:28
fungiwhat's sad is there's still /etc/rc2.d/K01mailman22:29
fungias added by `systemctl disable mailman` earlier22:29
clarkbthey are units too. Maybe we should do another systemctl stop mailman && systemctl disable mailman ?22:29
clarkbs/units too/units now/22:29
fungialso /etc/rc2.d/K01exim422:29
clarkbI think it may ignore the compat stuff when there are valid units22:30
clarkbthat was based on testing at some point22:30
fungiRemoved /etc/systemd/system/mailman-qrunner.service.22:30
fungiso that's why indeed22:30
clarkbI added that step to my notes. Do you want to do the same with exim4 to see if it removes anything?22:31
fungixenial to bionic upgrade switched from sysv-compat to systemd and didn't honor the existing service state22:31
clarkblooks like you just did I'll put that in the notes too22:31
fungii did exim4 and it didn't remove anything22:31
clarkbyup22:31
clarkbdoesn't hurt to run it again though22:31
fungiexim4.service is not a native service, redirecting to systemd-sysv-install.22:31
clarkbshould I select y to reboot now?22:32
fungiwhen the exim4 service switches from sysv-compat to systemd i expect we'll see the same behavior for it22:32
fungiyep, go for it22:32
clarkbya that may happen between bionic and focal and explain why I remember it causing trouble when tesed previously22:32
clarkbit is rebooting now22:32
clarkbit is up22:33
fungilogged in and root screen session started agaiun22:34
clarkbok we can do the sanity checks there22:34
clarkbthen continue to the focal upgrade22:34
fungiit did not restart those services on reboot22:34
clarkbyup22:35
fungipuppet uninstall command lgtm22:35
clarkbfungi: does that puppet removal look correct to you? we didn't do it prior to upgraded to bionic but can do that on the prod server. I figure we should purge it out now22:35
clarkbcool running it22:36
fungido we want to autoremove as well?22:36
fungiyes please22:36
clarkb++22:36
fungiyou can also do a clean after that to clear out the previous downloads and free up some space in /var22:37
clarkblike that?22:38
clarkboh I guess clean != clean. Do you think we should do a clean here?22:38
fungisure, clean would also remove packages which still exist on the mirrors22:38
clarkbdoesn't seem to make a difference here?22:39
fungiautoclean only removes local downloaded copies of things which are no longer on the mirrors22:39
fungi(in theory anyway)22:39
fungialso possible the do-release-upgrade script already did that for us22:39
clarkbya it may have done so22:39
clarkbAlright the next step on the etehrpad is upgrading to focal unless there is other sanity checkign yuo want to do22:40
funginothing i can think of22:40
fungihopefully we'll get fewer debconf prompts from this upgrade22:44
clarkbfungi: from my previous notes I selected yes here but do you think we should select no to maybe avoid things like exim and mailman getting started?22:45
fungiwell, i think it's worth trying to see. if they do start it's not the end of the world since we cleaned things out22:45
clarkbfungi: select yes then?22:45
fungiand in production upgrade scenarios it won't matter22:45
fungiyeah, go for the yes i guess22:45
clarkbok I'll select yes22:45
fungiideally restarts would only be triggered for already running services22:46
clarkbI think the text reported it was restarting exim but I don't see it running22:46
clarkbso it must be smart about that22:46
clarkbfor server upgrade planning I'm thinking we can do something like thursday: Stop services on lists.kc.io then shut it down and snapshot it. Then start it again without services running and go through this upgrade process. If that all checks out do similar for lists in a week or two?22:50
clarkb*do similar for lists.o.o in a week or two22:50
clarkbkeeping snmpd.conf because it is ansible managed22:52
fungiyeah, noting that lists.o.o imaging will take hours even if shut down22:52
clarkbhrm ya maybe we need to think about that first22:52
clarkbshouldn't be a problem to proceed with lists.kc.io as it snapshots quickly22:52
fungiwe have backups of lists.o.o22:52
fungifor the most part these debconf prompts should be the same ones we kept our versions of in the previous upgrade22:53
clarkbyup just fewr of them if previous testing is an indication22:54
fungithat was reasonably quick22:57
fungiwant to pause again at the reboot step so i can check running exim and mailman services22:58
clarkbyup22:58
fungineither is running nor enabled22:59
fungiexim4 is still via sysv-compat too22:59
fungishould be safe to reboot22:59
clarkbok it must have only been mailman that was a problem last time22:59
clarkbrebooting22:59
fungii'm ssh'd in again with a new root screen started23:01
clarkbjoining23:01
fungii'm paranoid about iptables rules getting wiped out due to circular symlinks after that one time where the wiki server ended up with an exposed elasticsearch23:04
fungibut lgtm23:04
clarkbif you talk to apahce it says tehre are no mailing lists23:04
clarkbbut the lists are listed in /var/lib/mailman/lists23:05
fungialso `list_lists` spits them out23:06
fungibut seems /usr/lib/cgi-bin/mailman/listinfo isn't finding them23:06
fungihowever, this part we can probably troubleshoot tomorrow, if you want a break23:06
opendevreviewMerged opendev/system-config master: Remove extra service-codesearch job in deploy  https://review.opendev.org/c/opendev/system-config/+/80479623:09
clarkbya I'm thinking this has been a number of hours of mailman upgrade stuff so far. Definitely want to figure this out but maybe we can do that tomorrow23:09
clarkbI need to get our meeting agenda out today too23:09
fungicool, in that case i'm a get back to yardwork before i run out of sunlight23:09
clarkbfungi: do you think we should stop apache or shutdown the test server for now?23:09
clarkbor just leave it be?23:09
ianwclarkb/fungi: did you have any thoughts on restricting the redirect for paste to specific UAs in https://review.opendev.org/c/opendev/system-config/+/804539 ?  23:15
ianwi don't really mind if we want to just leave the http site up, just seemed like an option23:15
clarkbianw: does the paste command use a UA that we can key off of23:17
clarkb?23:17
ianwclarkb: i linked to the change that i think implements it, added in ~2014 23:19
clarkbI'd be ok with redirecting for everything else23:20
fungiianw: yeah, saw the comment, seems like a fine idea, i just haven't had time to work out the details and test today23:21
ianwno probs, i can have a look since i broke it :)23:21
clarkbI think the issue has to do with mailmans vhosting. lists know what url they belong to and if you lookup from the wrong url it doesn't work23:27
clarkbusing /etc/hosts to override locally seems to fix it for me. fungi you can confirm tomorrow23:27
clarkbI'm going to context switch to meeting agenda stuff now23:28
clarkband can pick this up tomorrow23:28
fungiclarkb: oh, yep, great guess, that does seem to be the answer23:48
clarkbanother thing I found when digging into that is you can set the python version explicitly which might need to be done more carefully with our mutlisite mailman since we do config things there23:50
clarkbsomething to check out23:50
fungimy bigger concern with the multi-site mailman is adapting the systemd service unit to it23:52
clarkbI don't think that is an initial issue as the sysv stuff should keep working23:53
clarkbit did on the zuul test nodes23:53
clarkbwe leave the systemd unit as disabled, then can followup with unit conversions if we like23:53
clarkbagenda is sent23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!