19:01:06 #startmeeting infra 19:01:06 Meeting started Tue Apr 21 19:01:06 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:09 The meeting name has been set to 'infra' 19:01:16 o/ 19:01:16 #link http://lists.opendev.org/pipermail/service-discuss/2020-April/000010.html Our Agenda 19:01:19 o/ 19:01:41 #topic Announcements 19:01:46 o/ 19:02:00 I wanted to call out here that splitting opendev into its own comms channels seems to be working for getting more people to engage 19:02:09 o/ 19:02:27 welcome! to all those people (not sure if any are here in this channel now but we've seen more traffic on the mailing list) 19:03:00 we're up to 80 nicks currently in the #opendev channel 19:03:38 (still a far cry from the 250+ in #openstack-infra, but many of those may be zombies for all intents and purposes) 19:04:38 #topic Actions from last meeting 19:04:38 also 20 subscribers to service-discuss and 25 to service-announce 19:04:44 #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-04-14-19.01.txt minutes from last meeting 19:04:51 there were no actions. 19:04:57 #topic Priority Efforts 19:05:03 #topic Update Config Management 19:05:28 maybe mordred can update on his activity here then ianw? 19:07:16 maybe we lost mordred 19:07:30 my understanding of it is that we've continued to push towards zuul driven CD of things 19:07:48 in particular we are now looking at cleaning up puppetry as and where necessary 19:07:56 does anyone know what the status of containerized zuul is? 19:08:01 heya - sorry 19:08:07 yes! 19:08:23 so - three things going on 19:08:26 (i'd like to proceed with the tls work, so catching up on that would be helpful for me) 19:09:06 first - I'm still working through the followup from the gerrit rollout - next on that list is gerritbot - this led me to eavesdrop which has turned in to reorganizing how we run puppet a bit 19:09:18 so - sorry for that rabbithole - but I think it'll be worth it 19:09:36 https://review.opendev.org/#/q/topic:puppet-apply-jobs 19:09:40 that's the topic related to that 19:09:48 second and third are nodepool and zuul 19:10:21 https://review.opendev.org/#/q/topic:container-zuul 19:10:26 (i think that rabbit hole -- getting the puppet jobs down to size -- is great and worth it) 19:10:40 nodepool-launcher is ready to go and I think now safe to land: https://review.opendev.org/#/c/720527/ 19:10:56 it won't restart containers in prod, so I think we can land it then do a manual rolling restart of the launchers 19:11:20 if people are happy with what we did there with starting vs. not starting docker-compose ... I can apply the same thing to the zuul patch: 19:11:30 https://review.opendev.org/#/c/717620/ 19:12:08 (issue being we don't necessarily want ansible to run docker-compose up every time it runs - but we DO want that to happen in the gate) 19:12:34 I believe once I update that patch with the start boolean - it'll also be ready to go 19:12:42 and I think also safe to land 19:12:56 but - since that's nodepool and zuul - please review with an eye to "is this safe to land" 19:13:05 we probably could start nodepool-launcher every time 19:13:39 corvus: maybe we land first with nothing starting - because we have to stop the systemd stuff ... 19:13:42 yeah 19:13:49 i'm okay with starting conservative there 19:13:50 and then land a patch to flip the var on the things where we're happy to do it every time 19:14:04 what's the thinking on nodepool builders? i didn't get the full story yesterday 19:14:18 also ianw discovered that debootstrap (used by dib to make debian/ubuntu images) needs a couple of patches to work from a container, so has published a custom build of it in a ppa and confirmed that's working from a container 19:14:19 (is nb04 broken? or what?) 19:14:21 ianw is further with diagnosing the issue - I think he's got a working build 19:14:25 we debated switching to something newer like mmdebstrap, but don't want dib to break for users of older platforms where those newer tools aren't yet shipped as part of the distro 19:14:29 but it involves two unlanded merge requests 19:14:44 corvus: nb04 is broken for debuntu builds 19:14:48 so they have been removed from it 19:14:53 yes, a few things in progress 19:14:58 oh good - it's ianw 19:15:03 specifically because debootstrap in docker containers explodes the next thing that runs in the container? 19:15:19 yes, it likes to unmount /proc 19:15:19 https://review.opendev.org/721394 19:15:41 it sounds like we can't really run our builders or executors in containers at the moment 19:15:54 i'm a little worried that the zuul tls work is starting to collide with this 19:16:13 the zuul patch has the executors running outside of containers 19:16:40 should we rethink what we're doing with the builders? or can we get them into a consistent state soon? 19:16:42 corvus: well - I think we can get to full ansible 19:16:43 i am working to get our dib functional tests of converted to building from the container 19:16:57 it sounds like we should be able to run builders from containers with a patched debootstrap 19:17:00 corvus: which would be the part of the story that would most impact tls work, yes? 19:17:06 mordred: yeah 19:17:26 so we're looking at having 3 builders run from ansible+puppet, and 1 from ansible+containers? 19:17:35 so - yeah - let's give ianw a little bit to see if we can get a solid container story for the builder with patched debootstrap 19:17:37 or 3 from just "ansible" 19:17:49 please don't forget there is an arm64 builder which has not had a lot of attention, but i would not like to drop 19:17:53 I think just ansible if we can't get the container build going 19:18:01 ianw: I have thoughts on that - let's come back to arm 19:18:19 what kind of time are we talking about there, cause it sounds like ianw is working on a rabbit hole of his own with the container functional testing? 19:18:36 basically, we're holding a zuul release on opendev being able to test this stuff 19:18:59 well - it sounds like the patched debootstrap works - so now it's about updating testing to prove that it works and make sure we don't regress, yes? 19:19:29 (and working out the arm story) 19:19:36 so i think we need to either get the system into a place where we can realistically land a coordinated configuration change to the whole system in a day or two, or else sever the dependency between opendev and zuul releases (at least, temporarily) 19:19:57 ok. so - there are a couple of options for that 19:20:40 we can work on an ansible+pip install (I can work on that right now)- based on the current ansible+docker install and similar to how we did zuul-executors in the zuul patch 19:20:50 we'll need focal nodes for them to be new enough 19:21:03 mordred: why do we need focal for that? 19:21:14 everything is pip installed so shouldn't depend on focal? 19:21:22 because of the reasons we're using the containers in the first place- the rpm helper tools on bionic are old or missing 19:21:36 oh for builders specifically. Got it 19:21:38 yeah 19:21:38 mordred: if you mean, just use pip on a plain host to install, i.e. replicating the puppet in ansible, i have a patch that does that 19:21:44 for focal, we need to merge https://review.opendev.org/#/c/720718/ to mirror it - and stop mirroring trusty 19:21:48 I think it's not unreasonable to upload a focal base image 19:22:15 yes 19:22:27 and that would be good so that we can have integration test jobs 19:22:38 but - I think we can work those in parallel 19:22:48 also, arm only builds xenial/buster/bionic/centos atm. we don't need the updated tools which are required for fedora, as of right now 19:22:53 fungi: i don't understand your comment in 720718 19:23:06 and get a focal base image uploaded to rax-dfw and boot a nb on it that we can use for fedora builds 19:23:09 fungi: i don't know what the differences between those two hosts are 19:23:15 as ianw says - we only need that for fedora builds 19:23:30 corvus: its a response to my comment 19:23:48 corvus: we have done the work to move reprepro from puppet to ansible yet 19:23:52 clarkb: i understand that. i don't understand how mirror-update.opendev.org and mirror-update.openstack.org are different 19:23:53 so we can boot the other ansible-baesd builders on bionic 19:24:12 corvus: mirror-update.opendev.org is ansible managed and only does rsync based mirror updates currently 19:24:21 the opendev.org server is the new one cron jobs are being migrated too, off the older openstack.org server 19:24:27 corvus: mirror-update.openstack.org does all the other mirror updates (reprepro and maybe other tools too) 19:24:30 s/too/to/ 19:24:45 but they're both afs heads? 19:25:03 yes, both write into afs 19:25:35 I think we should not tie this to reworking anything about how reprepro and old mirror-update works - really just upping the quota and doing a manual release should be fine to get this moving, yes? 19:25:43 mordred: yes 19:25:43 sorry, this is proving a distraction. i still don't understand fungi's comment and the implications, but i'll just follow up later. 19:26:07 basically my comment was calling out that you need to bump the quota and do the manual release 19:26:19 if you do that its should all be fine 19:26:21 and fungi said something isn't necessary, but i don't know wha.t 19:26:35 great - so I think tasks would be: get focal mirroring going, get nodepool building focal nodes, build a manual focal-minimal to upload as a base image into rax-dfw, get a pure-ansible port of nodepool-builder 19:26:53 most of those can be done in parallel 19:27:04 corvus: oh, because mirror-update.opendev.org get vos release run remotely by ansible and uses localauth to avoid timeouts 19:27:08 so do we want to switch all of the nb nodes to pure-ansible, retiring the current nb04? 19:27:12 I'm happy to take the pure-ansible port since I'm cranking on that stuff- can someone else help drive the mirror update? 19:27:14 sorry, i had to page all that back in 19:27:24 then make the container switch later after there's lots more testing? 19:27:55 mordred: bionic is sufficient to build fedora. in fact, i already did all of that, let me fine the patch 19:27:55 my comment was specifically in response to clarkb's 19:27:56 yeah - I think that's a sane thing to do for now - although I do think that continuning the container debugging and testing work is imporant 19:28:05 ianw: but not suse 19:28:09 ianw: because it doesnt' have zypper 19:28:25 in response to clarkb's "Its possible this is no longer a concern..." 19:29:23 so - I think we should operate under the assumption that having at least one focal node would be beneficial - and that we also might need at least one bionic node because arm. hopefully we can coalesce on only focal once we can prove out that it works fine for arm 19:29:24 mordred: ianw I think the only major risk with the focal plan is focal + arm64. But it sounds like we can maybe keep that on xenial or bionic for a bit longer 19:29:29 so i was saying initial vos release timeouts *are* a concern for anything added to mirror-update.openstack.org (like reprepro-based mirroring which is still there for the moment) but not for things mirrored using mirror-update.opendev.org (like rsync-based stuff) 19:29:30 yeah 19:29:39 I don't think the ansible differences between bionic and focal are likley large 19:29:45 corvus: does that answer your question? 19:29:46 we don't have big things like systemd vs sysvinit 19:30:28 fungi: does that change apply to opendev or openstack? 19:30:57 mordred: that sounds reasonable 19:31:02 corvus: openSTACK because it's an ubuntu mirror 19:31:14 so vos release timeouts are still a concern for that change 19:31:36 i probably should have quoted clarkb's comment in my reply, but thought it was obvious what i was replying to (clearly it wasn't, sorry!) 19:32:04 well, the part you were referring to with "this" would have been helpful 19:32:13 I'm happy to work on the ansible nodepool-builder (which ianw may have already done) and the focal-minimal image in rax-dfw - can someone else drive the steps needed to get the vos release done safely? 19:32:16 https://review.opendev.org/#/c/692924/ 19:32:20 ianw: awesome, thanks! 19:32:28 ianw: ^ does that seem reasonable? I think we can switch to containers on top of ansible without containers easily enough when we have that working relaibly 19:33:10 tbh this is was i was proposing about 6 months ago :) 19:33:17 ianw: although I think a lot of the current code can stay as it is in container-builder - so it'll be about picking out the appropriate things that we need for pip nodepool 19:33:25 mordred: I can start a root screen on mirror-update and grab the lock to get this started 19:33:42 then we need to update the quota, merge the cahnge, and manually trigger the update 19:33:50 clarkb: ++ 19:34:07 but first I need to load my ssh key 19:34:19 ianw: well - I think we would have been golden with the container work you did -- if there wasn't this crazy debootstrap bug :) 19:34:39 of all the things to derail this - I wasn't expecting _that_ ;) 19:34:57 it's just a lot of uncharted territory all around 19:35:25 anyway, i'll keep working on it 19:35:51 ianw: cool - and yes - I'd love for us to be able to switch back to pure-containers there 19:35:57 mordred: if we can swing back to gerrit things, we have a third party ci operator that is using devstack-gate and discovered that we are still not replicating to review.o.o/p/ properly 19:36:04 although there is also an arm thing we have to solve 19:36:18 mordred: I triggered replicaton openstack/requirements (the out of date repo) just in case that was somethign that didn't get rereplicated after config updates and no change 19:36:19 I've got thoughts on the arm thing - it's solvable 19:36:46 http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-04-19.log.html#t2020-04-19T15:04:32 <-- basic summary of what we need to do to build arm and x86 images when we buiold images 19:36:50 I also noticed we have replication things failing for the local /opt/git thing there 19:37:01 clarkb: ok - we need to investigate that - it should be fixed 19:37:18 can we swing back to that after the meeting? 19:37:55 mordred: yup 19:38:02 weird, a quick check shows that checks out for me 19:38:19 corvus: you can cloen it but you end up on an older commit aiui 19:39:04 clarkb: i'm saying i don't see an older commit 19:39:25 oh - I know what it is 19:39:27 but anyway, mordred requested we defer this 19:39:55 well - I looked anyway 19:40:09 the issue is a few missing repos that we created while the mount wasn't properly in place 19:40:19 so we didn't actually create them on the real filesystem 19:40:22 (my local test case is opendev/system-config, because that's the ancient checkout i noticed the problem with) 19:40:33 starlingx/kernel would be one I believe 19:41:32 actually - it's owned by root 19:41:54 anywho - we can fix that 19:42:06 we should figure out if we're running manage-projects as the wrong user 19:42:19 and thus creating the local replication target repos as the wrong user 19:42:47 k anything else on this subject? as a time check we have 18 minutes left and a few other things to get to (but this was also a huge chunk of change last week so want to make sure we get through it) 19:43:32 I'm good 19:43:40 all clear here 19:44:47 #topic OpenDev 19:45:01 As mentioned we seem to be picking up some new traffic which is good 19:45:26 Fungi has proposed that openstack-infra become a SIG and the openstack TC is on board with that 19:45:32 I think it makes sense too 19:46:12 why not fold it into qa? 19:46:13 #link http://lists.openstack.org/pipermail/openstack-discuss/2020-April/014296.html Forming a Testing and Collaboration Tools (TaCT) SIG 19:46:16 The TaCT SIG - I love the proposal ;) 19:46:28 frickler: I think gmann is concerned they don't have the knowledge to manage some of the tools officially like that 19:46:37 frickler: members/leaders of the openstack qa team have expressed concern over suddenly becoming responsible for "more stuff" 19:46:50 frickler: I think long term that may make sense but for now this gives us the ability to keep it a thing that is more tightly scoped and work with qa team as necessary 19:47:02 (also i could see qa shutting down as a team and folding into the same sig eventually) 19:47:17 oh, maybe that direction might work, too 19:47:28 or split of some things like devstack 19:47:31 o.k. 19:47:38 s/of/off 19:48:52 we could stand some volunteers to serve as chairs for that sig, if it's something folks are generally in favor of forming 19:49:21 If you'd like to volunteer to be service coordinate for opendev now is the time to do so 19:49:39 I mentioned I'd be happy to continue but also think new involvement is good as well 19:49:49 keep in mind that there is little responsibility as a sig chair, mostly just be aware of what's going on generally within the sig and be able to serve as a representative for it 19:50:08 ya we'll need a sig chair as well but thats less involved I exect 19:50:22 #topic General Topics 19:50:28 It is PTG planning time 19:50:58 we have been given some constraints in an effort to have collaboration happen between projects and keep hours sane for attendees 19:51:04 i have a link i was going to paste with sig chair responsibilities, but maybe i'll just follow up to that ml post with it 19:51:12 so is the vPTG to run completely on meetpad? or something else like zoom or bj? 19:51:16 (snice we're short on time) 19:51:23 the result of that is a giant ethercalc where we need to sign up for time 19:51:33 frickler: not determined yet 19:51:35 frickler: I think that is still be sorted out. I'd like to be able to make meetpad an option 19:51:41 so we should keep pushing on it 19:52:03 from that etherpad I've identified 3 two hour blcoks that I think work with our global presence 19:52:04 i gather there's communication going out in the next day or two from the event planners to try and nail down requirements for collaboration software 19:52:05 i'll see about working out what the deal is with the python version there 19:52:13 corvus: I think mordred pushed a fix for it 19:52:14 o.k., but that'd need some big push IMO. I'll try to get some time allocated for that 19:52:18 oh awesome 19:52:30 corvus: the root cause was fun 19:52:34 Monday 1300-1500 UTC, Monday 2300-0100 UTC, Wednesday 0400-0600 UTC 19:52:40 corvus: https://review.opendev.org/#/c/721707/ 19:52:49 those are the blocks I think will work so that we can each attend ~2 out of ~3 without too much pain 19:53:31 if there isn't any immediate objection to those blocks I can go ahead and sign us up for them (and tweak later if necessary) 19:54:09 (and yes it will mean an early morning or a late night for many of us if you intend to hit 2 of the 3) 19:54:28 (but it seemed to be an equitable distribution when I wrote out the times in a table ) 19:55:12 +1 from me 19:55:23 yeah, i'll make myself available whenever 19:55:42 LGTM 19:55:44 maybe i can even swing all three with appropriate quantities of caffeine coursing through my veins 19:55:48 cool. I probably won't get to signing up for those times until later today so let me know if there is a major conflict 19:55:57 Next up is the wiki update but I think we can skip it due to time 19:56:00 which takes us to etherpad 19:56:10 etherpad is dead, long live etherpad? 19:56:19 as part of mordreds container/ansible/cd work the old etherpad servers are gone including etherpad-dev 19:56:29 we haven't replaced that server and will instead rely on system-config end to end testing 19:56:45 the idea mordred and I had was if we need to verify UI behavior we can hold a test node and use it to manually verify off what zuul built 19:57:02 I expect this will work reasonably well as a tool we can leverage for various services 19:57:10 yeah - if we find that sucks - we can always spin up a new etherpad-dev 19:57:18 Wanted to call this out as a separate agenda because if it isn't working well then that feedback would be good to hear 19:57:19 mordred: ++ 19:57:46 #topic Open Discussion 19:57:52 alright have a few minutes for any thing else 19:59:26 Sounds like that might be it. Thank you everyone! 19:59:30 #endmeeting