19:01:06 #startmeeting infra 19:01:07 Meeting started Tue Mar 6 19:01:06 2018 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:11 * hrw just in case 19:01:11 o/ 19:01:15 o/ 19:01:17 * dmsimard adds a topic to agenda 19:01:23 #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting 19:01:38 #topic Announcements 19:02:23 hello 19:02:30 The PTG was last week. Despite being snowed in and the conference kicking us out a day early I think things went reasonably well 19:02:49 * dmsimard added topic to agenda 19:03:20 * mordred waves at the nice people 19:03:26 But keep in mind that people may still be traveling (I was traveling less than 12 hours ago still) and don't be surprised if planned discussions didn't happen becuase people were trying to find their way home or just find a guiness 19:03:42 * tobiash waves back 19:03:49 I heard that someone got rebooked from Friday to Thursday 19:04:01 worth noting, the _conference_ didn't kick us out a day early, just the venue 19:04:04 conference went on 19:04:04 hrw: wow 19:04:21 fungi: right 19:04:27 a colleague spent 2 days at the airport for nothing :( 19:04:40 many of us did still meet thrusday afternoon and friday, but in a far more ad hoc fashion 19:04:41 the airline just kept delaying the flights before eventually cancelling them 19:05:10 basically all that to say expect a slower than normal resumption of dev activities 19:05:17 * mordred would like to thank all scandinavian people for their role in getting him away from the snowstorm 19:05:27 #topic Quick PTG Recap 19:05:42 #link https://etherpad.openstack.org/p/infra-rocky-ptg 19:05:57 Despite this the Infra team did manage to get through quite a few of its PTG topics 19:06:21 If you are curious about how discussions went we kept notes on that etherpad 19:06:44 i was less available for infra discussions than i'd hoped 19:06:50 Also I think the helproom days went a bit better than in denver. More communication and explicit scheduling along with zuulv3 being a reality now probably the reason for that 19:06:57 ++ 19:07:01 #action fungi generate rocky cycle artifact signing key 19:07:16 I wasn't in as may helproom hours as I would have otherwise liked, but I felt I was in a good position to help when I was 19:07:20 * fungi didn't get around to going through that with ptg attendees 19:07:21 clarkb: agree, the extra comms we did as a team, helped a lot. 19:08:15 If there is anything in particular people are interested in happy to talk about that now, but I should also send a recap to the infra list and we can have proper discussion there too 19:09:02 ok we'll take it to the mailing list then 19:09:06 #topic Actions from last meeting 19:09:17 #link http://eavesdrop.openstack.org/meetings/infra/2018/infra.2018-02-20-19.01.txt Minutes from last meeting 19:09:22 #action clarkb clean up old specs 19:09:31 I may actually finally get around to that now that the ptg is over 19:10:46 #topic Priority Efforts 19:10:53 #topic Zuul v3 19:11:29 really quickly before we get to the gerrit/arm/ara stuff is there anything urgent on to go over with zuul? 19:11:54 umm, it's currently in emergency file, want to talk about that? 19:12:04 let's 19:12:08 ++ 19:12:23 zuul-web changed and it needs new puppet in https://review.openstack.org/#/c/549608/ to deploy 19:12:26 (also I don't think zuul01 ever got rebooted to address the systemd sadness from week before ptg) 19:12:29 i saw some of the discussion but didn't have time to work out the situation 19:12:34 * mordred is working on fixing the bits that are needed for 549608 to work 19:12:50 something about zuul-web static files again, right? 19:13:08 clarkb: sorry, jumping back to PTG: Is there an update from some of the help sessions as well? I'm especially interested in the handling of jobs and irrelevant-files 19:13:26 yep, location moved and apache config just needs to be updated to 19:13:28 clarkb: can do via email as well 19:13:47 AJaeger: I unfortunately was not part of that discussion but definitely something we should follow up with corvus and AJaeger on 19:13:49 er 19:13:50 we are at 25GB of ram, so a reboot / restart would help bring that down again (zuul) 19:13:51 the main issue remaining is that the building/publishing of the javascript content isn't actually working (and is also broken for storyboard, fwiw) 19:13:53 corvus and andreaf 19:13:57 http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all 19:14:09 for 549608 two things for review ... doesn't seem we need the documentroot any more? since we have a .* redirect? 19:14:11 mordred: even after updating the job it is still broken? 19:14:19 clarkb what's up? 19:14:22 clarkb: yes - there are a few more issues - I just found one more 19:14:25 clarkb: but we're VERY close 19:14:41 and dmsimard backups of status? do they actually need to be exported via www? 19:14:45 andreaf: one sec two discussions happening. We'll come back to the thing you can help with in a few 19:14:56 ianw: I thought they needed to, but it turns out they don't 19:15:11 ianw: see the confusion in https://review.openstack.org/#/c/536622/ 19:15:21 ianw: moving them outside of www is fine 19:15:27 ok, that's good if they can just live in /var/lib/zuul/backups for admins to use, that's easy 19:15:28 yeah, we should just use them directly from the filesystem and not worry about serving copies of them through apache 19:15:44 ++ 19:15:50 I've done that a few times 19:15:53 I'll convert 536622 into a documentation patch to explain examples of restoring from file:// 19:15:55 they only get used by tools run from the command line on the same server anyway 19:16:47 so seems like fixing all that up is in flight, so seems ok to me 19:16:52 fwiw - for everyone's edification (I'll be updating job descripts with this...) 19:16:53 if the dump script doesn't support a file:/// url or something similar today, we should just fix that and not work around it 19:16:57 i'm around all day if i can help 19:16:59 publish-openstack-javascript-tarball is about making and publishing a source tarball 19:17:18 publish-openstack-javascript-content is about making and publishing a bundle of the built html/javascript 19:17:46 fungi: I didn't know file:// could be used with the dump script so the confusion came from that 19:18:26 i don't know that it can either, just suggesting it's software we wrote, we should make it support what we need rather than adding complexity in other places to work around a lack of a feature we haven't implemented 19:18:37 It can, pabelanger has used it before 19:18:52 perfect. problem solved! ;) 19:18:57 sounds like the general plan then should be something like 1) fix js publishing jobs 2) fix zuul puppet and remove zuul01 from emergency file 3) reboot zuul01 4) update queue saving tooling if necessary to use local backups ? 19:18:57 mordred: note during debugging, i ran the tools/install_javascript.sh thing on zuul01, so i guess it has yarn & npm installed now. i ran a manual pip reinstall after that to get the setup hooks to build it 19:19:23 then i realised the puppet needed to change, so that's when i went for the revert solution 19:19:33 i'll clean that up 19:19:50 ianw: cool ... 19:19:58 mordred: ianw ^ have I captured the list of things above reasonably well? 19:20:13 ++ 19:20:13 clarkb: ++ 19:20:44 anything else zuul related? 19:21:47 sounds like no 19:21:53 #topic General Topics 19:22:09 ianw: Github replication woes 19:22:21 ahh, yes 19:22:31 more people than you maybe think seem to like github 19:22:45 #link http://lists.openstack.org/pipermail/openstack-infra/2018-March/005842.html 19:23:00 they all come crawling out of the knotholes as soon as replication begins to fall behind 19:23:12 is the debugging i've done, which seems to suggest something has changed and the nova-specs corruption is holding up the github replication thread 19:23:27 * dmsimard has a feeling fungi REALLY doesn't like GitHub 19:23:27 AFAICT, mostly it tries to push and raises an exception, and thing move on 19:23:29 worth noting we had to restart gerrit during the ptg as it got really slow 19:23:36 and I think this started afterthat 19:23:55 load was really high on review.o.o too 19:23:57 but it seems that it can get stuck, not raise an exception, and then things just bunch up 19:23:59 (but we didn't update gerrit so behavior shouldn't have changed, but github did update their ssh stuff recently) 19:24:23 dmsimard: i'm just not a fan of proprietary software/services, really. nothing particularly unique to github 19:24:28 anyway, i've had a proposal to fix it out for a while described in 19:24:30 #link http://lists.openstack.org/pipermail/openstack-dev/2017-June/119166.html 19:24:39 but is the issue we want to re-index after we do that? 19:25:03 we should be able to perform reindexing online 19:25:25 clarkb: ah yeah, RDO got bitten by GitHub removing deprecated ciphers -- I didn't notice any issues with the upstream gerrit for that though 19:25:31 however, gerrit replication and other tasks will lag for some 8-12 hours while online reindexing is performed 19:26:20 if we don't reindex gerrit will still think it has those refs around? 19:26:35 ianw: that upstream bug hasn't really gotten any attention afaict :( https://bugs.chromium.org/p/gerrit/issues/detail?id=6622 19:26:55 clarkb: i'm not sure, seeing as it's all corrupt who knows if it thinks they're there or not? 19:27:06 clarkb: yeah, not sure what might happen with queries for nova-specs changes if we don't reindex 19:27:21 dmsimard: i'm not surprised, i didn't have a replication for it. i mostly put it there hoping if someone else sees it, we can collaborate 19:28:09 so a) does someone want to check my work (jump into the logs i guess) and make sure they can't see anything else causing this, other than nova-specs? 19:28:36 b) should i copy out the repo for a backup, run the recovery and trigger a reindex soon? 19:28:51 ianw: if memory serves, the corruption seems to have possibly coincided with changes/patch sets which were pushed when gerrit was out of memory 19:29:05 ianw: does that mean the repo on git.o.o is corrupted as well ? 19:29:42 we had a few events around that timeframe and during the first couple we checked the repos associated with any of the write errors gerrit logged, but that one seems to have occurred overnight for most of the admins and i don't think anyone checked repository integrity before restarting 19:29:58 for a) I know that in the past nova-specs was the only repo doing it, I'm not sure if that is still the case 19:30:00 dmsimard: i guess so, but it doesn't get rejected on replication. i think it's github's implementation that notices 19:30:46 for b) it may be the jetlag or just failing at reading comprehension but I'm not sure I understand the the steps you intend to take. Will gerrit be stopped? 19:31:09 yeah, worst case with git.o.o we can simply blow away the contents on the servers for that repo and then explicitly re-replicate it via the api 19:31:32 i would stop gerrit, copy out the repo for backup, run the steps to get rid of the bad objects, restart gerrit, reindex? 19:31:39 if we're worried there's lingering corruption there after we repair the canonical copy 19:32:11 ianw: ok, thats what I thought but wasn't sure based on the email. I think that is a resonable approach 19:32:28 The total gerrit outage should be reasonably short in that case, but the reindexing may take a while 19:33:00 Maybe we just quickly ask release team if that will hold them up and then do it today since people are otherwise recovering from fun travel? 19:33:11 ianw has the special ability of being on the other side of the planet when things tend to be quieter 19:33:24 "outage" insofar as the service being offline will be short, but effects of ongoing reindexing will be lengthy 19:33:45 we should make sure, for example, that the release team knows not to approve new releases of stuff until reindexing concludes 19:33:48 clarkb: yeah I'm mostly concerned about the cycle-with-trailing projects (not sure if I got that tag name right) 19:33:56 it's also unknown if this will still have issues pushing to github, but i think this is the first step in finding out 19:34:07 dmsimard: ya but at least one of them was complaining about non working github so may be its ok :) 19:34:29 wfm 19:34:30 If the process is 8-12 hours, maybe start around 20:00 UTC, with the intent of having things online by 8:00 UTC (which is almost but not quite near the end of the day for many of the most easterly of folk) 19:35:09 another option is to say no github until the weekend then do it late on friday? 19:35:10 ianw: I'm not saying it's a solution but out of curiosity, do we have the ability to take nova-specs out of the github replication until this is sorted out so the remainder of the projects can sync ? 19:35:19 yeah, that's probably the least impactful timing given the curve we see on our activity graphs 19:35:23 that is less good for ianw though because of timezones (I guess could do it "sunday") 19:36:03 i need to take a storyboard outage soonish for some database changes as well and was hoping to do that lateish on friday too 19:36:07 clarkb: Unless I miscalculate 20:00+11 = 7:00 (so later than now). Not ideal for AU, but less bad than many other times. 19:36:35 usually that's ok, but *this* particular weekend i will be moving house and have very uncertain internet situation 19:36:54 ianw: in that case I think maybe we should try and start today and just work with release team 19:36:59 wfm 19:37:08 We can't temporarily take nova-specs out of the replication to let the remainder of the projects sync ? 19:37:24 So that only nova-specs is out of sync 19:37:28 oh, worth noting, the release ptl is in an apac tz this week 19:37:33 dmsimard: not in any easy way 19:37:41 so, yeah, will require some careful notification 19:37:53 * andreaf leaving for dinner now 19:37:55 dmsimard: we do a wildcard replication, don't we? 19:38:06 mordred: hm, in RDO's implementation (no jeepyb, mind you) we use regexes to control what we replicate (or not) 19:38:10 ya we do wildcard replication 19:38:20 we might be able to configure gerrit to exclude nova-specs 19:38:35 jeepyb is not involved with replication 19:38:43 replication is entirely controlled by the gerrit server config 19:39:12 yeah I found http://git.openstack.org/cgit/openstack-infra/puppet-gerrit/tree/templates/replication.config.erb 19:39:22 if release team says it would be a major burden on them to do this work nowish we can work on an exclusion for nova-specs instead as that will just require a short server restart and no reindex 19:39:25 ianw: ^ that work? 19:39:30 changing replication config also requires a gerrit restart, fwiw 19:39:52 i'll chat with tonyb (after breakfast time :), and if there's issues then look at the alternatives 19:39:56 fungi: yeah, this would only allow us to delay the reindexing until it it more convenient for us 19:40:24 ok sounds like we have a plan 19:40:31 Shall we move on to the arm64 update? 19:40:54 check with smcginnis as well, he should hopefully be waking up soon (though will presumably be busy at the ops meetup) 19:41:40 fungi: that may mean its an excellent time for the work 19:41:52 potentially 19:41:59 okdokie 19:42:46 Now for some arm 64 updating 19:42:56 sounds like we ran a job or jobs? 19:43:23 yes, i forget where the last update was, but we have nodes and they work 19:43:39 currently, AJaeger pointed out late my time last night the builder has disconnected or something http://nl01.openstack.org/dib-image-list 19:43:49 i will look into that, but builds are working 19:43:52 excellent news, I have asked my team to work with jeffrey4l on adding some kolla jobs 19:43:55 https://review.openstack.org/546466 merged, and ianw started a job to create the ubuntu-ports mirror. last I checked, it was published over AFS, but the contents aren't accessible yet. 19:44:01 persia started on some jobs 19:44:11 \o/ 19:44:18 TIL /dib-image-list is a thing 19:44:30 first issue was i'd named the mirror wrong (mirror.cn1 rather than mirror.regionone...) 19:44:33 that fixed pip 19:45:17 persia also updated reprepro for ubuntu-ports 19:45:39 i created the volume and started running a sync, but overnight it's run out of quota so i need to look into that 19:45:52 we may need some more disk on the afs servers, i will check 19:46:09 hopefully today i will push changes to make the mirror setup use ubuntu-ports when appropriate 19:46:11 ianw: where did regionone come from? 19:46:28 that's the actual region name in like clouds.yaml config 19:46:44 ah, I see it http://logs.openstack.org/19/549319/1/check/storyboard-tox-pep8/bb91d84/zuul-info/inventory.yaml 19:46:46 right we build the name from the nodepool cloud info 19:46:53 but the horizion is cn1.linaro.cloud and i've been calling it "cn1" 19:46:55 kk, so we need to rebuild the mirror? or just update dns 19:46:56 since that is what ends up in the ansible inventory 19:47:30 i just modified the cname, for now. we could rebuild, not sure it matters 19:48:24 the problem for jobs atm is that the mirror setup overwrites things so it tries to get at the x86 repo, ergo no packages can install 19:48:29 for now its probably not a major thing but could be once new regions are up if they are also called regionone 19:48:40 mirror.regionone.linaro.o.o is new CNAME? 19:48:41 so once we have that sorted ... i think jobs will work! 19:48:42 if they are also called regionone it will be a problem 19:48:46 clarkb: we can change the name 19:48:51 pabelanger: yep 19:48:52 pabelanger: Yes. 19:48:56 cool 19:49:04 gema: I think as long as the new regions have distinct names it iwll be fine (don't have to change existing cloud) 19:49:07 mordred: yah, thinking that too 19:49:14 It is possible to run a working job now, but the job has to not require any extra packages. 19:49:20 clarkb: ack, no problem, will let niedbalski know 19:50:45 Outstanding items include: wheel mirrors (to speed up jobs), working around some of the other mirrors not having the right architectures, etc. 19:50:46 (to be fair, we *can* deal with two regions named regionone - it'll just mean each one will get their own cloud name in clouds.yaml) 19:50:57 sounds like progress and no major hurdles? 19:51:05 * mordred thinks this is super-cool fwiw 19:51:18 that's it, i think ... thanks to persia who has been super helpful getting things moving! 19:51:38 ++ 19:51:42 ++ 19:51:52 thank you to everyone getting this going, I too think this is super cool 19:51:52 clarkb: As we're now in a place where we can start running things, I think we're about to find the hurdles. Lots of naming assumptions, need to do manual things (like AFS volume creation), etc. 19:51:56 lots of excitement at PTG for arm64 19:52:20 dmsimard had an ara topic as well which we have a little time left over for 19:52:37 ohai 19:52:38 My guess is that we're about a month out from reliably running a significant number of jobs (and I'm hoping gema, niedbalski, and others can find more resources for then) 19:53:11 I just wanted to mention that https://review.openstack.org/#/q/topic:ara-sqlite-middleware would let us enable ara reports for all jobs without having to generate (or store) HTML 19:53:14 pabelanger: and that's good thing 19:53:18 #link https://review.openstack.org/#/q/topic:ara-sqlite-middleware 19:53:22 If I didn't screw up anything, that is 19:53:37 dmsimard: this will drastically cut down on the number of required inodes per job right? allowing us to go back to having ara enabled for all jobs? 19:53:48 (currently we only generate ara reports on failed jobs) 19:54:08 clarkb: yeah, I explain the delta with an example openstack-ansible job here: https://ara.readthedocs.io/en/latest/advanced.html 19:54:20 clarkb: tl;dr, one small file instead of thousands of larger files 19:54:30 some jobs like ansible and tripleo are still generating their own ara reports, could we update them to use sqlite first too? 19:54:42 as another data point of it working 19:55:04 pabelanger: they would have to update their jobs but I would expect that our apache server would do the right thing for them too 19:55:06 clarkb: another advantage is that we don't have to generate the HTML (which can take >1 min for *large* runs) and we also don't need to rsync that to the log server 19:55:18 pabelanger: yes, so the goal here is to try it on logs-dev.o.o first 19:55:38 yah, logs-dev.o.o wfm 19:55:43 pabelanger: and technically, ".*/ara-report/ansible.sqlite" should work for any project 19:55:59 I mean, logs.o.o/some/tripleo/job/logs/foo/ara-report/ansible.sqlite 19:56:02 my experience suggests that rsync time is dictated more by inode count than block count too 19:56:24 sounds like it will be a good improvement. Please review if you have time :) 19:56:26 I share that experience with rsync 19:56:30 so overall this could be a significant chunk of time savings 19:56:46 #topic Open Discussion 19:56:55 Really quickly before our hour is up anything else? 19:56:56 If you're curious, I have a standalone hosted test for the middleware 19:57:06 Review request: https://review.openstack.org/#/c/546700/ Make gerritbot install from git master 19:57:21 to the earlier discussion of possibly filtering nova-specs out of gh replication, i'm not immediately seeing how to do that from https://gerrit.googlesource.com/plugins/replication/+/stable-2.13/src/main/resources/Documentation/config.md 19:57:28 #link https://review.openstack.org/#/c/546700/ install gerritbot from git master instead of latest release on pypi 19:57:34 And there are 4 (of my) patches up for review for gerritbot: https://review.openstack.org/#/c/545607/ 19:57:40 Thanks :) 19:57:58 I still could use some help with the subunit2sql check db stuff: 19:58:17 basically just need reviews and someone to help drive setting things up after they merge 19:58:23 * hrw out 19:58:34 i've also been working with zaro_ in #openstack-storyboard to try to determine why story comments via gerrit's its-storyboard plugin have stopped happening... seems like the timing may be related to our 2.13 upgrade 19:58:46 if anybody has an interest in helping with that. lmk 19:58:57 https://review.openstack.org/#/q/status:open+topic:subunit2sql-check-data 19:59:29 fungi: reading the gerrit docs really quickly maybe we want to set max retries and a timeout 19:59:44 fungi: and just set that for the github replication and maybe that will get things moving again 19:59:48 oh, and we have a new donor cloud ready to be brought up. if any newer infra-roots want to give that a try, i have the credentials for the accounts and am happy to pass that activity along 20:00:07 fungi: \o/ 20:00:14 fungi: I think i'd like to sign up for that 20:00:17 mtreinish: i have some familiarity with that, i will take a look 20:00:19 I'm largely going to be out today. My brain is unworking and I haven't had a proper meal in over a day 20:00:32 dmsimard: awesome--i'll get up with you in #openstack-infra later 20:00:34 ianw: ok cool, thanks 20:00:37 I'll follow along on irc and help as I can but should be here properly tomorrow 20:00:47 and with that we are out of time 20:00:50 thank you everyone 20:00:54 #endmeeting