Wednesday, 2021-01-20

*** tosky has quit IRC		00:02
ianw	ok, now we have a problem that the backup volume is full in vexxhost so i can't create the new user/home for wiki backup	00:02
ianw	i'm going to try that ethercalc prune (noop first)	00:03
*** iurygregory has quit IRC		00:06
ianw	OSError: [Errno 28] No space left on device	00:07
ianw	hrmmm	00:07
clarkb	ianw: I wonder if we should've been setting the additional_free_space setting	00:09
ianw	i'm moving borg-translate01 (22g) to /opt directly to free up some space temporarily	00:09
clarkb	ok	00:10
ianw	clarkb: i guess http://paste.openstack.org/show/801749/ looks about right?	00:14
ianw	i feel like give it a go and see how much gets freed	00:14
clarkb	ianw: ya that looks about right. my only other thought is --keep-monthly 12 would probably be nice	00:15
clarkb	but unlikely to have much effect here since borg is recent	00:15
clarkb	(its backwards to the output from borg list so took me a second to reverse sort)	00:15
*** artom has quit IRC		00:16
ianw	106Gborg-ask01	00:16
ianw	1.8Gborg-ethercalc02	00:16
ianw	188Gborg-etherpad01	00:16
ianw	203Gborg-gitea01	00:16
ianw	5.2Gborg-lists	00:16
ianw	4.9Gborg-review-dev01	00:16
ianw	446Gborg-review01	00:16
ianw	29Gborg-storyboard01	00:16
*** artom has joined #opendev		00:16
ianw	0borg-translate01	00:16
ianw	4.3Gborg-zuul01	00:16
ianw	for reference	00:16
ianw	etherpad seems too large, we probably shoudl look at exclusions more closely there	00:17
clarkb	ianw: its probably due to the large databse backups there	00:17
clarkb	and ya maybe we can make that better by not keeping as many local db backups	00:17
mordred	is it backing up any old historical backups?	00:17
clarkb	or instruct borg to only backup the most recent db backup	00:17
mordred	yeah	00:17
mordred	I think that	00:17
mordred	backing up the rotated backups is wasteful	00:18
clarkb	++ I like having the local db backups if we can keep them and telling borg to only look at the most recent one is a good workaround to that I guess	00:18
mordred	++ yeah - local rotated backups is super helpful for ease of use	00:18
ianw	i'm going to try that prune on ethercalc, even though it's small, now	00:18
mordred	maybe an exclusion with the .gz$ or [0-9].gz or whatever	00:18
clarkb	ianw: ++	00:18
mordred	++	00:18
clarkb	mordred: ianw ya keep in mind though that I think logrotate makes it weird where we end up with a 0 byte file and its the .1.gz that is most recent	00:19
clarkb	but ya I assume we can do it with a matcher of some sort	00:19
mordred	++	00:19
ianw	heh, so it pruned to ... 1.8Gborg-ethercalc02	00:20
ianw	i guess it's deltas are very efficient	00:20
clarkb	that service might be a bad test because ya that	00:20
clarkb	ianw: it sounds like there is alos a --compress option to the backup step	00:20
clarkb	are we using that? If not I bet that would help with space usage too	00:20
clarkb	ianw: ask and gitea01 also do local db backups with rotation so may be good indicators	00:22
clarkb	(review too)	00:22
ianw	5.6G.	00:22
ianw	/var/backups/etherpad-mariadb# du -h	00:22
ianw	but i wonder if having it a .gzip files destroys more effective delta updates?	00:23
clarkb	ianw: likely yes	00:23
clarkb	I wonder if --compress is smart about that somehow	00:24
ianw	it would probably be better to have the latest uncompressed, then have logrotate compress and rotate that locally	00:24
clarkb	we may not have enough disk space for that though on etherpad	00:24
fungi	okay, doing mirror.epel now	00:24
clarkb	but ya that is a potential thing we could try	00:24
clarkb	also are gitea backups doing another set of git backups I wonder	00:25
clarkb	between review and gitea I mean	00:25
clarkb	not the worst thing but maybe another place we can prune	00:25
ianw	yeah i wasn't sure if we needed gitea at all	00:25
clarkb	ianw: on gitea we want the db backups as that preserves our redirects in the database	00:26
clarkb	ianw: but I don't think we need anything else from it	00:26
ianw	yeah i think we're definitely getting the git trees ...	00:26
*** DSpider has quit IRC		00:28
ianw	clarkb: so do you think we can exclude /var/gitea?	00:33
clarkb	ianw: I think so, if not the entirety of that dir at least /var/gitea/data/git (I think that is the path but going from memory there)	00:34
ianw	97M.	00:34
clarkb	since ew're backing those up on the gerrit side	00:34
ianw	/var/backups/gitea-mariadb# du -h	00:34
clarkb	ya the gitea db is very small	00:34
clarkb	its largely just we have these projects and redirects since we don't do issues and wiki and users	00:34
ianw	yeah i can't see anything under /var/gitea that isn't covered by config mgmt	00:35
clarkb	ssl certs may be the only thing?	00:35
ianw	access logs	00:35
clarkb	oh ya ++ to those	00:35
*** iurygregory has joined #opendev		00:43
*** artom has quit IRC		00:45
fungi	ssl certs are presumably not valuable because le will just issue more automatically, right?	01:00
*** stevebaker has quit IRC		01:03
*** mlavalle has quit IRC		01:05
ianw	it seems like you can't "--exclude /var/lib/gitea" --include "/varlib/gitea/logs"	01:07
ianw	# The file '/home/user/cache/important' is not backed up:	01:07
ianw	$ borg create -e /home/user/cache/ backup / /home/user/cache/important	01:07
ianw	the etherpad dump is 15905104958 bytes	01:20
ianw	16gb	01:20
*** hamalq has quit IRC		01:26
fungi	full release of mirror.epel finished and we're already well into the catch-up pass across the volumes. once they're done i'll remove my locks and we can look at getting ianw's release serialization change deployed i think	02:02
ianw	fungi: where is the content to backup; in /srv/mediawiki?	02:07
fungi	it's scattered all throughout there. in the puppeted version i've extracted the stateful data away from the deployed software and configuration, but on that production server it's quite comingled	02:11
fungi	and honestly, since the deployment and configuration aren't well understood yet, we probably need to be backing them up there anyway	02:12
fungi	oh, maybe i misunderstood your question, yes we should back up (all of) /srv/mediawiki	02:13
ianw	hrm, srv is 11G, but I guess fairly stable?	02:13
fungi	yeah, images are the main thing which change on it (that's where uploaded files wind up)	02:13
fungi	and the lucene index lives in there so it changes when it's regenerated	02:13
ianw	i have everything deployed but we're going to need to free up some space or get some more	02:13
ianw	(backup space)	02:14
fungi	i was half following, sorry, were you able to work out the pruning?	02:19
ianw	fungi: umm, sort of. i think we've uncovered a number of things	02:31
ianw	pruning down to weekly, monthly we can do on command line	02:31
ianw	the space efficiency gzipping the database removing borg's ability to de-dup is something to think about	02:32
ianw	gzipping the daily database dumps	02:32
ianw	and we can prune a bunch of directories from gitea at least	02:32
*** stevebaker has joined #opendev		02:37
auristor	ianw: I see that the afs01.dfw volserver is idle. just to note in case it was missed that the "docs" and "mirror.fedora" RO volumes are still new on afs01 and old on afs02. "docs" is also locked which might mean a release was in flight when afs01 died.	02:44
ianw	auristor: thanks for looking in! :) it looks like fungi has dropped locks and the fedora mirror process is running now, so that's expected	02:46
fungi	well, i'm still holding a (non-afs) lock which prevents our normal mirror content updates, and have been steadily going through them in a serialized fashion until we get them caught up to present	02:46
fungi	which i'm hoping will be in the next hour or two	02:47
ianw	fungi: sorry, i just unlocked the docs volume, but somewhat accidentally pasted in the release command too	02:49
ianw	i can kill it or just let it run; i think i'm tending to the latter	02:51
ianw	i don't know why it failed to release but given all the recent commotion nothing would surprise me	02:51
*** openstackgerrit has joined #opendev		02:58
openstackgerrit	Ian Wienand proposed opendev/system-config master: borg-backup: prune after successful backup https://review.opendev.org/c/opendev/system-config/+/771531	02:58
*** hemanth_n has joined #opendev		03:12
fungi	ianw: i wasn't holding any lock for the docs volume, just the mirror volumes	03:52
fungi	and now i've released them all as the updates indicate having all completed	03:53
ianw	fungi: thanks! great to be back. i'm just letting the docs one run now	03:56
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea backup: prune some large directories https://review.opendev.org/c/opendev/system-config/+/771534	05:02
*** ykarel\|away has joined #opendev		05:03
*** hemanth_n has quit IRC		05:07
*** hemanth_n has joined #opendev		05:07
openstackgerrit	Ian Wienand proposed opendev/system-config master: borg-backup: fix logrotate name https://review.opendev.org/c/opendev/system-config/+/771557	05:12
*** iurygregory has quit IRC		05:33
ianw	ok, i've run rdiff on the two mysql zip files and the delta is the file size	05:40
ianw	from etherpad	05:41
ianw	it looks like we can actually create a borg archive from stdin. i.e. dump the db directly into borg as a separate archive.	05:57
ianw	i think that's going to be better; hosts can still dump their db's to disk but we can just ignore that in the backups	05:58
ianw	that'll be tomorrow, if clarkb doesn't beat me to it :)	05:59
*** zbr5 has joined #opendev		06:04
*** zbr has quit IRC		06:06
*** zbr5 is now known as zbr		06:06
*** ykarel_ has joined #opendev		06:17
*** ykarel\|away has quit IRC		06:19
*** marios has joined #opendev		06:22
*** ykarel_ is now known as ykarel		06:26
*** slaweq has joined #opendev		07:04
*** slaweq has quit IRC		07:30
openstackgerrit	Rico Lin proposed openstack/project-config master: Add ubuntu-bionic-arm64-xlarge https://review.opendev.org/c/openstack/project-config/+/771565	07:30
*** eolivare has joined #opendev		07:31
openstackgerrit	Daniel Blixt proposed zuul/zuul-jobs master: Use urlencoded filenames in test fixtures https://review.opendev.org/c/zuul/zuul-jobs/+/771566	07:38
*** slaweq has joined #opendev		08:00
*** hashar has joined #opendev		08:03
*** fressi has joined #opendev		08:07
*** andrewbonney has joined #opendev		08:09
*** sboyron_ has joined #opendev		08:12
*** rpittau\|afk is now known as rpittau		08:17
*** sboyron__ has joined #opendev		08:38
*** sboyron_ has quit IRC		08:41
*** hemanth_n has quit IRC		08:41
*** stevebaker has quit IRC		08:41
*** hemanth_n has joined #opendev		08:41
*** akahat\|rover is now known as akahat\|lunch		08:46
*** tosky has joined #opendev		08:47
*** DSpider has joined #opendev		08:48
*** jpena\|off is now known as jpena		08:54
*** raukadah has quit IRC		09:18
*** tristanC has quit IRC		09:18
*** raukadah has joined #opendev		09:20
*** tristanC has joined #opendev		09:20
*** brinzhang_ has quit IRC		09:34
*** ysandeep is now known as ysandeep\|afk		09:45
*** brinzhang has joined #opendev		09:52
*** klonn has joined #opendev		10:07
*** akahat\|lunch is now known as akahat\|rover		10:09
*** ysandeep\|afk is now known as ysandeep		10:17
*** rpittau is now known as rpittau\|bbl		10:20
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: bindep: remove set_fact usage when converting string to list https://review.opendev.org/c/zuul/zuul-jobs/+/771585	10:24
*** priteau has joined #opendev		10:24
*** sshnaidm\|afk is now known as sshnaidm\|ruck		10:43
*** dtantsur\|afk is now known as dtantsur		10:44
*** hashar has quit IRC		10:50
*** rpittau\|bbl is now known as rpittau		11:24
*** iurygregory has joined #opendev		11:27
*** sboyron__ has quit IRC		11:30
*** klonn has quit IRC		11:31
openstackgerrit	Guillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445	11:47
*** jpena is now known as jpena\|lunch		12:29
*** sboyron has joined #opendev		12:45
*** klonn has joined #opendev		12:47
openstackgerrit	Radosław Piliszek proposed opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule https://review.opendev.org/c/opendev/irc-meetings/+/771642	12:49
openstackgerrit	Merged opendev/git-review master: Drop support for py27 https://review.opendev.org/c/opendev/git-review/+/770556	13:04
openstackgerrit	Merged opendev/git-review master: Assure git-review works with py37 and py38 https://review.opendev.org/c/opendev/git-review/+/770641	13:05
*** artom has joined #opendev		13:22
*** ysandeep is now known as ysandeep\|afk		13:24
auristor	ianw fungi: the "docs" volume has still not released properly. Looking more carefully, its second RO site is afs01.ord not afs02.dfw and afs01.ord is not responding.	13:25
*** jpena\|lunch is now known as jpena		13:28
*** whoami-rajat___ has joined #opendev		13:31
*** brinzhang has quit IRC		13:37
openstackgerrit	Merged opendev/irc-meetings master: Move the Masakari meeting to the weekly schedule https://review.opendev.org/c/opendev/irc-meetings/+/771642	13:38
*** michael-mcaleer has joined #opendev		13:43
*** sboyron has quit IRC		13:48
*** brinzhang has joined #opendev		13:49
*** brinzhang has quit IRC		13:51
*** sboyron has joined #opendev		13:51
*** brinzhang has joined #opendev		13:51
openstackgerrit	Guillaume Chauvel proposed opendev/system-config master: Increase comment log text width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445	14:13
*** hemanth_n has quit IRC		14:20
*** zoharm has joined #opendev		14:37
fungi	auristor: interesting, i agree vos status says it's not reachable. when i ssh into it afsd is running and the openafs lkm is loaded, i'll have to dig deeper into it after some morning errands and meetings	14:38
fungi	thanks for the heads up!	14:38
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Add policy about overriding role input variables https://review.opendev.org/c/zuul/zuul-jobs/+/771655	15:00
*** whoami-rajat___ is now known as whoami-rajat__		15:02
*** hashar has joined #opendev		15:04
*** klonn has quit IRC		15:06
*** d34dh0r53 has quit IRC		15:12
*** d34dh0r53 has joined #opendev		15:19
*** slaweq has quit IRC		15:21
*** slaweq has joined #opendev		15:23
*** ysandeep\|afk is now known as ysandeep		15:31
clarkb	that is the one we upgraded to 1.8 right?	15:32
clarkb	maybe the key conversion thing didn't go properly?	15:32
*** fressi has quit IRC		15:39
*** sboyron has quit IRC		15:45
*** klonn has joined #opendev		15:50
*** sboyron has joined #opendev		16:02
*** ykarel has quit IRC		16:21
*** mlavalle has joined #opendev		16:26
auristor	fungi: afsd is the client not the servers	16:37
fungi	oh, right	16:37
auristor	the servers are bosserver, dafileserver, davolserver, dasalvageserver	16:37
fungi	clarkb: they're all upgraded to 1.8	16:37
clarkb	oh thats already done? /me so far behind	16:37
fungi	auristor: yep, i think those are what's not running. maybe they didn't get started automatically at boot, i'll be able to fiddle with it in a couple hours	16:38
fungi	er, nevermind, bad grep. bosserver, dafileserver, davolserver are all in the process table (no dasalvageserver though)	16:39
fungi	in a couple more hours i should be in a position to be able to start digging in logs	16:40
auristor	firewall rules?	16:40
fungi	unlikely any of that has changed, and they should be consistent across afs01.dfw.openstack.org, afs02.dfw.openstack.org and afs01.ord.openstack.org, but i'll compare them all once i have a moment	16:49
*** fbo has quit IRC		16:51
*** fbo has joined #opendev		16:52
*** artom has quit IRC		17:15
*** michael-mcaleer has quit IRC		17:23
*** dtantsur is now known as dtantsur\|afk		17:23
*** rpittau is now known as rpittau\|afk		17:26
*** ysandeep is now known as ysandeep\|away		17:26
*** marios is now known as marios\|out		17:27
openstackgerrit	Sorin Sbârnea proposed opendev/git-review master: Allow the default of notopic to be configurable https://review.opendev.org/c/opendev/git-review/+/697448	17:44
openstackgerrit	Sorin Sbârnea proposed opendev/git-review master: Fix bug in git_credentials() https://review.opendev.org/c/opendev/git-review/+/753946	17:44
openstackgerrit	Sorin Sbârnea proposed opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded https://review.opendev.org/c/opendev/git-review/+/399779	17:44
*** artom has joined #opendev		17:47
*** artom has quit IRC		17:47
openstackgerrit	Sorin Sbârnea proposed opendev/git-review master: Support spaces and other characters in topic https://review.opendev.org/c/opendev/git-review/+/681906	17:47
*** artom has joined #opendev		17:47
*** ralonsoh has quit IRC		17:58
*** klonn has quit IRC		18:07
*** cloudnull has quit IRC		18:19
*** cloudnull has joined #opendev		18:20
*** eolivare has quit IRC		18:21
*** cloudnull5 has joined #opendev		18:26
*** cloudnull has quit IRC		18:27
*** cloudnull5 is now known as cloudnull		18:27
*** jpena is now known as jpena\|off		18:32
openstackgerrit	Merged opendev/git-review master: Allow the default of notopic to be configurable https://review.opendev.org/c/opendev/git-review/+/697448	18:41
openstackgerrit	Merged opendev/git-review master: Fix "git-review -d" erases work directory if on the same branch as the change downloaded https://review.opendev.org/c/opendev/git-review/+/399779	18:41
*** marios\|out has quit IRC		18:43
*** sboyron has quit IRC		18:44
*** hashar is now known as hasharAway		19:00
*** andrewbonney has quit IRC		19:09
*** akrpan-pure has joined #opendev		19:18
akrpan-pure	If I'm having an issue with the devstack-gate-wrap (openstack) script in third party CI, is there a good channel to go to? #openstack-third-party-ci is pretty dead it seems like	19:19
clarkb	akrpan-pure: devstack-gate is effectively daed at this point	19:20
clarkb	your best bet is likely to migrate away from it	19:21
fungi	akrpan-pure: devstack-gate is effectively unmaintained these days, upstream jobs parent to a zuul v3 native "devstack" job in the openstack/devstack repository	19:21
akrpan-pure	Urkkkkk, I guess I should've expected that at this point	19:27
akrpan-pure	Alright, I'll continue down the longer path of updating to those jobs too. Thanks!	19:28
*** zoharm has quit IRC		19:40
ianw	did we get to the bottom of the ORD issue ... looking now	19:43
ianw	Wed Jan 20 08:43:59 2021 fssync: breaking all call backs for volume 536870992	19:46
ianw	Starting transaction on cloned volume 536870992... done	19:47
ianw	Deleting extant RO_DONTUSE site on afs01.ord.openstack.org... done	19:47
ianw	Creating new volume 536870992 on replication site afs01.ord.openstack.org: done	19:47
ianw	This will be a full dump: previous release failed	19:47
ianw	Starting ForwardMulti from 536870992 to 536870992 on afs01.ord.openstack.org (entire volume).	19:47
ianw	Failed to set correct names and ids: Possible communication failure	19:47
ianw	Could not end transaction on a ro volume: Possible communication failure	19:47
clarkb	ianw: no sorry, gerrit account issue is current distraction	19:47
ianw	fun	19:49
fungi	ianw: no, i haven't looked deeper other than to confirm the server uptime and which services are running	19:51
ianw	there's stuff in here about the volume being salvaged Tue Jan 19 02:45:57 2021 fileserver requested salvage of clone 536870992; scheduling salvage of volume group 536870991...	19:51
auristor	ianw: rxdebug to afs01.ord on ports 7000, 7005, and 7007 all fail to receive a response.	19:51
fungi	i expect you're on the money with it being a firewall issue. looks like we may have reverted iptables to our basic ruleset (ssh and snmp)	19:52
fungi	so the next question is why	19:52
ianw	ohhhhhhh	19:53
auristor	icmp reply destination unreachable - host administratively prohibited. so definitely firewall rules	19:53
ianw	i bet it's ansible	19:53
fungi	looks like /etc/iptables/rules.* were last updated today at 06:23z	19:53
ianw	i'd say we relied on puppet. looking into it.	19:54
fungi	so yes, i think we should focus there first	19:54
ianw	inventory/service/group_vars/afs.yaml:iptables_extra_public_udp_ports:	19:54
fungi	probably just a matter of adding the ports to our group vars for those servers	19:54
ianw	yeah, i changed the group name to afs-1.8	19:55
fungi	yeah, that	19:55
fungi	mmm	19:55
ianw	ok, that should be changed back, let me see where that got to (the group name)	19:55
clarkb	should be able to copy the group vars for afs to afs-1.8 to address that	19:55
clarkb	and or siwtch everything back to afs if we are ready now	19:55
fungi	the change for that is up, maybe not merged yet	19:55
* fungi checks		19:55
auristor	not reachable yet	19:56
ianw	https://review.opendev.org/c/opendev/system-config/+/771293	19:56
ianw	it has a linter error on the group matching bits, let me fix	19:57
fungi	yeah, looks like we can just merge that then	19:57
fungi	(once linters are passing)	19:57
ianw	WARNING Couldn't open /home/iwienand/programs/openstack-infra/system-config/playbooks/roles/letsencrypt-create-certs/roles/letsencrypt-create-certs/handlers/restart_gitea.yaml - No such file or directory [try:2]	20:07
ianw	i'm not sure why ansible-lint looks for stuff there, and now not sure why it tries to open the non-existant file 3 times :/	20:07
clarkb	ianw: that looks buggy there is extra pathing in the middle there	20:08
clarkb	like maybe its assuming it knows where the location of handlers are and doing so poorly	20:08
clarkb	I wonder if we should just disable it	20:08
openstackgerrit	Ian Wienand proposed opendev/system-config master: Remove afs-1.8 group https://review.opendev.org/c/opendev/system-config/+/771293	20:08
openstackgerrit	Ian Wienand proposed opendev/system-config master: Manage afsdb servers with Ansible https://review.opendev.org/c/opendev/system-config/+/771340	20:08
openstackgerrit	Ian Wienand proposed opendev/system-config master: Remove AFS puppet https://review.opendev.org/c/opendev/system-config/+/771342	20:08
ianw	it's only a warning, but it seems to try to find it and then sleep for (maybe?) a second and try it again x 3, which kind of adds up when it's doing 3 times for about 7 handlers	20:11
openstackgerrit	Kendall Nelson proposed openstack/project-config master: Remove Karbor projects from infra https://review.opendev.org/c/openstack/project-config/+/767057	20:12
ianw	clarkb: if and when you get this gerrit issue sorted, a few pruning things @ https://review.opendev.org/q/topic:%22backup-prune%22 from yesterday	20:12
ianw	we're still space constrained and need space if we're going to get wiki backed up, so still working on it	20:13
clarkb	ya I just sent email to gerrit upstream about the account thing. I can review those next	20:15
clarkb	but then i need to find lunch	20:15
zbr	ianw: the 3 retries no longer happens on newer versions.	20:16
clarkb	ianw: that topic lgtm	20:18
ianw	zbr: do you know why it's constructing the wrong path for the handler?	20:19
ianw	clarkb: not sure if you saw, but what i'm thinking of doing is piping the output of mysqldump directly into borg as a separate archive, via it's stdin reader	20:20
ianw	in theory, we then only keep incremental db updates that should deduplicate	20:21
clarkb	ianw: ya I saw some thoughts on that but was't sure if I full grocked them. YOu mean do something like tee it into borg and onto disk and then stop borg from looking at the on disk stuff?	20:21
ianw	more like "mysqldump \| borg create --stdin-name dump"	20:22
clarkb	and just do the local copies separately?	20:22
ianw	we can keep a local dump too; but not put that in the backups	20:23
clarkb	ya got it	20:23
clarkb	and then beacuse its plain text we'd get better incrementalness	20:23
ianw	yeah, and the local dumps can be compressed for size	20:23
clarkb	ianw: also did you see that bup supports a compressed backups option. Not sure if we are doing that or if it does it by default	20:23
ianw	that's the theory anyway	20:23
clarkb	but that may be another option available to us, I think bup was compressing by default so maybe that explains the difference in growth	20:23
clarkb	or some of it anyway	20:24
clarkb	ok I'm told lunch is waiting for me, back in a bit	20:24
ianw	hrm, i don't think we are; might be an option. i generally worry a little with things like that if it can turn a small corruption into a big corruption	20:24
clarkb	ya, just calling it out as I'm 95% sure bup was doing it due to its git like packfiles (git compresses packfiles)	20:25
ianw	yeah, very true, and that format was very "interlinked" as well (yes you can pull things out of corrupt git trees, sort of, but not somthing anyone wants to do)	20:27
fungi	there os such a thing as "diffable compression" but just not compressing is likely easier	20:28
*** klonn has joined #opendev		20:29
fungi	also if borg used a copy-on-write scheme it could theoretically have deduplicated differential/incremental backups where the most recent data is de facto complete, but i expect there are reasons it doesn't	20:32
ianw	alright, getting some breakfast, will push that ord fix and monitor as soon as it passes. i'm just leaving it rather than messing up the iptables state by doing something by hand	20:34
fungi	it's not urgent so long as some untoward incident doesn't knock afs01.dfw offline	20:35
* fungi gives rax a long sideways look		20:35
*** zimmerry has joined #opendev		20:35
zbr	ianw: likely the unsupported repo layout with nested roles directory may be involved. Afaik, include paths works fine for official layout: only one roles/ folder at root. But I may be wrong.	20:37
fungi	roles directory parallel to the location of the playbook is no longer supported?	20:40
*** tosky has quit IRC		20:41
*** fbo has quit IRC		20:42
*** tosky has joined #opendev		20:42
*** raukadah has quit IRC		20:42
*** fbo has joined #opendev		20:42
*** stevebaker has joined #opendev		20:42
*** raukadah has joined #opendev		20:43
zbr	i need to check tomorrow, remind me if it forget	20:56
zbr	the guessing inside the linter is a bit of a mess, i wanted to work on it but never got enough time	20:57
clarkb	fungi: ianw what do you think about approving https://review.opendev.org/c/opendev/system-config/+/769226 now? are all the fires sufficiently contained?	21:03
clarkb	that is the gitea 1.13.1 upgrade change.	21:03
fungi	i think it should be safe to move forward there, yeah	21:06
ianw	++ agree	21:06
clarkb	alright I'm approving it now then	21:06
ianw	i'm going to try the mysql dump to borg archive on etherpad manually, maybe run it again manually tomorrow and see if it gets us the de-duplication we hope for	21:35
ianw	btw, we're using lz4 compression with borg, so it does have higher compression options but we have something	21:36
clarkb	oh cool	21:37
openstackgerrit	Merged opendev/system-config master: Remove afs-1.8 group https://review.opendev.org/c/opendev/system-config/+/771293	21:39
*** whoami-rajat__ has quit IRC		21:51
ianw	infra-prod-base is running which should hopefully restore the iptables rules for ord	21:53
clarkb	heh its also gonna do all the things because it affected groups	21:55
clarkb	so will be a little while for the gitea upgrade once it lands (I should still be around for a number of hours today so not a big deal)	21:56
clarkb	hrm I wonder if it is possible that we'll get ordering slightly wrong though	21:58
clarkb	if the gitea image updates when the change lands, then the old system-config version does a pull and compose down then up it will restart on the new version but without the template updates?	21:58
clarkb	oh wait no the template updates area ll in the image	21:59
clarkb	so the only issue would be https://review.opendev.org/c/opendev/system-config/+/769226/3/playbooks/roles/gitea/templates/app.ini.j2 ?	21:59
clarkb	thats probably minor enough that we'll be fine	21:59
clarkb	may just need to roll through and restart things again once app.ini updates	21:59
clarkb	I thought containers were supposed to fix all these problems :P	22:00
fungi	containers == magic pixie dust	22:01
mordred	you're a container	22:02
ianw	ok, afs01.ord is back with the right iptables rules	22:05
ianw	i guess i'll try the docs update again	22:05
ianw	actually the cron job seems to be running it	22:11
*** hasharAway has quit IRC		22:14
clarkb	I think it runs every 5 minutes or so	22:14
fungi	yup, along with the rest of the static site updates	22:14
clarkb	once that finishes can we switch back to using the RO path for static/	22:14
fungi	i switched us back to that already over the weekend	22:15
fungi	i think i status logged it	22:16
clarkb	oh cool	22:16
clarkb	it does look like base and le failed so all the things behind them skipped too fwiw	22:17
fungi	ahh, didn't status log, but https://review.opendev.org/770857 deployed 2021-01-16 23:25:14	22:17
fungi	so saturday	22:17
clarkb	nb03 is unreachable	22:18
clarkb	and nb01 and nb02 both failed in ansible	22:18
* fungi checks if the mirror there is also		22:18
clarkb	that appears to be why the LE playbook failed	22:18
clarkb	fungi: can you reboot nb03 if necessary?	22:18
fungi	gladly	22:18
clarkb	nb01 and nb02 have full /opts	22:19
fungi	mirror02.regionone.linaro-us.opendev.org is up for 5 days	22:19
fungi	the gentoo images may be filling disk when they fail?	22:19
fungi	i have a change up to pause them again until we can get a new dib release	22:19
clarkb	its possible. I think I'll start by stopping nodepool-builder on both, disabling the service, then rebooting and see what has leaked?	22:19
fungi	or has that already happened?	22:19
clarkb	gentoo pause is false	22:20
fungi	yeah, https://review.opendev.org/771104 if we want them to stop again	22:20
fungi	i proposed that when it was clear they were still broken, but we were hip-deep in other fire	22:21
fungi	i was like "i'll just put this over here with the rest of the fire"	22:21
clarkb	Failed to stop nodepool-builder.service: Unit nodepool-builder.service not loaded	22:21
clarkb	systemctl list-units -a shows it knows nothing about nodepool	22:22
clarkb	oh right I'm a derp	22:22
clarkb	its docker compose now	22:22
fungi	anyway, i think i approved all prometheanfire's gentoo element fixes for dib, but we still need a dib release before we'll use them on the builders	22:22
clarkb	bother are rebooting now, then we can see what leaked in /opt and trim	22:23
clarkb	fungi: if they aren't expected to build then pausing them makes sense t ome	22:23
corvus	ianw, fungi, clarkb, mordred: if ansible-lint is continuing to have more problems with the contents of system-config, maybe we should get more consensus on disabling it for that repo: https://review.opendev.org/733406	22:23
corvus	3 people in favor of that, but i'd love for ianw and clarkb to weigh in	22:24
openstackgerrit	Merged opendev/system-config master: Update gitea to 1.13.1 https://review.opendev.org/c/opendev/system-config/+/769226	22:24
*** hamalq has joined #opendev		22:25
fungi	console log show nb03.opendev.org says "Guest does not have a console available." and server list shows the instance in SHUTOFF state. booting it now	22:27
ianw	kevinz: ^ i think you made some scheduler changes?	22:27
clarkb	/opt/dib_tmp did leak dib_build* dib_image* and profiledirs on both servers. I'm cleaning those up first to see what that frees up	22:28
clarkb	gitea should be upgrading nowish	22:28
ianw	fungi: oh, i got totally distracted on a dib release. i got into a state, i can do a release now. but still quite a lag as we need to push into nodepool and update images	22:29
fungi	ianw: yeah, we may still want to re-pause the gentoo image builds	22:30
fungi	i was hesitant to tag dib without some more eyeballs on the changes which went in or may be pending	22:31
ianw	yeah, i went through the queue, thanks for looking in on it too :) pushed 3.6.0	22:32
*** slaweq has quit IRC		22:32
fungi	thanks!	22:32
clarkb	https://gitea01.opendev.org:3000/ has updated	22:33
fungi	prometheanfire: ^ we still need to get that into nodepool container images and deploy them, but closer at least	22:33
clarkb	looks good to me at first glance. I'll follow it as it goes through the list	22:33
prometheanfire	fungi: do I need to do anything?	22:34
fungi	prometheanfire: i don't think so yet. once we get it deployed you'll want to take another look at gentoo image build logs	22:34
clarkb	I may need to put nb01 and nb02 in the emergency file as their hourly deploy is queued up to happen soon	22:35
clarkb	I'll go ahead and do that now	22:35
clarkb	and done	22:35
prometheanfire	cool	22:36
clarkb	cleaning up /opt/dib_tmp on nb01 freed 67GB which is unlikely to be sufficient for very long	22:37
clarkb	I'll look at any leaked images in /opt/nodepool_dib once nb02's dib_tmp is cleaned up	22:37
clarkb	fungi: I notice we're still building stretch images. Any idea if those are used by anything?	22:39
fungi	not without digging in codesearch, no	22:39
prometheanfire	the gentoo image does try and cache binpkgs, for quicker (re)builds	22:40
fungi	we probably eventually need a better way to answer questions like that	22:40
clarkb	found two leaked focal images on nb01. Will clean those up. Likely need to look through all the other images and see if they have leaked too	22:42
*** cloudnull8 has joined #opendev		22:44
*** cloudnull has quit IRC		22:46
*** cloudnull8 is now known as cloudnull		22:46
ianw	This archive: 15.92 GB 4.17 GB 208.56 MB	22:49
ianw	clarkb: ^ that's a more-or-less back-to-back run of dumping the etherpad db directly, so it looks like an incremental is ~208MB	22:49
clarkb	which seems to support your theory that we'd be better of doing it that way	22:50
clarkb	rather than ~4GB compressed each time or whatever it is (I think it is in that range)	22:50
ianw	yeah, about 5gb	22:50
clarkb	all 8 giteas have upgraded now	22:50
clarkb	the zuul/zuul frontpage loads for me	22:51
clarkb	and things look generally correct	22:51
clarkb	nb01 now has 157GB of disk free after cleaning two leaked nb01 images and two intermediate.bak files from old builds in nodepool_dib	22:51
clarkb	all other images in nodepool_dib look legit	22:51
clarkb	cleaning up the dib_build.* on nb02 freed about 100GB and cleaning dib_image.* freed another 260GB or so	22:53
clarkb	I'm checking nb02 for stale content in nodepool_dib now	22:53
clarkb	hrm I think nb02 hasn't built an image in a long while	22:54
clarkb	dib-image-list \| grep nb02 shows that everything has failed there except for gentoo forever ago	22:54
clarkb	I'll clean up the stale images there except for gentoo then maybe we start it alone for a bit and let it take some of the load off of nb01?	22:54
clarkb	also as a side note the gentoo images that we haev attributed to nb02 in zk don't appear to be on disk	22:57
clarkb	ok nb02 is cleaned up. I will start its builder now	23:01
clarkb	I'll remove nb02 from the emergency file but keep nb01 in it so that nb02 can pick up some of the lsack for a bit	23:02
clarkb	#status Log Upgraded gitea to 1.13.1	23:03
openstackstatus	clarkb: finished logging	23:03
openstackgerrit	Merged opendev/system-config master: borg-backup: prune after successful backup https://review.opendev.org/c/opendev/system-config/+/771531	23:03
clarkb	#status log Cleaned up /opt on nb01 and nb02 to remove stale image build data from dib_tmp and nodepool_dib. nb02's builder has been started as it has much more free space and we want it to "steal" builds from nb01.	23:04
openstackstatus	clarkb: finished logging	23:04
ianw	hrm i went through all the builders just before christmas	23:04
clarkb	ianw: most of these appaered stale since mid november	23:04
clarkb	but maybe they were active in december and only recently rolled out?	23:04
ianw	i think the failure case, where one fills up and guarantees the other will then fill up is something to think about	23:04
clarkb	agreed. One thing I've thought about is having them weight their job grabs based on how full their disk is	23:05
clarkb	which should trned to sharing the load over time	23:05
ianw	i need to get back to https://review.opendev.org/c/zuul/nodepool/+/764280	23:05
ianw	that will refuse to start a build if it knows it's going to run out of disk	23:05
clarkb	one simple way to do the weight thing I was thinking of is to do a sleep before grabbing a new build based on how much free space there is	23:06
clarkb	but I think that may fail if the sleep is less than a typical image build runtime	23:06
clarkb	ianw: maybe at the end of your day you can check how many images nb02 has built and if it is in the range of say 4 start up nb01? otherwise I can start nb01 tomorrow morning?	23:07
clarkb	actually nb02 cannot take over more than half of the images since we keep the current and previous image	23:08
clarkb	in that acse it should be safe to let it run for 24 hours before starting nb01	23:08
clarkb	I'll start nb01 tomorrow given ^	23:08
*** lbragstad has quit IRC		23:13
*** lbragstad_ has joined #opendev		23:13
*** bodgix has joined #opendev		23:14
*** bodgix_ has quit IRC		23:14
clarkb	heh i took nb02 out of emergency which caused it to restart a couple of minutes ago when ansible ran against it	23:16
clarkb	took me a minute to figure out why the centos 7 image build it was doing just disappeared	23:16
clarkb	ianw: ^ fwiw that does appear to have leaked a build in dib_tmp	23:16
clarkb	ianw: I think nb02:/opt/dib_tmp/dib_build.dVZ8L3kD dib_image.igFXlwm2 and profiledir.MhYYhz belonged to the build that was aborted due to a restart	23:17
clarkb	for about 5.6GB of disk use	23:17
clarkb	I'm going to watych it closer and see if the current build stuff goes away when that image build finishes and if so I think I can be confident I've found the correct leaked files and manually clean them up	23:18
*** brinzhang has quit IRC		23:20
*** brinzhang has joined #opendev		23:29
*** tosky has quit IRC		23:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!