Wednesday, 2017-10-11

*** pleia2 has quit IRC02:43
*** pleia2 has joined #openstack-infra-incident02:50
*** tumbarka has quit IRC07:02
-openstackstatus- NOTICE: The CI system will be offline starting at 11:00 UTC (in just under an hour) for Zuul v3 rollout: http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html10:08
-openstackstatus- NOTICE: Due to unrelated emergencies, the Zuul v3 rollout has not started yet; stay tuned for further updates13:05
*** rosmaita has joined #openstack-infra-incident15:03
jeblairfungi, clarkb, mordred: looking into the random erroneous merge conflict error, i found some interesting information.17:04
fungiall ears17:04
jeblairwhen i warmed the caches on the mergers and executors, i cloned from git.o.o17:04
jeblairthat left the origin as git.o.o17:05
jeblairif *zuul* clones a repo for its merger, it clones from gerrit, and leaves the origin as gerrit17:05
jeblairour random merge failures are because we're pulling changes from the git mirrors before they have updated17:05
fungiahh. so the warming process should have used gerrit as the origin?17:05
jeblairthis also explains why we're seeing git timeouts only in v317:05
jeblairfungi: yes, or at least switched the origin after cloning17:06
fungibut yes, i agree, that does provide a great explanation for the got timeout situation17:06
clarkbjeblair: you think the timeout is because it tries to fetch a non existant ref from git.o.o?17:06
fungis/got/git/17:06
AJaegerjeblair: good detective work!17:06
jeblairclarkb: i don't know what is causing the timeout, but it's https to a different server, versus ssh to gerrit17:06
jeblairclarkb: so there's *lots* of variables that are different :)17:07
fungiwell, at least the work around improving git remote operation robustness isn't a waste17:07
clarkbgot it17:07
jeblairclarkb: the *merge failure* i tarcked down though was because the ref had not updated yet17:07
jeblairfungi: oh, yeah, i still think that's good stuff17:07
jeblairanyway, i think we have two choices here:17:07
jeblair1) update the origins to review.o.o17:08
jeblair2) make using git.o.o more reliable17:08
fungii'm in favor of #1 for now... that's more or less what the mergers for v2 were doing right?17:09
clarkbgerrit does emit replication events now that could possibly be used, but we'd have to have some logic around "are all the git backends updated" which may get gross in zuul (which shouldn't really need to know the dirty details of replication)17:09
jeblair1) is easy, and returns us to v2 status-quo -- mostly.  the downside to that is that v3 is much more intensive about fetching stuff from remotes (it merges things way more than necessary) so we are likely to see increased load on gerrit.17:09
fungii mean, having the option of offloading that to git.o.o would be nice, but not a regression over v217:09
jeblair2) clarkb just pointed out what would be involved in 2.  it's some non-straightforward zuul coding.17:09
fungii agree the increased gerrit traffic is something to keep an eye on, but not necessarily a problem17:10
fungias an aside, how many sessions to the ssh api are we likely to open in parallel from a single ip address?17:11
fungissh api and/or ssh jgit17:11
clarkbI think just one because we'll only grab a single gearman job at a time?17:11
jeblairthere are 2 things that make it inefficient -- we merge once for every check pipeline, and then we merge once for every build.  both of those things can be improved, but not easily.  though, having fewer 'check' pipelines will help.  v3 has 3 now.17:12
jeblairclarkb, fungi: yes, one per ip.17:12
fungijust wanting to keep in mind that we're currently testing a connlimit protection on review.o.o to help keep runaway ci systems under 100 concurrent connections, but sounds like this wouldn't run afoul of it anyway17:13
jeblairso it's a total of 14 from v3 in our current setup.  likely 18 when we're fully moved over.  compared to 4 now (but 8 when we were at full strength v2)17:13
fungiso around a 2x bump17:13
fungiseems safe enough17:14
*** efried has joined #openstack-infra-incident17:15
mordredit seems to me like trying 1) while in our current state (we won't see gate merge load - but with the check pipelines we should still see a lot)17:15
efriedjeblair o/  Sorry, didn't even know this channel existed17:15
jeblairefried: no i'm sorry; i should have mentioned i was switching17:15
mordredwould give us an idea of whether or not it's workable to do that17:16
clarkbmordred: ++17:16
jeblairmordred: ya good point17:16
mordredhowever - it also seems like since we have a git farm, evne if 1 is workable we still may want to consider putting 2 on the backlog17:16
jeblairmordred: we'll actually see the full load17:16
mordredjeblair: oh - good point17:16
mordredthat's great17:16
jeblairya i didn't even think about that till you mentioned it17:16
mordredI think the results of 1 will tell us how urgent 2 is17:17
jeblairthis scales with the size of the zuul cluster, not anything else (as long as it's not idle)17:17
jeblairokay, so i'll just go ahead and update all the origins17:17
jeblairand, um, if gerrit stops then i'll stop zuulv3 :)17:17
efriedCool beans guys, thanks for tracking this down!17:18
mordredjeblair: cool17:19
mordredjeblair: if gerrit stops, maybe just update the origins all back while we re-group :)17:19
fungiplan sounds good17:21
AJaegerefried: the channel is published on eavesdrop.openstack.org17:30
efriedAJaeger Yup, thanks, I'm caught up.17:30
*** rosmaita has quit IRC17:50
*** rosmaita has joined #openstack-infra-incident17:51
*** efried is now known as efried_nomnom18:17
dmsimardfungi: So just to wrap something up that worried me earlier, because I'm involved in this to some extent... As far as I can understand, "rh1 closing next week" is a misunderstanding. It is not closing next week. Soon, but not next week.18:19
dmsimardfungi: Soon, like, a matter of weeks, not months.18:19
dmsimardThe bulk of the work is already done and many different jobs have already been running off of RDO's Zuul18:20
fungidmsimard: yeah, the irc discussion in #-dev more or less confirmed that as well, but "soon" at least18:34
*** efried_nomnom is now known as efried18:58
* mordred waves to fungi and jeblair20:42
jeblairokay yeah, i thought we've run into the inode thing before20:43
jeblairi guess we reformatted?20:43
fungiso the manpage for mkfs.ext4 (shared by ext2 and ext3) says this about the -i bytes-per-inode value: "Be warned that it  is  not  possible  to change  this  ratio  on  a filesystem after it is created, so be careful deciding the correct value  for  this  parameter.   Note that  resizing a filesystem changes the numer of inodes to maintain this ratio."20:43
jeblairi checked the infra status log but did not find anything :(20:43
*** Shrews has joined #openstack-infra-incident20:44
fungiso if we grew the size of the filesystem, we'd get more inodes, but short of that...20:44
jeblairyeah, if only we could, right?20:45
fungijeblair: looking back through my infra channel log now. found an incident from 2013-12-0820:45
fungioh, docs-draft in 2014-03-0320:46
fungizm03 filled to max inodes for its rootfs on 2015-10-1220:47
fungiran out of inodes for /home/gerrit2 on review.o.o on 2016-02-1320:50
fungiconcluded at the time that it's unfixable for the fs, and so moved everything to a new cinder volume20:51
clarkbtripleo is still logging all of /etc in places last I checked20:52
clarkbdo we think its stuff like that using all the inodes?20:52
clarkbalso ara is tons of little files20:52
fungihttp://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-02-13.log.html#t2016-02-13T14:47:2220:53
mordredthe find is doing the gzip pass as well as the prune pass ... should we perhaps do a find pass that doesn't do gzipping ... and maybe add something to prune crazy /etc dirs?20:53
dmsimardclarkb: yes, an ara static report is not large but indeed lots of smaller files20:54
dmsimardThat's why I would like to explore other means of providing the report20:54
jeblairfungi: thanks, that archeology helps :)20:54
fungiyeah, in past emergencies i've made a version of the log maintenance script which omits the random delay, the docs-draft bit, and drops the compression stanza so it's just a deleter20:55
fungiand usually significantly dropped the retention timeframe while at it20:55
mordredfungi: do you happen to have any of those versions around still?20:55
fungimordred: static.o.o:~fungi/log_archive_maintenance.sh20:56
mordredyah - ~fungi/log_archive_maintenance.sh exists20:56
fungiheh, it's like i'm predictable20:56
mordred:)20:56
mordredinfra-root: shall we stop the current cleaner script and run fungi's other version? or do we want to investigate other options?20:58
fungishall i #status alert something for now?20:58
fungimordred: yeah, i would stop it20:58
jeblairmordred: wfm20:58
jeblairfungi: ++20:58
fungirunning more than one at a time is just more i/o traffic slowing both down20:58
mordredjeblair, fungi: ok. do I need to do anything special to stop it other than kill?20:58
fungialso make sure something else like an mlocate.db update isn't running and having similar performance impact20:59
ianwcan someone give a 2 sentence summary of what's wrong for those of us who might not have been awake (i.e. me ;)20:59
jeblairianw: logs is full20:59
fungimordred: you can just kill the parent shell process and then kill the find20:59
fungiand that way it should wrap up without trying to run any subsequent commands20:59
mordredfungi: cool20:59
*** ChanServ changes topic to "logs volume is full"20:59
fungiand it should release the flock on its own that way21:00
mordredfungi: I'm going to start a screen session called repair_logs21:00
fungigood idea21:00
pabelangerspeaking of full volumes, this is from afs01.dfw.o.o: /dev/mapper/main-vicepa  3.0T  3.0T   30G 100% /vicepa21:00
jeblairmordred, fungi: and then maybe grab the flock in the custom script?21:00
jeblairpabelanger: want to throw another volume at it?21:00
mordredjeblair: it grabs the flock21:00
fungijeblair: the custom script already flocks the same file21:00
jeblairpabelanger: we can expand vicepa21:00
jeblairmordred, fungi: cool21:00
pabelangerjeblair: ya, let me see if there is anything to clean up first21:01
jeblairpabelanger: just the usual lvm stuff21:01
jeblairpabelanger: oh, maybe the docs backup volume?21:01
jeblairold-docs or whatever21:01
mordredfungi: bash script killed - nex kill the find not the flock?21:01
ianwpabelanger: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-10-09.log.html#t2017-10-09T06:15:55 ... that didn't take long :/21:01
fungimordred: yeah21:01
mordredfungi: and let flock just exist21:01
pabelangerjeblair: k, let me see21:01
mordredk21:01
fungithe flock should terminate21:01
fungion its own21:01
jeblairianw, pabelanger: i read that we still have 30G, right?21:01
jeblairso, i mean, probably several more hours!21:02
pabelangerYah :)21:02
mordredok. I have started ~fungi/log_archive_maintenance.sh21:02
pabelangerjeblair: okay, so docs-old can be deleted?21:02
ianwjeblair: yeah, so it was 45gb when i posed, and 30gb now21:02
jeblairpabelanger: i think so, but let's check with AJaeger or other docs folks first.  i think that would only give us 10G anyway.21:03
fungii checked ioptop, and it looks like that find doesn't have major competition (unless we want to also stop apache and lock down the jenkins/zuul accounts)21:03
ianwpabelanger / jeblair : i'll take an action item to add a volume to that, after current crisis21:03
pabelangerianw: wfm21:03
jeblairianw: ack, thx21:03
fungibut yeah, if the last find/delete/compress pass ran for 11 days and counting, i expect we've significantly increased inode count beyond just normal block count increases21:04
pabelangerI'm going to add mirror-update to emergency, and first manually apply https://review.openstack.org/502316/21:04
pabelangerthat will free up some room with the removal of opensuse-42.221:04
jeblairpabelanger, ianw: i went ahead and asked in #openstack-doc, but ajaeger is gone for the day so we shouldn't expect a complete answer about docs-old until tomorrow21:05
pabelangerk21:05
ianwok, yeah one less distro will help21:06
jeblairpabelanger, ianw: dhellman asks that we *do* keep docs-old around for a while longer.  so let's not do anything with that volume, and just expand via lvm.21:08
pabelanger++21:09
dhellmannideally we'll be able to delete docs-old by the end of the cycle. I've made a note to coordinate with you all about that21:09
pabelanger/dev/mapper/main-vicepa  3.0T  2.9T   80G  98% /vicepa21:09
pabelangerback up to 80GB with removal of opensuse 42.221:09
pabelangershould be enought room until ianw gets the other volume21:09
pabelangerenough*21:09
ianwexcellent, that's breathing room21:09
pabelangerI'll clean up 502316 and get that approved21:10
mordred/dev/mapper/main-logs          768M  768M     0  100% /srv/static/logs21:10
mordredwe are not removing them faster than we are making them - at least not yet21:10
pabelangerYah, last time I ran it, took a little bit to get a head of the curve21:11
mordredwelp - nothing I like more than watching a find command sit there and churn :)21:11
clarkbvroom vroom21:12
jeblairmordred: on the plus side -- it doesn't look like inodes are a problem now!21:12
jeblairdhellmann: thanks!21:13
pabelangermordred: I see 95% now :D21:14
mordredjeblair: oh - that was df -hi21:15
jeblairmordred: oh i never thought to use -h with -i :)21:18
jeblairmordred: though, clearly 768M should have clued me in21:18
pabelangerwe're trying to cleanup old logs, I see some movement21:26
mordredinfra-root: I'm finally seeing non-zero numbers of free inodes21:59
mordred/dev/mapper/main-logs          768M  768M   11K  100% /srv/static/logs21:59
mordredwell - that was short lived21:59
mordredseriously - something just ate 11k inodes21:59
clarkbwe should probably count the ara inode count and tripleo logs inode count21:59
clarkbas I think it likely those two are at least partially to blame22:00
pabelangeryah, ps show tripleo jobs currently using logs folder22:01
fungishould be able to adapt some of our earlier analysis scripts to do average inodes per job et cetera22:01
pabelanger/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f5422:01
mordredmordred@static:/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54$ sudo find . | wc -l22:02
mordred2137422:02
mordredso that's 21k files per job22:02
mordredso ...22:03
mordred./logs/undercloud/tmp/ansible/lib64/python2.7/site-packages/ansible/modules/network/f5/.~tmp~22:03
dmsimardmordred: to their defense, there's probably like 3 ara reports in there22:03
mordrednope - that's not it22:03
mordredan ara report is 445 files22:03
dmsimardmordred: 1) devstack gate 2) from oooq-ara and from zuul v322:04
dmsimardokay22:04
pabelangerUm22:04
pabelanger/srv/static/logs/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp22:04
pabelangerdu -h22:04
pabelangerthat is glorious22:04
pabelangerthey are copying back ansible tmp files22:04
mordredyah22:05
mordredtons of things like logs/subnode-2/etc/pki/ca-trust/source/.~tmp~22:05
pabelangeryup22:05
pabelangerI have to run, but can help out when we I get back22:05
pabelangerEmilienM: ^ might want to prepare for incoming logging work again22:06
dmsimardWhat I do know, and I saw that recently22:06
EmilienMhi22:06
dmsimardis that they glob the entirety of /var/log/**22:06
* EmilienM reads context22:06
mordredEmilienM: we're out of inodes on logs.o.o22:06
pabelangerhttp://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp22:06
EmilienMoh dear22:06
pabelangercontains a ton of ansible tmp files22:06
EmilienMwho added tmp dir22:07
EmilienMi noticed that recently, let me look22:07
pabelangerokay, have to run now.22:07
pabelangerbbiab22:07
mordredwe should put in a filter to not copy over anything that has '.~tmp~' in the path22:07
*** weshay|ruck has joined #openstack-infra-incident22:07
weshay|ruckhello22:07
mordred$ sudo find logs | grep -v '.~tmp~' | wc -l22:08
mordred336922:08
mordred$ sudo find logs | grep '.~tmp~' | wc -l22:08
mordred1800422:08
weshay|ruckdid tripleo fill up the log server?22:08
dmsimardweshay|ruck: can we cherry-pick what we want from /var/log instead of globbing everything ? https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/collect-logs/defaults/main.yml#L522:08
EmilienMweshay|ruck: since when we have tmp? http://logs.openstack.org//27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp22:08
mordred18k of the files are things with .~tmp~ in the path22:08
EmilienMweshay|ruck: out of inodes on logs.o.o22:08
mordredso if we cna just stop uploading those I think we'll be in GREAT shape22:08
dmsimardweshay|ruck: tripleo did not single handedly fill up the log server, but contributes to the problem22:08
weshay|ruckwe cherry pick quite a bit afaik22:08
weshay|ruckbut can look further22:08
mordredsrrsly - just filter '.~tmp~' and I think we'll be golden22:09
dmsimardweshay|ruck: I think the ansible tmpdirs is the most important part to fix22:09
dmsimardweshay|ruck: there's no value in logging those directories at all http://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/daa71f9/logs/undercloud/tmp/22:10
weshay|ruckthat's odd.. I don't remember tmp being there22:10
weshay|ruckk.. sec I'll add it to our explicit exclude22:10
weshay|ruckthat's new though22:10
EmilienMyeah22:11
EmilienMI'm still git log now22:11
mordredthing is- apache won't even show the files with .~tmp~ in them22:12
mordredhttp://logs.openstack.org/27/505827/14/gate/gate-tripleo-ci-centos-7-containers-multinode/8bd0f54/logs/undercloud/tmp/ansible/lib64/python2.7/encodings/ is an example ...22:12
EmilienMis that https://review.openstack.org/#/c/483867/4/toci-quickstart/config/collect-logs.yml ?22:12
mordredthere's 210 files in that dir22:12
dmsimardEmilienM: that sounds like a good suspect22:12
EmilienMyeah i'm not sure, let me keep digging now22:13
dmsimard"Collect the files under /tmp/ansible, which are useful to debug mistral executions. This dir contains the generated inventories, the generated playbooks, ssh_keys, etc."22:13
EmilienMright22:14
EmilienMif you read comment history we were not super happy with this patch22:14
EmilienMwe can easily revert it and fast approve22:14
EmilienMI'm just making sure it's this one22:14
EmilienMmordred, dmsimard, weshay|ruck : ok confirmed. I'm reverting and merging22:15
weshay|ruckya.. you guys found the same commit22:16
EmilienMhttps://review.openstack.org/#/c/511347/22:16
dmsimardEmilienM: flaper mentions the possibility of making mistral write the interesting things elsewhere so it looks like it can be workaround' and is not critical22:16
weshay|ruckdmsimard, we need to remove it.. I see it there twice22:16
EmilienMdmsimard: it's not critical at all22:16
EmilienMweshay|ruck: twice?22:17
EmilienMmordred: do whatever you can to promote https://review.openstack.org/#/c/511347 if possible22:17
weshay|ruckwell /tmp/*.yml /tmp/*.yaml AND /tmp/ansible22:17
EmilienMweshay|ruck: /tmp/*.yml /tmp/*.yaml would be another patch22:18
weshay|ruckok22:18
EmilienMto keep proper git history I prefer a revert + separate patch for /tmp/*.yml /tmp/*.yaml22:18
EmilienMweshay|ruck: are you able to send patch for /tmp/*.yml /tmp/*.yaml ? otherwise I'll look when I can, I'm in mtg now22:19
dmsimard /tmp/*.yml and /tmp/*.yaml is certainly not as big of a deal22:19
weshay|ruckEmilienM, ya.. I post one22:20
EmilienMk22:20
EmilienMTBH I don't see why we do that22:20
EmilienMlike why do we collect /tmp...22:20
EmilienMits not in my books :)22:20
pabelangermade it up to 60k free inodes, something ate it up23:38
clarkbcould be tripleo jobs that started before the fix merged23:38
pabelangerya,23:39
clarkbthere is likely going to be a periodic of time where that happens23:39
clarkbsince those jobs take up to ~3 hours23:39
pabelangerI've found a few large tripleo patches and manually purging the directories23:39
pabelangerseems to be helping, up to 108K now23:39
pabelanger/dev/mapper/main-logs          768M  768M  205K  100% /srv/static/logs23:43
pabelangerheading in the right direction23:43

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!