Saturday, 2014-01-18

clarkblifeless: isn't it your weekend? :P01:35
devanandahey lifeless, got a few minutes?01:35
clarkbwe have a long weekend this side of the planet01:35
devanandaclarkb: what's monday again?01:35
clarkbdevananda: mlk jr day01:35
devanandaah right01:35
derekhlifeless: got a node running the jjb stuff again, now rebuilding the TE before I clock off01:49
openstackgerritDerek Higgins proposed a change to openstack-infra/tripleo-ci: Switch test environment users
derekhlifeless: looks like TE host on the baremetal cloud can't contact the broker, have updated the etherpad with more uptodate notes and will pick it back up tomorrow02:55
derekhroot@testenv-testenvconfig-lieghn64l4vq:/home/heat-admin# ping
derekh13 packets transmitted, 0 received, +7 errors, 100% packet loss, time 12065ms02:56
openstackgerritTzu-Mainn Chen proposed a change to openstack/tuskar-ui: Add unstyled overcloud resource category page
*** rushiagr has joined #tripleo07:32
*** akuznetsov has joined #tripleo09:10
*** derekh has joined #tripleo11:50
*** derekh has quit IRC13:34
*** CaptTofu has joined #tripleo17:37
lifelessmore crickets!19:21
lifelessto the theme of 'more cowbell!'19:21
SpamapSI got a fever20:21
lifelessI think we're going to have to debug this network performance thing asap20:39
lifeless12kBps is too slow20:39
phschwartzWhat type of network issue. I have some cycles while testing of a new release here is going on and I can take a look20:39
lifelessphschwartz: on, which is a regular cd-overcloud just a different name, so we have a stable base for infra to run in20:40
lifelessphschwartz: instances are getting 12kbps from the internet20:40
lifelessphschwartz: gre overlay network20:40
phschwartzgre+ovs I take it20:40
phschwartzI have had this issue a few times. What version of ovs is installed?20:41
lifelessml2 drive20:41
lifelessI just tried clamping the mtu of an instance down, no discernable effect20:41
lifelesslet me log into the plumbing and I'll answer the ovs question20:41
lifelessphschwartz: ovs-vsctl --version20:42
lifelessovs-vsctl (Open vSwitch) 1.10.220:42
lifelessCompiled Sep 23 2013 14:53:1320:42
lifelesson the network node20:42
phschwartzNo, that won't do it. One of the older ovs installs had an issue with gre networking that caused its in memory datastore for ovs+gre routing to eat cpu and ram. It would clean the ram, but would leave cpu usage high causing a reduction in traffic routing compute which in turn slows down throughput20:43
lifelesswhich is able to pull 20MB/s from the host I was testing against20:43
phschwartzLet me check to see if that is the version with the issue or not20:43
lifelesssame version on the compute node20:43
phschwartzok, that is the one I had the issue with that over time would have the same problem. I was running the default ubuntu installed 1.10.2 on 13.04.20:45
lifeless1.10.2 is bad?20:45
phschwartzCompiling for my local 1.11.0 fixed the issue. The other thing that helped was moving from using the python wrapper for root commands20:45
phschwartzI found it to be with gre20:46
phschwartzWorks good with nvp20:46
lifelessok, thats super useful. THanks!20:46
lifelessphschwartz: is nvp open source?20:46
phschwartzI found this before I started with Rax, but I think someone in Rax found the same as they moved to nvp and that I know still run 1.10.220:46
phschwartzno, it is not.20:46
lifelessah :)20:46
lifelessok, so we need to replace the openvswitch packages too20:47
lifelessI can see us just building everything from scratch :/20:47
phschwartzThat was what I did for the fix. Built my own and made a local repo for install20:47
phschwartzI need to look at the bug back log and work on a few when I have time like this. Haven't had much time lately.20:47
lifelessI'm going to poke deeper on this, as there isn't a CPU problem today, just a throughput problem20:49
phschwartzI think what I found on the ovs mailing lists when I had the issue was that it would eat ram and cpu, then kill the gre threads, and it would severely limit them when it respawns them and that is why it has the issue.20:50
lifelessyeah but this is right from first vm on cloud ever20:50
lifelessit would have to eat them spectacularly fast...20:50
phschwartzI found it to happen very fast20:52
lifelessorder of minutes?20:52
phschwartzIf he is on, kbringard in #openstack had the same issue in the ovs+gre setup that at&t was using and helped me locate the issue. He might have more in depth info still.20:53
phschwartzyes, a matter of mintues20:53
lifelessok, cool20:53
lifelessso - replacing the package version is going to be a little tricky right now, but will dig into it20:53
phschwartzI would get the slow down starting within 2-5 min of quantum bringing up networking as a whole for my env.20:53
phschwartzIt will be in this case. I had the benefit of a small cluster at the time with no impact of stopping to redo it.20:54
lifelessso one thing thats odd20:54
lifelesswhen I wget from the instance to the world - slow20:54
lifelesswhen I rsync up the same content from my home to the instance - fast20:54
lifelessphschwartz: would restarting openvswitch temporarily fix things?20:55
phschwartzdefn the same issue that I had then. It was slowness in the computing of routing in the ovs namespaces that were using gre.20:55
phschwartzThat would work sometimes, but usually needed a host reboot.20:55
lifelessrighto, from the ip router netns I get 160Mbps of throughput to a static file in the UK20:56
lifelesswhich isn't brilliant but is tolerable20:57
phschwartzI would see not even to the net, but between external networks in the datacenter hits where I would get 50-60kpbs, and the core network for the env was 160gb and the clusters interconnect was 8 10g ports aggregated with 2 10g aggregated on each host.21:02
phschwartzYou can never be 100% positive, but defn sounds like the same issue I was having21:02
phschwartzWhen I would get rid of namespacing it would improve, but that defeats the purpose21:03
lifelesstrusty has 2.021:03
lifelessthat might be easier21:03
phschwartzhere is a mail list thread from OS that someone had the same issue.
phschwartzIn their case, the only fix was setting up a proxy to get around the issue with the gre namespacing21:04
lifelessfamily time, shall dig in in detail this evening21:06
lifelessthanks for the pointers21:06
phschwartzJust had a network eng from LexisNexis (where I use to work) remind me that we also had to turn GRO off on the hardware side as the offloading made the problem happen a lot faster.21:06
phschwartzno problem at all21:06
*** derekh has joined #tripleo21:41
derekhneed a bigger VM - Out of memory23:27
